Week 5 Assignment_Drills with R

docx

School

Cumberland University *

*We aren’t endorsed by this school

Course

441

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

8

Uploaded by DrElementNewt28

Report
1 Week 5 Assignment Manisha Reddy Yerla Masters in data science, University of Cumberland’s 2023 Fall - Statistics for Data Science (MSDS-531-B02) - Second Bi-term Dr. Mina Richards November 22, 2023
2 Drills with R on generalized linear models Question 1: For the Houses data at Index of Datasets consider Y = selling price, x1 = tax bill (in dollars), and x2 = whether the house is new: a. Form the scatter plot of y and x1. Then answer, does the normal GLM structure of constant variability in y seem appropriate? If not, how does it seem to be violated? The below R code conducts an analysis of a dataset representing the information of houses with variables case, price, size, new, taxes, bedrooms, and baths. It starts by loading the dataset from a specified URL into data frame “data”. Then we create a scatter plot using the plot function. It plots the values of the "taxes" column on the x-axis (data$taxes) against the values of the "price" column on the y-axis (data$price). The xlab, ylab, and main arguments are used to label the x-axis, y-axis, and give the plot a title, respectively. The scatter plot visually represents the relationship between the tax bill (taxes) and selling price (price). Each point on the plot corresponds to a house in the dataset. We notice that the spread of points increases systematically with changes in taxes, it suggests a violation of the constant variability assumption. Which means the variability of the
3 residuals is not constant across all levels of the predictor variables. This violation could be addressed by transforming the response variable or using a different modelling approach. b. Construct Using the identity link function, fit the i. normal GLM ii. gamma GLM iii. For each model, interpret the effect of x2. The below R code conducts an analysis of a dataset representing the information of houses with variables case, price, size, new, taxes, bedrooms, and baths. It starts by loading the dataset from a specified URL into data frame “houses”. Then the glm function is used to fit a generalized linear model (GLM) with the identity link function for a normal distribution. The formula “price ~ taxes + new” specifies that we want to predict the response variable price based on predictors taxes and new. The family = gaussian(link = "identity") part specifies the normal distribution and identity link function. Similarly, the gamma glm function is used to fits another GLM, but this time using a gamma distribution instead of a
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 normal distribution. The family = Gamma(link = "identity") part specifies the gamma distribution and identity link function. Then summary function is used to display detailed information about each model, including coefficients, standard errors, t-values, and p-values. The summary provides insights into the coefficients and their significance. The focus is on the coefficient associated with the variable’s taxes and new and the interpretation considers whether being a new house has a significant effect on the house price. The line coef(gamma_glm)["new"] is extracting the coefficient associated with the predictor variable new from the fitted gamma GLM model and he line coef(normal_glm)["new"] is extracting the coefficient associated with the predictor variable new from the fitted normal GLM model. Now, let's interpret the effect of x2 for both models: Normal GLM (Gaussian distribution with identity link): In a normal GLM, the coefficients represent the change in the mean response for a one-unit change in the predictor variable. We can interpret the coefficient for variable new as the change in the mean selling price when the house is new compared to when it's not. So, in this case the coefficient is positive suggesting an increase in the mean selling price for new houses. Gamma GLM (Gamma distribution with identity link): The interpretation of coefficients in a gamma GLM is a bit different. The coefficients represent the multiplicative effect on the mean response. So, for variable new, the interpretation would be the multiplicative effect on the mean selling price when the house is new compared to when it's not. In this case the coefficient is greater than 1 suggesting an increase in the mean selling price for new houses.
5 c. For each model, describe how the estimated variability in selling prices varies as the mean selling price varies from 100 thousand to 500 thousand dollars The below R code conducts an analysis of a dataset representing the information of houses with variables case, price, size, new, taxes, bedrooms, and baths. It starts by loading
6 the dataset from a specified URL into data frame “houses”. Then the code fits two generalized linear models (GLMs): one with a normal distribution (using the gaussian family) and another with a gamma distribution (using the Gamma family). The models are specified to predict the selling price (price) based on the predictor’s taxes and new. The identity link function (link = "identity") is used for both models. Then "mean_selling_prices" creates a sequence of 100 mean selling prices ranging from 100 to 500 thousand dollars. Then "simulation_data" data frame is created to hold the simulated data. It assumes constant values for taxes and new, with the mean values from the original dataset. The sequence of mean selling prices is then added as a new column named price. Then the "normal_predictions and gamma_predictions" lines generate predicted selling prices for the normal and gamma models based on the simulated data. The "plot" creates an initial plot using the mean selling prices on the x-axis and the predicted selling prices on the y-axis for the normal GLM. The type = "l" argument specifies a line plot, and col = "blue" sets the color to blue. The "lines" adds a line plot for the gamma GLM predictions on the same plot, using a red color. The "legend" adds a legend to the top-right corner of the plot, indicating which line corresponds to the normal GLM and which corresponds to the gamma GLM.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 The blue line (representing the normal GLM) the red line (representing the gamma GLM) shows a consistent pattern as mean selling prices increase suggesting that both models are capturing a certain level of variability in selling prices. The blue line and the red line follow similar patterns indicating that both the normal and gamma GLMs are capturing similar trends in variability. For both the model’s variability is constant as the mean selling price increases
8 d. Which model is preferred according to AIC? The Akaike Information Criterion (AIC) is a measure used for model comparison. In general, a lower AIC value indicates a better-fitting model. To determine which model is preferred according to AIC, we can compare the AIC values of the normal GLM and gamma GLM. The provided R code conducts an analysis of a dataset representing the information of houses with variables case, price, size, new, taxes, bedrooms, and baths. It starts by loading the dataset from a specified URL into data frame “houses”. Then the code fits two generalized linear models (GLMs): one with a normal distribution (using the gaussian family) and another with a gamma distribution (using the Gamma family). Then the code compares the Akaike Information Criterion (AIC) values between a normal GLM and a gamma GLM, and it determines which model is preferred based on the AIC values. AIC(normal_model) and AIC(gamma_model) lines calculate the AIC values for the normal GLM (normal_model) and the gamma GLM (gamma_model), respectively. cat(...) prints the AIC values for both models along with a label. The "\n" adds a new line after each print statement to make the output more readable. The model with the lower AIC value is generally preferred. In this case AIC for gamma GLM is 1106.705 is preferred as it is lower than AIC for normal GLM which is 1162.178.