Stats 101A HW 6

Rmd

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

101A

Subject

Statistics

Date

Jan 9, 2024

Type

Rmd

Pages

4

Uploaded by ucladsp

Report
--- title: "Stats 101A HW 5" author: 'Ian Zhang UID: 205702810' date: "2023-05-12" output: pdf_document --- ## Question 1 - Chapter 3 1B The ordinary straight line regression model that plots Fare vs Distance does seem to fit the data well. The scatter plot shows data that seems like it is linear, so the model looks like it is valid. However, if you take a look at the residual plot, we can see that there is a clear upside down U pattern of the residuals, meaning that the linear regression model is actually not valid for this data. To improve this model, we can transform one of the variables. We can use the log() transformation on the distance variable because the range is more than a 1 magnitude increase since distance ranges from 0-2000. We also need to determine if the outlier at around (2000, 500) is a bad leverage point, as it is clear that it does not follow the quadratic trend, so we would need to test if it is a bad leverage point, then we will be able to remove it accordingly ## Question 2 ### Part A ```{r a} ads<-read.csv("AdRevenue.csv") ad.m1 <- lm(AdRevenue ~ Circulation, data = ads) plot(ad.m1) #transformation log.data <- transform(ads, logCirc = log(Circulation), logAd = log(AdRevenue)) ad.log <- lm(logAd ~ logCirc, data = log.data) plot(ad.log) summary(ad.log) ``` This model predicts advertising revenue per page from circulation. Looking at the residual plot from the first linear regression model, there is a clear patter within the points, thus making the linear model invalid. However, after applying the log transformation to both advertising revenue, the residual plots after, we can say that the model is now valid. The new residuals vs fitted gets rid of most of the pattern that was present in the original plot, and the QQ-plot is linear, validating the normal distribution condition. The scale location plot validates the constant variance condition, as the values are spread relatively equally around the line with no clear pattern. ### b ```{r b} #95% prediction interval from 4.583774 to 5.294453 predict(ad.log, data.frame(logCirc = 0.5), interval = "predict") #95% prediction interval from 14.27041 to 16.22938 predict(ad.log, data.frame(logCirc = 20), interval = "predict") ``` ### c Some weaknesses of the log model may be that the residuals are not as randomly scattered as they could be. There could be an argument to say that there is a
slight pattern within the residual plot. However, it is definitely a better fit than the original model and at first glance the plot looks randomly scattered. The same could be said about the scale-location plot, as the line is not extremely horizontal and there could be an argument made that the points aren't similarly spaced out, but again, it is a lot better than the original, which had all the points concentrated in one area. There are also some high leverage points, but since they do not fall outside of Cook's distance, these aren't considered bad leverage points. ## Part B ### a ```{r B} ads.second <- lm(AdRevenue ~ Circulation+I(Circulation^2), data=ads) plot(ads.second) summary(ads.second) ads.third <- lm(AdRevenue ~ Circulation+I(Circulation^2)+I(Circulation^3), data = ads) plot(ads.third) summary(ads.third) ``` ### b ```{r b2} #95% prediction interval from 19.47858 to 186.1802 predict(ads.second, data.frame(Circulation = .5), interval = "predict") #95% prediction interval from 490.5858 to 674.188 predict(ads.second, data.frame(Circulation=20), interval = "predict") #95% prediction interval from 14.92314 to 153.4138 predict(ads.third, data.frame(Circulation = .5), interval = "predict") #95% prediction interval from 418.179 to 580.8878 predict(ads.third, data.frame(Circulation=20), interval = "predict") ``` ### c Quadratic There are some weaknesses in this model. For one, the residual plot does not seem like it is very randomly scattered, as there is a cluster of points near the lower x values that is clearly sloping up. There is also a slight fan shape of the residuals indicating non-constant variance. The QQ plot also does not follow a linear trend, as it trails higher near the top of the data, which indicates non- normality. The scale location plot also supports the fact that the model is weak, as the points are not distributed evenly around the line but instead cluster near the left of the graph and the line is not horizontal at all but increasing. There are also bad leverage points that lie outside of Cook's distance which pulls the model improperly. Third This model also has weaknesses, but they are slightly better than the quadratic model's. First the residual plot looks slightly better, as there is less of an upwards slope and has seemingly more random scatter. However, the points still cluster towards the left of the graph. The plot also has a fan shape that fans out from left to right, meaning that the constant variance condition is violated. The qq plot also does not follow the line, as the tails of the plot stray away from the line which is another weakness of this model. The scale-location plot also shows
weaknesses, as the line is not horizontal at all. The points are better distributed around the line, but it is still not completely random scatter. The leverage plot also shows that there are bad leverage points which are outside of Cook's distance. This model is slightly better than the quadratic model, but still has its weaknesses. Ultimately, we would select the third order model over the quadratic model, as the residual plots of the third order model are stronger than the quadratic model. ## Part C ### a When comparing the three models, linear, quadratic, and third order, we can conclude that the linear model is the most appropriate for the data. When looking at the linear model, this residual plot is the one that has the most random scatter, and does not have very clear trends. It also doesn't show any fan shape, which the other two do. The QQ plot is also much better than the quadratic and third order plots, as there is significantly less deviation from the line near the ends. The scale location plot of the linear model also has the most horizontal line out of the three, and the points are also the most randomly scattered around the line. The leverage plot also shows that there aren't any bad leverage points, where the quadratic and third order plots all have points outside of Cook's distance. ### b For 0.5 million: Linear model provides an interval from 4.583774 to 5.294453 Third order model provides an interval from 14.92314 to 153.4138 For 20 million: Linear model provides an interval from 14.27041 to 16.22938 Third order model provides an interval from 418.179 to 580.8878 I would recommend the prediction interval of the linear model, since it is the interval that is the smallest. The intervals of the quadratic and third order are a lot larger which shows that the model is unable to narrow down the prediction with a 95% certainty. The linear model is able provide a narrower range which allows for higher precision. However, if the aim is to be cautious, the third order model allows a wider range of values to be included. ## Question 3 ###a You would be able to calculate the standard deviations of Y for each value of X by using the formula of sqrt(sum(y - mean(y)^2)/n) for each value of x. y would be a vector of all the y values of a specific x value, and n is the number of y values there are. # b This is a special case and because of it, we are able to directly calculate the standard deviation at each value of x. Since x is a discrete variable, there will be multiple y values for each x value, and thus you will be able to calculate the standard deviation for each x value. We are not able to directly calculate the standard deviation of the dataset in part A and B because that dataset is continuous, and thus there will only be one y value for every x value, so you won't be able to calculate the standard deviation for each x value. This is when you would need to use linear regression to estimate the standard deviation of the data.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help