6501 hw 02_15_23

pdf

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

6501

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

6

Uploaded by MajorOtterMaster1158

Report
Question 8.1 Describe a situation or problem from your job, everyday life, current events, etc., for which a linear regression model would be appropriate. List some (up to 5) predictors that you might use. I would use a linear regression model to predict Telsa stock price based on Musk’s tweets. We could use data from twitter to do this. Using likes, replies, and retweets, or other metrics that twitter may provide, would allow us to potentially show a relationship. For a more detailed perspective we would likely use Musk’s most recent tweet to grab the metrics that we would compare to the current stock price. The issues programmatically would be the frequency that Musk tweets. Sometimes he tweets 20 times in 10 mins and sometimes he does not tweet for hours. This might be a place to use exponential smoothing. Using past data we would be able to predict the likelihood of Tesla stock falling or rising based on a given tweet. Question 8.2 Using crime data from http://www.statsci.org/data/general/uscrime.txt (file uscrime.txt, description at http://www.statsci.org/data/general/uscrime.html ), use regression (a useful R function is lm or glm) to predict the observed crime rate in a city with the following data: M = 14.0 So = 0 Ed = 10.0 Po1 = 12.0 Po2 = 15.5 LF = 0.640 M.F = 94.0 Pop = 150 NW = 1.1 U1 = 0.120 U2 = 3.6 Wealth = 3200 Ineq = 20.1 Prob = 0.04 Time = 39.0 Show your model (factors used and their coefficients), the software output, and the quality of fit. Note that because there are only 47 data points and 15 predictors, you’ll probably notice some overfitting. We’ll see ways of dealing with this sort of problem later in the course. We are using the lm function in R to model this data via linear regression. The output of the summary of the model is as follows: Call: lm(formula = Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + U2 + Wealth + Ineq + Prob + Time, data = crime_data)
Residuals: Min 1Q Median 3Q Max -395.74 -98.09 -6.69 112.99 512.67 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 *** M 8.783e+01 4.171e+01 2.106 0.043443 * So -3.803e+00 1.488e+02 -0.026 0.979765 Ed 1.883e+02 6.209e+01 3.033 0.004861 ** Po1 1.928e+02 1.061e+02 1.817 0.078892 . Po2 -1.094e+02 1.175e+02 -0.931 0.358830 LF -6.638e+02 1.470e+03 -0.452 0.654654 M.F 1.741e+01 2.035e+01 0.855 0.398995 Pop -7.330e-01 1.290e+00 -0.568 0.573845 NW 4.204e+00 6.481e+00 0.649 0.521279 U1 -5.827e+03 4.210e+03 -1.384 0.176238 U2 1.678e+02 8.234e+01 2.038 0.050161 . Wealth 9.617e-02 1.037e-01 0.928 0.360754 Ineq 7.067e+01 2.272e+01 3.111 0.003983 ** Prob -4.855e+03 2.272e+03 -2.137 0.040627 * Time -3.479e+00 7.165e+00 -0.486 0.630708 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 209.1 on 31 degrees of freedom Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078 F -statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07 Initially this output shows the multivariate linear regression model we wrote in R. The residuals represent the differences between the actual values of the dependent variable (in this case, the crime rate) and the predicted values based on the model. More specifically, the residual is the vertical distance between an observed data point and the regression line. Ideally residuals should be small and randomly distributed near zero. Our residuals seem to be distributed around zero but are quite large. I am not quite sure how to interpret this but I think there are only two options. Either this model does not fit the data well and there is something we are missing or the difference between crime rate and its predictors are quite large. In this case we should scale all of our data to be on similar orders of magnitude in order to determine if the model fits the data well. We have four columns of data for our coefficients. An explanation of the output is as follows: 1. Estimate - this is the estimated value of the regression coefficient for each predictor variable. It represents the amount of change in the dependent variable (crime rate) associated with a one-unit change in the predictor variable, holding all other variables constant.
2. Std. Error - this is the standard error of the coefficient estimate. It represents the average amount that the coefficient estimate is expected to vary from the true population value across different samples. A smaller standard error indicates a more precise estimate. 3. t value - this is the t-statistic for the coefficient estimate. It represents the ratio of the estimated coefficient to its standard error. The t-value indicates the extent to which the estimated coefficient deviates from zero, relative to its variability. A larger t-value indicates a larger deviation from zero and a higher degree of statistical significance. 4. Pr(>|t|) - this is the p-value associated with the t-statistic for the coefficient estimate. It represents the probability of observing a t-statistic as extreme or more extreme than the observed value, assuming that the null hypothesis (i.e., that the coefficient is equal to zero) is true. A smaller p-value indicates a lower probability of observing the coefficient estimate by chance and a higher degree of statistical significance. 5. Signif. Codes - these are asterisks that indicate the level of statistical significance of the coefficient estimate. They are based on the p-value of the t-statistic and provide a convenient summary of the statistical significance of each coefficient. I am a bit confused on exactly what this means so any input is appreciated. 6. Residual standard error - this is the estimated standard deviation of the error terms, or residuals, in the regression model. In this case, the residual standard error is 209.1, which means that on average, the predicted crime rate from the model is expected to differ from the true crime rate by about 209.1. 7. Multiple R-squared - this is a measure of the variance in the dependent variable (crime rate) that is explained by the independent variables in the model. In this case, the multiple R-squared value is 0.8031, which means that the independent variables in the model explain 80.31% of the variance in the crime rate. 8. Adjusted R-squared - this is a modified version of the multiple R-squared that adjusts for the number of independent variables in the model. It provides a more conservative estimate of the amount of variance explained by the model. In this case, the adjusted R-squared is 0.7078, which is slightly lower than the multiple R-squared due to the inclusion of 15 predictor variables. 9. F-statistic - this is a test of the overall significance of the regression model. It compares the variance explained by the regression model to the variance not explained by the model. A larger F-statistic indicates a more significant fit of the model to the data. In this case, the F-statistic is
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
8.429 with 15 and 31 degrees of freedom, which means that the regression model as a whole is statistically significant. 10. P-value - this is the probability of obtaining an F-statistic as extreme or more extreme than the observed value, assuming that the null hypothesis, that all regression coefficients are zero, is true. In this case, the p-value is 3.539e-07, which is very small and indicates strong evidence against the null hypothesis. Plotting this data shows that: The data follows a normal distribution, meaning scaling the data isn’t necessary for this data set. Scaling the data may still have beneficial effects though.
The data holds its linearity fairly well, meaning a linear regression is appropriate. These are some awesome graphs using the car package. They show a single predictor variable vs the response. Lines with slopes further away from zero shows a greater relationship between the two variables graphed.
Now for our prediction we loaded the data manually by creating a dataframe with it. With the given data we determined the crime rate should be around 155 offenses per 100,000 population in 1960. To do this again I would love to scale the data to get more easily understandable data. I don’t think this would change our prediction at all but it would make it much more obvious which factors were affecting the crime rate.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help