Ass2

docx

School

Boston University *

*We aren’t endorsed by this school

Course

699

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

docx

Pages

16

Uploaded by n1026818121

Report
AD699: Data Mining for Business Analytics Individual Assignment #2 Spring 2023 Due by: Friday, 3Mar @ 11:59 p.m. Simple Linear Regression Q1 Bring dataset into R environment. The dataset has been set in the environment correctly. Q2 Use str() function and indentify data types Numerical: enrltot, teachers, calwpct, mealpct, computer, testscr, compstu, expnstu, str, avginc, elpct, readscr, mathscr Categorical: distcod, county, district, grspan Q3 Filter the dataset to 16 most common counties remain
I first create a new data frame called ‘county_counts’ that shows the number of school districts in each county. Next I select only the counties with 10 or more school districts and extract the county names into a vector called ‘common’. Finally, select only the rows in Caschool that correspond to the common. Q4 Partition
Trainning set:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Validation set: Partitioning the data into a training set and a validation set helps us to build better predictive models that generalize well to new data. The advantages are preventing overfitting, Evaluating model performance, and tuning model hyperparameters. Q5 readscr vs mealpct
The percentage of students in the district who qualify for free and reduced price lunches is inversely proportional to average reading score. This does make intuitive sense to me based on the scatter plot shows. It seems there is a strong relationship between readscr and mealpct in the training set. Q6 Correlation between readscr and mealpct
The correaltion coefficient between readscr and mealpct is -0.8925 which indicates a very strong negative correlation between the two variables. The p-value is less than 2.2e-16 which suggests that the correlation is statistically significant at the 5% level. Q7 Simple linear regression On average, a one unit increase in mealpct will decrease readscr by 0.6469. The p-value for mealpct is <2e-16, which is highly significant and suggests that there is a strong linear relationship between mealpct and readscr. The R-squared value of 0.7966 indicates that approximately 79.66% of the variation in readscr can be explained by mealpct.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Q8 Residual minimum residual is -22.52644 and maximum residual is 20.83557. a. According to the results above, the model underpredicted the average reading score for this district by 20.8356 points. b. According to the results above, the model overpredicted the average reading score for this district by 22.52644 points. c. This is my concern since question 5, I cannot figure out why mealpct has such a strong impact to readscr. While mealpct seems like a significant predictor of reading scores, there are many other factors that contribute to a student's academic success. I would try to use other factors such as student teacher ratio, expenditure per student, or percent of English learners to explor their relationship with readscr.
Q9 hypothetical input and outcome According to Question 7, the equation is: readscr = 683.76628 - 0.64690 * mealpct This equation tells us that for every one-unit increase in mealpct, there is an expected decrease of 0.64690 in the average reading score. let's say we have a hypothetical input value of 50 for mealpct. Plugging this value into the regression equation, we get: readscr = 683.76628 - 0.64690 * 50 readscr = 650.33128 This means that our model predicts an average reading score of approximately 650.33128 for a district with a mealpct of 50. Q10 Comparison The purpose of comparing the accuracy of the model on both the training and validation sets is to assess its ability to generalize to new data. Based on the accuracy values, we can see that the model has similar performance on both the training and validation sets. The RMSE and MAE values are slightly lower on the validation set, indicating that the model may be slightly overfitting to the training set. However, the differences in accuracy measures are relatively small, suggesting that the model is generalizing well to new data. Q11 RMSE and sd If the RMSE is larger than the standard deviation, it suggests that the model is not doing a good job of capturing the variability in the data and making accurate predictions. On the other hand, if the RMSE is smaller than the standard deviation, it suggests that the model is doing a good job of capturing the variability in the data and making accurate predictions. In this case, the RMSE of the model on the training set is 8.986514, while the standard deviation of readscr in the training set is 19.98554. This suggests that the RMSE is relatively small compared to the standard deviation, indicating that the model is doing a good job of capturing the variability in the data and making accurate predictions.
Multiple Linear Regression Q1 Remove variables a. b. I decide to remove district code and district variables because they have too many unique values and both of them are categorical variables. Q2 Correlation table
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In first correlation table, enrltot, teachers, computer, and mealpct are removed. Enrltot/teachers/computer – each one has high correlation with other two. Mealpct – As we discussed in simple linear regression part, mealpct may not perfectly predict reading scores. Q4 Dummy variables Dummy variables are variables that take on a value of either 0 or 1, representing the absence or presence of a particular categorical attribute, respectively. Dummy variables are often created in statistical modeling to allow for the inclusion of categorical variables in a regression analysis, where they are used to represent the effect of the categorical variable on the outcome variable. Essentially, they allow the incorporation of categorical information into a numerical model by representing the categorical variable as a set of binary variables. Q4 Building a model with a categorical input. a. Orange county
b. c. d. The model predicts the test score for Orange County to be 654.65 (estimated by subtracting the coefficient of Orange County, -14.500, from the intercept, 669.150). However, it's important to note that the model has a relatively low adjusted R-squared value of 0.3213, which suggests that county alone may not be a strong predictor of reading scores. e. They both equal to 654.65. This makes sense because the simple linear regression model was built using the county as the only predictor, and the intercept term represents the average reading score for the reference county, which in this case is Orange County.
Q5 Backward elimination
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
This model includes the variables calwpct, str, avginc, and elpct, and has an lowest AIC of 762. When calwpct, str, or elpct decrease, reading score increases. When avginc goes up, readscr increases too. a. The summary shows the final multiple linear regression model with four predictors. The p-values of all the predictors are statistically significant at the 0.05 level or lower, indicating that they have a significant relationship with the response variable "readscr". Q6 Model metrics a. b.
c. This value is likely the R-squared value for my multiple linear regression model. Q7 t-distribution plot From the output, we can see that the t-value for "str" is -2.558, and the number of degrees of freedom in our model is 159. 98.9% of the curve is shaded. The shaded area in the t-distribution
plot represents the probability of getting a t-value equal to or more extreme than the observed t-value under the null hypothesis. Q8 F-statistic The F-statistic is reported for each predictor variable, as well as for the residual errors. The F- statistic for each predictor variable tests the null hypothesis that the corresponding regression coefficient is equal to zero, while the F-statistic for the residual errors tests the null hypothesis that there is no significant unexplained variation in the model. Q9 Fictional school district “BU School District” calwpct = 40% str = 20 avginc = $60,000 elpct = 10% readscr = 674.51 - 0.45 * 40 - 1.16 * 20 + 1.09 * 60000 - 0.48 * 10 readscr = 674.51 - 18 - 23.2 + 65400 - 4.8 readscr = 673.01 BU School District will have an average reading score of 673.01 Q10 Accuracy
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the model performs well on the training set with an RMSE of 9.901552 and an MAPE of 1.188187. However, when we test the model on the validation set, we see that the RMSE is lower at 8.913565 and the MAPE is also lower at 1.004548. This suggests that the model is overfitting the training data and does not generalize well to new data. Compared to the MLR model, we can see that the SLR model performs better in terms of accuracy on both the training and validation sets.