Ass2
docx
keyboard_arrow_up
School
Boston University *
*We aren’t endorsed by this school
Course
699
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
docx
Pages
16
Uploaded by n1026818121
AD699: Data Mining for Business Analytics
Individual Assignment #2
Spring 2023
Due by: Friday, 3Mar @ 11:59 p.m.
Simple Linear Regression
Q1 Bring dataset into R environment.
The dataset has been set in the environment correctly.
Q2 Use str() function and indentify data types
Numerical:
enrltot, teachers, calwpct, mealpct, computer, testscr, compstu, expnstu, str, avginc,
elpct, readscr, mathscr
Categorical:
distcod, county, district, grspan
Q3 Filter the dataset to 16 most common counties remain
I first create a new data frame called ‘county_counts’ that shows the number of school districts
in each county. Next I select only the counties with 10 or more
school districts and extract the
county names into a vector called ‘common’. Finally, select only the rows in Caschool that
correspond to the common.
Q4 Partition
Trainning set:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Validation set:
Partitioning the data into a training set and a validation set helps us to build better predictive
models that generalize well to new data. The advantages are preventing overfitting, Evaluating
model performance, and tuning model hyperparameters.
Q5 readscr vs mealpct
The percentage of students in the district who qualify for free and reduced price lunches is
inversely proportional to average reading score. This does make intuitive sense to me based on
the scatter plot shows. It seems there is a strong relationship between readscr and mealpct in
the training set.
Q6 Correlation between readscr and mealpct
The correaltion coefficient between readscr and mealpct is -0.8925 which indicates a very strong
negative correlation between the two variables. The p-value is less than 2.2e-16 which suggests
that the correlation is statistically significant at the 5% level.
Q7 Simple linear regression
On average, a one unit increase in mealpct will decrease readscr by 0.6469. The p-value for
mealpct is <2e-16, which is highly significant and suggests that there is a strong linear
relationship between mealpct and readscr. The R-squared value of 0.7966 indicates that
approximately 79.66% of the variation in readscr can be explained by mealpct.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Q8 Residual
minimum residual is -22.52644 and maximum residual is 20.83557.
a.
According to the results above, the model underpredicted the average reading score for this
district by 20.8356 points.
b.
According to the results above, the model overpredicted the average reading score for this
district by 22.52644 points.
c.
This is my concern since question 5, I cannot figure out why mealpct has such a strong impact to
readscr. While mealpct seems like a significant predictor of reading scores, there are many other
factors that contribute to a student's academic success. I would try to use other factors such as
student teacher ratio, expenditure per student, or percent of English learners
to explor their
relationship with readscr.
Q9 hypothetical input and outcome
According to Question 7, the equation is:
readscr = 683.76628 - 0.64690 * mealpct
This equation tells us that for every one-unit increase in mealpct, there is an expected decrease
of 0.64690 in the average reading score. let's say we have a hypothetical input value of 50 for
mealpct. Plugging this value into the regression equation, we get:
readscr = 683.76628 - 0.64690 * 50
readscr = 650.33128
This means that our model predicts an average reading score of approximately 650.33128 for a
district with a mealpct of 50.
Q10 Comparison
The purpose of comparing the accuracy of the model on both the training and validation sets is
to assess its ability to generalize to new data. Based on the accuracy values, we can see that the
model has similar performance on both the training and validation sets. The RMSE and MAE
values are slightly lower on the validation set, indicating that the model may be slightly
overfitting to the training set. However, the differences in accuracy measures are relatively small,
suggesting that the model is generalizing well to new data.
Q11 RMSE and sd
If the RMSE is larger than the standard deviation, it suggests that the model is not doing a good
job of capturing the variability in the data and making accurate predictions. On the other hand, if
the RMSE is smaller than the standard deviation, it suggests that the model is doing a good job
of capturing the variability in the data and making accurate predictions. In this case, the RMSE of
the model on the training set is 8.986514, while the standard deviation of readscr in the training
set is 19.98554. This suggests that the RMSE is relatively small compared to the standard
deviation, indicating that the model is doing a good job of capturing the variability in the data
and making accurate predictions.
Multiple Linear Regression
Q1 Remove variables
a.
b.
I decide to remove district code and district variables because they have too many unique values
and both of them are categorical variables.
Q2 Correlation table
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In first correlation table, enrltot, teachers, computer, and mealpct are removed.
Enrltot/teachers/computer – each one has high correlation with other two.
Mealpct – As we discussed in simple linear regression part, mealpct may not perfectly predict
reading scores.
Q4 Dummy variables
Dummy variables are variables that take on a value of either 0 or 1, representing the absence or
presence of a particular categorical attribute, respectively. Dummy variables are often created in
statistical modeling to allow for the inclusion of categorical variables in a regression analysis,
where they are used to represent the effect of the categorical variable on the outcome variable.
Essentially, they allow the incorporation of categorical information into a numerical model by
representing the categorical variable as a set of binary variables.
Q4 Building a model with a categorical input.
a.
Orange county
b.
c.
d.
The model predicts the test score for Orange County to be 654.65 (estimated by subtracting the
coefficient of Orange County, -14.500, from the intercept, 669.150). However, it's important to
note that the model has a relatively low adjusted R-squared value of 0.3213, which suggests that
county alone may not be a strong predictor of reading scores.
e.
They both equal to 654.65. This makes sense because the simple linear regression model was
built using the county as the only predictor, and the intercept term represents the average
reading score for the reference county, which in this case is Orange County.
Q5 Backward elimination
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
This model includes the variables calwpct, str, avginc, and elpct, and has an lowest AIC of 762.
When calwpct, str, or elpct decrease, reading score increases. When avginc goes up, readscr
increases too.
a.
The summary shows the final multiple linear regression model with four predictors. The p-values
of all the predictors are statistically significant at the 0.05 level or lower, indicating that they have
a significant relationship with the response variable "readscr".
Q6 Model metrics
a.
b.
c.
This value is likely the R-squared value for my multiple linear regression model.
Q7 t-distribution plot
From the output, we can see that the t-value for "str" is -2.558, and the number of degrees of
freedom in our model is 159.
98.9% of the curve is shaded. The shaded area in the t-distribution
plot represents the probability of getting a t-value equal to or more extreme than the observed
t-value under the null hypothesis.
Q8 F-statistic
The F-statistic is reported for each predictor variable, as well as for the residual errors. The F-
statistic for each predictor variable tests the null hypothesis that the corresponding regression
coefficient is equal to zero, while the F-statistic for the residual errors tests the null hypothesis
that there is no significant unexplained variation in the model.
Q9 Fictional school district
“BU School District”
calwpct = 40%
str = 20
avginc = $60,000
elpct
= 10%
readscr = 674.51 - 0.45 * 40 - 1.16 * 20 + 1.09 * 60000 - 0.48 * 10
readscr = 674.51 - 18 - 23.2 + 65400 - 4.8
readscr = 673.01
BU School District will have an average reading score of 673.01
Q10 Accuracy
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
the model performs well on the training set with an RMSE of 9.901552 and an MAPE of
1.188187. However, when we test the model on the validation set, we see that the RMSE is
lower at 8.913565 and the MAPE is also lower at 1.004548. This suggests that the model is
overfitting the training data and does not generalize well to new data.
Compared to the MLR model, we can see that the SLR model performs better in terms of
accuracy on both the training and validation sets.