MAT 303 Project One Summary Report_ChinhDoan
docx
keyboard_arrow_up
School
University of North Dakota *
*We aren’t endorsed by this school
Course
303
Subject
Mathematics
Date
Jun 23, 2024
Type
docx
Pages
17
Uploaded by CountFoxMaster1054
MAT 303 Project One Summary Report
CHINH DOAN
chinh.doan@snhu.edu
Southern New Hampshire University
1
1. Introduction
As a data analyst employed by a real estate firm, I am tasked with analyzing a substantial
historical data set pertaining to residential properties. The objective of this analysis is to examine the
relationships among various attributes of homes. The findings from this analysis will be utilized to assist
the real estate company in establishing more accurate pricing for their clients' property listings. The
analytical methods employed in this project will encompass first order and second order regression
models, involving both quantitative and qualitative variables, as well as a nested second order
regression model.
2. Data Preparation
The key variables included in this data set are price, age, square footage of the living area,
number of bathrooms, view, square footage of the upper level, school rating, and crime rate. The data
set consists of 2,692 individual records (rows) and encompasses 23 columns.
3. Model #1 - First Order Regression Model with Quantitative and Qualitative Variables
2
The scatterplot presented above illustrates a positive correlation between the price of a home
and the square footage of its living area. Specifically, as the living area in square footage increases, there
is a corresponding increase in the price of the home.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The scatterplot of the price compared to the age of the home exhibits a positive trend,
indicating no association between the two variables.
4
The correlation coefficient between the price and the living area is 0.6895, while the correlation
coefficient between the price and the age of the home is -0.0746. These values indicate a strong positive
correlation between price and living area, and a strong negative correlation between price and the age
of the home.
Reporting Results:
The general form and prediction equation of the multiple regression model is as follows:
E
(
y
)
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
3
+
β
4
x
4
+
β
5
x
5
R script: ^
y
=
7709
+
129.3
x
1
+
19.51
x
2
+
1451
x
3
+
43970
x
4
+
1.67
∗
10
5
x
5
+
e
The multiple regression model is as follows:
^
y
=
^
β
0
+
^
β
1
x
1
+
^
β
2
x
2
+
^
β
3
x
3
+
^
β
4
x
4
+
^
β
5
x
5
R script: ^
y
=
77 09
+
129.3
x
1
+
19
.
51
x
2
+
1451
x
3
+
43970
x
4
+
2
.
49
∗
10
5
x
5
5
The multiple regression model yields an R-squared value of 0.6029 and an adjusted R-squared
value of 0.602. These values indicate a 60.29% and 60.2% variation within the model, respectively. The
beta estimate for living area is 1.293e+02, and for lake view is 2.490e+05. This suggests that a lake view
increases the price by 2.490e+05, and each unit increase in living area leads to a price increase of
1.293e+02.
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Upon analysis of the plots, it is observed that there is no specific trend to the residuals against
the fitted values. The points on the plot exhibit wide variation and lack a specific shape. In the Normal Q-
Q plot, the variables consistently follow a positive trend, remaining close to or on the line.
Evaluating Significance of Model
In conducting the overall F-test, the null hypothesis posits that there is no significant
relationship between the response variables and predictor variables, while the alternative hypothesis
7
suggests that there is a relationship between at least one of the predictor variables. With a P-value of
2.2e-16, the model is deemed significant at a 5% level of significance, leading to the rejection of the null
hypothesis in favor of accepting the alternative hypothesis.
The variables age, view, and sqft of the living room are found to be significant at a 5% level of
significance. The null hypothesis asserts that all of the variables are significant, while the alternative
hypothesis posits that none of the variables are significant to the model. With a P-value below 5%, the
null hypothesis is accepted while the alternative hypothesis is rejected.
Making Predictions Using Model
The predicted price for a home with 2150 sqft living area, 1050 sqft upper-level living area, 15
years old, 3 bathrooms, and backing out to the road is $459,828.2.
The 90% prediction interval for the price of this home is (239,563, 680,093.4) and the 90%
confidence interval is (446,087.9, 473,568.5).
8
The predicted price for a home with 4250 sqft living area, 2100 sqft upper-level living area, 5
years old, 5 bathrooms, and backing out to a lake is $1,074,285.
The 90% prediction interval for the price of the home is (852,522.6, 1,296,048) and the 90%
confidence interval is (1,045,117, 1,103,454).
The width of the prediction interval exceeds that of the confidence interval due to the
adjustment made in the prediction interval to accommodate regression errors and to account for
sampling uncertainty in the model.
4. Model #2 - Complete Second Order Regression Model with Quantitative Variables
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Upon analyzing the scatterplots depicting price in relation to crime rate per 100,000 individuals
and price in relation to the average school rating in the area, it was observed that both exhibit a non-
linear trending curve. Specifically, the scatterplot depicting price in relation to crime rate trends
downward as price decreases, and crime rate increases. Conversely, the scatterplot depicting price in
relation to school rating trends upward as price increases, and school rating increases. Therefore, it
would be appropriate to utilize a second order model for these relationships.
10
Presentation of Findings
The general form and prediction equation is as follows:
E
(
y
)
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
1
x
2
+
β
4
x
1
2
+
β
5
x
2
2
The complete second order model for price utilizing average school rating in the area and crime
rate is as follows:
^
y
=
^
β
0
+
^
β
1
x
1
+
^
β
2
x
2
+
^
β
3
x
1
x
2
+
^
β
4
x
1
2
+
^
β
5
x
2
2
R script: ^
y
=
7.339
∗
10
5
−
7.375
∗
10
4
x
1
−
3.155
∗
10
3
x
2
−
52.27
x
1
x
2
+
1.165
∗
10
4
x
1
2
+
6.377
x
2
2
The R-squared value is 0.8088, and the adjusted R-squared value is 0.8084. These values indicate
that 80.88% of the variation in price can be explained by the provided variables, and the adjusted R-
squared value accounts for 80.84% of the variation in price with the given variables.
11
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The above plots indicate that there is no discernible trend in the residuals against fitted values,
and the points on the normal q-q plot closely align with the line, with only a few falling slightly below or
above it. Overall, the data from the above plots support the assumption of homoscedasticity.
Evaluating Significance of Model
Upon conducting the overall F-test, the null hypothesis suggests that the variables are equally
significant, while the alternative hypothesis suggests that the variables are not equally significant. The
13
resulting P-value is 2.2e-16, leading to the rejection of the null hypothesis in favor of the alternative
hypothesis. Hence, the model is significant at a 5% level of significance.
The variables "school rating" and "crime" are found to be significant at a 5% level of significance.
The null hypothesis, which states that the terms are not significant, is rejected in favor of the alternative
hypothesis, which suggests the terms are significant, based on a P-value of 2.2e-16.
Making Predictions Using Model
The predicted price for a home in an area with an average school rating of 9.80 and a crime rate
of 81.02 per 100,000 individuals is $874,497. The 90% prediction interval is (721606.2, 1027388), and
the 90% confidence interval is (863681.4, 885312.7). There is a 90% probability that the price of the
home falls within these intervals. It can be stated with 90% confidence that the price will fall within
these two intervals.
14
The forecasted price for a residence in an area with an average school rating of 4.28 and a crime
rate of 215.50 per 100,000 individuals is $199,706.7. The 90% prediction interval is (46991.65,
352421.7), and the 90% confidence interval is (191753.5, 207659.9). There is a 90% probability that the
price of the home, based on these predictions, will fall within the two intervals. We have 90%
confidence that the price will fall within these two intervals.
5. Nested Models F-Test
General form and prediction equation reduced as follow:
E
(
y
)
=
β
0
+
β
1
x
1
+
β
2
x
2
General form and prediction equation complete is as follow:
^
y
=
^
β
0
+
^
β
1
x
1
+
^
β
2
x
2
R script: ^
y
=−
410233.37
+
155559.97
x
1
+
2230.07
x
2
−
564.85
x
1
x
2
Evaluating Significance of Model
15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Following the overall F-test, the null hypothesis states that the model would be sufficient for
analysis, while the alternative hypothesis states that the model must be complete in order to be
accepted for analysis. The P-value is 2.2e-16. The variables that are significant at a 5% level of
significance are school rating and crime. In conclusion, the null hypothesis would be rejected, and the
alternative hypothesis would be accepted. The model is not significant at a 5% level of significance.
Model Comparison
A reduced model contains only a subset of the original equation, while a complete model
includes all terms as well as interaction terms in a regression model.
General form and prediction equation reduced as follow:
E
(
y
)
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
1
x
2
^
y
=
^
β
0
+
^
β
1
x
1
+
^
β
2
x
2
+
^
β
3
x
1
x
2
General form and prediction equation complete is as follow:
E
(
y
)
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
1
x
2
+
β
4
x
1
2
+
β
5
x
2
2
^
y
=
^
β
0
+
^
β
1
x
1
+
^
β
2
x
2
+
^
β
3
x
1
x
2
+
^
β
4
x
1
2
+
^
β
5
x
2
2
When conducting the nested model F-test at a 5% level of significance, the null hypothesis
states that there is no relationship between the squared terms, and the alternative hypothesis states
that there is a relationship between the squared terms. The P-value is 2.2e-16, so we would reject the
null hypothesis and accept the alternative hypothesis.
6. Conclusion
Throughout this project, I conducted statistical analyses using a large set of historical data to
create multiple regression models. These models were analyzed to assist a real estate company in
setting better prices when listing a home for a client. The model that I would choose to predict house
prices is the complete second-order regression model because almost all variables can be tested within
16
this model. I found that the highest level of variation can be explained using this model. The practical
significance of the analyses performed is that they can be used to predict the value of a house based on
various factors such as layout, outside view, school rating, and crime rates.
17