MAT 303 Project One Summary Report_ChinhDoan

docx

School

University of North Dakota *

*We aren’t endorsed by this school

Course

303

Subject

Mathematics

Date

Jun 23, 2024

Type

docx

Pages

17

Uploaded by CountFoxMaster1054

Report
MAT 303 Project One Summary Report CHINH DOAN chinh.doan@snhu.edu Southern New Hampshire University 1
1. Introduction As a data analyst employed by a real estate firm, I am tasked with analyzing a substantial historical data set pertaining to residential properties. The objective of this analysis is to examine the relationships among various attributes of homes. The findings from this analysis will be utilized to assist the real estate company in establishing more accurate pricing for their clients' property listings. The analytical methods employed in this project will encompass first order and second order regression models, involving both quantitative and qualitative variables, as well as a nested second order regression model. 2. Data Preparation The key variables included in this data set are price, age, square footage of the living area, number of bathrooms, view, square footage of the upper level, school rating, and crime rate. The data set consists of 2,692 individual records (rows) and encompasses 23 columns. 3. Model #1 - First Order Regression Model with Quantitative and Qualitative Variables 2
The scatterplot presented above illustrates a positive correlation between the price of a home and the square footage of its living area. Specifically, as the living area in square footage increases, there is a corresponding increase in the price of the home. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The scatterplot of the price compared to the age of the home exhibits a positive trend, indicating no association between the two variables. 4
The correlation coefficient between the price and the living area is 0.6895, while the correlation coefficient between the price and the age of the home is -0.0746. These values indicate a strong positive correlation between price and living area, and a strong negative correlation between price and the age of the home. Reporting Results: The general form and prediction equation of the multiple regression model is as follows: E ( y ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 + β 5 x 5 R script: ^ y = 7709 + 129.3 x 1 + 19.51 x 2 + 1451 x 3 + 43970 x 4 + 1.67 10 5 x 5 + e The multiple regression model is as follows: ^ y = ^ β 0 + ^ β 1 x 1 + ^ β 2 x 2 + ^ β 3 x 3 + ^ β 4 x 4 + ^ β 5 x 5 R script: ^ y = 77 09 + 129.3 x 1 + 19 . 51 x 2 + 1451 x 3 + 43970 x 4 + 2 . 49 10 5 x 5 5
The multiple regression model yields an R-squared value of 0.6029 and an adjusted R-squared value of 0.602. These values indicate a 60.29% and 60.2% variation within the model, respectively. The beta estimate for living area is 1.293e+02, and for lake view is 2.490e+05. This suggests that a lake view increases the price by 2.490e+05, and each unit increase in living area leads to a price increase of 1.293e+02. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Upon analysis of the plots, it is observed that there is no specific trend to the residuals against the fitted values. The points on the plot exhibit wide variation and lack a specific shape. In the Normal Q- Q plot, the variables consistently follow a positive trend, remaining close to or on the line. Evaluating Significance of Model In conducting the overall F-test, the null hypothesis posits that there is no significant relationship between the response variables and predictor variables, while the alternative hypothesis 7
suggests that there is a relationship between at least one of the predictor variables. With a P-value of 2.2e-16, the model is deemed significant at a 5% level of significance, leading to the rejection of the null hypothesis in favor of accepting the alternative hypothesis. The variables age, view, and sqft of the living room are found to be significant at a 5% level of significance. The null hypothesis asserts that all of the variables are significant, while the alternative hypothesis posits that none of the variables are significant to the model. With a P-value below 5%, the null hypothesis is accepted while the alternative hypothesis is rejected. Making Predictions Using Model The predicted price for a home with 2150 sqft living area, 1050 sqft upper-level living area, 15 years old, 3 bathrooms, and backing out to the road is $459,828.2. The 90% prediction interval for the price of this home is (239,563, 680,093.4) and the 90% confidence interval is (446,087.9, 473,568.5). 8
The predicted price for a home with 4250 sqft living area, 2100 sqft upper-level living area, 5 years old, 5 bathrooms, and backing out to a lake is $1,074,285. The 90% prediction interval for the price of the home is (852,522.6, 1,296,048) and the 90% confidence interval is (1,045,117, 1,103,454). The width of the prediction interval exceeds that of the confidence interval due to the adjustment made in the prediction interval to accommodate regression errors and to account for sampling uncertainty in the model. 4. Model #2 - Complete Second Order Regression Model with Quantitative Variables 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Upon analyzing the scatterplots depicting price in relation to crime rate per 100,000 individuals and price in relation to the average school rating in the area, it was observed that both exhibit a non- linear trending curve. Specifically, the scatterplot depicting price in relation to crime rate trends downward as price decreases, and crime rate increases. Conversely, the scatterplot depicting price in relation to school rating trends upward as price increases, and school rating increases. Therefore, it would be appropriate to utilize a second order model for these relationships. 10
Presentation of Findings The general form and prediction equation is as follows: E ( y ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + β 4 x 1 2 + β 5 x 2 2 The complete second order model for price utilizing average school rating in the area and crime rate is as follows: ^ y = ^ β 0 + ^ β 1 x 1 + ^ β 2 x 2 + ^ β 3 x 1 x 2 + ^ β 4 x 1 2 + ^ β 5 x 2 2 R script: ^ y = 7.339 10 5 7.375 10 4 x 1 3.155 10 3 x 2 52.27 x 1 x 2 + 1.165 10 4 x 1 2 + 6.377 x 2 2 The R-squared value is 0.8088, and the adjusted R-squared value is 0.8084. These values indicate that 80.88% of the variation in price can be explained by the provided variables, and the adjusted R- squared value accounts for 80.84% of the variation in price with the given variables. 11
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The above plots indicate that there is no discernible trend in the residuals against fitted values, and the points on the normal q-q plot closely align with the line, with only a few falling slightly below or above it. Overall, the data from the above plots support the assumption of homoscedasticity. Evaluating Significance of Model Upon conducting the overall F-test, the null hypothesis suggests that the variables are equally significant, while the alternative hypothesis suggests that the variables are not equally significant. The 13
resulting P-value is 2.2e-16, leading to the rejection of the null hypothesis in favor of the alternative hypothesis. Hence, the model is significant at a 5% level of significance. The variables "school rating" and "crime" are found to be significant at a 5% level of significance. The null hypothesis, which states that the terms are not significant, is rejected in favor of the alternative hypothesis, which suggests the terms are significant, based on a P-value of 2.2e-16. Making Predictions Using Model The predicted price for a home in an area with an average school rating of 9.80 and a crime rate of 81.02 per 100,000 individuals is $874,497. The 90% prediction interval is (721606.2, 1027388), and the 90% confidence interval is (863681.4, 885312.7). There is a 90% probability that the price of the home falls within these intervals. It can be stated with 90% confidence that the price will fall within these two intervals. 14
The forecasted price for a residence in an area with an average school rating of 4.28 and a crime rate of 215.50 per 100,000 individuals is $199,706.7. The 90% prediction interval is (46991.65, 352421.7), and the 90% confidence interval is (191753.5, 207659.9). There is a 90% probability that the price of the home, based on these predictions, will fall within the two intervals. We have 90% confidence that the price will fall within these two intervals. 5. Nested Models F-Test General form and prediction equation reduced as follow: E ( y ) = β 0 + β 1 x 1 + β 2 x 2 General form and prediction equation complete is as follow: ^ y = ^ β 0 + ^ β 1 x 1 + ^ β 2 x 2 R script: ^ y =− 410233.37 + 155559.97 x 1 + 2230.07 x 2 564.85 x 1 x 2 Evaluating Significance of Model 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Following the overall F-test, the null hypothesis states that the model would be sufficient for analysis, while the alternative hypothesis states that the model must be complete in order to be accepted for analysis. The P-value is 2.2e-16. The variables that are significant at a 5% level of significance are school rating and crime. In conclusion, the null hypothesis would be rejected, and the alternative hypothesis would be accepted. The model is not significant at a 5% level of significance. Model Comparison A reduced model contains only a subset of the original equation, while a complete model includes all terms as well as interaction terms in a regression model. General form and prediction equation reduced as follow: E ( y ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 ^ y = ^ β 0 + ^ β 1 x 1 + ^ β 2 x 2 + ^ β 3 x 1 x 2 General form and prediction equation complete is as follow: E ( y ) = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 1 x 2 + β 4 x 1 2 + β 5 x 2 2 ^ y = ^ β 0 + ^ β 1 x 1 + ^ β 2 x 2 + ^ β 3 x 1 x 2 + ^ β 4 x 1 2 + ^ β 5 x 2 2 When conducting the nested model F-test at a 5% level of significance, the null hypothesis states that there is no relationship between the squared terms, and the alternative hypothesis states that there is a relationship between the squared terms. The P-value is 2.2e-16, so we would reject the null hypothesis and accept the alternative hypothesis. 6. Conclusion Throughout this project, I conducted statistical analyses using a large set of historical data to create multiple regression models. These models were analyzed to assist a real estate company in setting better prices when listing a home for a client. The model that I would choose to predict house prices is the complete second-order regression model because almost all variables can be tested within 16
this model. I found that the highest level of variation can be explained using this model. The practical significance of the analyses performed is that they can be used to predict the value of a house based on various factors such as layout, outside view, school rating, and crime rates. 17