Module1Assignment - Copy

docx

School

Northeastern University *

*We aren’t endorsed by this school

Course

6015

Subject

Mathematics

Date

Apr 3, 2024

Type

docx

Pages

20

Uploaded by PresidentToadPerson1018

Report
Module 1 Assignment College of Professional Studies, Northeastern University ALY6015, 21626 Harpreet Sharma January 15 th , 2023 1
Table of Contents Introduction ................................................................................................................................ 3 Analysis ...................................................................................................................................... 3 Figure 1 ................................................................................................................................... 3 Histogram of Sale Price ......................................................................................................... 3 Figure 2 ................................................................................................................................... 4 Descriptive Statistics of Sale Price ......................................................................................... 4 Figure 3 ................................................................................................................................... 4 Correlation Matrix Plot of Subset Data ................................................................................. 4 Figure 4 ................................................................................................................................... 5 Scatterplot for the Variable with the Highest correlation ...................................................... 5 Figure 5 ................................................................................................................................... 6 Scatterplot for the Variable with the lowest correlation ......................................................... 6 Figure 6 ................................................................................................................................... 7 Scatterplot for the variable with correlation closest to 0.5 .................................................... 7 Figure 7 ................................................................................................................................... 9 Diagnostic Plot ....................................................................................................................... 9 Figure 8 ................................................................................................................................. 11 All Subset Regression Plot .................................................................................................... 11 Conclusion/Interpretations ....................................................................................................... 13 References ................................................................................................................................ 14 Appendices ............................................................................................................................... 15 2
Introduction This report details an analysis focused on exploring and modeling the Ames housing dataset comprising 2930 records and 82 variables, with the primary goal of predicting sale prices. The central question guiding the exploration is understanding the determinants of housing prices. To do this, a methodical strategy is used, which includes importing and preparing the dataset, performing thorough Exploratory Data Analysis (EDA), and applying predictive modeling tools. Analysis Commencing with an exploration of the dataset, an in-depth analysis was carried out using visualizations (see Appendix A) and descriptive statistics (see Appendix B). This process involved uncovering patterns, understanding distributions, and discerning correlations, particularly about the sale price. Subsequently, numeric variables were extracted from the original dataset (see Appendix C), and further preparation was undertaken by imputing missing values using the mean values of each respective variable (see Appendix D). Figure 1 Histogram of Sale Price 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Figure 1 shows a histogram illustrating the distribution of sale prices with a noticeable positive skew, suggesting that the data is skewed to the right. Figure 2 Descriptive Statistics of Sale Price Figure 2 details descriptive statistics for the sale price of 2,930 properties revealing an average sale price of $180,796.1, with a median of $160,000. The standard deviation of $79,886.69 indicates notable variability around the mean. Sale prices range from $12,789 to $755,000, reflecting diverse property values. Correlation analysis was performed on the subset dataset to compute the correlation matrix, and a visual representation of the matrix was generated (see Appendix E). 4
Figure 3 Correlation Matrix Plot of Subset Data The resulting correlation matrix plot in Figure 3, visually represents the strength and direction of linear relationships between the numeric variables. Darker colors indicate stronger correlations. The legend on the right side of the correlation matrix indicates the strength of correlations with 1 indicating a perfect positive correlation, -1 indicating a perfect negative correlation, and 0 indicating that there is no relationship between the different variables ( Correlation Analysis Different Types of Plots in R | R-Bloggers , 2021). Following that, scatterplots were generated for variables exhibiting the highest and lowest correlation with Sale Price, along with the variable demonstrating a correlation closest to 0.5 (refer to Appendix F). 5
Figure 4 Scatterplot for the Variable with the Highest correlation The scatter plot for the variable with the highest correlation (above-ground living area) with Sale Price reveals a positive linear relationship. As the above-ground living area increases, there is a corresponding increase in sale price. 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Figure 5 Scatterplot for the Variable with the lowest correlation The scatter plot shown in Figure 5 for the variable with the lowest correlation (Enclosed Porch) with Sale Price shows a weak relationship. There is no clear trend, suggesting that changes in the Enclosed Porch have minimal impact on Sale Price. 7
Figure 6 Scatterplot for the variable with correlation closest to 0.5 The scatter plot shown in Figure 6 for the variable with the correlation closest to 0.5 reveals a moderate positive linear relationship. A regression model was fitted using four variables to establish a relationship between the dependent variable, sale price, and the specified independent variables (see Appendix G). The chosen variables were Lot Area, Above Ground Living Area, Garage Area, and Total Square Feet of Basement Area. The regression model revealed that the independent variables are significant predictors of sale price. The coefficients except the Lot Area which has a p- value of 0.576 are significantly significant as indicated by the extremely low p-values (less than 0.0000000000000002). The Multiple R-squared is 0.6795, suggesting that approximately 67.95% of the variability in Sale Price is explained by the model. 8
The linear regression equation for the fit model is then expressed as follows: Sale Price = −29,593.644 + 0.06265 (Lot Area) + 68.862 (Above Ground Living Area) + 105.145 (Garage Area) + 54.586 (Total square Feet of Basement Area) The intercept of approximately -$29,593.644 represents the estimated Sale Price when all predictor variables (Lot Area, Above Ground Living Area, Garage Area, and Total Square Feet of Basement Area) are zero. The Lot Area has a coefficient of 0.06265 suggesting that for each additional square foot increase in the lot area, the sale price increases by $0.06265, holding other variables constant. The Above Ground Living Area coefficient of 68.862 suggests that for each additional square foot increase in the living area, the Sale Price is estimated to increase by $68.862, holding other variables constant. The Garage Area coefficient of 105.145 indicates that for each additional square foot increase in the garage area, the Sale Price is estimated to increase by $105.145, holding other variables constant. The Total Square Feet of Basement Area coefficient of 54.586 implies that for each additional square foot increase in the total basement area, the Sale Price is estimated to increase by $54.586, holding other variables constant. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Figure 7 Diagnostic Plot As shown in Figure 7, I produced four diagnostic plots (see Appendix H): a residual vs. fitted values plot, a Q-Q plot, a scale-location plot, and a residual vs. leverage plot to assess the model (appendix). The residuals vs. fitted values plot assesses the randomness of residuals. The Q-Q plot checks for normality in residuals. The scale-location plot examines homoscedasticity. The residuals vs. leverage plot identifies influential data points. To check for multicollinearity in the regression model, I assessed the variance inflation factor (VIF) for each predictor variable (see Appendix I). The results are 1.114879, 1.413121, 1.483363, and 1.414133 for the Lot Area, Above Ground Living Area, Garage Area, and Total Square Feet of the Basement Area respectively. This indicates that there is very little multicollinearity among the variables as the VIF values are all close to 1 well below 10
commonly accepted thresholds for multicollinearity ( Multicollinearity Essentials and VIF in R - Articles - STHDA , 2018). The next step was figuring out if there were outliers in the model (see Appendix J). The approach focused on detecting outliers through an outlier test and through the diagnostic plot examining the relationship between standardized residuals and leverage as shown in Figure 7. As shown data points marked as 957, 1499, and 2181 are part of outliers identified. Analysis of the model residuals reveals the presence of outliers as evidenced by the high absolute values of r student and extremely low unadjusted and Bonferroni-adjusted p-values (see appendix J). These outliers significantly deviate from the expected pattern and may impact the reliability of the model. The p-value of Lot Area in the linear regression model is 0.576, proving that it is not statistically significant in predicting the sale price in the model. Therefore, it is left out to simplify and potentially improve the model. After fitting the model without the Lot Area (see Appendix K), it has a similar overall fit with the initial model but a slightly lower residual standard error (45250) and a higher F- statistic (2068 on 3 and 2926). The coefficients and significance levels for the remaining variables in both models are consistent. Removing Lot Area does not seem to impact the model’s explanatory power significantly; however, it is preferred for its simplicity. To identify the best model, the all-subsets regression method is used (see Appendix L) as it explores all possible combinations of predictor variables. 11
Figure 8 All Subset Regression Plot As shown in Figure 8, the intercept along with the above-ground living area, garage area, and the total square feet of basement area has an adjusted R-square of 0.68, which is the highest value on the plot indicating that the three-predictor model is the best. The preferred model equation is as follows: Sale Price = −29,593.644 + 68.862 (Above Ground Living Area) + 105.145 (Garage Area) + 54.586 (Total square Feet of Basement Area). Comparing the preferred model from the all-subset regression method to the previous model where the Lot Area was removed; 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The model without Lot Area, includes Above Ground Living Area, Garage Area, and Total square Feet of Basement Area as predictors. The coefficients are as follows: - Intercept: -29,593.644 - Above Ground Living Area: 68.862 - Garage Area: 105.145 - Total square Feet of Basement Area: 54.586 This model yields an adjusted R-squared of 0.6791, indicating a reasonably good fit. The all-subsets regression method identified the best model with an adjusted R- squared of 0.68. The preferred model equation is: Sale Price = −29,593.644 + 68.862 (Above Ground Living Area) + 105.145 (Garage Area) + 54.586 (Total square Feet of Basement Area). This model is selected based on its higher adjusted R-squared value, indicating an optimal balance between model fit and simplicity. The model without Lot Area and the model obtained through the all-subsets regression method essentially converge into the same predictive equation: Sale Price = −29,593.644 + 68.862 (Above Ground Living Area) + 105.145 (Garage Area) + 54.586 (Total square Feet of Basement Area). Both models exhibit consistent coefficients and result in an adjusted R-squared around 0.68, emphasizing the robustness of the chosen predictor variables in explaining the Sale Price. In conclusion, whether derived through manual variable selection or the all- 13
subsets regression method, the chosen model underscores the importance of the ground Living Area, Garage Area, and Total square Feet of Basement Area in predicting Sale Price. Conclusion/Interpretations In conclusion, the exploration into predictive modeling for housing prices shows significant agreement between models created using the all-subsets regression approach and models created manually, highlighting the pivotal role of the ground Living Area, Garage Area, and Total square Feet of Basement Area in determining the Sale Price. Both models balance simplicity and accuracy with consistent coefficients and comparable adjusted R- squared values of 0.68. Based on these findings, real estate professionals and stakeholders are advised to emphasize these critical aspects when determining property values. Continuous efforts to improve the model, including the investigation of new relevant features, may improve predicted accuracy. 14
References 9.1 - Distinction Between Outliers and High Leverage Observations | STAT 462 . (2018). Psu.edu. https://online.stat.psu.edu/stat462/node/170/ Bondar, S. (2023, September 13).  Missing Data Imputation with R . Reintech.io; Reintech. https://reintech.io/blog/missing-data-imputation-with-r Correlation Analysis Different Types of Plots in R | R-bloggers . (2021, May 13). R- Bloggers. https://www.r-bloggers.com/2021/05/correlation-analysis-different-types-of-plots- in-r/ Getting Started with Multiple Imputation in R | UVA Library . (2024). Virginia.edu. https://library.virginia.edu/data/articles/getting-started-with-multiple-imputation-in-r Kabacoff, R. (2015). R in action: Data analysis and graphics with R . Manning. Multicollinearity Essentials and VIF in R - Articles - STHDA. (2018, March 11). Sthda.com. http://www.sthda.com/english/articles/39-regression-model-diagnostics/160- multicollinearity-essentials-and-vif-in-r/ Thieme, C. (2021, May 15).  Identifying Outliers in Linear Regression — Cook’s Distance . Medium; Towards Data Science. https://towardsdatascience.com/identifying- outliers-in-linear-regression-cooks-distance-9e212e9136a 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Appendices Appendix A R code for the histogram of Sale Price The R code in Appendix A is used to plot a histogram of the Sale Price to view its distribution. Appendix B R code for descriptive statistics for Sale Price The R code in Appendix B is used to get descriptive statistics for Sale Price. Appendix C Numerical Variables The R code in Appendix C is used to extract numerical variables from the initial dataset. 16
Appendix D Missing Values Appendix D details how missing values of the numerical variables are identified and imputed with their mean values. Appendix E Correlation Matrix Appendix E details the code used to run and plot the correlation matrix of the numeric values. 17
Appendix F Scatterplots showing relationships with Sale Price The appendix details to R code used to plot scatterplots of variables with the highest, and lowest correlation with sale price, and whose correlation is closest to 0.5. Appendix G Regression model using 4 continuous variables Appendix G details the fit regression model using 4 continuous variables. 18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Appendix H Diagnostic Plot Appendix H details the r code used to get the diagnostic plots for the model. Appendix I Multicollinearity Appendix I details the R code used to check for multicollinearity amongst independent variables. Appendix J Outliers Appendix J details the R code used to identify outliers and the result. 19
Appendix K Regression model excluding one variable Appendix K details the R code used to fit the model without Lot Area. Appendix L All subset Regression Appendix L details the R code used to identify the best subset model 20