CQMS_442_-_Project.edited

docx

School

University of Toronto *

*We aren’t endorsed by this school

Course

143

Subject

Economics

Date

Jan 9, 2024

Type

docx

Pages

16

Uploaded by MajorElementHummingbird35

Report
1 Case Study 2 – Modeling the Sale Prices of Residential Properties in Four Neighborhoods Group 8 Nardeen Abdulkareem (Student #500867209) Jermaine Garrett (Student #501101292) Dimitri Konstantopoulos (Student #500916405) Lena Tarzi (Student #501097903) Chen Yang (Student #501122207) This group project was prepared for Professor Fu’s CQMS442- DJ0 Multiple Regression for Business class and was submitted on April 8th, 2023
2 The Problem The purpose of this case study is to examine the relationships between the mean sales price, E(y), of a property and the three following independent variables. 1. The appraised land value of the property 2. The appraised value of the improvements on the property 3. Neighbourhood in which the property is listed The objectives: This case study focuses on the correlation between the appraised value of a property and its sale price. The sale price of a property is subject to multiple factors, such as the seller's asking price, the property's appeal to buyers, and the state of the real estate and markets within specific neighbourhoods. 1. Determine whether or not there is sufficient evidence to indicate that these variables contribute information for predicting the sales price from the data supplied. 2. Determine if appraisers use the same appraisal criteria for various types of neighbourhoods. The Data – TAMSALES-ALL.xlsx The data utilized for this case study are 351 randomly selected observations from a more extensive data set provided by the textbook. This data was supplied by the property appraiser's office of Hillsborough County, Florida, and consists of the appraised land and improvement values and sales prices for residential properties sold in Tampa, Florida, from May 2008 to June 2009. Four Neighbourhoods were selected, each relatively similar but varying sociologically and in property types and values. Town & City (base), Cheval, Avila, and Northdale The subset of sales and appraisal data pertinent to these four neighbourhoods was used to develop a prediction equation to relate sales price to appraised land and improvement values. This was recorded in thousands of dollars. The Theoretical Model This activity aims to build a model that accurately predicts value ŷ correlates and relates to the mean sales price of y. The problem we face is understanding how closely our independent variables, the appraised value of land, improvements value, and neighbourhood location, reflect the actual selling price of the homes on the market.
3 Our objective would be to study different models and asses which are the most useful in predicting the selling price of the 351 randomly selected homes in Tampa, Florida, from May 2008 to June 2009. The combination of the land value and improvement value makes up the independent variable of appraised value. Ideally, the appraised value would be equal to the mean sales price of the homes. A straight line with a slope of 1 would illustrate this. Using a linear model for this problem yields us a satisfactory model. Since when we plot the scarred plot of sales price versus appraised value, we get a graph that appears to fit a straight- line model. Observe The figure below on the section with the hypothesized regression model. A robust linear relationship exists between the sales price and appraised values in thousands of dollars. However, the variation along the trend line may be attributed to the following reasons. The age of the appraised data being swayed by inflation Over or under-appraised value of homes due to certain biases by the retail agent or due to the neighbourhood of the home The model used by real estate appraisers to appraise the value of the homes The research we are conducting will help us answer the following questions: 1. If the three independent variables: appraised land value, appraised values of improvement on the property, and the neighbourhood in the home listed is a good predictor of the actual selling price of homes within Tampa, Florida? 2. Which model is the strongest predictor of sales price while using our three independent variables 3. Whether this relationship can be applied to appraising the value of homes in neighborhoods outside of those in Tampa, Florida. The Hypothesized Regression Models Our objective is to relate sale price, y, to three independent variables The Qualitative Factor → Neighbourhoods (four levels)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 The Quantitative Factors → Appraised land value and appraised improvement value We consider the following four models as candidates for this relationship: Model 1 is a first-order model that will trace a dependent plane for the mean sale price as a function of appraised land value (x1) and appraised improvement value (x2). This model assumes that the relationship between appraised land value and appraised improvement value is identical across all four neighbourhoods →, thus allowing a first-order model to be appropriate for relating the expected mean sales price to x1 and x2 Model 2 is an additional first-order model that assumes the relationship between where E(y) and x1 and x2 are still first-order but that the planes' y-intercepts differ depending on the neighbourhood.
5 The fourth neighbourhood, Town & Country, is the base level. Nonetheless, this model predicts E(y) for Town & Country when x 3 = x 4 = x 5 = 0 . Model 2 helps us study the change in y for increases in x1 or x2 to vary depending on the neighbourhood. There is also no interaction term between the independent variables x1 and x2 and the neighbourhood terms. Therefore, Model 2 assumes that changes in the sale price y for every dollar increase in x1 and x2 do not depend on the neighbourhood. This model has appropriate application if an appraiser established values based on a relationship between the mean sale price and x1 and x2 that differed in at least two neighbourhoods that remained constant for different values of x1 and x2. Model 3 is also a first-order model. It is similar to model 2, except we now have added interaction terms between the dummy variables, corresponding to each neighbourhood, and the quantitative variables x1 and x2. This model allows the changes in y, for increases in x1 or x2, to vary with a given neighbourhood. It takes the following form. Model 4 saw the addition of an x1, x2 interaction term and its corresponding interactions by neighbourhood. Model 4 builds upon model 3 by adding the interaction mentioned above terms so that we can observe an interaction between the independent variables x1 and x2 to depend on one another to predict the value of the sales price. It also builds interaction between each independent variable and the neighbourhoods we are interested in.
6 This model assumes a first-order regression model comprising 15 beta coefficients. Main effect terms of the land value and the value of the improvements (b1&b2) Two-way interaction between land value and improvements value (b3) Main effect terms for neighbourhoods (b4, b5, b6) Two-way interaction between land value and neighbourhoods (b7, b8, b9) Two-way interaction between improvements value and neighbourhoods (b10, b11, b12) Three-way interaction between land, improvements value, and neighbourhoods (b13, b14, b15) Summary of outputs Model 1 Summary Output for Regression Statistics Multiple R : The multiple correlation coefficient (R) indicates the strength of the linear relationship between the dependent variable and the independent variables in the model. In this case, the multiple correlation coefficient is 0.9691, which indicates a strong positive correlation. R Square: The coefficient of determination (R-squared) represents the proportion of the variance in the dependent variable that the independent variables in the model can explain. In this case, the R-squared value is 0.9392, which means that the independent variables in the model can explain 93.92% of the variance in the dependent variable. Adjusted R Square: This is a modified version of the R-squared value that considers the number of independent variables in the model. It is generally used when there are multiple independent variables in the model. In this case, the adjusted R-squared value is 0.9388. Standard Error: The standard error measures the variability of the dependent variable around the regression line. In this case, the standard error is 63630.84.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 ANOVA Table df: The degrees of freedom represent the number of independent pieces of information used to estimate the model parameters. In this case, there are 2 degrees of freedom for the regression and 347 degrees of freedom for the residual, which gives a total of 349 degrees of freedom. SS: The sum of squares is a measure of the dependent variable variation explained by the regression or residual. In this case, the sum of squares for the regression is 2.17E+13, the sum of squares for the residual is 1.40E+12, and the total sum of squares is 2.31E+13. MS: The mean square is calculated by dividing the sum of squares by the degrees of freedom. It represents the average amount of variation explained by the regression or residual. In this case, the mean square for the regression is 1.09E+12, and the mean square for the residual is 4048883939. F: The F-statistic is a ratio of the regression's mean square to the residual's mean square. It tests the null hypothesis that all the regression coefficients are equal to zero, indicating that the independent variables do not affect the dependent variable. In this case, the F-statistic is 2680.0252, which suggests that the model has a significant overall fit. Significance F: The p-value associated with the F-statistic provides a measure of the statistical significance of the model. In this case, the p-value is 1.02E-211, which is very small and indicates that the model is highly significant. Overall, the ANOVA table suggests that the regression model has a good overall fit and that the independent variables included in the model are highly significant in explaining the variation in the dependent variable. The regression statistics and the ANOVA results show values that would lead us to believe that model 1 is statistically significant. In addition, both variables have p-values indicating statistical significance.
8 Model 2 Summary Output for Regression Statistics Multiple R: The multiple correlation coefficient (R) indicates the strength of the linear relationship between the dependent variable and the independent variables in the model. In this case, the multiple correlation coefficient is 0.9703, which indicates a strong positive correlation. R Square: The coefficient of determination (R-squared) represents the proportion of the variance in the dependent variable that the independent variables in the model can explain. In this case, the R-squared value is 0.9415, which means that the independent variables in the model can explain 94.15% of the variance in the dependent variable. Adjusted R Square: This is a modified version of the R-squared value that considers the number of independent variables in the model. It is generally used when there are multiple independent variables in the model. In this case, the adjusted R-squared value is 0.9407. Standard Error: The standard error measures the variability of the dependent variable around the regression line. In this case, the standard error is 62662.43. ANOVA Table df: The degrees of freedom represent the number of independent pieces of information used to estimate the model parameters. In this case, there are 5 degrees of freedom for the regression and 344 degrees of freedom for the residual, which gives a total of 349 degrees of freedom. SS: The sum of squares is a measure of the dependent variable variation explained by the regression or residual. In this case, the sum of squares for the regression is 2.18E+13, the sum of squares for the residual is 1.35E+12, and the total sum of squares is 2.31E+13. MS: The mean square is calculated by dividing the sum of squares by the degrees of freedom. It represents the average amount of variation explained by the regression or residual. In this
9 case, the mean square for the regression is 4.35E+12, and the mean square for the residual is 3926580397. F: The F-statistic is a ratio of the regression's mean square to the residual's mean square. It tests the null hypothesis that all the regression coefficients are equal to zero, indicating that the independent variables do not affect the dependent variable. In this case, the F-statistic is 1108.1623, which suggests that the model has a significant overall fit. Significance F: The p-value associated with the F-statistic provides a measure of the statistical significance of the model. In this case, the p-value is 1.23E-209, which is very small and indicates that the model is highly significant. Overall, the ANOVA table suggests that the regression model has a good overall fit and that the independent variables included in the model are highly significant in explaining the variation in the dependent variable. The regression statistics and ANOVA table results lead us to believe that model 2 is significant. In addition, a few but not all of the beta coefficients are statistically significant. Model 3 Summary Output for Regression Statistics Multiple R: The multiple correlation coefficient (R) indicates the strength of the linear relationship between the dependent variable and the independent variables in the model. In this case, the multiple correlation coefficient is 0.9710, which indicates a strong positive correlation.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10 R Square: The coefficient of determination (R-squared) represents the proportion of the variance in the dependent variable that the independent variables in the model can explain. In this case, the R-squared value is 0.9428, which means that the independent variables in the model can explain 94.28% of the variance in the dependent variable. Adjusted R Square: This is a modified version of the R-squared value that considers the number of independent variables in the model. It is generally used when there are multiple independent variables in the model. In this case, the adjusted R-squared value is 0.9409. Standard Error: The standard error measures the variability of the dependent variable around the regression line. In this case, the standard error is 62552.35. ANOVA Table df: The degrees of freedom represent the number of independent pieces of information used to estimate the model parameters. In this case, there are 11 degrees of freedom for the regression and 338 degrees of freedom for the residual, which gives a total of 349 degrees of freedom. SS: The sum of squares is a measure of the dependent variable variation explained by the regression or residual. In this case, the sum of squares for the regression is 2.18E+13, the sum of squares for the residual is 1.32E+12, and the total sum of squares is 2.31E+13. MS: The mean square is calculated by dividing the sum of squares by the degrees of freedom. It represents the average amount of variation explained by the regression or residual. In this case, the mean square for the regression is 1.98E+12, and the mean square for the residual is 3.91E+09. F: The F-statistic is a ratio of the regression's mean square to the residual's mean square. It is used to test the null hypothesis that all the regression coefficients are equal to zero, indicating that the independent variables do not affect the dependent variable. In this case, the F-statistic is 506.1402, which indicates that the model has a significant overall fit. Significance F: The p-value associated with the F-statistic provides a measure of the statistical significance of the model. In this case, the p-value is 1.85E-202, which is very small and indicates that the model is highly significant. Overall, the ANOVA table suggests that the regression model has a good overall fit and that the independent variables included in the model are highly significant in explaining the variation in the dependent variable.
11 The regression statistics and ANOVA results lead us to believe that this model is significant. However, only one beta coefficient seems to be significant as indicated by the green highlight. Model 4 Summary Output for Regression Statistics Multiple R: The multiple correlation coefficient (R) indicates the strength of the linear relationship between the dependent variable and the independent variables in the model. In this case, the multiple correlation coefficient is 0.9739, which indicates a strong positive correlation. R Square: The coefficient of determination (R-squared) represents the proportion of the variance in the dependent variable that the independent variables in the model can explain. In this case, the R-squared value is 0.9486, which means that the independent variables in the model can explain 94.86% of the variance in the dependent variable. Adjusted R Square: This is a modified version of the R-squared value that considers the number of independent variables in the model. It is generally used when there are multiple independent variables in the model. In this case, the adjusted R-squared value is 0.9463.
12 Standard Error: The standard error measures the variability of the dependent variable around the regression line. In this case, the standard error is 59650.47. ANOVA Table df: The degrees of freedom represent the number of independent pieces of information used to estimate the model parameters. In this case, there are 15 degrees of freedom for the regression and 334 degrees of freedom for the residual, which gives a total of 349 degrees of freedom. SS: The sum of squares is a measure of the dependent variable variation explained by the regression or residual. In this case, the sum of squares for the regression is 2.1919E+13, the sum of squares for the residual is 1.1884E+12, and the total sum of squares is 2.3107E+13. MS: The mean square is calculated by dividing the sum of squares by the degrees of freedom. It represents the average amount of variation explained by the regression or residual. In this case, the mean square for the regression is 1.4613E+12, and the mean square for the residual is 3558178663. F: The F-statistic is a ratio of the regression's mean square to the residual's mean square. It is used to test the null hypothesis that all the regression coefficients are equal to zero, indicating that the independent variables do not affect the dependent variable. In this case, the F-statistic is 410.6736, which suggests that the model has a significant overall fit. Significance F: The p-value associated with the F-statistic provides a measure of the statistical significance of the model. In this case, the p-value is 7.3256E-205, which is very small and indicates that the model is highly significant. Overall, the ANOVA table suggests that the regression model has a good overall fit and that the independent variables included in the model are highly significant in explaining the variation in the dependent variable.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
13 From the previous output, we know the overall model seems like a good fit for the data. However, most of the beta coefficients are not statistically significant. There are several reasons why this is the case The model might be overfitting There might be multicollinearity between the independent variables Model Comparisons With our four models, we use a nested model F test to compare the models and find the most appropriate model for predicting sales price. We conduct these tests at α = 0.05 . As a result, we can hypothesize that models 3 and 4 can build redundancy and multicollinearity into their models with the number of interaction terms that may not be necessary if we have a strong and simple model. The following table shows that model has the lowest MSE and s, and the highest R 2 a . Model MSE R 2 a s 1 4,048,883,939 0.939 63,630.8 2 3,926,580,397 0.941 62,662.4 3 3,910,000,000 0.941 62,552.4 4 3,558,178,663 0.946 59,650.5
14 Test #1: Model 1 versus Model 2 (F-Test for Comparing Nested Models) H 0 : β 3 = β 4 = β 5 = 0 H a : At least one of the β coefficients being tested does not equal 0 We wish to test the null hypothesis that the type of neighbourhood does not contribute statistically significant information to the sales price F = ( SS E R SS E C )/ Number of β parameters H 0 MS E C SS E R = SSEof the Reduced Model SS E C = SSEof theComplete Model MS E C = MSE of the Complete Model F = ( 1404960000000 1350740000000 )/ 3 3926580397 = 89.49432614 Rejection Region: (-∞, 0.0672] ∪ [7.7636, ∞) Conclusion: Reject H 0 , there is significant evidence to suggest β 3 4 , β 5 contribute to the prediction of y. To put into context, this result implies that the appraiser is not assigning the same appraised values to properties across all neighbourhoods, which means there is a variation in the first-order relationship between sales (y) and appraised values (x1 and x2). Test #2: Model 2 versus Model 3 H 0 : β 6 = β 7 = β 8 = β 9 = β 10 = β 11 = 0 H a : At least one of the β coefficients does not equal to 0 F = ( SS E R SS E C )/ Number of β parameters H 0 MS E C SS E R = SSEof the Reduced Model SS E C = SSEof theComplete Model MS E C = MSE of the Complete Model F = ( 1350740000000 1322530000000 )/ 6 3910000000 = ¿ 1.202472293 Rejection region: (-∞ 0.1523] ∪ [4.044, ∞) Conclusion: We fail to reject H 0 . There is insufficient evidence to suggest that the neighbourhood interaction terms of Model 3 contribute information for the prediction of y. In this context, it tells us we do not have enough evidence to suggest that the effect of the land value and improvements value on the sales price depends on the type of neighbourhood.
15 Test #3: Model 3 versus Model 4 H 0 : β 3 = β 13 = β 14 = β 15 = 0 H a : Thealternativehypothesis isthat at least one of these parameters does not equal 0 F = ( SS E R SS E C )/ Number of β parameters H 0 MS E C SS E R = SSEof the Reduced Model SS E R = SSEof theComplete Model MS E C = MSE of the Complete Model F = ( 1322530000000 1188430000000 )/ 4 3558178663 = 9.421955212 Rejection Region: (-∞, 0.3003] ∪ [3.0078, ∞) Conclusion: Reject H 0 , there is significant evidence to suggest th at the X 1 , X 2 interaction terms in model 4 significantly contribute to the prediction of y. To put into context, we have enough evidence to reject the null hypothesis and accept the alternative hypothesis that concludes that the variability of the sales price is not the same for all properties across the neighbourhood. Conclusions The results of our model comparisons lead us to believe that model 2 is the best of the four depicted models in modelling the sales price of a home on the housing market. The results of the global F test show a high level of significance of the independent variables to the dependent variable. In addition, the adjusted R 2 a was also relatively high. It indicated that ~94.1% of the variability in model 2 was explained by the quantitative variables, land value and improvement value, and the qualitative variable, neighbourhood. Ŷ = 33366.82971 + 1.810093704 x 1 + 0.81177063 x 2 + 38726.00396 x 3 + 40761.17325 x 4 + 16216.33924 x 5 Ŷ is the predicted value of the dependent variable Quantitative Independent Variables: x1 is the value of the independent variable "Land_Value" x2 is the value of the independent variable "Improvements_Value" Qualitative Independent Variables:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16 x3 is the value of the independent variable neighbourhood "CHEVAL", x3 = 1 if “CHEVAL”, x3 = 0 if not x4 is the value of the independent variable neighbourhood "AVILA", x4 = 1 if “AVILA”, x4 = 0 if not x5 is the value of the independent variable neighbourhood "NORTHDALE", x5 = 1 if “NORTHDALE”, x5 = 0 if not The predicted sales value for a property in the Town & City neighbourhood, with a land value of $55,160, and improvements value of $148,453, is ~$253,721. Therefore, we are 95% confident that the predicted sales price value falls within the highlighted range of $130,024 to $377,418.