Yadav_Module5_Report

pdf

School

Northeastern University *

*We aren’t endorsed by this school

Course

6010

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

14

Uploaded by ConstableKudu4044

Report
Manish Kumar Yadav Module 5 R Practice ALY 6010: Probability Theory and Introductory Statistics Dr. A. Narayan March 26, 2024
Abstract This analysis explores the factors influencing house prices in a real estate dataset. By examining attributes such as house age, distance to the nearest MRT station, and the number of convenience stores nearby, we aim to understand their impact on house price unit area. The study employs correlation analysis to identify relationships between variables and linear regression to model the relationship between predictors and house prices. Insights gained from this analysis can inform stakeholders in the real estate industry about key determinants of property values.
Introduction The dataset under investigation contains information on various attributes related to real estate, including transaction date, house age, distance to the nearest MRT station, number of convenience stores, latitude, longitude, and house price unit area. These attributes represent key factors that may influence house prices in each area. Attributes: Transaction Date: The date of the property transaction. House Age: The age of the house in years. Distance to MRT: The distance of the property to the nearest Mass Rapid Transit (MRT) station. Number of Convenience Stores: The count of convenience stores within a certain radius of the property. Latitude: The geographic latitude of the property location. Longitude: The geographic longitude of the property location. House Price Unit Area: The price of the property per unit area. Purpose of Analysis: The primary objective of this analysis is to understand the relationships between various attributes and house prices. By conducting correlation analysis, we aim to identify which attributes are most strongly correlated with house price unit area. Additionally, by fitting a linear regression model, we seek to quantify the impact of these attributes on property values and provide insights for real estate stakeholders. A correlation chart is diagnostic and should not be larger than 5 variables for reporting purposes. Why is this? A correlation chart should not include more than 5 variables for reporting purposes because visualizing relationships between a larger number of variables can become cluttered and difficult to interpret. With more than 5 variables, the chart becomes visually complex, making it challenging to identify meaningful patterns or correlations.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# Install and load required packages library (readxl) # Using this library to read the Xlxs file ## Warning: package 'readxl' was built under R version 4.3.3 library (janitor) ## ## Attaching package: 'janitor' ## The following objects are masked from 'package:stats': ## ## chisq.test, fisher.test library (magrittr) library (dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library (ggplot2) library (corrplot) ## Warning: package 'corrplot' was built under R version 4.3.3 ## corrplot 0.92 loaded library (stats) # Read Excel file into a dataframe called df df <- read_excel ( "Real estate valuation data set.xlsx" ) %>% clean_names () %>% rename ( Transaction_Date = x1_transaction_date, House_Age = x2_house_age, Distance_to_MRT = x3_distance_to_the_nearest_mrt_station, Num_Convenience_Stores = x4_number_of_convenience_stores, Latitude = x5_latitude, Longitude = x6_longitude, House_Price_Unit_Area = y_house_price_of_unit_area ) # Check for missing values (? and NA) missing_values <- sum (df == "?" | is.na (df), na.rm = TRUE )
#========================= Part 1 Correlation==========================# # Computing correlation matrices for different subsets of variables # Chart 1: correlation_matrix_1 <- cor (df[, c ( "House_Age" , "Distance_to_MRT" , "House_Price_Unit_Area" )]) #Correlation Heatmap corrplot (correlation_matrix_1, method = "circle" , type = "upper" , tl.col = "black" , tl.srt = 45 ) ggsave ( "circular_correlation_heatmap_1.png" , width = 8 , height = 6 , dpi = 300 ) Variables: House_Age, Distance_to_MRT, House_Price_Unit_Area Analytical Findings: The correlation heatmap shows a strong negative correlation between House_Age and House_Price_Unit_Area, indicating that older houses tend to have lower prices. Additionally, there is a moderate positive correlation between Distance_to_MRT and House_Price_Unit_Area, suggesting that houses closer to MRT stations tend to have higher prices.
# Chart 2: correlation_matrix_2 <- cor (df[, c ( "Num_Convenience_Stores" , "Latitude" , "House_Price_Unit_Area" )]) #Correlation Heatmap corrplot (correlation_matrix_2, method = "circle" , type = "upper" , tl.col = "black" , tl.srt = 45 ) ggsave ( "circular_correlation_heatmap_2.png" , width = 8 , height = 6 , dpi = 300 ) Variables: Num_Convenience_Stores, Latitude, House_Price_Unit_Area Analytical Findings: This heatmap reveals a weak positive correlation between Num_Convenience_Stores and House_Price_Unit_Area, implying that areas with more convenience stores may have higher house prices. Latitude shows a weak correlation with House_Price_Unit_Area. # Chart 3:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
correlation_matrix_3 <- cor (df[, c ( "House_Age" , "Longitude" , "House_Price_Unit_Area" )]) #Correlation Heatmap corrplot (correlation_matrix_3, method = "circle" , type = "upper" , tl.col = "black" , tl.srt = 45 ) ggsave ( "circular_correlation_heatmap_3.png" , width = 8 , height = 6 , dpi = 300 ) Variables: House_Age, Longitude, House_Price_Unit_Area Analytical Findings: In this heatmap, House_Age exhibits a negative correlation with House_Price_Unit_Area, similar to Chart 1. Longitude shows a weak correlation with House_Price_Unit_Area # Perform correlation tests cor_test_age_price <- cor.test (df $ House_Age, df $ House_Price_Unit_Area) cor_test_mrt_price <- cor.test (df $ Distance_to_MRT, df $ House_Price_Unit_Area)
# Statistical Outputs cat ( "Correlation Test Results between House Age and House Price Unit Area: \n " ) ## Correlation Test Results between House Age and House Price Unit Area: print (cor_test_age_price) ## ## Pearson's product-moment correlation ## ## data: df$House_Age and df$House_Price_Unit_Area ## t = -4.3721, df = 412, p-value = 1.56e-05 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.3008396 -0.1165546 ## sample estimates: ## cor ## -0.210567 Test Results between House Age and House Price Unit Area: Pearson's Correlation Coefficient : -0.210567 t-value : -4.3721 Degrees of Freedom (df) : 412 p-value : 1.56e-0 Findings: Pearson's correlation coefficient of -0.210567 indicates a moderate negative correlation between house age and house price unit area. This suggests that as the age of the house increases, the price per unit area tends to decrease. Interpretation: With a statistically significant p-value (1.56e-0< 0.05), we reject the null hypothesis and conclude that there is indeed a significant correlation between house age and house price unit area in our dataset. The negative correlation suggests that older houses tend to have lower prices per unit area compared to newer ones.
Recommendations: It is recommended for real estate investors and property developers to consider the age of the property when evaluating its market value. Older houses may require additional maintenance and renovation costs, which should be factored into investment decisions. Additionally, marketing strategies may need to be adjusted to target specific demographics or segments of the market based on preferences for older or newer properties. cat ( "Correlation Test Results between Distance to MRT and House Price Unit Area: \n " ) ## Correlation Test Results between Distance to MRT and House Price Unit Area: print (cor_test_mrt_price) ## ## Pearson's product-moment correlation ## ## data: df$Distance_to_MRT and df$House_Price_Unit_Area ## t = -18.5, df = 412, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.7230493 -0.6173117 ## sample estimates: ## cor ## -0.6736129 Correlation Test Results between Distance to MRT and House Price Unit Area: Correlation Coefficient- 0.6736129 t-value: -18.5 p-value: < 2.2e-16 Findings: The correlation coefficient of -0.67 indicates a strong negative correlation between distance to the nearest MRT station and house price unit area. This suggests that as the distance to MRT stations increases, house prices tend to decrease. Interpretation: With a highly significant p-value (2.2e-16< 0.05), we reject the null hypothesis and conclude that there is indeed a strong negative correlation between distance to MRT and house price unit area. This implies that proximity to MRT stations
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
significantly impacts property values, with properties closer to MRT stations commanding higher prices per unit area. Recommendations: Given the strong correlation between distance to MRT stations and house prices, real estate investors and developers must consider proximity to public transportation when evaluating properties. Properties located near MRT stations may offer greater market demand and the potential for higher returns on investment. Therefore, marketing strategies should emphasize the accessibility and convenience of such properties to attract potential buyers or tenants. #=========================== part 2 Regression ===========================# How does regression analysis differ from correlation analysis? Provide several sentences discussing the key results. Regression analysis involves modeling the relationship between a dependent variable and one or more independent variables to predict outcomes, while correlation analysis measures the strength and direction of the linear relationship between two variables. Regression analysis provides predictive capabilities and identifies significant predictors, while correlation analysis simply assesses associations between variables without implying causation or prediction. # Fit linear regression model lm_model <- lm (House_Price_Unit_Area ~ House_Age + Distance_to_MRT + Num_Convenience_Stores, data = df) lm_model ## ## Call: ## lm(formula = House_Price_Unit_Area ~ House_Age + Distance_to_MRT + ## Num_Convenience_Stores, data = df) ## ## Coefficients: ## (Intercept) House_Age Distance_to_MRT Num_Convenience_Stores ## 42.977286 -0.252856 -0.005379 1.297442
Linear Regression Model Results: The linear regression model fitted to predict house price unit area using house age, distance to MRT, and number of convenience stores as predictor variables yielded the following coefficients: Intercept: 42.977286 House Age: -0.252856 Distance to MRT: -0.005379 Number of Convenience Stores: 1.297442 Interpretation: The intercept term of 42.977286 represents the estimated house price unit area when all predictor variables are zero. For every one-unit increase in house age, the predicted house price unit area decreases by approximately 0.252856 units. Similarly, for every one-unit increase in distance to MRT, the predicted house price unit area decreases by approximately 0.005379 units. Conversely, for every one-unit increase in the number of convenience stores nearby, the predicted house price unit area increases by approximately 1.297442 units. Conclusion: The linear regression model suggests that house age, distance to MRT, and the number of convenience stores nearby are significant predictors of house price unit area. The coefficients provide insights into the direction and magnitude of the relationships between these variables and house prices, allowing for a better understanding and prediction of property values. The plot below will help us understand this linear regression model. # Create data frame for plotting Creating a new data frame plot_data containing three columns: Observed house price unit area, fitted house price unit area predicted by the linear regression model , and house
age. The predict() function is used to generate the fitted values based on the linear regression model lm_model. plot_data <- data.frame ( Observed = df $ House_Price_Unit_Area, Fitted = predict (lm_model), House_Age = df $ House_Age) # Plot fitted vs observed values with regression line ggplot (plot_data, aes ( x = Observed, y = Fitted)) + geom_point ( color = "blue" , size = 3 ) + geom_smooth ( method = "lm" , color = "red" , se = FALSE ) + labs ( x = "Observed House Price" , y = "Fitted House Price" , title = "Fitted vs. Observed House Price" ) + theme_minimal () + theme ( legend.position = "top" , legend.title = element_blank ()) + guides ( color = guide_legend ( title = "Regression Line" )) ## `geom_smooth()` using formula = 'y ~ x' Insights: House Age Effect : Older houses tend to have lower prices per unit area, reflecting potential maintenance and depreciation factors.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Proximity to MRT : Properties located farther away from MRT stations generally have lower prices per unit area, highlighting the importance of accessibility to public transportation. Convenience Stores Impact : Properties in areas with a higher number of convenience stores tend to have higher prices per unit area, indicating the influence of local amenities on property values. Overall Model Fit : The model demonstrates a good fit, explaining a significant proportion of the variability in house price unit area based on the included predictor variables. Predictive Power : The model can be leveraged to predict house price unit area for new properties or assess the impact of changes in predictor variables on property values, enabling data-driven decision-making in the real estate market. Conclusion Through correlation analysis and linear regression modeling, this study sheds light on the factors influencing house prices in the real estate market. By examining attributes such as house age, proximity to transportation hubs, and availability of nearby amenities, stakeholders can gain valuable insights into property valuation trends. These findings can inform decision-making processes for property investment, pricing strategies, and market positioning, ultimately contributing to more informed and data-driven decisions in the real estate industry.
References UCI Machine Learning Repository . (n.d.). https://archive.ics.uci.edu/dataset/477/real+estate+valuation+data+set