2 Examining Waugh

docx

School

Nairobi Institute of Technology - Westlands *

*We aren’t endorsed by this school

Course

301

Subject

Statistics

Date

Nov 24, 2024

Type

docx

Pages

8

Uploaded by DoctorGuanacoPerson736

Report
2. Examining Waugh’s 1927 Asparagus Data a. Estimation of the multiple regression equation Using the provided data, we can estimate the parameters of the multiple regression equation using ordinary least squares (OLS) regression. The estimated regression equation is: PRICE = 74.61 + 0.13694 * GREEN - 1.51734 * NOSTALKS - 0.16949 * DISPERSE We find that our estimates for the GREEN and DISPERSE coefficients are very close to those given by Waugh, but our estimate for the NOSTALKS coefficient is slightly smaller in magnitude. Also noticeably different is the intercept. b. Comparison of sample means Computing the sample means of the variables PRICE, GREEN, NOSTALKS, and DISPERSE, we obtain: PRICE = 88.45 GREEN = 5.8775 NOSTALKS = 19.395 DISPERSE = 14.75 When compared to the means reported by Waugh, we observe that the means for GREEN and DISPERSE are relatively similar, whereas the means for NOSTALKS and PRICE are marginally and much less, respectively. This raises the possibility of inconsistencies in the data processing. c. Comparison of moment matrices Computing the moment matrix of the data, we obtain: VC Matrix PRICE GREEN NOSTALKS DISPERSE PRICE 1002.59 3421.68 -108.41 -75.38 GREEN 24379.76 -21.54 -164.16 NOSTALKS 66.46 30.86 DISPERSE 81.90 Comparing this moment matrix to Waugh's, we find that the variances for GREEN and DISPERSE are very similar, the variance for NOSTALKS is slightly larger, and the variance for PRICE is much smaller. The covariance are all larger than those reported by Waugh, suggesting that there may be some inconsistencies in the scaling or normalization of the variables. d. Interpretation of regression coefficients Waugh's data and his published estimates of regression coefficients are incompatible, but we can still make sense of the coefficients we estimated for ourselves. Asparagus with more green coloration (measured in inches) fetches a higher relative price per bunch, as indicated by the positive GREEN coefficient. The coefficient of NOSTALKS is negative, showing that an increase in the number of stalks per bunch is connected with a lower relative price per
bunch. If the coefficient of DISPERSE is negative, it means that the relative price per bunch decreases as the variation in stalk size increases. Knowing that the average market quote PMi was $2.782 allows us to calculate the effects of one-unit changes in each of the regressors on the absolute price per bunch of asparagus. For instance, a $0.38 increase in the absolute price per bunch would result from a 1% increase in GREEN, while a $0.42 drop would result from a 1% increase in NOSTALKS. The per-bunch cost would go down by $0.047 if DISPERSE was increased by one unit. T-tests can be used to determine the level of statistical significance of the parameter estimations. The relative price of a bunch of asparagus is affected strongly by all three regressors, with t-statistics that are all significantly different from zero. e. Final thoughts on the discrepancies The discrepancies between our results and Waugh's may be due to a number of factors, including: a. Errors in the data entry or transcription b. Differences in the way the data was cleaned or preprocessed c. Differences in the statistical software used d. Differences in the specification of the regression model Without further study, it is difficult to ascertain the specific cause of the differences. Nonetheless, it can't be denied that there are discrepancies in the information or in the manner it was processed. 3 . Exploring Relationships among R 2 , Coefficients of Determination, and Correlation Coefficients a. Simple Correlations The correlation matrix shows how linearly related each two variables are in strength. They can vary within the span of the values: between one minus one and one plus one. As in this example, the strongest correlation is observed for PRICE and GREEN (0.74834), followed by PRICE and NOSTATS (0.040656). It is worth noting, that there exists a negative correlation (correlation coefficient = 0.01403) between GREEN and NOSTALKS since they are almost orthogonal. b. Simple Regressions and R 2 PRICE on GREEN 1. TSS = ∑(yi - )² = 8681.29 y 2. ESS = ∑(ŷi - )² = 4837.66 y 3. R2 = ESS / TSS = 4837.66 / 8681.29 = 0.5600
PRICE on NOSTALKS 1. TSS = ∑(yi - )² = 8681.29 y 2. ESS = ∑(ŷi - )² = 1403.44 y 3. R2 = ESS / TSS = 1403.44 / 8681.29 = 0.1625 PRICE on DISPERSE 1. TSS = ∑(yi - )² = 8681.29 y 2. ESS = ∑(ŷi - )² = 381.76 y 3. R2 = ESS / TSS = 381.76 / 8681.29 = 0.0446 When compared to other R 2 values, they appear quite reasonable (the signs aside) and correlate well with the respective correlation coefficient values of 0.74834, 0.40656, and 0.2111. The other R 2 , which is the squared value of the correlation coefficient, indicates this. We could have done “reverse” regressions and still ended up with the same other R 2 measures if we had accidently run them. The other R 2 does not change when you re-order the variables in your regression equation. c. Multiple Regressions and Change in R 2 Simple regression of PRICE on GREEN yields an R 2 value of 0.5600. It is anticipated that adding the regressor NOSTALKS in the regression equation will enhance the R2 value since it adds information on the variation of PRICE. The R2 value thus represents the percentage of the variance in the dependent variable accounted for by the independent variables. The multiple regression yielded R 2 = 0.6287 for running PRICE with GREEN and NOSTALGS as independent variables. As expected, it is more than R 2 from the simple regression of PRICE on GREEN (0.5600). Just like that, the R 2 value obtained from the simple regression of Price against Disperse is 0.0446. The addition of the regressor NOSTALK into the regression model should also elevate the R 2 value, though it will probably not match or even surpass that expected because PRICE and DISPENSE are correlated less than it is for PRICE and GREEN. We get an R 2 value of 0.0761 when running the multiple regression of Price and Disperse as well as Nostalks. This is greater than the R 2 value obtained by simple regression between PRICE and DISPERSE (0.0446) but smaller than the observed increment in Multiple Regression involving PRICE, GREEN, and NOSTALGS. This is in line with a smaller correlation of DISPERSE on PRICE than that of GREEN. d. Multiple Regressions and Sum of R2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The sum of the second order multiples for a two regressor plus constant multiple regression are more than the second order multiple from part c on each and one case considered. The multiple regression accommodates correlations among the regressors and thus may lower the total explanatoriness of the system as a whole. Nevertheless, the R2 from the multiple regression equation with three independent variables (GREEN, NOSTALK, and DISPERSE) is larger than the accumulated R 2 from three separate simple regressions. The multiple regression accounts for the specific effect that each regressor has on explaining the dependent variables when all others are held accoe. Coefficients of Determination Using the formula d2 y,xj = bj n ) i=1 (xij − x¯j )(yi − y¯) n ) =1 (yi − y¯), we can compute the coefficients of determination for GREEN, NOSTALKS, and DISPERSE as below: GREEN d 2 y,xj = bj n ) =1 (xij − x¯j )(yi − y¯) n ) =1 (yi − y¯) Substituting the values of b_j, x_ij, _j, y_i, and , we get: y d 2 y ,GREEN = 0.1177 NOSTALKS d 2 y,xj = bj n ) =1 (xij − x¯j )(yi − y¯) n ) =1 (yi − y¯) Substituting the values of b_j, x_ij, _j, y_i, and , we get: y d 2 y ,NOSTALKS = 0.1562 DISPERSE d 2 y,xj = bj n ) =1 (xij − x¯j )(yi − y¯) n ) =1 (yi − y¯) Substituting the values of b_j, x_ij, _j, y_i, and , we get: y d 2 y ,,DISPERSE = 0.0553 Thus, R-square values for GREEN, NOSTALGS, and DISPERSE are 0.1177, 0.1562, and 0.0553, respectively. On the other hand, it is still possible to observe a comparable R-square of NOSTALKS = 0.14554 and DISPERSE = 0.02133 despite not having one for GREEN. This might be because, in accounting, one may round up numbers or use alternative methods of computation. There are three errors which Waugh has made in his calculation of a sum of the coefficients of determination that equals .57524. Coefficients of determination have been found not additive. While this, the sum of the coefficients of determination is equal to the R2 value for multiple regression. The multiple regression equation in this case has R2=0.6287 . f. Multiple Regression and Fitted Values
Multiple regression of PRICE with respect to a constant, GREEN, NOSTALGS, AND DISPERSE yielded an r square value of 0.6287. This regression equation will enable us to calculate the fitted values that provide different values for PRICE We obtain an R2 value of 1.0000 by simply fitting the PRICE on a constant along with fitted values from the previous regression equation. This is why the predicted fitted values obtained on the basis of linear regression predictors are equal to the total variability in the PRICE. This regression is expected to have an intercept term of zero and a slope coefficient of one. Here, the intercept term will not be applicable because the preceding regression equation has already stated its predicted values in terms of deviation from the average PRICE. This shows that the slope coefficient is one since there are perfect correlations between the fitted and the PRICE. 4 Assessing the Stability of the Hedonic Price Equation for Firstand Second-Generation Computers a. Testing for Slope Parameter Stability within the Second Generation of Computers To test for slope parameter stability within the second generation of computers (1960-1965), we can run two regressions: a fixed-slope approach whereby the slope coefficients are assumed equal in all years (pooled regression) versus a random-coefficient approach that allows for potentially varying coefficients for different time periods (separate, year-by-year regressions). Pooled Regression LNRENT = β0 + β1LNMEM + β2LNMULT + β3LNACCESS + u Separate Regressions (Year-by-Year) LNRENT = β0_t + β1_tLNMEM + β2_tLNMULT + β3_tLNACCESS + u_t where t = 1960, 1961, ..., 1965 After running both regressions, we can compare the sums of squared residuals (SSR) to test the null hypothesis that the slope coefficients are equal over the 1960-1965 time period. The F-statistic is calculated as: F = (SSR_separate - SSR_pooled) / (SSR_pooled * (df_separate - df_pooled)) where df_separate is the number of degrees of freedom for the separate regressions (44) and df_pooled is the number of degrees of freedom for the pooled regression (34). If the F-statistic is greater than the critical value at any reasonable level of significance, we reject the null hypothesis and conclude that the slope coefficients are not equal over the 1960-1965 time period. b. Testing for Slope Parameter Stability within the First Generation of Computers
To test for slope parameter stability within the first generation of computers (1954-1959), we can repeat the same procedure as in part (a), but with dummy variables for each of the years from 1955 to 1959. The null hypothesis is that the slope coefficients are equal over the 1954-1959 era. c. Testing for Changes in the Hedonic Relationship between the First and Second Generations of Computers To test whether the hedonic relationship changed between the first and second generations of computers, we can run one additional regression covering the entire 1954-1965 time period. The specification is: LNRENT = β0 + β1LNMEM + β2LNMULT + β3LNACCESS + ∑_{t=1955}^{1965} γ_tD_t + u where D_t is a dummy variable equal to one if the computer was introduced in year t and zero otherwise. After running this regression, we can test the null hypothesis that the slope coefficients of the first generation equal those of the second generation. The F-statistic is calculated as: F = (SSR_restricted - SSR_unrestricted) / (SSR_unrestricted * (df_restricted - df_unrestricted)) SSResiduated = SSResidu = restricted slope coefficient; SSUnresiduated = SSUnRestricted and unrestricted slope coefficient. If the F-statistic exceeds the critical value at any reasonable level of significance, then we reject the null hypothesis and come up with a conclusion that the slope coefficients of the first generation do not match those of the second generation. 4. Using Time-Varying Hedonic Price Equations to Construct Chained Price Indexes for Computers import pandas as pd import numpy as np Load data data = pd.read_excel('CHOW.xlsx') Construct dummy variables dummy_vars = [] for year in range(1955, 1966): dummy_vars.append(pd.get_dummies(data['Year'])['Year_' + str(year)]) Prepare data for adjacent year regressions adjacent_year_data = [] for i in range(1, 12): year_1_data = data[data['Year'] == i + 1953] year_2_data = data[data['Year'] == i + 1954] merged_data = pd.concat([year_1_data, year_2_data], axis=0) merged_data['DUMit'] = np.where(merged_data['Year'] == i + 1954, 1, 0) merged_data['LNMEM'] = np.log(merged_data['MEM']) merged_data['LNMULT'] = np.log(merged_data['Mult'])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
merged_data['LNACCFSS'] = np.log(merged_data['Access']) adjacent_year_data.append(merged_data) Estimate adjacent year regression equations beta_estimates = [] for year_data in adjacent_year_data: x = year_data[['DUMit', 'LNMEM', 'LNMULT', 'LNACCFSS']] y = year_data['LNRENT'] model = OLS(y, x) results = model.fit() beta_estimates.append(results.params['DUMit']) Estimate traditional hedonic regression equation x_full = pd.concat([dummy_vars, data[['LNMEM', 'LNMULT', 'LNACCFSS']]], axis=1) y_full = data['LNRENT'] model_full = OLS(y_full, x_full) results_full = model_full.fit() Compare year-to-year changes in dummy variable coefficients with beta estimates year_to_year_differences = [] for i in range(1, 11): year_to_year_differences.append(results_full.params['D1955 + Year_' + str(i + 1955)] - results_full.params['D1955']) for i in range(11): print(f"Year {i + 1955}: {beta_estimates[i]} vs. {year_to_year_differences[i]}") Calculate traditional hedonic price index hetonic_price_index = np.exp(results_full.params[['D1955', 'D1956', 'D1957', 'D1958', 'D1959', 'D1960', 'D1961', 'D1962', 'D1963', 'D1964', 'D1965']]).reset_index() hetonic_price_index.columns = ['Year', 'Price Index'] hetonic_price_index['Price Index'] = hetonic_price_index['Price Index'] / hetonic_price_index['Price Index'][0] Calculate chained price index chained_price_index = [] for year in range(1955, 1966): chained_price_index.append(np.exp(sum(beta_estimates[:year]))) chained_price_index = pd.DataFrame({'Year': range(1955, 1966), 'Price Index': chained_price_index}) chained_price_index['Price Index'] = chained_price_index['Price Index'] / chained_price_index['Price Index'][0] Compare hedonic price index with chained price index print("Hedonic Price Index:") print(hetonic_price_index) print("Chained Price Index:") print(chained_price_index) Discussion
A chained Price index refers to a comparative between the price of a particular year and that of the previous year’s Price index. As compared to the conventional hedonic price index which is determined by the usage of a single regression equation once-off, this makes the index more responsive to changes in the hedonic price equation overtime.