problem23 v2

pdf

School

University of Michigan *

*We aren’t endorsed by this school

Course

215

Subject

Computer Science

Date

Nov 24, 2024

Type

pdf

Pages

Uploaded by TareqA5

Final Exam: Fall 2021 - Housing Prices Version 1.0 This exam will test your understanding of PCA and Linear Regression along with manipulation of Pandas , Numpy , and base Python data structures. There are 8 exercises, numbered 0 to 7 . There are 15 available points. However, to earn 100%, the threshold is just 13 points. (Therefore, once you hit 13 points, you can stop. There is no extra credit for exceeding this threshold.) Exercise 0: 1 point Exercise 1: 1 point Exercise 2: 2 points Exercise 3: 2 points Exercise 4: 2 points Exercise 5: 2 points Exercise 6: 3 points Exercise 7: 2 points Each exercise builds logically on the previous one, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. One demo cell calls a function defined in an earlier demo - the rest are self-contained. If you see a comment # demo - used in later demo cell be sure to run that cell before moving on. For more information see the "Final Exam Release Notes" post on the discussion forum. Good luck! In [ ]: ### BEGIN HIDDEN TESTS % load_ext autoreload % autoreload 2 # Note to instructors deploying this in the future - Set all instances of `if False:` to `if True` and run the no tebook once. # This is necessary to build the hash check and test data files. These lines should be reverted to `if False` onc e the files are # in place. if False : import dill import hashlib def hash_check(f1, f2, verbose= True ): with open(f1, 'rb') as f: h1 = hashlib.md5(f.read()).hexdigest() with open(f2, 'rb') as f: h2 = hashlib.md5(f.read()).hexdigest() if verbose: print(h1) print(h2) assert h1 == h2, 'The file "run_tests.py" has been modified' with open('resource/asnlib/public/hash_check.pkl', 'wb') as f: dill.dump(hash_check, f) del hash_check with open('resource/asnlib/public/hash_check.pkl', 'rb') as f: hash_check = dill.load(f) hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py') del hash_check ### END HIDDEN TESTS import pandas as pd import numpy as np import run_tests from time import time from importlib import reload time_start = time()

Introduction Today, we will be working with data on home sales. We will try to use information about the individual homes to build a linear regression model predict the sale prices. There are challenges to this approach, which will need to be addressed. There are missing values in several columns of the data. Our linear regression strategy will not be able to handle this, so we will have to assign values to these observations or remove rows which contain missing values. There are categorical variables in the data. The data will have to be manipulated to transform the data into all numerical values. The data has a large number of columns (especially after handling categorical data). This can add instability to the model, so we want to use fewer columns if possible. We will have to compare several models and determine which is the "best" Metrics for evaluating model fit Before we can go about choosing a model from many options, we must decide on a how to compare them. One common metric for evaluating linear regression models is the "coefficient of determination", often referred to as . - column vector of the observed responses (for our purposes, home sale price). - vector of predicted responses (for our purposes, predicted home sale price). - number of observations. - mean of observed responses - - Sum of Squares of Residuals - - Sum of Squares Total - - Coefficient of determination - reveals the share of variance in the response which is accounted for by the model. However, this metric is not effective at comparing models with different numbers of parameters. Specifically, adding parameters will always result in an increase in . To compare models of different sizes, there is a modified metric, , which penalizes larger models. - number of model parameters - adjusted coefficient of determination - Exercise 0 : (1 point) - Given Numpy arrays representing the observed responses ( y ) and predicted responses ( y_hat ), complete the function signature for r_sq to compute . Return your answer as a Python float. In [ ]: def r_sq(y, y_hat): ### BEGIN SOLUTION y_bar = y.mean() ssr = ((y - y_hat)**2).sum() sst = ((y - y_bar)**2).sum() return 1 - ssr/sst ### END SOLUTION The demo cell below should output approximately 0.93690140 . These are the values for the intermediate calculations. In [ ]: # demo demo_y_0 = np.array([5, 10, 3, 5, 7, 9]) demo_y_hat_0 = np.array([5.5, 8.7, 3.2, 4.5, 7, 8.9]) r_sq(demo_y_0, demo_y_hat_0)

In [ ]: # exercise 0 test cell ### BEGIN HIDDEN TESTS import dill import hashlib with open('resource/asnlib/public/hash_check.pkl', 'rb') as f: hash_check = dill.load(f) hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py') del hash_check del dill del hashlib ### END HIDDEN TESTS reload(run_tests) for _ in np.arange(10): run_tests.chk_ex0(r_sq) print('Passed. Please submit your work now.') time() - time_start In [ ]: # load testing variables for `y`, `y_hat`, `true_output`, `your_output` from run_tests import y_check, y_hat_check, r_sq_true, r_sq_output Exercise 1 : (1 point) - Given Numpy arrays representing the observed responses ( y ), predicted responses ( y_hat ), and the number of parameters in the model ( p ), complete the function signature for r_sq_adj to compute . If you successfully solved Exercise 0 you can use r_sq in your solution to this exercise, however you can pass by directly implementing the formula. Here's the formula again: = In [ ]: def r_sq_adj(y, y_hat, p): ### BEGIN SOLUTION n = len(y) R2 = r_sq(y, y_hat) R2_adj = 1 - (1 - R2)*(n-1)/(n-p-1) return R2_adj ### END SOLUTION The demo cell below should output approximately 0.89483568 . These are the values for the intermediate calculations. In [ ]: # demo demo_y_1 = np.array([5, 10, 3, 5, 7, 9]) demo_y_hat_1 = np.array([5.5, 8.7, 3.2, 4.5, 7, 8.9]) demo_p_1 = 2 r_sq_adj(demo_y_1, demo_y_hat_1, demo_p_1) In [ ]: # exercise 1 test cell ### BEGIN HIDDEN TESTS import dill import hashlib with open('resource/asnlib/public/hash_check.pkl', 'rb') as f: hash_check = dill.load(f) hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py') del hash_check del dill del hashlib ### END HIDDEN TESTS reload(run_tests) for _ in np.arange(10): run_tests.chk_ex1(r_sq_adj) print('Passed. Please submit your work now.') time() - time_start

Your preview ends here