problem23 v2
pdf
keyboard_arrow_up
School
University of Michigan *
*We aren’t endorsed by this school
Course
215
Subject
Computer Science
Date
Nov 24, 2024
Type
Pages
17
Uploaded by TareqA5
Final Exam: Fall 2021 - Housing Prices
Version 1.0
This exam will test your understanding of
PCA
and
Linear Regression
along with manipulation of
Pandas
,
Numpy
, and
base Python
data structures.
There are
8
exercises, numbered 0 to
7
. There are
15 available points.
However, to earn 100%, the threshold is just
13 points.
(Therefore, once you hit
13
points, you can stop. There is no extra credit for exceeding this threshold.)
Exercise 0: 1 point
Exercise 1: 1 point
Exercise 2: 2 points
Exercise 3: 2 points
Exercise 4: 2 points
Exercise 5: 2 points
Exercise 6: 3 points
Exercise 7: 2 points
Each exercise builds logically on the previous one, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and
try the next one. One demo cell calls a function defined in an earlier demo - the rest are self-contained. If you see a comment
# demo - used in
later demo cell
be sure to run that cell before moving on.
For more information see the "Final Exam Release Notes" post on the discussion forum.
Good luck!
In [ ]:
### BEGIN HIDDEN TESTS
%
load_ext
autoreload
%
autoreload
2
# Note to instructors deploying this in the future - Set all instances of `if False:` to `if True` and run the no
tebook once.
# This is necessary to build the hash check and test data files. These lines should be reverted to `if False` onc
e the files are
# in place.
if False
:
import dill
import hashlib
def
hash_check(f1, f2, verbose=
True
):
with
open(f1, 'rb')
as
f:
h1 = hashlib.md5(f.read()).hexdigest()
with
open(f2, 'rb')
as
f:
h2 = hashlib.md5(f.read()).hexdigest()
if
verbose:
print(h1)
print(h2)
assert
h1 == h2, 'The file "run_tests.py" has been modified'
with
open('resource/asnlib/public/hash_check.pkl', 'wb')
as
f:
dill.dump(hash_check, f)
del
hash_check
with
open('resource/asnlib/public/hash_check.pkl', 'rb')
as
f:
hash_check = dill.load(f)
hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py')
del
hash_check
### END HIDDEN TESTS
import pandas as pd
import numpy as np
import run_tests
from time import
time
from importlib import
reload
time_start = time()
Introduction
Today, we will be working with data on home sales. We will try to use information about the individual homes to build a linear regression model predict the
sale prices. There are challenges to this approach, which will need to be addressed.
There are missing values in several columns of the data. Our linear regression strategy will not be able to handle this, so we will have to assign
values to these observations or remove rows which contain missing values.
There are categorical variables in the data. The data will have to be manipulated to transform the data into all numerical values.
The data has a large number of columns (especially after handling categorical data). This can add instability to the model, so we want to use
fewer columns if possible.
We will have to compare several models and determine which is the "best"
Metrics for evaluating model fit
Before we can go about choosing a model from many options, we must decide on a how to compare them. One common metric for evaluating linear
regression models is the "coefficient of determination", often referred to as
.
- column vector of the observed responses (for our purposes, home sale price).
- vector of predicted responses (for our purposes, predicted home sale price).
- number of observations.
- mean of observed responses -
- Sum of Squares of Residuals -
- Sum of Squares Total -
- Coefficient of determination -
reveals the share of variance in the response which is accounted for by the model. However, this metric is not effective at comparing models with
different numbers of parameters. Specifically, adding parameters will always result in an increase in
. To compare models of different sizes, there is a
modified metric,
, which penalizes larger models.
- number of model parameters
- adjusted coefficient of determination -
Exercise 0
: (1 point) - Given Numpy arrays representing the observed responses (
y
) and predicted responses (
y_hat
), complete the function signature
for
r_sq
to compute
. Return your answer as a Python float.
In [ ]:
def
r_sq(y, y_hat):
### BEGIN SOLUTION
y_bar = y.mean()
ssr = ((y - y_hat)**2).sum()
sst = ((y - y_bar)**2).sum()
return
1 - ssr/sst
### END SOLUTION
The demo cell below should output approximately
0.93690140
. These are the values for the intermediate calculations.
In [ ]:
# demo
demo_y_0 = np.array([5, 10, 3, 5, 7, 9])
demo_y_hat_0 = np.array([5.5, 8.7, 3.2, 4.5, 7, 8.9])
r_sq(demo_y_0, demo_y_hat_0)
In [ ]:
# exercise 0 test cell
### BEGIN HIDDEN TESTS
import dill
import hashlib
with
open('resource/asnlib/public/hash_check.pkl', 'rb')
as
f:
hash_check = dill.load(f)
hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py')
del
hash_check
del
dill
del
hashlib
### END HIDDEN TESTS
reload(run_tests)
for
_
in
np.arange(10):
run_tests.chk_ex0(r_sq)
print('Passed. Please submit your work now.')
time() - time_start
In [ ]:
# load testing variables for `y`, `y_hat`, `true_output`, `your_output`
from run_tests import
y_check, y_hat_check, r_sq_true, r_sq_output
Exercise 1
: (1 point) - Given Numpy arrays representing the observed responses (
y
), predicted responses (
y_hat
), and the number of parameters in the
model (
p
), complete the function signature for
r_sq_adj
to compute
. If you successfully solved Exercise 0 you can use
r_sq
in your solution to this
exercise, however you can pass by directly implementing the formula.
Here's the formula again:
=
In [ ]:
def
r_sq_adj(y, y_hat, p):
### BEGIN SOLUTION
n = len(y)
R2 = r_sq(y, y_hat)
R2_adj = 1 - (1 - R2)*(n-1)/(n-p-1)
return
R2_adj
### END SOLUTION
The demo cell below should output approximately
0.89483568
. These are the values for the intermediate calculations.
In [ ]:
# demo
demo_y_1 = np.array([5, 10, 3, 5, 7, 9])
demo_y_hat_1 = np.array([5.5, 8.7, 3.2, 4.5, 7, 8.9])
demo_p_1 = 2
r_sq_adj(demo_y_1, demo_y_hat_1, demo_p_1)
In [ ]:
# exercise 1 test cell
### BEGIN HIDDEN TESTS
import dill
import hashlib
with
open('resource/asnlib/public/hash_check.pkl', 'rb')
as
f:
hash_check = dill.load(f)
hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py')
del
hash_check
del
dill
del
hashlib
### END HIDDEN TESTS
reload(run_tests)
for
_
in
np.arange(10):
run_tests.chk_ex1(r_sq_adj)
print('Passed. Please submit your work now.')
time() - time_start
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [ ]:
# load testing variables for `y`, `y_hat`, `p`, `true_output`, `your_output`
from run_tests import
y_check, y_hat_check, p_check, r_sq_adj_true, r_sq_adj_output
Cleaning the Data
Exercise 2
: (2 points) - The data set has some missing values, columns which are the wrong type, and columns which we are not interested in. Complete
the function signature for
clean
to return a
new
DataFrame that is a copy of
df
with the following modifications based on the Python dictionary,
params
.
The values of
params
are all Python lists of column names.
Drop all columns in
params['cols_to_drop']
.
Convert the values in all columns in
params['cols_to_string']
to Python strings.
Convert missing values in all columns in
params['cols_to_zero']
to
0.0
, while leaving all other values intact.
Convert missing values in
params['cols_to_none']
to the
Python string
'None'
- Don't forget the quotation marks!
Drop any
rows
which still have missing values
after
all of the above modifications have been made.
Notes
:
You can assume that
params
will always have all of the keys mentioned above.
You can assume that any column included in
params['cols_to_string']
does not have missing values.
You can assume that the
params
lists are mutually exclusive - i.e. a column included in
params['cols_to_string']
will not be in any of the
other lists.
For example, given the following DataFrame input:
foo_drop
bar_drop
foo_zero foo_none bar_none
foo_str
foo
bar
0
1.0
0.0
2.0
foo1
bar
1
foo
bar
1
2.0
NaN
4.0
foo2
bax
2
foo
bar
2
3.0
NaN
NaN
foo3
qux
3
foo
bar
3
4.0
NaN
NaN
foo4
bar
4
foo
bar
4
5.0
NaN
3.0
NaN
bax
5
foo
bar
5
6.0
1.0
4.0
NaN
qux
6
foo
bar
6
NaN
2.0
5.0
foo1
bar
7
foo
bar
7
NaN
3.0
6.0
foo2
bax
8
foo
NaN
8
NaN
4.0
7.0
foo3
qux
9
NaN
bar
9
NaN
5.0
1.0
foo4
NaN
10
foo
bar
With these params:
{
'cols_to_drop': ['foo_drop', 'bar_drop'],
'cols_to_zero': ['foo_zero'],
'cols_to_none': ['foo_none', 'bar_none'],
'cols_to_string': ['foo_str']
}
The function call to
clean
should return these results.
foo_zero foo_none bar_none foo_str
foo
bar
0
2.0
foo1
bar
1
foo
bar
1
4.0
foo2
bax
2
foo
bar
2
0.0
foo3
qux
3
foo
bar
3
0.0
foo4
bar
4
foo
bar
4
3.0
None
bax
5
foo
bar
5
4.0
None
qux
6
foo
bar
6
5.0
foo1
bar
7
foo
bar
9
1.0
foo4
None
10
foo
bar
Notice that the columns
'foo_drop'
,
'bar_drop'
were dropped entirely,
'foo_zero'
had missing values replaced with
0.0
in rows 2 and 3,
'foo_none'
and
'bar_none'
had missing values replaced with
'None'
in rows 4,5, and 9, the numbers in
'foo_str'
were converted to strings, and
rows 7 and 8 were dropped because of missing values in
'foo'
and
'bar'
.
The demo cell below should generate the same output.
In [ ]:
def
clean(df, params):
### BEGIN SOLUTION
df = df.copy()
if
'cols_to_drop'
in
params:
df = df.drop(columns=params['cols_to_drop'])
for
c
in
params.get('cols_to_string', []):
df[c] = df[c].astype(str)
for
c
in
params.get('cols_to_zero', []):
df[c] = df[c].fillna(0.0)
for
c
in
params.get('cols_to_none', []):
df[c] = df[c].fillna('None')
return
df.dropna()
### END SOLUTION
In [ ]:
# demo
demo_params = {
'cols_to_drop': ['foo_drop', 'bar_drop'],
'cols_to_zero': ['foo_zero'],
'cols_to_none': ['foo_none', 'bar_none'],
'cols_to_string': ['foo_str']
}
demo_df_2 = pd.DataFrame(
{
'foo_drop': [1, 2, 3, 4, 5, 6, np.nan, np.nan, np.nan, np.nan],
'bar_drop': [0, np.nan, np.nan, np.nan, np.nan, 1, 2, 3, 4, 5],
'foo_zero': [2., 4., np.nan, np.nan, 3., 4., 5., 6., 7., 1],
'foo_none': ['foo1', 'foo2', 'foo3', 'foo4', np.nan, np.nan, 'foo1', 'foo2', 'foo3', 'foo4'],
'bar_none': ['bar', 'bax', 'qux', 'bar', 'bax', 'qux', 'bar', 'bax', 'qux', np.nan],
'foo_str': [1,2,3,4,5,6,7,8,9,10],
'foo': ['foo', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo', 'foo', np.nan, 'foo'],
'bar': ['bar', 'bar', 'bar', 'bar', 'bar', 'bar', 'bar', np.nan, 'bar', 'bar' ]
}
)
print(clean(demo_df_2, demo_params))
In [ ]:
# exercise 2 test cell
### BEGIN HIDDEN TESTS
import dill
import hashlib
with
open('resource/asnlib/public/hash_check.pkl', 'rb')
as
f:
hash_check = dill.load(f)
hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py')
del
hash_check
del
dill
del
hashlib
### END HIDDEN TESTS
reload(run_tests)
for
_
in
np.arange(10):
run_tests.chk_ex2(clean)
print('Passed. Please submit your work now.')
time() - time_start
In [ ]:
# load testing variables for `df`, `params`, `true_output`, `your_output`
from run_tests import
df_check_in, params_check, df_check_out, clean_check
Partitioning the Data
The data has categorical predictors as well as numerical predictors. We have to treat each type separately to get meaningful predictions from our model.
As such, we need to partition the data separating the numerical predictors, the categorical predictors, and the target variable. Additionally, there is
sometimes not a linear relationship between predictors and the target variable. It may be appropriate to transform one or both to obtain the desired
relationship. The most common transformation is the log transformation, where the natural logarithm of a particular variable is used for modeling in lieu of
the variable itself.
Exercise 3
: (2 points) - Given a DataFrame (
df
), the name of the column containing the responses (
target
), and a boolean value (
log
), complete the
function signature for
partition_cols
. The function should output the following:
Y
-
Numpy array
- Vector holding the response values.
Y
should be log-transformed based on the truth-value of
log
.
X_num
-
Numpy array
- Matrix holding the values of all
numerical
columns of
df
aside from the
target
column.
X_cat_df
-
DataFrame
- A copy of all the
categorical
columns of
df
.
Implementation details:
"Log transformed" means that each individual value,
, is transformed to
.
If
log
is
True
,
Y
should be log transformed. Otherwise, no transformation should be performed.
Numerical columns will have a "numerical"
dtype
(i.e. some flavor of
int
or
float
). All columns which do not have numerical
dtypes
should
be consiederd categorical.
For example, given the following DataFrame input:
foo
bar
baz
kag
tar
0
abf
1
2.0
148.413159
zyx
1
def
3
4.0
54.598150
wvu
2
ghi
5
6.0
20.085537
tsr
3
jkl
7
8.0
1096.633158
qpo
With
'kag'
as the
target
and
log = True
, a call to
partition_cols
should return the following output:
(array([5., 4., 3., 7.]),
array([[1., 2.],
[3., 4.],
[5., 6.],
[7., 8.]]),
foo
tar
0
abf
zyx
1
def
wvu
2
ghi
tsr
3
jkl
qpo)
Notice that the output is a tuple with three elements. The first is an array of the
log transformed
values in the target column,
'kag'
. The second is an
array of the values in the numerical columns
'bar'
and
'baz'
. The third is a DataFrame with the categorical columns
'foo'
and
'tar'
.
The demo cell below should return the following results.
In [ ]:
def
partition_cols(df, target, log=
True
):
### BEGIN SOLUTION
from pandas.api.types import
is_numeric_dtype
Y = df[target].values
if
log:
Y = np.log(Y)
num_cols, cat_cols = [], []
for
c
in
df.columns:
if
c == target:
continue
if
is_numeric_dtype(df[c].dtype):
num_cols.append(c)
else
:
cat_cols.append(c)
X_num = df[num_cols].values
X_cat_df = df[cat_cols].copy()
return
Y, X_num, X_cat_df
### END SOLUTION
return
Y, X_num, X_cat_df
In [ ]:
# demo
demo_df_3 = pd.DataFrame(
{
'foo': ['abf', 'def', 'ghi', 'jkl'],
'bar': [1,3,5,7],
'baz': [2,4.,6,8],
'kag': [np.exp(5), np.exp(4), np.exp(3), np.exp(7)],
'tar': ['zyx', 'wvu', 'tsr', 'qpo']
}
)
demo_target = 'kag'
demo_log =
True
partition_cols(demo_df_3, demo_target, demo_log)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [ ]:
# excercise 3 test cell
### BEGIN HIDDEN TESTS
import dill
import hashlib
with
open('resource/asnlib/public/hash_check.pkl', 'rb')
as
f:
hash_check = dill.load(f)
hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py')
del
hash_check
del
dill
del
hashlib
### END HIDDEN TESTS
reload(run_tests)
for
_
in
np.arange(10):
run_tests.chk_ex3(partition_cols)
print('Passed. Please sumbit your work now.')
time() - time_start
In [ ]:
# load testing variables for `df`, `target`, `log`, `true_Y_output`, `your_Y_output`, `true_X_num_output`, `true_
X_cat_df_output`, `your_X_cat_df_output`
from run_tests import
df_check_in, target_check, do_log, Y_check, Y_out, X_num_check, X_num_out, X_cat_df_check,
X_cat_df_out
One-hot Encoding
To use categorical variables in regression, they must be converted into some numerical form. One strategy for doing this is "One-hot" encoding. To
accomplish this, several new variables are created - one for each unique value of the original variable. For each observation a
1
is assigned to the new
variable corresponding to original variable, and the rest are assigned
0
. This strategy is preferred over assigning each unique original value a number
because the categorical variables don't have any "order" or "value".
Exercise 4
: (2 points) - Given a DataFrame (
df
) complete the function signature for
one_hot_all
to return a new DataFrame which is a copy of
df
with
all categorical columns replaced with a one-hot encoded version of the column (note that a single column will be replaced by multiple columns). Use the
guidelines for performing the one-hot encoding for each categorical column.
As in exercise 3 - all columns with a
non-numerical
dtype
should be considered categorical.
For each categorical column:
Determine the unique values in the column.
Create a new column for each unique value named
<original column name>_<unique value>
.
In each row the if the original value is
<unique value>
then
<original column name>_<unique value>
should have an entry of
1
, all
other new columns should have entries of
0
.
All of the
new
columns should have a
dtype
of
'int64'
- the columns which are not one-hot encoded should keep their original
dtype
.
Hint
- There is a Pandas method which will accomplish most of what needs to be done in this exercise. We won't name it, but feel free to search the web
and use whatever you find.
For example, given the input DataFrame:
foo
bar
baz
0
baz
tar
2.3
1
qux
kag
3.0
2
baz
kag
4.0
3
qux
tuk
1.1
4
baz
tar
6.0
5
qux
kag
8.0
Notice that there are two categorical columns (you can check for yourself using
dtypes
).
'foo'
has unique values
'baz'
and
'qux'
the output should have columns
'foo_baz'
and
'foo_qux'
.
'bar'
has unique values
'kag'
,
'tar'
, and
'tuk'
the output should have columns
'bar_kag'
,
'bar_tar'
, and
'bar_tuk'
.
The result has replaced the columns
'foo'
and
'bar'
with the one-hot encoded versions:
baz
foo_baz
foo_qux
bar_kag
bar_tar
bar_tuk
0
2.3
1
0
0
1
0
1
3.0
0
1
1
0
0
2
4.0
1
0
1
0
0
3
1.1
0
1
0
0
1
4
6.0
1
0
0
1
0
5
8.0
0
1
1
0
0
The demo cell below should produce the same output.
In [ ]:
def
one_hot_all(df):
### BEGIN SOLUTION
from pandas.api.types import
is_numeric_dtype
from pandas import
get_dummies, concat
df = df.copy()
num_cols = df[[c
for
c
in
df.columns
if
is_numeric_dtype(df[c].dtype)]]
cat_cols = df[[c
for
c
in
df.columns
if not
is_numeric_dtype(df[c].dtype)]]
one_hots = get_dummies(cat_cols, dtype='int64')
return
concat([num_cols, one_hots], axis=1)
### END SOLUTION
In [ ]:
# demo
demo_df_4 = pd.DataFrame(
{
'foo': ['baz', 'qux', 'baz', 'qux', 'baz', 'qux'],
'bar':['tar', 'kag', 'kag', 'tuk', 'tar', 'kag'],
'baz': [2.3, 3, 4, 1.1, 6, 8]
}
)
print(demo_df_4)
print(one_hot_all(demo_df_4))
In [ ]:
# exercise 4 test cell
### BEGIN HIDDEN TESTS
import dill
import hashlib
with
open('resource/asnlib/public/hash_check.pkl', 'rb')
as
f:
hash_check = dill.load(f)
hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py')
del
hash_check
del
dill
del
hashlib
### END HIDDEN TESTS
reload(run_tests)
for
_
in
np.arange(10):
run_tests.chk_ex4(one_hot_all)
print('Passed. Please submit your work now.')
time() - time_start
In [ ]:
# load testing variables for `df`, `true_output`, and `your_output`
from run_tests import
df_check, oh_check, oh_out
Train-test split
It is common practice to randomly partition available data into a training and test set to more accurately gage how well a model will perform on new data. It
is possible (and somewhat likely) that models will fit some of the randomness or "noise" in the data which is not actually accounted for in the predictive
variables.
The cell below defines a function
ind_splitter
. It takes two inputs:
m
- the number of rows bring split.
pct
- a float indicating the percentage of rows to be assigned to the test set.
The function
ind_splitter
returns two lists:
test_inds - a Python list of the indices which will belong to the test set.
train_inds - a Python list of the indices which will belong to the training set.
In [ ]:
# demo - used in later demo cell
def
ind_splitter(m, pct):
n = int(pct*m)
rng = np.random.default_rng(6040)
inds = np.arange(m)
rng.shuffle(inds)
return
inds[:n], inds[n:]
# test indices, training indices
Exercise 5
: (2 points) - Given a tuple (
data
) containing some number of either DataFrames or Numpy arrays, the percentage desired in the test partition,
and a function (
splitter
) which will decide the indices belonging to each partition, fill out the function definition for
train_test_split
. The function
should do the following:
Make a
single
call to
splitter
to determine which indices (rows) belong to which partiton.
splitter
will take the same arguments as
ind_splitter
and return two list-like structures. The first is the indices of the test set, the second is the indices of the training set.
Use the outputs of
splitter
to return a tuple that has the test and training partitions for each data structure in
data
. There will be two entries in
the output for each entry in
data
.
If the input
data
is of the form
(data0, data1, data2)
, then the output should be of the form
(data0_test, data0_train,
data1_test, data1_train, data2_test, data2_train)
. There can be any number of entries in
data
, but the output should give the
test then the trainning partition of each entry before moving on to the next entry.
You can assume that all elements of
data
will have the same number of rows. This is checked with the
assert
statement inside the for-loop
given in the starter code. Feel free to use the variable
m
in your solution.
Note
:
splitter
takes the same inputs/outputs as the
ind_splitter
example above. However, the test code will
not
use
ind_splitter
. Your
solution must call
splitter
to pass.
Additionally,
the order
of the data in the test and training sets (both column and row) must match the ordering determined by
splitter
. Do not
reset any DataFrame indexes or shuffle any columns. This requirement is necessary to ensure that the relation between each piece of
data
is
maintained properly when partitioning.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
For example, consider the following tuple:
(array([[ 0,
1,
2],
[ 3,
4,
5],
[ 6,
7,
8],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]]),
foo
bar
0
baz
tar
1
qux
kag
2
baz
kag
3
qux
tuk
4
baz
tar
5
qux
kag)
It has two entries - the first is an array and the second is a DataFrame. Notice how they both have the same number of rows. Using
pct = 0.4
and
ind_splitter
as our
splitter
a call to
train_test_split
should produce the following output.
(array([[ 9, 10, 11],
[ 0,
1,
2]]),
array([[ 3,
4,
5],
[15, 16, 17],
[12, 13, 14],
[ 6,
7,
8]]),
foo
bar
3
qux
tuk
0
baz
tar,
foo
bar
1
qux
kag
5
qux
kag
4
baz
tar
2
baz
kag)
Notice that rows
3
and
0
are assigned to the test partition for both entries in
data
. The ordering of the rows has been changed but is consistent between
the array and the tuple as well.
The demo cell below should produce the same output.
In [ ]:
def
train_test_split(data, pct, splitter=ind_splitter):
m = data[0].shape[0]
for
d
in
data:
assert
d.shape[0] == m
### BEGIN SOLUTION
test_inds, train_inds = splitter(m, pct)
outputs = []
for
d
in
data:
if
isinstance(d, pd.DataFrame):
outputs.append(d.iloc[test_inds])
outputs.append(d.iloc[train_inds])
else
:
# assume numpy array
outputs.append(d[test_inds])
outputs.append(d[train_inds])
return
tuple(outputs)
### END SOLUTION
In [ ]:
# demo - calls function defined in earlier demo cell
demo_df_5 = pd.DataFrame(
{
'foo': ['baz', 'qux', 'baz', 'qux', 'baz', 'qux'], 'bar':['tar', 'kag', 'kag', 'tuk', 'tar', 'kag']
}
)
demo_arr_5 = np.arange(18).reshape((6,3))
demo_pct_5 = 0.4
demo_data_5 = (demo_arr_5, demo_df_5)
train_test_split(demo_data_5, demo_pct_5, ind_splitter)
In [ ]:
# exercise 5 test cell
### BEGIN HIDDEN TESTS
import dill
import hashlib
with
open('resource/asnlib/public/hash_check.pkl', 'rb')
as
f:
hash_check = dill.load(f)
hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py')
del
hash_check
del
dill
del
hashlib
### END HIDDEN TESTS
reload(run_tests)
for
_
in
np.arange(10):
run_tests.chk_ex5(train_test_split)
print('Passed. Please submit your work now.')
time() - time_start
In [ ]:
# load testing variables for `data`, `pct`, `splitter`, `true_output`, `your_output`
from run_tests import
data_check, pct_check, splitter_check, true_output, your_output
Column Standardization, PCA and Linear Regression
To attempt to reduce our data's dimensionality, we will use Principal Component Analysis. However, PCA works best when each data column is
"standardized", i.e. has a 0 and 1 for its mean and standard deviation.
To standardize a matrix
, suppose that
is a column of
. Also, suppose that
and
are the mean and standard deviation of
. To calculate the
standardized column
use this formula.
To refresh your memory on PCA: Suppose our data is in a matrix
with
rows and
columns. Recall the Singular Value Decomposition (SVD) of
:
Each
row
of
(column of
) is a principal component of
. We can use the diagonal of
to compute the fraction of the variance explained by each
principal component as well. Let
be the
-th entry on the diagonal of
. Then to compute the fraction of the variance of
accounted for in the
-th
principal component, use the formula below.
To transform our data into the Principal Component space, we simply multiply
, where
is the matrix of the first
principal components.
To use the same column standardization and PCA transformation on new data, we need the vector of column means, vector of column standard
deviations and the principal components. With this we can evaluate models based on our transformed data with new observations.
Exercise 6
: (3 points) - Complete the function
prin_comps
to meet the following requirements.
Use the formulas above to standardize the columns of
X
and compute its SVD.
prin_comps
should return the column mean vector (Numpy array), the column sample standard deviation vector (Numpy array), and a number
of principal components to be determined by the rules below.
If
pct
is not
None
use the SVD to determine the minimum value of
such that the first
principal components account for at least
pct
of the variance. Return the first
principal components.
If
n
is not
None
return the first
n
principal components.
If both
pct
and
n
are
None
return all of the principal components.
You can assume the case where both
pct
and
n
are not
None
will not occur - this is checked with an assertion in the starter code.
Notes
: Return the principal components as a slice of
- do not use a transpose to modify the output from
numpy.linalg.svd
(which you
should use for SVD calculation).
Your solution must compute the
sample standard deviations
(i.e. 1 degree of freedom) for each column - this is
not
the default for
np.svd
.
Check the documentation
(https://numpy.org/doc/stable/reference/generated/numpy.std.html)
for details on how to set the degrees of freedom.
There are other libraries (like
scipy.stats
) which can also calculate column sample standard deviations, but you will need to check the
documentation to ensure that the degrees of freedom are what's intended.
There is no simplified example or demo for this exercise. However, this exercise incorporates concepts almost directly from Notebook 15.
SVD calculation - 15.1.1
Calculating number of components necessary to account for a fraction of variance - 15.1.5
Note
: 15.1.5 involves finding the number of components necessary to have an error less than a specified amount. Here we want to have the
fraction of variance accounted for (1-error) greater than
pct
.
In [ ]:
def
prin_comps(X, pct=
None
, n=
None
):
assert not
(pct
is not None and
n
is not None
), 'Can
\'
t set both of these'
assert
pct
is None or
(0 < pct <= 1), 'Invalid pct. Must be between 0 and 1 or `None`.'
assert
n
is None or
(type(n)
is
int
and
1 <= n <= X.shape[1]), 'n must be integer between 1 and number of col
umns in X'
### BEGIN SOLUTION
def
find_rank(pct, Sigma):
# define helper function
S2 = Sigma**2
SS2 = S2.sum()
F = np.cumsum(S2) / SS2
k = np.min(np.where(F >= pct))
return
k+1
m = X.mean(axis=0)
s = X.std(axis=0, ddof=1)
Z = (X - m) / s
_, Sigma, VT = np.linalg.svd(Z, full_matrices=
False
)
if
pct
is not None
:
k = find_rank(pct, Sigma)
# call helper function
elif
n
is not None
:
k = n
else
:
k = len(VT)
VTk = VT[:k, :]
### END SOLUTION
return
m, s, VTk
# means, standard deviations, and principal components
In [ ]:
# exercise 6 test cell
reload(run_tests)
### BEGIN HIDDEN TESTS
import dill
import hashlib
with
open('resource/asnlib/public/hash_check.pkl', 'rb')
as
f:
hash_check = dill.load(f)
hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py')
del
hash_check
del
dill
del
hashlib
### END HIDDEN TESTS
### BEGIN HIDDEN TESTS
if False
:
df = pd.read_csv('resource/asnlib/publicdata/train.csv')
df['Age'] = df['YrSold'] - df['YearRemodAdd']
params = {
'cols_to_drop': ['GarageYrBlt', 'Id', 'MoSold', 'YearBuilt', 'YearRemodAdd'],
'cols_to_zero': ['LotFrontage', 'MasVnrArea'],
'cols_to_none': ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
'PoolQC', 'Fence', 'MiscFeature'],
'cols_to_string': ['MSSubClass']
}
df_clean = clean(df, params)
Y, X_num, X_cat_df = partition_cols(df_clean, 'SalePrice')
run_tests.generate_test_cases_pca(prin_comps, X_num)
### END HIDDEN TESTS
run_tests.chk_ex6(prin_comps, 50)
print('Passed. Please submit your work now.')
time() - time_start
In [ ]:
# load testing variables `pca_test`, `your_m_output`, `your_s_output`, `your_VTk_output`
# `pca test` is a dictionary with keys ['X', 'pct', 'n', 'm', 's', 'VTk'] corresponding to
# inputs and required outputs given in the problem description
from run_tests import
pca_test, m_out, s_out, VTk_out
We have defined
pca_transformers
and
calc_theta
below. Take the time to look at them, because Exercise 7 will incorporate elements from both.
pca_transformer
- You have seen above that functions are able to take other functions as arguments. Well, they can also
return
functions!
This function returns the function
transform
, which uses the parameters from
pca_transformers
as constants and takes a single parameter
itself. The idea is that
pca_transformer
can be used to create a function that projects a new matrix (
X
) into the same Principal Component
space that was used to create
m
,
s
, and
VTk
.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [ ]:
# demo
def
pca_transformer(m, s, VTk):
def
transform(X):
X_std = ((X-m)/(s))
return
X_std.dot(VTk.T)
return
transform
calc_theta
- This function takes a data matrix
X
and a response vector
y
and computes the regression coefficient vector
theta
. Notice how
the "column of ones" is added to the left of
X
when
X_p
is created.
In [ ]:
# demo
def
calc_theta(X, y):
from scipy.linalg import
solve_triangular
# add intercept
m = X.shape[0]
X_p = np.append(np.ones((m,1)), X, axis=1)
# compute regression parameter
Q, R = np.linalg.qr(X_p)
b = Q.T.dot(y)
theta = solve_triangular(R, b)
return
theta
Exercise 7
: (2 points) - Given a linear regression parameter (
theta
), complete the function
lr_predictor
to meet these requirements.
Return a function which takes one parameter (a data matrix as a Numpy array). This function will augment the data matrix and then compute
regression estimates based on
theta
.
To compute predictions based on the data matrix
and a coefficient vector
, complete the following steps:
Create an augmented data matrix
by concatenating a column of ones to the
left
of
. This is
not
the same orientation as seen in
Notebook 12 or the lecture videos.
Perform the matrix-vector multiplication
. The prediction vector is the result.
For example if given the input
np.array([1,2,3])
for
theta
, a call to
lr_predictor
should produce a
function
which makes predictions on new
data.
demo_predictor = lr_predictor(np.array([1,2,3]))
Here
demo_predictor
is a
function
that will make predictions based on
theta
. We can make predictions on the following data.
X0 = np.array([[112.89440365,
65.62227416],
[111.84947876,
58.21425899],
[ 99.58399277,
72.10462519],
[ 89.75919282,
80.56690485],
[106.24550229,
79.71511156]])
X1 = np.array([[102.37686976,
73.0361458 ],
[108.56479013,
70.62355242],
[ 96.92681804,
56.06983186],
[101.33747938,
66.66655047],
[ 93.75484396, 43.88275971]])
X2 = np.array([[106.42275221,
62.33434278],
[108.38137453,
67.76061615],
[112.14494972,
64.73565864],
[110.38170377,
64.6506213],
[113.28106798,
75.74780834]])
The demo below makes the following calls and should produce the same output.
demo_predictor(X0)
[423.65562978 399.34173449 416.48186111 422.21910019 452.63633926]
demo_predictor(X1)
[424.86217692 430.00023752 363.06313166 403.67461017 320.15796705]
demo_predictor(X2)
[400.84853276 421.04459751 419.49687536 415.71527144 454.80556098]
In [ ]:
def
lr_predictor(theta):
def
predict(X):
### BEGIN SOLUTION
X_p = np.append(np.ones((len(X), 1)), X, axis=1)
return
X_p.dot(theta)
### END SOLUTION
return
predict
In [ ]:
#demo
demo_predictor = lr_predictor(np.array([1,2,3]))
X0 = np.array([[112.89440365,
65.62227416],
[111.84947876,
58.21425899],
[ 99.58399277,
72.10462519],
[ 89.75919282,
80.56690485],
[106.24550229,
79.71511156]])
X1 = np.array([[102.37686976,
73.0361458 ],
[108.56479013,
70.62355242],
[ 96.92681804,
56.06983186],
[101.33747938,
66.66655047],
[ 93.75484396, 43.88275971]])
X2 = np.array([[106.42275221,
62.33434278],
[108.38137453,
67.76061615],
[112.14494972,
64.73565864],
[110.38170377,
64.6506213],
[113.28106798,
75.74780834]])
print('demo_predictor(X0) ->', demo_predictor(X0), '
\n
demo_predictor(X1) ->', demo_predictor(X1), '
\n
demo_predict
or(X2) ->', demo_predictor(X2))
In [ ]:
# exercise 7 test cell
reload(run_tests)
### BEGIN HIDDEN TESTS
import dill
import hashlib
with
open('resource/asnlib/public/hash_check.pkl', 'rb')
as
f:
hash_check = dill.load(f)
hash_check('run_tests.py', 'resource/asnlib/public/run_tests.py')
del
hash_check
del
dill
del
hashlib
### END HIDDEN TESTS
### BEGIN HIDDEN TESTS
if False
:
run_tests.generate_test_cases_lr(lr_predictor)
### END HIDDEN TESTS
run_tests.chk_ex7(lr_predictor, 50)
print('Passed. Please submit your work now.')
time() - time_start
In [ ]:
from run_tests import
lr_test, pred_check
Fin.
You have reached the end of the exam. Make sure you have submitted your work and congratulations on finishing the semester!
The additional content below is optional.
Epilogue
We defined 8 useful functions in the exercises above. Here, we will show how to use these functions to perform a simple regression analysis on some real
estate sales data. Set the variable
run_demo
to
True
in order to run the demonstration code in the cells below.
Note
: the demonstration below will not run correctly if you have not solved all of the exercises in the exam.
In [ ]:
# set `run_demo` to `True` if you want to enable the demo epilogue in your work area
run_demo =
False
Loading the data
Below we will load the data from a file and take a look at the columns.
In [ ]:
if
run_demo:
df_raw = pd.read_csv('resource/asnlib/publicdata/train.csv')
print(df_raw.columns)
Determining age
One step in our preprocessing that is not covered in the code we wrote above is the age of the home. In the raw data, we have the year that the home was
last remodeled. It makes sense to compute the age by subtracting the remodel year from the year the home was sold and use this value instead of the
years. We will do this in the cell below.
In [ ]:
if
run_demo:
df_raw['Age'] = df_raw['YrSold'] - df_raw['YearRemodAdd']
Cleaning the data
Based on an examination of the data descriptions we have elected to drop the columns 'GarageYrBlt', 'Id', and 'MoSold' because they won't add useful
information to our model. Additionally, we are dropping 'YearBuilt' and 'YearRemodAdd' because they are incorporated into our age calculation.
The column 'MSSubClass' has all numerical values, but it is actually categorical. We need to change the type of this column so that it will be interpreted as
such.
Now we will check which columns have missing values so that we can deal with them appropriately.
In [ ]:
if
run_demo:
print(df_raw.loc[:, df_raw.isnull().sum() > 0].columns)
The columns 'LotFrontage' and 'MasVnrArea' are numerical and have missing values. The remaining columns with missing values are numerical. We will
replace the numerical missing values with
0.0
and the categorical ones with
'None'
. We do not want to include homes without electricity in our analysis
at all.
Below, we set our cleaning parameters to meet our requirements and run
clean
. Note that 'Electrical' is not included because we want to drop any rows
with missing values in that column.
In [ ]:
if
run_demo:
params = {
'cols_to_drop': ['GarageYrBlt', 'Id', 'MoSold', 'YearBuilt', 'YearRemodAdd'],
'cols_to_zero': ['LotFrontage', 'MasVnrArea'],
'cols_to_none': ['Alley', 'MasVnrType', 'BsmtQual', 'BsmtCond',
'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
'PoolQC', 'Fence', 'MiscFeature'],
'cols_to_string': ['MSSubClass']
}
df_clean = clean(df_raw, params)
Partitionng the data
Next we need to separate our data into the response, numerical variables, and categorical variables. We will be performing the log
transform on the response variable. We should also split each of these into training and test sets. We will hold out roughly 30% for testing.
There are a lot of categorical variables. One-hot encoding them all would give us a very high dimensional model. We used an off-the-shelf recursive
feature elimination with cross-validation algorithm to select the most informative categorical variables - 'LotConfig', 'HouseStyle' and 'ExterQual'. Selecting
these three categorical variables will give us 17 one-hot variables.
In [ ]:
if
run_demo:
# separate response, numerical, and categorical data
y, X_num, X_cat = partition_cols(df_clean, 'SalePrice',
True
)
# one-hot encode categorical data
X_cat_oh = one_hot_all(X_cat[['LotConfig', 'HouseStyle', 'ExterQual']])
# construct `data` tuple and partition into training and test
data = (y.reshape((-1,1)), X_num, X_cat_oh)
y_test, y_train, X_num_test, X_num_train, X_cat_test, X_cat_train = train_test_split(data, 0.3, ind_splitter)
Linear regression with categorical variables
Here we will use our categorical training set and the training response to train linear regression model.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [ ]:
if
run_demo:
cat_lr_predict = lr_predictor(calc_theta(X_cat_train, y_train))
In the cell below, we use our model to make predictions on the training and test sets.
In [ ]:
if
run_demo:
pred_cat_train = cat_lr_predict(X_cat_train)
pred_cat_test = cat_lr_predict(X_cat_test)
Now we can evaluate the predictions. We will use the metrics
r_sq
and
r_sq_adj
. We will also plot the predictions against the observations. Blue points
have observed values as the "x" coordinate and predicted values as the "y". The orange dots are on the line where predicted values are the same as
observed values. You should interpret the
horizontal
distance between the observations and predictions as the prediction error.
Note
: due to the log transformation, the purchase prices will be between 11 and 13.
In [ ]:
if
run_demo:
def
plot_predictions(pred, y, p, title):
from matplotlib import
pyplot
as
plt
fig = plt.figure()
fig.suptitle(f'
{title}
- R_sq: {r_sq(y, pred)}; R_sq_adj: {r_sq_adj(y, pred, p)}')
fig.supxlabel('Observations')
fig.supylabel('Predictions')
plt.scatter(y, pred)
plt.scatter(pred, pred)
plot_predictions(pred_cat_train, y_train, 18, 'Training Set')
plot_predictions(pred_cat_test, y_test, 18, 'Test Set')
These predictions are not great, but it is interesting that the lot configuration, house type, and overall exterior condition can explain around half of the
variance in the sale price. It's also worth noting that there is not much difference in accuracy between the training and test data. The model actually
performed a little bit better on the test data!
Linear regression with first 3 principal components of numerical variables
Now we will see how the model does with the numerical data. We will
determine the parameters necessary to transform the data into its first 3 principal components then use those to train a linear regression model and make
predictions.
In [ ]:
if
run_demo:
# determine transformation parameters
pca_params = prin_comps(X_num_train, n=3)
# instantiate pca transformer
transformer = pca_transformer(*pca_params)
# transform training and test data
X_num_train_pc = transformer(X_num_train)
X_num_test_pc = transformer(X_num_test)
# train regression model
pca_lr_predict = lr_predictor(calc_theta(X_num_train_pc, y_train))
# make predictions
pred_num_train = pca_lr_predict(X_num_train_pc)
pred_num_test = pca_lr_predict(X_num_test_pc)
Below we will calculate the metrics and plot the predictions and observations in the same way we did for the categorical variables.
In [ ]:
if
run_demo:
plot_predictions(pred_num_train, y_train, 18, 'Training Set')
plot_predictions(pred_num_test, y_test, 18, 'Test Set')
The numerical predictors performed a bit better, even though we only used the first 3 principal components. Again it is worth mentioning that there does
not appear to be any overfitting as the model made slightly better predictions on the training data!
Linear regression using both types of predictors
Below we will combine the numerical and categorical predictors into one data set. If the information contained in the categorical variables captures
different qualities than the information in the numerical variables, we may be able to improve model performance even more! However, there is a chance
that there is correlation between the two and not much information will be gained by combining.
In [ ]:
if
run_demo:
# Combine categorical and numerical variables
X_train = np.append(X_num_train_pc, X_cat_train.values, axis=1)
X_test = np.append(X_num_test_pc, X_cat_test.values, axis=1)
# Train regression model
full_predictor = lr_predictor(calc_theta(X_train, y_train))
# Make predictions
pred_train = full_predictor(X_train)
pred_test = full_predictor(X_test)
# Plot predictions
plot_predictions(pred_train, y_train, 18, 'Training Set')
plot_predictions(pred_test, y_test, 18, 'Test Set')
The models with the combined data performed marginally better, even accounting for the additional variables. There was not much improvement though
which indicates that there was not much additional information in the categorical variables.
Fin!
- for real this time.
Related Documents
Recommended textbooks for you
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781305480537
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT
Recommended textbooks for you
- Programming Logic & Design ComprehensiveComputer ScienceISBN:9781337669405Author:FARRELLPublisher:CengageC++ Programming: From Problem Analysis to Program...Computer ScienceISBN:9781337102087Author:D. S. MalikPublisher:Cengage LearningEBK JAVA PROGRAMMINGComputer ScienceISBN:9781337671385Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENT
- C++ for Engineers and ScientistsComputer ScienceISBN:9781133187844Author:Bronson, Gary J.Publisher:Course Technology PtrOperations Research : Applications and AlgorithmsComputer ScienceISBN:9780534380588Author:Wayne L. WinstonPublisher:Brooks ColeEBK JAVA PROGRAMMINGComputer ScienceISBN:9781305480537Author:FARRELLPublisher:CENGAGE LEARNING - CONSIGNMENT
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage

C++ Programming: From Problem Analysis to Program...
Computer Science
ISBN:9781337102087
Author:D. S. Malik
Publisher:Cengage Learning

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781337671385
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT

C++ for Engineers and Scientists
Computer Science
ISBN:9781133187844
Author:Bronson, Gary J.
Publisher:Course Technology Ptr

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole

EBK JAVA PROGRAMMING
Computer Science
ISBN:9781305480537
Author:FARRELL
Publisher:CENGAGE LEARNING - CONSIGNMENT