Homework2
pdf
keyboard_arrow_up
School
University of Wisconsin, Madison *
*We aren’t endorsed by this school
Course
521
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
Pages
4
Uploaded by MagistrateKomodoDragon3072
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022
Homework 2
Assigned: November 8, 2022
Due: before 4pm on November 22, 2022
Instructions
•
You are encouraged to work in groups of two.
Please make sure you (and your
partner’s) name and student number are clearly visible on the first page.
•
Homework is to be submitted online through the course website.
•
Please include all code files used to generate your results.
•
This course follows a strict lateness policy:
late homework will not be accepted.
•
A correct answer does not guarantee full credit and a wrong answer does not guar-
antee loss of credit.
You should concisely indicate your reasoning and show all
relevant work. The grade on each problem is based on our judgment of your level
of understanding as reflected by what you have written.
•
Write clearly! Scans of hand-written work are accepted, but must legible; if we can’t
read it, we can’t grade it.
1
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022
Problem 1
Predicting Life Expectancy in the United States during the 1970s
In this problem, you will continue to investigate the data from
StateData.csv
, which
includes state-level data collected during the 1970s for all fifty US states.
The data
dictionary is given in Table
1
.
Table 1: Data dictionary for StateData.csv
Variable
Description
Population
Population
Income
Per capita income
Illiteracy
Illiteracy rates (percentage of state population)
LifeExp
Life expectancy (in years)
Murder
Murder and non-negligent manslaughter rate per 100,000 popula-
tion
HighSchoolGrad
High-school graduation rate
Frost
Average number of days (over the last 30 years) with a minimum
temperature below freezing in the state capital or a major city
Area
Land area (square miles)
Longitude
Longitude of the center of the state
Latitude
Latitude of the center of the state
Region
The region that the state belongs to (Northeast, South, North Cen-
tral, or West)
1. Copy the code from the solution to Homework 1, Problem 2, Question 2, part e.
Run this code with different random seeds. What do you see?
2. Now change the the KFold function to 3-folds and all the GridSearchCV functions
to 3-folds. Does this help? Why?
3. We will attempt to further mitigate the problem by using
repeated
cross validation.
You will need to manually create a for loop that repeats the code starting a the
KFold line (make sure you move the print functions outside all the loops). Set the
KFold random state to the repetition number. Use 25 repetitions.
(a) Create a boxplot the shows the distribution of R-squared values for each model
(you should have 75 test set R-squared values for each model). What model
performed best?
(b) Are the results concerning? Why or why not?
4. We will now expand the code to include two additional models: random forest and
AdaBoost. For random forest, we will search over five number of trees: [10
,
100
,
250
,
500
,
1000]
and for AdaBoost, we will search over four values for the learning rate: [0
.
001
,
0
.
01
,
0
.
1
,
1].
(a) Create a boxplot the shows the distribution of R-squared values for all four
models. What model performed best?
2
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022
(b) Are the results still concerning?
Why or why not?
Why is this happening?
What might cause this?
5. How long did it take to run your experiments? Do you remember how many obser-
vations are in this data set?
Problem 2
Predicting invasive species continued
We will consider more complex models (as compared to logistic regression) to predict
the likelihood that an invasive tree species is present on a particular plot of land in the
forest. The file
SpeciesData.csv
includes a large data set with 11,684 observations and 54
features. Our target (column name
Target
) is a binary variable that indicates whether or
not the invasive species is present. The features are shown in Table
2
Table 2: Data dictionary for SpeciesData.csv
Variable
Description
Elevation
Vertical elevation from sea level (meters)
Aspect
The direction that the plot of land is facing (degrees azimuth)
Slope
The average slope of the plot of land (degrees)
HdistWater
Horizontal distance to nearest surface water (meters)
VdistWater
Vertical distance to nearest surface water (meters)
HdistRoad
Horizontal distance to nearest road (meters)
Shade9
Hill-shade index at 9am during summer solstice (0-255)
Shade12
Hill-shade index at 12pm during summer solstice (0-255)
Shade3
Hill-shade index at 3pm during summer solstice (0-255)
HdistFire
Horizontal distance to nearest wildfire ignition point (meters)
WA*
4 binary columns represent the wilderness area designation (Rawah,
Neota, Comanche, Cache)
Soil*
40 binary columns represent the soil type (1-40)
We will compare three methods: logistic regression, random forest, and SVMs. Before
training any models, make sure you scale your data to the interval [0
,
1] using the
Min-
MaxScaler
function. Then use
train
test
split
to split the data into a training set (70%)
and testing set (30%). Make sure that you set
shuffle=True
and
random
state=1
.
Use the
GridSearchCV
function with 3-fold cross-validation on the training data to
find the best hyperparameters. Note that these experiments may take time to run.
1. Logistic regression. Use a lasso penalty with the liblinear solver. Search over five
values of C = [0
.
001
,
0
.
01
,
0
.
1
,
1
,
10].
(a) What is the best value for C?
(b) What is the corresponding test set AUC? What is training set AUC?
(c) Do you think overfitting will be a problem with this model? Explain.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
University of Wisconsin - Madison
Department of Industrial and Systems Engineering
ISyE 521: Machine Learning in Action
Fall 2022
2. Random forest. Search over five different number of trees [10
,
100
,
1000
,
5000
,
10000].
(a) What is the best number of trees?
(b) How many trees until you observe diminishing returns?
(c) What is the test set AUC for the best number of trees? What is training set
AUC?
(d) Do you think overfitting will be a problem with this model? Explain.
3. SVM. Search over four different kernels [‘
linear ,
‘
poly ,
‘
rbf ,
‘
sigmoid
].
(a) What is the best kernel?
(b) What is the corresponding test set AUC? What is training set AUC?
(c) Do you think overfitting will be a problem with this model? Explain.
(d) Do you think this is the best SVM you can create? How can you improve the
model? Do you think (with enough work) the SVM will be able to beat the
random forest? (Note you do not need to run anything)
4. Which model performed best?
Which model took the longest to train and test?
Do you think the performance improvement is worth the computational time for
random forest? For the SVM?
Problem 3
This question will focus on ensemble models.
1. List three differences between bagging and boosting.
2. Choose one of your differences and explain how it impacts you as a practitioner.
3. Why is overfitting more of a concern with boosting as compared to bagging?
4. How is stacking different from boosting and bagging?
Problem 4
This question will focus on support vector machines.
1. What is the difference between a soft and hard margin SVM? Give an example
where you might prefer a soft margin.
2. What does the kernel trick allow us to do? Draw a conceptual example.
3. Why do we solve the dual formulation of the SVM optimization problem (instead
of the primal formulation)?
4