Homework2

pdf

School

University of Wisconsin, Madison *

*We aren’t endorsed by this school

Course

521

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

pdf

Pages

4

Uploaded by MagistrateKomodoDragon3072

Report
University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 Homework 2 Assigned: November 8, 2022 Due: before 4pm on November 22, 2022 Instructions You are encouraged to work in groups of two. Please make sure you (and your partner’s) name and student number are clearly visible on the first page. Homework is to be submitted online through the course website. Please include all code files used to generate your results. This course follows a strict lateness policy: late homework will not be accepted. A correct answer does not guarantee full credit and a wrong answer does not guar- antee loss of credit. You should concisely indicate your reasoning and show all relevant work. The grade on each problem is based on our judgment of your level of understanding as reflected by what you have written. Write clearly! Scans of hand-written work are accepted, but must legible; if we can’t read it, we can’t grade it. 1
University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 Problem 1 Predicting Life Expectancy in the United States during the 1970s In this problem, you will continue to investigate the data from StateData.csv , which includes state-level data collected during the 1970s for all fifty US states. The data dictionary is given in Table 1 . Table 1: Data dictionary for StateData.csv Variable Description Population Population Income Per capita income Illiteracy Illiteracy rates (percentage of state population) LifeExp Life expectancy (in years) Murder Murder and non-negligent manslaughter rate per 100,000 popula- tion HighSchoolGrad High-school graduation rate Frost Average number of days (over the last 30 years) with a minimum temperature below freezing in the state capital or a major city Area Land area (square miles) Longitude Longitude of the center of the state Latitude Latitude of the center of the state Region The region that the state belongs to (Northeast, South, North Cen- tral, or West) 1. Copy the code from the solution to Homework 1, Problem 2, Question 2, part e. Run this code with different random seeds. What do you see? 2. Now change the the KFold function to 3-folds and all the GridSearchCV functions to 3-folds. Does this help? Why? 3. We will attempt to further mitigate the problem by using repeated cross validation. You will need to manually create a for loop that repeats the code starting a the KFold line (make sure you move the print functions outside all the loops). Set the KFold random state to the repetition number. Use 25 repetitions. (a) Create a boxplot the shows the distribution of R-squared values for each model (you should have 75 test set R-squared values for each model). What model performed best? (b) Are the results concerning? Why or why not? 4. We will now expand the code to include two additional models: random forest and AdaBoost. For random forest, we will search over five number of trees: [10 , 100 , 250 , 500 , 1000] and for AdaBoost, we will search over four values for the learning rate: [0 . 001 , 0 . 01 , 0 . 1 , 1]. (a) Create a boxplot the shows the distribution of R-squared values for all four models. What model performed best? 2
University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 (b) Are the results still concerning? Why or why not? Why is this happening? What might cause this? 5. How long did it take to run your experiments? Do you remember how many obser- vations are in this data set? Problem 2 Predicting invasive species continued We will consider more complex models (as compared to logistic regression) to predict the likelihood that an invasive tree species is present on a particular plot of land in the forest. The file SpeciesData.csv includes a large data set with 11,684 observations and 54 features. Our target (column name Target ) is a binary variable that indicates whether or not the invasive species is present. The features are shown in Table 2 Table 2: Data dictionary for SpeciesData.csv Variable Description Elevation Vertical elevation from sea level (meters) Aspect The direction that the plot of land is facing (degrees azimuth) Slope The average slope of the plot of land (degrees) HdistWater Horizontal distance to nearest surface water (meters) VdistWater Vertical distance to nearest surface water (meters) HdistRoad Horizontal distance to nearest road (meters) Shade9 Hill-shade index at 9am during summer solstice (0-255) Shade12 Hill-shade index at 12pm during summer solstice (0-255) Shade3 Hill-shade index at 3pm during summer solstice (0-255) HdistFire Horizontal distance to nearest wildfire ignition point (meters) WA* 4 binary columns represent the wilderness area designation (Rawah, Neota, Comanche, Cache) Soil* 40 binary columns represent the soil type (1-40) We will compare three methods: logistic regression, random forest, and SVMs. Before training any models, make sure you scale your data to the interval [0 , 1] using the Min- MaxScaler function. Then use train test split to split the data into a training set (70%) and testing set (30%). Make sure that you set shuffle=True and random state=1 . Use the GridSearchCV function with 3-fold cross-validation on the training data to find the best hyperparameters. Note that these experiments may take time to run. 1. Logistic regression. Use a lasso penalty with the liblinear solver. Search over five values of C = [0 . 001 , 0 . 01 , 0 . 1 , 1 , 10]. (a) What is the best value for C? (b) What is the corresponding test set AUC? What is training set AUC? (c) Do you think overfitting will be a problem with this model? Explain. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
University of Wisconsin - Madison Department of Industrial and Systems Engineering ISyE 521: Machine Learning in Action Fall 2022 2. Random forest. Search over five different number of trees [10 , 100 , 1000 , 5000 , 10000]. (a) What is the best number of trees? (b) How many trees until you observe diminishing returns? (c) What is the test set AUC for the best number of trees? What is training set AUC? (d) Do you think overfitting will be a problem with this model? Explain. 3. SVM. Search over four different kernels [‘ linear , poly , rbf , sigmoid ]. (a) What is the best kernel? (b) What is the corresponding test set AUC? What is training set AUC? (c) Do you think overfitting will be a problem with this model? Explain. (d) Do you think this is the best SVM you can create? How can you improve the model? Do you think (with enough work) the SVM will be able to beat the random forest? (Note you do not need to run anything) 4. Which model performed best? Which model took the longest to train and test? Do you think the performance improvement is worth the computational time for random forest? For the SVM? Problem 3 This question will focus on ensemble models. 1. List three differences between bagging and boosting. 2. Choose one of your differences and explain how it impacts you as a practitioner. 3. Why is overfitting more of a concern with boosting as compared to bagging? 4. How is stacking different from boosting and bagging? Problem 4 This question will focus on support vector machines. 1. What is the difference between a soft and hard margin SVM? Give an example where you might prefer a soft margin. 2. What does the kernel trick allow us to do? Draw a conceptual example. 3. Why do we solve the dual formulation of the SVM optimization problem (instead of the primal formulation)? 4