First Ordinary Least Squares Regression: Data Management Guide

L2 @ O |3‘] file: | Dot | star| - Jup| = Doxlﬂ ans ans x| Jup| = hwi| jl par| g sk ma| - skle| skle| o0 ske| o vac| & Ad| + - o X C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q o percent of Gleason scores 4 or 5 (pgg45) - L 4 The data use a fixed Train / Test split, which we will load below. . o #ALL finalised needed imports import pandas as pd “ import itertools from sklearn.model_selection import cross_val_score from sklearn import linear_model O import matplotlib.pyplot as plt import numpy as np G import warnings # Suppress warnings warnings.filterwarnings(“ignore") w df_train = pd.read_csv('prostate_train.csv') 4 df_train.head() lcavol Iweight age Ibph svi lcp gleason pggd5 Ipsa 0 -0579818 2769459 50 -1.386294 0 -1.386294 6 0 -0.430783 1 -0.994252 3319626 58 -1386294 0 -1.386294 6 0 -0.162519 2 -0510826 2691243 74 -1386294 0 -1.386294 7 20 -0.162519 3 -1.203973 3282789 58 -1386294 0 -1.386294 6 0 -0.162519 4 0.751416 3432373 62 -1386294 0 -1.386294 6 0 0371564 0 3 . () Problem 1: Your First Regression v & We will beain bv fittina our first ordinarv least sauares rearession model. But first we need to do a little data manaaement. You will notice that the data exist in a sinale data }ST here t h ] A B D Z ) 12:01 AM ype here to searc = % S 111172023 2 @ O |3‘] file| DO\I star[ Jupl = DoxlE ans ans X | Jup| @ hw~| [ parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q - - - Problem 1: Your First Regression P We will begin by fitting our first ordinary least squares regression model. But first we need to do a little data management. You will notice that the data exist in a single data - frame (one for Train and one for Test). The last column of the data frame ('Ipsa’) is the quantity that we wish to predict (the Y-value). . s (a ° Do the following in the cell below, e Create X_train and Y_train by separating out the last column (‘lpsa‘) and store it in Y_train * Do the same for X_test and Y_test w ¢ Display the DataFrame X_train + features = ['lcavol’, 'lweight', 'age', 'lbph', 'svi', 'lcp’', 'gleason', 'pggd5'] output = ['lpsa‘] X_train = df_train[features] Y_train = df_train[output] # Display training inputs print(X_train) lcavol lweight age lbph svi lcp gleason pggd5 @ -0.579818 2.769459 50 -1.386294 0 -1.386294 6 e 1 -0.994252 3.319626 58 -1.386294 0 -1.386294 6 e 2 -0.510826 2.691243 74 -1.386294 0 -1.386294 7 20 @ 3 -1.203973 3.282789 58 -1.386294 0 -1.386294 6 e 4 ©.751416 3.432373 62 -1.386294 @ -1.386294 6 e @ 62 3.246491 4.101817 68 -1.386294 0 -1.386294 6 e 63 2.532903 3.677566 61 1.348073 1 -1.386294 7 15 v 2\'?:3 ;3 Type here to search | ' . ) ~ GO g 202AM : = ’ = % SV 111172023

2 @ O |3‘] file| DO\I star[ Jupl: DoxlE ans ans X | Jup|: hwl[l:i parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad«|+ - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites [ Catmail & UofA Homepage /A, Inter Library Loan Q 3 -1.203973 3.282789 58 -1.386294 0 -1.386294 6 e . 4 ©.751416 3.432373 62 -1.386294 0 -1.386294 6 e L 62 3.246491 4.101817 68 -1.386294 0 -1.386294 6 e - 63 2.532903 3.677566 61 1.348073 1 -1.386294 7 15 64 2.830268 3.876396 68 -1.386294 1 1.321756 7 60 . 65 3.821004 3.896909 44 -1.386294 1 2.169054 ¥ 3 40 “ 66 2.882564 3.77391@ 68 1.558145 1 1.558145 7 8@ [67 rows x 8 columns] o Now we will fit our first model using a single feature (‘lcavol’). Do the following in the cell below, e Train a linear regression model on the ‘Icavol' feature T e Compute the R-squared score of the model on the training data ¢ Scatterplot the training data for the ‘Icavol’ feature ¢ Plot the regression line over the scatterplot o Label the plot axis / title and report the R-squared score A couple of notes: e Scikit-learn gets cranky when you pass in single features. In some versions you will need to use, X_train['Icavol].values.reshape(-1, 1) * To plot the regression line you can create a dense grid of points using numpy.arange, between the min() and max() of the feature values. a Documentation - Scikit-Learn - LinearRegression () from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score - 233 ;3 — ) E ) 1202AM ype here to searc ; : * 1171172023 2 @ O |3‘] file| DO\I star[ Jupl = DoxlE ans ans X | Jup| @ hw~| [ parl sld:] ma| sld‘| skl:l sld«| vac | /& Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A N W (un] e R o Import favorites Catmail UofA Homepage Inter Library Loan Q from sklearn.linear_model import LinearRegression gl & from sklearn.metrics import r2_score feature = X_train['lcavol'].values.reshape(-1, 1) -] model = LinearRegression() model.fit(feature, Y_train ( _train) i Iy prediction = model.predict(feature) r2 = r2_score(Y_train, prediction) 0 plt.scatter(feature, Y_train, color='blue', label='Real values') 6 x_range = np.arange(feature.min(), feature.max(), ©.1).reshape(-1, 1) pred_range = model.predict(x_range) w plt.plot(x_range, pred_range, color='red', linewidth=2, label='Regression line') plt.title('Linear Regression: lcavol vs lpsa‘) + plt.xlabel('lcavol’) plt.ylabel('lpsa‘) plt.legend() plt.show() print('R-squared score:', r2) Linear Regression: Icavol vs Ipsa ® Real values ) 54 = Regression line o 12:02 AM RRVARVPIZE] m Type here to search

2 @ O |3‘] file| DO\I star[ Jupl = DoxlE ans ans X | Jup|: hw~| [ parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q T e = Linear Regression: Icavol vs Ipsa 4 ® Real values L] & 54 = Regression line s 41 ! [+ oY 3 R b w a8 9 4 14 0 -1 0 1 2 3 4 Icavol R-squared score: ©.5375164690552882 ) () Problem 2: Best Subset Feature Selection Now we will lnok at findina the hest suhset of featiures nuit of all nossible suhsets Tn da this vou will imnlement the Rest Feature Suihset Selection as nresented in lecture 12:02 AM RRVARVPIZE] L Type here to search 4 Aoz 2 @ O |3‘] file| DO\I star[ Jupl: DoxlE ans ans X | Jup|: hwl[l:i parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad«|+ - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q - Problem 2: Best Subset Feature Selection v Now we will look at finding the best subset of features out of all possible subsets. To do this, you will implement the Best Feature Subset Selection as presented in lecture - (see lecture slides). We will break this into subproblems to walk through it. To help you with this we have provided a function findsubsets(S k). When passed a set S this function will return a set of all subsets of size k, which you can iterate through to train models. b def findsubsets(S,k): . return set(itertools.combinations(S, k)) (a) We will start by getting familiar with the findsubsets() function. The variable ‘features' was defined previously as a set of all feature names. In the cell do the following: + * Use findsubsets to find all possible subsets of 3 features o Perform 5-fold cross validation to train a LinearRegression model on each set of 3 features Find the model with the highest average R? score (scoring="r2') o Report the best performing set of features and the corresponding R? score Documentation - Scikit-Learn - cross_val_score from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression subsets = findsubsets(features, 3) best_r2_score = -float('inf') @ best_feature_subset = None for subset in subsets: 2 Tyoe here' A = _ ay 1202AM ype here tosearch : : © 7 1171172023

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

L2 @ O |&) file| — Dov| star| & Jup| Z Do| @ ans ans x (T Jup| T hw| Bl par| i skie| v oma| 0 ske| n0 skie| u ske| @ uac| A Ad| 4+ - o O (D localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download =false A [ % m & %R B3 Import favorites [ Catmail & UofA Homepage /A, Inter Library Loan from sklearn.model_selection import cross_val_score from sklearn.linear_model import LinearRegression B ¢ o0 @ X subsets = findsubsets(features, 3) 4 best_r2_score = -float('inf') best_feature_subset = None for subset in subsets: subset_features = list(subset) . X_subset = X_train[subset_features] model = LinearRegression() r2_scores = cross_val_score(model, X_subset, Y_train, cv=5, scoring='r2') 1 2 O + avg_r2_score = np.mean(r2_scores) if avg_r2_score > best_r2_score: best_r2_score = avg_r2_score best_feature_subset = subset_features print(“Best Feature Subset:", best_feature_subset) print("Best R-squared Score:", best_r2_score) Best Feature Subset: ['lcavol', 'lweight', 'age'] Best R-squared Score: -6.611191159159722 (b) Now, repeat the above process for all subsets of all sizes. For each k = 1,. .., 8 find all possible subsets of k features and evaluate a model on each set of features using 5- fold cross validation. Report your findings as follows, & 12:03 AM §e Typg rlerf to search ‘ 4 ‘ \ * 7 1171172023 2 @ O |3‘] file| DO\I star[ Jupl: DoxlE ans ans X | Jup|: hwl[l:i parl sld:] ma| sld‘| skl:l sld«| vac | /& Ad«|+ - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q (b) b & Now, repeat the above process for all subsets of all sizes. For each k= 1,.. ., 8 find all possible subsets of k features and evaluate a model on each set of features using 5- o fold cross validation. Report your findings as follows, * Produce a scatterplot of R? values for every run with subset size on the horizontal axis, and R? on the vertical axis (label your plot axes/title) l‘ e Find the best performing model overall and report the R? and features for that model all_r2_scores = [] best_r2_score = -float('inf') . & best_feature_subset = None w for k in range(1, 9): subsets = findsubsets(features, k) subset_r2_scores = [] best_k_r2_score = -float('inf') best_k_feature_subset = None + for subset in subsets: subset_features = list(subset) X_subset = X_train[subset_features] model = LinearRegression() r2_scores = cross_val_score(model, X_subset, Y_train, cv=5, scoring='r2') avg_r2_score = np.mean(r2_scores) subset_r2_scores.append(avg_r2_score) o if avg_r2_score > best_k_r2_score: best_k_r2_score = avg_r2_score best_k_feature_subset = subset_features if best_k_r2_score > best_r2_score: 2 Tyoe here' A : » : j 1203AM ype here to search ; - . ERVARVPIIPE:

2 @ O |3‘] file| DO\I star[ Jupl = DoxlE ans ans X | Jup| @ hw~| [ parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites [ Catmail J& UofA Homepage /A, Inter Library Loan Q subset_features = list(subset) a X_subset = X_train[subset_features] & model = LinearRegression() r2_scores = cross_val_score(model, X_subset, Y_train, cv=5, scoring='r2') =] avg_r2_score = np.mean(r2_scores) ‘l subset_r2_scores.append(avg_r2_score) if avg_r2_score > best_k_r2_score: best_k_r2_score = avg_r2_score best_k_feature_subset = subset_features . 1 2 O if best_k_r2_score > best_r2_score: best_r2_score = best_k_r2_score best_feature_subset = best_k_feature_subset 4 all_r2_scores.append(subset_r2_scores) subset_sizes = range(1, 9) for i, r2_scores in enumerate(all_r2_scores): plt.scatter([subset_sizes[i]] * len(r2_scores), r2_scores, color='blue', alpha=8.5) plt.scatter(len(features), best_r2_score, color='red', marker='*', s=200, label='Best Overall Model') plt.xlabel('Subset') plt.ylabel('R-squared Score') plt.title('R-squared Score for Different Feature Subsets') plt.xticks(subset_sizes) plt.legend() ) plt.show() (& print(“Best Overall Feature Subset:", best_feature_subset) print("Best Overall R-squared Score:", best_r2_score) - & L Type here t h - RPZIEYNY] . — ‘ - * Y 111172023 2 @ O |3‘] file| DO\I star[ Jupl = DoxlE ans ans X | Jup| @ hw~| [ parl sld:] ma| sld‘| skl:l sld«| vac | /& Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites [ Catmail & UofA Homepage /A, Inter Library Loan Q - R-squared Score for Different Feature Subsets L o i * - o ! x‘ = . ' ! . ° pS [+ ® v ' ! . -] =) o ° oY 3-10{e¢ § ° 2 E - | & . '~ 3 o ° ¥ w ® 12 ' f . i + o o [<] ° ! 8 ° o b ° ' ! -14 4 H ° ° ° v Best Overall Model 1 2 3 - 5 6 7 8 Subset Best Overall Feature Subset: ['lcavol’, 'lweight'] a Best Overall R-squared Score: -6.587279708487542 (& Excellent You have found the best set of features by brute-force search over all possible features. Good work. /O y " " " : ) 12:03 AM ype here to searc B 3 RRVARVPIZE]

2 @ O |3‘] file| DO\I star[ Jupl = Doxl & ans| ans X | O Jup| @ hw~| [ parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q Problem 3 : Ridge Regression @ (@ o The problem with brute force search over features is that it doesn't scale well. We can do it for 8 features, but we can't do it for larger sets of features. Instead, we will look at i a simpler model selection strategy by using L2 regularized linear regression (a.k.a. Ridge Regression). Do the following in the cell below, (< * Learn a Ridge regression model on training data with alpha=0.5 * Report the learned feature weights using the provided printFeatureWeights function 0% w Documentation - Scikit-Learn - linear_model.Ridge def printFeatureWeights(f, w): + for idx in range(len(f)): print('%s : ¥f' % (f[idx], w[idx])) from sklearn.linear_model import Ridge ridge_model = Ridge(alpha=8.5) ridge_model.fit(X_train, Y_train) printFeatureWeights(features, ridge_model.coef_[@]) lcavol : ©.576706 lweight : ©.593447 age : -0.018544 (m)] lbph : ©.145617 svi : ©.683643 lcp : -0.193621 gleason : -9.034175 pgg4s : ©.809508 L Type here to search = AGID gz T8 yP e = % S 111172023 8 © O |§] file] DO\I stall Jup]; Do*]E ans ans X | _ Jup|; hw'||.;| par | Sldtl mal ( skie | skle | skh] uac | JA, Ad‘[-{'- - o X C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q (b) - L We chose the regularization coefficient alpha=0.5 somewhat arbitrarily. We now need to perform model selection in order to learn the best value of alpha. We will do that by using cross_val_score over a range of values for alpha. When searching for regularization parameters it is generally good practice to search in log-domain, rather than linear -] domain. For example, we will search in the range [10 !, 10*]. Using Numpy's "logspace” function this corresponds to the range [—1, 3] in log-domain. In the cell below do the following, b e Create a range of 50 alpha values spaced logarithmically in the range [10 .3 103] [« e Perform 5-fold cross-validation of Ridge regression model for each alpha and record R? score for each run (there will be 5x50 values) ® Report the best R? score and the value of alpha that achieves that score * Use Matplotlib errorbar() function to plot the average R? with 1 standard deviation error bars for each of the 50 alpha values w Documentation - Matplotlib - errorbar Documentation - Numpy - logspace alphas = np.logspace(-1, 3, 50) mean_r2_scores = [] std_r2_scores = [] for alpha in alphas: ridge_model = Ridge(alpha=alpha) r2_scores = cross_val_score(ridge_model, X_train, Y_train, cv=5, scoring='r2') mean_r2 = np.mean(r2_scores) std_r2 = np.std(r2_scores) mean_r2_scores.append(mean_r2) std_r2_scores.append(std_r2) 2 Tyoe here' ; = _ 4y 1203AM ype here tosearch : : © 7 1171172023

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

2 @ O |3‘] file| DO\I star[ Jupl = DoxlE ans ans X | Jup| @ hw~| [ parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites [ Catmail & UofA Homepage /A, Inter Library Loan Q alphas = np.logspace(-1, 3, 50) - L mean_r2_scores = [] std_r2_scores = [] - for alpha in alphas: ridge_model = Ridge(alpha=alpha) r2_scores = cross_val_score(ridge_model, X_train, Y_train, cv=5, scoring='r2') 4 mean_r2 = np.mean(r2_scores) std_r2 = np.std(r2_scores) mean_r2_scores.append(mean_r2) std_r2_scores.append(std_r2) 1 2 O best_r2_index = np.argmax(mean_r2_scores) best_r2_score = mean_r2_scores[best_r2_index] . r best_alpha = alphas[best_r2_index] print("Best R-squared Score:", best_r2_score) print("Best Alpha:", best_alpha) plt.errorbar(alphas, mean_r2_scores, yerr=std_r2_scores, fmt='-o', color='r', ecolor='b', capsize=5) plt.xscale('log') plt.xlabel('Alpha‘) plt.ylabel('Average R-squared Score') plt.title('Ridge Regression Model Selection') plt.show() Best R-squared Score: -7.652146509544281 o Best Alpha: 6.25855192527397 Ridge Regression Model Selection ¢ 12:04 AM 7z ) ARVARVPIZES 2 @ O |3‘] file| DO\I star[ Jupl = DoxlE ans ans X | Jup| @ hw~| [ parl sld:] ma| sld‘| skl:l sld«| uac[é‘i‘ Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites [ Catmail J& UofA Homepage /A, Inter Library Loan Q plt. show() i a Best R-squared Score: -7.652146509544281 0 Best Alpha: 6.25055192527397 Ridge Regression Model Selection - 14 T . -4 1 T:"-T_ o g 61 ( & =] v 0 w 2 o —8 i -H:’_,: 2 T + b -4 g -10 A . ® j L "'0. g J_:‘-:- ~~~ I == od -12 ' L | ™ | J___JL-H ool e -14 4] -— 107! 10° 10! 10? 103 o Alpha Now that we have a good model we will look at what it has learned. Train the Ridge regression model using the selected alpha from the previous cell. Report the learned feature weiahts usina the printFeatureWeiahts() function previously provided = = 12:04 AM - D) n L Type here to search : % 12023

2 @ O |3‘] file| DO\I star[ Jupl: DoxlE ans ans X | Jup|: hwl[l:i parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad«|+ - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites [ Catmail & UofA Homepage /A, Inter Library Loan Q Now that we have a good model we will look at what it has learned. Train the Ridge regression model using the selected alpha from the previous cell. Report the learned m & feature weights using the printFeatureWeights() function previously provided. . z == ridge_model = Ridge(alpha=6.25055192527397) ridge_model.fit(X_train, Y_train) L k printFeatureWeights(features, ridge_model.coef_[@]) lcavol : ©.550337 o lweight : ©.430148 age : -9.014697 6 lbph : ©.154290 svi : ©.387669 W lcp : -0.102993 gleason : -0.046865 pgg4s : ©.8e9399 4 Problem 4 : LASSO bl Ridge regression performs shrinkage of the weights using the L2 norm. This will drive some weights close to zero, but not exactly zero. The LASSO method replaces the L2 penalty with an L1 penalty. Due to properties of L1 discussed in lecture, this has the effect of learning exactly zero weights on some features when it is supported by the data. In this problem we will repeat procedure of learning a Ridge regression model, but we will instead use LASSO. Let's start by fitting a LASSO model with a fixed alpha value. (a) _ (m)] In the cell below do the following, () ® Fit LASSO with alpha=0.1 ¢ Use printFeatureWeights() to report the learned feature weights @ Yt . » P 12:04 AM §e Typg rlerf to search L 4 ‘ </ %D 411172003 L2 @ O |&) file| — Dov| star| & Jup| Z Do| @ ans ans x (T Jup| T hw| Bl par| i skie| v oma| 0 ske| n0 skie| u ske| @ uac| A Ad| 4+ - o O (D localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download =false A O & m & %R In the cell below do the following, X B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q L 4 o e Fit LASSO with alpha=0.1 * Use printFeatureWeights() to report the learned feature weights Documentation - Scikit-Learn - linear_model.Lasso from sklearn.linear_model import Lasso lasso_model = Lasso(alpha=8.1) lasso_model.fit(X_train, Y_train) printFeatureWeights(features, lasso_model.coef_) + lcavol : ©.538986 lweight : ©.184891 age : -0.006352 lbph : ©.128433 svi : ©.000000 . lcp : -©.000000 gleason : -0.000000 pgg4s : 0.007727 (b) Now we will find a good value of alpha using cross-validation. Due to differences in how the LASSO model is optimized, there are dedicated methods for performing cross- o validation on LASSO. Scikit-Learn's LassolLarsCV class performs LASSO-specific cross-validation using an optimized Least Angle Regression (LARS) algorithm. In the cell below do the following, e e Using LassolarsCV perform 20-fold cross validation to solve all solution paths for Lasso 233 e Plat mean +/- standard error of mean sauared error versiis reanlarization coefficient o = 12:04 AM § 7 Q) n L Type here to search . % 12023

L2 @ O |&) file| — Dov| star| & Jup| Z Do| @ ans ans x (T Jup| T hw| Bl par| i skie| v oma| 0 ske| n0 skie| u ske| @ uac| A Ad| 4+ - o O (D localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download =false A @ W m T R B3 Import favorites [ Catmail & UofA Homepage /A, Inter Library Loan lasso_model = LassolLarsCV(cv=28) lasso_model.fit(X_train, Y_train) best_alphal = lasso_model.alpha_ cv_alphas = lasso_model.cv_alphas_ B ¢ o0 @ X mse = np.mean(lasso_model.mse_path_, axis=1) std_error = np.std(lasso_model.mse_path_, axis=1) 4 plt.errorbar(cv_alphas, mse, yerr=std_error, fmt='-o0', color='r', ecolor='b', capsize=4) plt.xscale('log') plt.xlabel('Alpha‘) plt.ylabel('Mean Squared Error') plt.title('LASSO Regression Model') plt.show() 1 2 O print("Best Alpha:", best_alphal) print(“Corresponding Mean Squared Error:", np.min(mse)) + LASSO Regression Model 31 - -— - w (W] T ® 2 s & g 4 2| T THl - o P N = 4y 1204AM b Type here tosearch =il - m @ O %D 11003 2 @ O |3‘] file| DO\I star[ Jupl = DoxlE ans ans X | Jup| @ hw~| [ parl sld:] ma| sld‘| skl:l sld«| uac[é‘i‘ Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A N W (un] e R o B3 Import favorites [ Catmail & UofA Homepage /A, Inter Library Loan Q - print(“Best Alpha:", best_alphal) 0 print("Corresponding Mean Squared Error:", np.min(mse)) LASSO Regression Model == s 54 [+ oY S @ 24 w g s + =4 0 L 5 11 ') = P of 1+ i 10-2 107! 10° 10! ) Alpha (& Best Alpha: ©.811311646934499087 Corresponding Mean Squared Error: ©.6322147354954939 = = 12:04 AM - D) n L Type here to search ‘ 4 : % 12023

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

2 @ O |3‘] file| DO\I star[ Jupl = DoxlE ans ans X | Jup|: hw~| [ parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad(l + - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites [ Catmail & UofA Homepage /A, Inter Library Loan Q Problem 5 : Evaluate on Test @ In this problem we will train all of the best performing models chosen by Best Subsets, Ridge Regression, and LASSO. We will evaluate and compare these models on the test o data. This dataset uses a standard train / test split so we begin by loading test data below. s df_test = pd.read_csv('prostate_test.csv') df_test.head() o lcavol lweight age Ibph svi lcp gleason pgg45 Ipsa 6 0 0737164 3473518 64 0615186 0 -1.386294 6 0 0.765468 w 1 -0.776529 3539509 47 -1.386294 0 -1.386294 6 0 1.047319 2 0223144 3244544 63 -1.386294 0 -1.386294 6 0 1.047319 + 3 1.205971 3442019 57 -1386294 0 -0.430783 7 5 1.398717 4 2059239 3501043 60 1474763 0 1.348073 7 20 1.658228 (a) Recall that all of the data are stored in a single table, with the final column being the output ‘Ipsa’. Before evaluating on test you must first create X_test and Y_test . input/outputs where Y_test is the final column of the DataFrame, and X_test contains all other columns. . 5 (m)] df_test = pd.read_csv('prostate_test.csv') X_test = df_test[features] (e Y_test = df_test[output] print(X_test) 12:04 AM ARVARVPIZE] L Type here to search ‘ E - : Q) S @ O |6 fle| Do | sta| & dup|Z Dot| @ ans ans x T dup| T hwe| i par| i skie| n ma| g ske|y0 skie|y skie| & vac| A Ad| + - o X C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A N W (un] e R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q (a) - _— Tap— _ | 4 Recall that all of the data are stored in a single table, with the final column being the output ‘Ipsa’. Before evaluating on test you must first create X_test and Y_test input/outputs where Y_test is the final column of the DataFrame, and X_test contains all other columns. - df_test = pd.read_csv('prostate_test.csv') ‘l X_test = df_test[features] o Y_test = df_test[output] print(X_test) [0} print(Y_test) lcavol lweight age lbph svi lcp gleason pggds w @ ©.737164 3.473518 64 0.615186 0 -1.386294 6 e 1 -0.776529 3.539509 47 -1.386294 0 -1.386294 6 e + 2 ©.223144 3.244544 63 -1.386294 0 -1.386294 6 e 3 1.205971 3.442019 57 -1.386294 0 -0.430783 7 5 4 2.059239 3.501043 60 1.474763 0 1.348073 7 20 5 ©.385262 3.667400 69 1.599388 0 -1.386294 6 e 6 1.446919 3.124565 68 0.300105 0 -1.386294 6 e 7 -0.400478 3.865979 67 1.816452 0 -1.386294 7 20 8 ©.182322 3.804438 65 1.704748 0 -1.386294 6 e 9 ©.00995@ 3.267666 54 -1.386294 0 -1.386294 6 e 10 1.308333 4.119850 64 2.171337 0 -1.386294 7 5 11 1.442202 3.68261@ 68 -1.386294 0 -1.386294 7 10 . 12 1.771557 3.896909 61 -1.386294 0 ©.81e930 7 6 13 1.163151 4.035125 68 1.713798 @ -9.430783 7 40 14 1.745716 3.498022 43 -1.386294 0 -1.386294 6 e (W] 15 1.220830 3.568123 70 1.373716 0 -0.798508 6 e 16 ©.512824 3.633631 64 1.49294 © 0.84879% F 4 70 @ 17 2.127e41 4.121473 68 1.766442 @ 1.446919 7 40 18 3.153590 3.516013 59 -1.386294 0 -1.386294 7 5 19 ©.974560 2.865054 47 -1.386294 @ ©.5ee775 7 < v 2:?:3 12:04 AM ARVARVPIZE]

S @ O |6 fle| Do | sta| & dup|Z Dot| @ ans ans x T dup| T hwe| i par| i skie| n ma| g ske|y0 skie|y skie| & vac| A Ad| + - o X C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites [ Catmail & UofA Homepage /A, Inter Library Loan Q lcavol 1lweight age lbph svi lcp gleason pggas - © 0.737164 3.473518 64 0.615186 0 -1.386294 6 e 0 1 -0.776529 3.539509 47 -1.386294 0 -1.386294 6 e 2 ©.223144 3.244544 63 -1.386294 0 -1.386294 6 2] 3 1.205971 3.442019 57 -1.386294 9 -0.430783 7 5 = 4 2.859239 3.501043 60 1.474763 @ 1.348073 7 20 5 ©.385262 3.667400 69 1.599388 0 -1.386294 6 e ‘l 6 1.446919 3.124565 68 0.300105 0 -1.386294 6 e 7 -0.400478 3.865979 67 1.816452 0 -1.386294 i 20 o 8 ©.182322 3.804438 65 1.7e4748 @ -1.386294 6 e 9 ©.009950 3.267666 54 -1.386294 0 -1.386294 6 e 10 1.308333 4.119850 64 2.171337 @ -1.386294 7 5 G 11 1.442202 3.682610 68 -1.386294 0 -1.386294 7 10 12 1.771557 3.8969@9 61 -1.386294 © ©.810930 F 6 W 13 1.163151 4.035125 68 1.713798 © -9.430783 7 40 14 1.745716 3.498022 43 -1.386294 0 -1.386294 6 e 15 1.220830 3.568123 70 1.373716 0 -0.798508 6 =] e 16 ©.512824 3.633631 64 1.4929%04 @ ©.04879% 7 70 17 2.127041 4.121473 68 1.766442 0 1.446919 7 40 18 3.15359@ 3.516013 59 -1.386294 9 -1.386294 T 5 19 ©.974560 2.865054 47 -1.386294 @ 0.5ee775 7 4 20 1.997418 3.719651 63 1.619388 1 1.909542 7 40 21 2.034706 3.917011 66 2.008214 1 2.110213 7 60 22 2.073172 3.623007 64 -1.386294 0 -1.386294 6 e 23 1.458615 3.836221 61 1.321756 0 -0.430783 74 20 24 1.214913 3.825375 69 -1.386294 1 ©.223144 7 20 25 1.838961 3.236716 60 ©.438255 1 1.178655 9 1 ' 26 2.779440 3.823192 63 -1.386294 @ ©.371564 7 50 27 2.677591 3.838376 65 1.115142 9 1.749200 9 70 28 2.907447 3.396185 52 -1.386294 1 2.463853 " 10 o 29 3.471966 3.974998 68 ©.438255 1 2.984165 7 20 1psa (&) @ ©0.765468 1 1.047319 2 1.047319 12:04 AM ARVARVPIZE] S @ O |6 fle| Do | sta| & dup|Z Dot| @ ans ans x T dup| T hwe| i par| i skie| n ma| g ske|y0 skie|y skie| & vac| A Ad| + - o X C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q 29 3.471966 3.974998 68 ©.438255 1 2.9e4165 7 20 a lpsa @ ©.765468 0 1 1.e47319 2 1.047319 = 3 1.398717 4 1.658228 ‘l 5 1.731656 6 1.766442 7 1.816452 o 8 2.008214 9 2.021548 0% 10 2.085672 11 2.307573 12 2.374%06 w 13 2.568788 14 2.591516 + 15 2.591516 16 2.684440 17 2.691243 18 2.7e4711 19 2.788093 20 2.853592 21 2.882004 22 2.882004 23 2.88759@ 24 3.056357 25 3.075006 . 26 3.513037 27 3.570940 o 28 5.143124 29 5.582932 @ (b) Best Subsets . B 2 Tyoe here' ; ‘ ) 1204AM ype here to searc ; 11/11/2023

2 @ O |3‘] file| DO\I star[ Jupl: DoxlE ans ans X | Jup|: hwl[l:i parl sld:] ma| ‘ sld‘| skl:l sld«| ) uac[é‘i‘ Ad«|+ - o >3 C (@ localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download=false A [ % m = %R o B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan Q (b) Best Subsets - 4 In Problem 2 you found the best subset of features for an ordinary least squares regression model by enumerating all feature subsets. Using the best selected features train the model below and report mean squared error on the test set. . Documentation - Scikit-Learn - MeanSquaredError ‘1 from sklearn.metrics import mean_squared_error X_test_subset = X_test[best_feature_subset] model = LinearRegression() model.fit(X_train[best_feature_subset], Y_train) 1 2 O predictions = model.predict(X_test_subset) mse = mean_squared_error(Y_test, predictions) + print("MSE on Test Set:", mse) MSE on Test Set: ©.4924823476805036 (c) Ridge Regression In the cell below, train a Ridge Regression model using the optimal regularization coefficient () found in Problem 2. Report mean squared error on the test set. ridge_model = Ridge(alpha=best_alpha) ridge_model.fit(X_train, Y_train) predictionsl = ridge_model.predict(X_test) . (m)] msel = mean_squared_error(Y_test, predictionsl) 2 2 print("MSE on Test Set:", msel) e MSE on Test Set: ©.527349459277083 233 - 2 g 12:04 AM pel Type here to search ; s § ¢ : 11/11/2023 L2 @ O |3‘] file: | Dot | star| - Jup| = Doxlﬂ ans ans x| Jup| = hwi| jl par| g sk ma| - skle| skle| o0 ske| o vac| & Ad| + - o X G @ localhost:8888/nbconvert/html/Downloads/HWO06/HWO06/ans.ipynb?download =false AN h % CD {\E %} o B3 Import favorites D Catmail JA UofA Homepage JA Inter Library Loan Q FIOL Ul 1E3L JBL. Ul Jk/IeINITas IUOS - p L 4 (d) LASSO Regression o Now, train and evaluate your final model. Train a Lasso regression using the optimal a parameters from Problem 3 and report MSE on the test set. lasso_model = Lasso(alpha=best_alphal) lasso_model.fit(X_train, Y_train) predictions2= lasso_model.predict(X_test) mse2 = mean_squared_error(Y_test, predictions2) 6 print("MSE on Test Set:", mse2) w MSE on Test Set: ©.5077059122563892 (e) Compare feature weights for each model + Now let's compare the feature weight learned by each of the three models. In the cell below, report the regression weights for each feature under Best Subset, Ridge, and Lasso models evaluated above. To make the output easier to read, please use a Pandas DataFrame to display the data. To do this, create a Pandas DataFrame where each column contains regression weights for one of the previous models, and then display that DataFrame in the standard fashion. You should also provide feature names on each of the rows. Documentation - Pandas - DataFrame max_length = max(len(best_feature_subset), len(ridge_model.coef_), len(lasso_model.coef_)) best_feature_subset = np.pad(best_feature_subset, (@, max_length - len(best_feature_subset)), mode='constant', constant_values=np.nan) o ridge_model.coef_ = np.pad(ridge_model.coef_[@], (@, max_length - len(ridge_model.coef_)), mode='constant', constant_values=np.nan) ' lasso_model.coef_ = np.pad(lasso_model.coef_[8], (@, max_length - len(lasso_model.coef_)), mode='constant', constant_values=np.nan) () weight = { 'Feature': features, pel Type here to search A& D7) 12:04 AM = 3 i z RRVARVPIiPE]

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

L2 @ O |&) file| — Dov| star| & Jup| Z Do| @ ans ans x (T Jup| T hw| Bl par| i skie| v oma| 0 ske| n0 skie| u ske| @ uac| A Ad| 4+ - o O (D localhost:8888/nbconvert/html/Downloads/HW06/HWO06/ans.ipynb?download =false A O o m & %R B3 Import favorites D Catmail JA& UofA Homepage JA Inter Library Loan max_length = max(len(best_feature_subset), len(ridge_model.coef_), len(lasso_model.coef_)) B ¢ o0 @ X best_feature_subset = np.pad(best_feature_subset, (8, max_length - len(best_feature_subset)), mode='constant', constant_values=np.nan) ridge_model.coef_ = np.pad(ridge_model.coef_[8], (@, max_length - len(ridge_model.coef_)), mode='constant', constant_values=np.nan) lasso_model.coef_ = np.pad(lasso_model.coef_[@8], (@, max_length - len(lasso_model.coef_)), mode='constant', constant_values=np.nan) 4 weight = { 'Feature': features, ‘Best Subset': best_feature_subset, 'Ridge Regression': ridge_model.coef_, 'LASSO Regression': lasso_model.coef_ } 1 2 O weight_df = pd.DataFrame(weight) + ValueError Traceback (most recent call last) Cell In[115], line 14 5 lasso_model.coef_ = np.pad(lasso_model.coef_[@], (@, max_length - len(lasso_model.coef_)), mode='constant', constant_values=np.nan) 7 weight = { 8 'Feature': features, 9 ‘Best Subset': best_feature_subset, 1e 'Ridge Regression': ridge_model.coef_, 11 'LASSO Regression': lasso_model.coef_ 12} ---> 14 weight_df = pd.DataFrame(weight) File ~\anaconda3\Lib\site-packages\pandas\core\frame.py:664, in (self, data, index, columns, dtype, copy) (m)] 658 mgr = self._init_mgr( 659 data, axes={"index": index, "columns": columns}, dtype=dtype, copy=copy - 660 ) . G 662 elif isinstance(data, dict): 663 # GH#38939 de facto copy defaults to False only in non-dict cases - {g} -=% RARA mor = dirt ta mor(data dnday ralimne dtune-dtuna ranv=ranv +un-manacer) 12:04 AM ARVARVPIZE] ABDzD

cs380hw6

Related Documents