01 PYTHON notebook -- Sarah Gets a Diamond1

pdf

School

Queens University *

*We aren’t endorsed by this school

Course

867

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

21

Uploaded by DoctorFalcon697

Report
Linear Regression in Python, "Sarah Gets a Diamond" case This notebook provides the code for building several linear models for the "Sarah Gets a Diamond case". It containt the following steps: 1. Load (install, if needed) the required packages and libraries. 2. Data engineering: load the data, clean the data, transform as needed, visualized, select evaluation metrics. 3. Builidng linear models: OLS, Ridge Regression, Lasso Regression. 4. Log-transformation is applied, and this improves quality. 5. Visualization and feature-engineering are used for fine-tuning. 6. Hurray--we are done! 0. Basic arrangements. 0.1. A Very brief Introduction to Python The code of this notebook is in Python 3. We would highly recommend installing the newest stable version that is available. At the time of writing this notebook, it is 3.8. If you are already familiar with some programming languages, this may boost your starting progress: https://learnxinyminutes.com/docs/python/ (https://learnxinyminutes.com/docs/python/) Official Python tutorial is available here: https://docs.python.org/3/tutorial/ (https://docs.python.org/3/tutorial/) It is also a good source to start learning programming if you are unfamiliar with it. Complete documentation is available here: https://docs.python.org/3/ (https://docs.python.org/3/) The common sense of programming is understanding basic language structures and using your favorite search engines, of which Google (www.google.com) definitely works. Usually good answers and questions can be found on StackOverflow (www.stackoverflow.com) and StackExchange (www.stackexchange.com), as well as on Quora (www.quora.com). Some relevant blogs can be available on Medium (www.medium.com), Habrahabr (www.habr.com), and sometimes useful dicussions appear on Reddit (www.reddit.com). Lastly, numerous online courses are also available on Udemy (www.udemy.com), Coursera (www.coursera.org), and DataCampt ( https://www.datacamp.com/ (https://www.datacamp.com/) ) among others. 0.2. Getting required libraries In [1]: # Importing some standard python libraries: import copy import math
This notebook will use certain third-party libraries. They are not installed with Python by default, however, if you install the Anaconda ( https://www.anaconda.com/ (https://www.anaconda.com/) ) distribution, then they will be installed together with Python. 1. Numpy. This is a de-facto standard library for linear algebra in Python, documentation is available by https://numpy.org/doc/ (https://numpy.org/doc/) 2. Pandas. It is most commonly used library for data engineering. Documentation is at https://pandas.pydata.org (https://pandas.pydata.org) 3. Scikit-Learn. This one contains majority of simple machine-learning algorithms ready to be applied out of box. https://scikit-learn.org/stable/ (https://scikit-learn.org/stable/) 4. Statsmodels. Documentation can be found at https://www.statsmodels.org/stable/index.html (https://www.statsmodels.org/stable/index.html) . 5. Matplotlib. Provides basic plot functionality in Python, https://matplotlib.org (https://matplotlib.org) In [2]: # Anaconda Libraries Installation: # 1. Check conda environment and installed packages and libaries # import sys # !conda env list # !conda list # 2. Use these if the required libaries/packages (pandas, numpy, scikit-learn, statsmodels) are not installed # !conda install pandas # !conda install scikit-learn # !conda install statsmodels # !conda install matplotlib In [3]: # Installation with Pip Installer: # !pip3 install numpy pandas sklearn statsmodels matplotlib In [4]: # Loading library pandas with giving it short name pd. This alias is an indust ry-standard. # Same for numpy, scikit-learn and statsmodels. import pandas as pd import numpy as np import statsmodels.api as sm import sklearn as sk import matplotlib as mp # This allows Jupyter-inlined plots. import matplotlib.pyplot as plt % matplotlib inline 1 Data Engineering
In [5]: # Reading the data from local machine. sarah_raw_data = pd . read_csv( "01 CSV data -- Sarah Gets a Diamond.csv" , sep = ',' ) # Update the path to guide to where your data is, e.g., Sarah_raw_data = pd.re ad_csv("C:\\Users\\A.OVCHINNIKOV\\01 CSV data -- Sarah Gets a Diamond.csv", se p = ',') print ( "Shape of Data: " , sarah_raw_data . shape) # Display first few rows of data. sarah_raw_data . head() In [6]: # Display last few rows of data. sarah_raw_data . tail() Below is workaround for Google Colab. If you are running Jupyter on local machine, this one can be skipped for now. Data first needs to be loaded to the cloud, and then read it into a dataframe. More information can be found at https://colab.research.google.com (https://colab.research.google.com) . Manuals can be found at: https://colab.research.google.com/notebooks/intro.ipynb (https://colab.research.google.com/notebooks/intro.ipynb) and https://colab.research.google.com/notebooks/io.ipynb (https://colab.research.google.com/notebooks/io.ipynb) . Shape of Data: (9142, 9) Out[5]: ID Carat Weight Cut Color Clarity Polish Symmetry Report Price 0 1 1.10 Ideal H SI1 VG EX GIA 5169.0 1 2 0.83 Ideal H VS1 ID ID AGSL 3470.0 2 3 0.85 Ideal H SI1 EX EX GIA 3183.0 3 4 0.91 Ideal E SI1 VG VG GIA 4370.0 4 5 0.83 Ideal G SI1 EX EX GIA 3171.0 Out[6]: ID Carat Weight Cut Color Clarity Polish Symmetry Report Price 9137 9138 0.96 Ideal F SI1 EX EX GIA NaN 9138 9139 1.02 Very Good E VVS1 EX G GIA NaN 9139 9140 1.51 Good I VS1 G G GIA NaN 9140 9141 1.24 Ideal H VS2 VG VG GIA NaN 9141 9142 0.79 Ideal I VS1 EX EX GIA NaN
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [7]: # from google.colab import files # uploaded = files.upload() # import io # sarah_raw_data = pd.read_csv(io.BytesIO(uploaded['01 CSV data -- Sarah Gets a Diamond.csv']), sep = ',') Since some data are categorical, a natural inital step is to convert categorical features with one-hot encoding. In [8]: sarah_data_full = pd . get_dummies(data = sarah_raw_data, columns = [ 'Cut' , 'Colo r' , 'Clarity' , 'Polish' , 'Symmetry' , 'Report' ], drop_first = True ) # Вводим фикт ивные (бинарные) переменные -- Convert non-numericals into a dummy variable sarah_data_full . head() In [9]: # From the dataset description, first 6000 rows contain prices, and the rest d o not. N_train = 6000 sarah_data = sarah_data_full[:N_train] sarah_predict = sarah_data_full[N_train:] print ( "Missing values for sarah_data:" , np . any(sarah_data . isnull() . values)) print ( "Missing values for the test:" , np . any(sarah_predict . isnull() . values)) Out[8]: ID Carat Weight Price Cut_Good Cut_Ideal Cut_Signature- Ideal Cut_Very Good Color_E Color_F Color_ 0 1 1.10 5169.0 0 1 0 0 0 0 1 2 0.83 3470.0 0 1 0 0 0 0 2 3 0.85 3183.0 0 1 0 0 0 0 3 4 0.91 4370.0 0 1 0 0 1 0 4 5 0.83 3171.0 0 1 0 0 0 0 5 rows × 25 columns Missing values for sarah_data: False Missing values for the test: True
In [10]: # Define target (Y_sarah) to be predicted, and features (X_sarah) to be used f or prediction. # BTW, do not use that naming in industry code - it is not too readable. We mo ve this way # for simplicity, however PEP8 guidlines should # be used to make code readable (https://www.python.org/dev/peps/pep-0008/). y_sarah = sarah_data[([ 'Price' ])] X_sarah = copy . deepcopy(sarah_data) . drop([ 'ID' , 'Price' ], axis =1 ) X_predict_sarah = copy . deepcopy(sarah_predict) . drop([ 'ID' , 'Price' ], axis =1 ) X_sarah . head() 1.1. Data Visualization Let us plot carat weight versus price. A reasonable option is to use pandas built-in functionality. More information is here: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html (https://pandas.pydata.org/pandas- docs/stable/user_guide/visualization.html) Out[10]: Carat Weight Cut_Good Cut_Ideal Cut_Signature- Ideal Cut_Very Good Color_E Color_F Color_G Color_H 0 1.10 0 1 0 0 0 0 0 1 1 0.83 0 1 0 0 0 0 0 1 2 0.85 0 1 0 0 0 0 0 1 3 0.91 0 1 0 0 1 0 0 0 4 0.83 0 1 0 0 0 0 1 0 5 rows × 23 columns
In [11]: sarah_data . plot . scatter( 'Carat Weight' , 'Price' ) Histogram for carat weights is presented below. In [12]: sarah_data[ 'Carat Weight' ] . plot . hist() 1.2. Train and Test For checking the qualities of our models and model selection, let us prepare a split of data into train and test. This can be done in two ways: a. Take first 5000 for train and 1000 for test. Approach works when the data is uniform. b. Get random 5000 for train and rest 1000 for test. This approach works in general, and we will discuss it later in the course Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x1df9b3d41d0> Out[12]: <matplotlib.axes._subplots.AxesSubplot at 0x1df9b739cf8>
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [13]: # Take first 5000 of 6000 to train and the rest 1000 for test. train_size = 5000 y_train_sarah, y_test_sarah = y_sarah[:train_size], y_sarah[train_size:] X_train_sarah, X_test_sarah = X_sarah[:train_size], X_sarah[train_size:] In [14]: # Random approach uses train-test split function. from sklearn.model_selection import train_test_split X_train_sarah_rnd, X_test_sarah_rnd, y_train_sarah_rnd, y_test_sarah_rnd = tra in_test_split(X_sarah, y_sarah, test_size =1000 , random_state =42 ) Documentation for train-test-split is available at https://scikit- learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html (https://scikit- learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 1.3. Score function Having several models, we need some metric to select the best one. For this case, we will use mean average percentage error, MAPE, as computed below. In [15]: def compute_mape_score (y_test_input, y_pred_input): y_test_input = np . array(y_test_input) . reshape( -1 ,) y_pred_input = np . array(y_pred_input) . reshape( -1 ,) percent_errors = np . abs((y_test_input - y_pred_input) / y_test_input) * 10 0 return np . mean(np . array(percent_errors)) SCORING_CONFIG = { 'mape' : compute_mape_score} 2. Linear Models 2.1. Ordinary Least Squares (OLS)
In [16]: # Linear Regression via statsmodels package. X_sarah_sm = sm . add_constant(X_sarah) # Statmodel does not add intercept by de fault, so let us explicitly require it. ols_sm = sm . OLS(y_sarah, X_sarah_sm) . fit() # Fit linear regression (ordinary l east squares). ols_sm . summary() # Summary of model results.
C:\Users\ao37\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core \fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be re moved in a future version. Use numpy.ptp instead. return ptp(axis=axis, out=out, **kwargs)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Out[16]: OLS Regression Results Dep. Variable: Price R-squared: 0.864 Model: OLS Adj. R-squared: 0.863 Method: Least Squares F-statistic: 1645. Date: Thu, 16 Apr 2020 Prob (F-statistic): 0.00 Time: 14:33:57 Log-Likelihood: -57908. No. Observations: 6000 AIC: 1.159e+05 Df Residuals: 5976 BIC: 1.160e+05 Df Model: 23 Covariance Type: nonrobust coef std err t P>|t| [0.025 0.975] const 2.642e+04 1954.801 13.517 0.000 2.26e+04 3.03e+04 Carat Weight 1.839e+04 105.110 175.005 0.000 1.82e+04 1.86e+04 Cut_Good -322.5957 362.520 -0.890 0.374 -1033.265 388.074 Cut_Ideal 274.6927 355.384 0.773 0.440 -421.988 971.373 Cut_Signature-Ideal 1677.1908 439.716 3.814 0.000 815.189 2539.193 Cut_Very Good -35.6357 345.517 -0.103 0.918 -712.973 641.702 Color_E -2327.3317 200.473 -11.609 0.000 -2720.331 -1934.332 Color_F -3078.2407 189.816 -16.217 0.000 -3450.349 -2706.132 Color_G -4799.5517 178.263 -26.924 0.000 -5149.012 -4450.092 Color_H -6361.2739 188.072 -33.824 0.000 -6729.963 -5992.584 Color_I -8039.9918 192.814 -41.698 0.000 -8417.977 -7662.006 Clarity_IF -2.709e+04 1912.086 -14.170 0.000 -3.08e+04 -2.33e+04 Clarity_SI1 -3.69e+04 1898.452 -19.436 0.000 -4.06e+04 -3.32e+04 Clarity_VS1 -3.397e+04 1899.196 -17.887 0.000 -3.77e+04 -3.02e+04 Clarity_VS2 -3.531e+04 1898.911 -18.596 0.000 -3.9e+04 -3.16e+04 Clarity_VVS1 -3.058e+04 1909.252 -16.014 0.000 -3.43e+04 -2.68e+04 Clarity_VVS2 -3.243e+04 1901.916 -17.053 0.000 -3.62e+04 -2.87e+04 Polish_G 1.4906 201.753 0.007 0.994 -394.018 396.999 Polish_ID -584.7362 744.971 -0.785 0.433 -2045.148 875.676 Polish_VG -156.7413 126.621 -1.238 0.216 -404.964 91.482 Symmetry_G -430.4773 188.917 -2.279 0.023 -800.822 -60.132 Symmetry_ID 148.1049 772.420 0.192 0.848 -1366.118 1662.328 Symmetry_VG -291.3244 133.555 -2.181 0.029 -553.141 -29.508 Report_GIA 184.3296 347.567 0.530 0.596 -497.027 865.687 Omnibus: 4389.321 Durbin-Watson: 2.040
In [17]: # Linear Regression with Scikit-Learn from sklearn.linear_model import LinearRegression # Fit a linear regression with vector y as dependent and matrix X as independe nt. ols_sk = LinearRegression() . fit(X_sarah, y_sarah) # Print the estimated intercept. print ( "Intercept = " , ols_sk . intercept_) # Print model coefficients (in order of variables in X). print ( "Model coefficients = " , ols_sk . coef_) # Print the R-squared score. print ( "R^2 =" , ols_sk . score(X_sarah, y_sarah)) # Note: there is no easy way to obtain other model outputs (p-values, etc.) fr om sklearn, # as these outputs are not present in other, non-regression, machine learning models. # From the point of machine learning, nobody cares -- most important if the mo del works or not. # Which is usually understandable from the cross-validation and/or score on te st set. We trained our baseline, OLS, linear model. Let us test its performance (of course, on the testing data). Prob(Omnibus): 0.000 Jarque-Bera (JB): 149885.754 Skew: 3.118 Prob(JB): 0.00 Kurtosis: 26.678 Cond. No. 227. Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. Intercept = [26423.89481315] Model coefficients = [[ 1.83948330e+04 -3.22595688e+02 2.74692681e+02 1.67 719076e+03 -3.56356533e+01 -2.32733175e+03 -3.07824070e+03 -4.79955175e+03 -6.36127388e+03 -8.03999181e+03 -2.70937885e+04 -3.68981919e+04 -3.39709797e+04 -3.53125448e+04 -3.05753935e+04 -3.24342346e+04 1.49058530e+00 -5.84736225e+02 -1.56741252e+02 -4.30477274e+02 1.48104862e+02 -2.91324371e+02 1.84329597e+02]] R^2 = 0.8635908847404312
In [18]: # Fit the model on the train set. lr_sk = LinearRegression() . fit(X_train_sarah, y_train_sarah) # Predict prices on the test set. y_pred_sarah_lr = lr_sk . predict(X_test_sarah) # Neagative prices are irrelevant. y_pred_sarah_lr[y_pred_sarah_lr < 0 ] = 0 # Compute and output the score. print ( "Linear Regression MAPE = " , compute_mape_score(y_test_sarah, y_pred_sar ah_lr)) Conclusion: the baseline linear model has an error of approximately 27% 2.2. Lasso Regression Let us try an approach with regularization. L1-regularization is implemented in Lasso, description and documentation can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html (https://scikit- learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) In [19]: from sklearn.linear_model import Lasso # Same as before: first fit, then predict. Numpy transformations are done for compatibility with method signature. lasso = Lasso(alpha =1 , tol =1e-2 , max_iter =1e7 ) . fit(np . array(X_train_sarah), np . array(y_train_sarah) . reshape( -1 ,)) y_pred_lasso = lasso . predict(np . array(X_test_sarah)) # Negative prices are irrelevant. y_pred_lasso[y_pred_lasso < 0 ] = 0 print ( "Lasso MAPE = " , compute_mape_score(y_test_sarah, y_pred_lasso)) Linear Regression MAPE = 26.198235972929858 Lasso MAPE = 26.243429369438296
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Initial result is not better than baseline. No problem, it sometimes happends. Every model is wrong, but some of them are useful. Let us perform parameter tuning. There standard and rather old tool for it is cross-validation. Though if the tool is old, it does not mean it is bad. Documentation on lasso-specified cross-validation is available by the URL: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html (https://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) The problem with this method is that is does not allow to user custom scoring function. No problem, we will deal with that later. In [20]: from sklearn.linear_model import LassoCV # Set of values to check from: ALPHAS = [ 0.0001 , 0.001 , 0.01 , 0.05 , 0.1 , 0.5 , 1.0 , 5.0 , 10.0 , 50.0 , 100.0 , 15 0.0 , 200.0 , 250.0 , 500.0 ] # Initialize the cross-validation with 5 folds and fit. lasso_cv = LassoCV(alphas = ALPHAS, max_iter =1e7 , tol =1e-2 , cv =5 ) lasso_cv . fit(np . array(X_train_sarah), np . array(y_train_sarah) . reshape( -1 ,)) # Now get predictions and compute the score. y_pred_lasso = lasso_cv . predict(X_test_sarah) y_pred_lasso[y_pred_lasso < 0 ] = 0 print ( "Lasso MAPE = " , compute_mape_score(y_test_sarah, y_pred_lasso)) 2.3. Ridge Regression Ridge regression uses another form of reqularization: a weighted sum of squared coefficients. This functional form allows for faster optimization. Scikit-Learn documentaton for the approach is available at: https://scikit- learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html (https://scikit- learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) Lasso MAPE = 26.19823882856578
In [21]: from sklearn.linear_model import Ridge # Fit the model on the train set. ridge = Ridge(alpha =0.5 ) . fit(X_train_sarah, y_train_sarah) # Predict prices on the test set. y_pred_sarah_ridge = ridge . predict(X_test_sarah) y_pred_sarah_ridge[y_pred_sarah_ridge < 0 ] = 0 # Compute and output the score. print ( "Ridge Regression MAPE = " , compute_mape_score(y_test_sarah, y_pred_sara h_ridge)) For better parameter search, let us do cross-validation. Built-in cross-validation for RidgeRegression is available at URL: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html (https://scikit- learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html) In [22]: from sklearn.linear_model import RidgeCV from sklearn.metrics import make_scorer # Values to test for. ALPHAS = [ 0.0001 , 0.001 , 0.01 , 0.05 , 0.1 , 0.5 , 1.0 , 5.0 , 10.0 , 50.0 , 100.0 , 15 0.0 , 200.0 , 250.0 , 500.0 ] # Since we have custom score function, we need other scorer. mape_scorer = make_scorer(compute_mape_score, greater_is_better = False ) # Run cross-validation with 5 folds. After that, get predictions. cv_ridge = RidgeCV(alphas = ALPHAS, cv =5 , scoring = mape_scorer) . fit(X_train_sarah , y_train_sarah) y_pred_cv_ridge = cv_ridge . predict(X_test_sarah) y_pred_cv_ridge[y_pred_cv_ridge < 0 ] = 0 # And plot the metric after answers are found print ( "Ridge CV MAPE = " , compute_mape_score(y_test_sarah, y_pred_cv_ridge)) We have a significant improvement: hyper-parameter tuning sometimes helps. Ridge Regression MAPE = 26.35067279831501 Ridge CV MAPE = 20.46076029498597
3. Logarithmic Model Let us modify the model. Intuitively, by the structure, linear model can predict negative price, which is hardly relevant. Therefore, we naturally want to force it for positive prices. An option is to predict the logarithm of the price. It is often natural to move all quantitative features to logarithm, since we now predict logarithm but not original value. 3.1. Simple Linear Regression In [23]: y_train_log, y_test_log = copy . deepcopy(y_train_sarah), copy . deepcopy(y_test_s arah) X_train_log, X_test_log = copy . deepcopy(X_train_sarah), copy . deepcopy(X_test_s arah) X_log, y_log = copy . deepcopy(X_sarah), copy . deepcopy(y_sarah) X_predict_log = copy . deepcopy(X_predict_sarah) # Apply log-transform to quantitative variables, which are price and weighth. for item in [y_train_log, y_test_log, X_train_log, X_test_log, X_log, y_log, X _predict_log]: if 'Price' in item . columns: item[ 'Price' ] = np . log(item[ 'Price' ]) if 'Carat Weight' in item . columns: item[ 'Carat Weight' ] = np . log(item[ 'Carat Weight' ]) # Note that the data nature changes: now it is log-price instead of price, and log-weight instead of weight. Note, we fit the double-log model, which means that we assume a relationships between the percentage change in carat.weight and a percentage change in price. The result shows a significant improvement, compared to 20%. In [24]: lm_log = LinearRegression() . fit(X_train_log, y_train_log) y_pred_lm_log = np . exp(lm_log . predict(X_test_log)) print ( "Log-Log Model MAPE = " , compute_mape_score(y_pred_lm_log, y_test_sarah )) 3.2. Ridge Regression As above, let us try ridge regression, which helped before. Naturally, cross validation parameter search is better than without it, so let us do the right way. Log-Log Model MAPE = 7.9379372408370905
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [25]: from sklearn.linear_model import RidgeCV # New loss function is needed, as we work with logarithms now. def log_loss_mape (y_test_log, y_pred_log): return compute_mape_score(np . exp(y_test_log), np . exp(y_pred_log)) # Creating scorer to meet interface of CV. log_scorer = make_scorer(log_loss_mape, greater_is_better = False ) # Set of parameters to check. ALPHAS = ( 0.0001 , 0.0005 , 0.001 , 0.01 , 0.05 , 0.1 , 0.5 , 1.0 , 5.0 , 10.0 , 50.0 , 1 00.0 , 150.0 , 200.0 , 250.0 , 500.0 ) # Fit and predict. cv_ridge_log = RidgeCV(alphas = ALPHAS, cv =5 , scoring = log_scorer) . fit(X_train_lo g, y_train_log) y_pred_cv_ridge_log = cv_ridge_log . predict(X_test_log) print ( "Log Ridge CV MAPE = " , compute_mape_score(y_test_sarah, np . exp(y_pred_c v_ridge_log))) Error is better, but not too much. This way, we have logarithm model to have 7.8% score. 3.3. Lasso Regression Let us introduce here another way of doing cross-validation. General non-method-specific way is to use GridSearchCV class. Documentation and description are provided at URL: https://scikit- learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSea (https://scikit- learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSea Log Ridge CV MAPE = 7.812154026736443
In [26]: from sklearn.model_selection import GridSearchCV from sklearn.linear_model import Lasso # As before, values we would like to check. ALPHAS = [ 0.0001 , 0.0005 , 0.001 , 0.01 , 0.05 , 0.1 , 0.5 , 1.0 , 5.0 , 10.0 , 50.0 , 1 00.0 , 150.0 , 200.0 , 250.0 , 500.0 ] # Parameter grid is a way to provide parameters to iterate. param_grid = { 'alpha' : ALPHAS, 'selection' : [ 'cyclic' , 'random' ] } # Initializing and fitting. lasso_cv = GridSearchCV(Lasso(), param_grid, cv =5 , refit = True , scoring = log_sco rer) lasso_cv . fit(np . array(X_train_log), np . array(y_train_log)) # Prediction and printing score. y_pred_lasso_log = lasso_cv . predict(X_test_log) print ( "Log Lasso MAPE = " , log_loss_mape(y_test_log, y_pred_lasso_log)) 4. Visual Analysis and Further Imrovements Let us compare model predictions visually. Log Lasso MAPE = 7.804833782344718
In [27]: import matplotlib.pyplot as plt # Ploting test against best linear prediction. # https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.scatter.html plt . scatter(y_test_sarah, y_pred_sarah_ridge, color = 'blue' ) # Labels of axes. # https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.ylabel.html?highl ight=ylabel#matplotlib.pyplot.ylabel plt . ylabel( 'Predicted price, Linear model' ) # https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.xlabel.html?highl ight=xlabel#matplotlib.pyplot.xlabel plt . xlabel( 'Actual price' ) # 45-degree line indicates when there is no error. # https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html?highlig ht=pyplot.plot#matplotlib.pyplot.plot plt . plot([ 0 , 70000 ], [ 0 , 70000 ], color = 'black' , lw =2 ) plt . show Out[27]: <function matplotlib.pyplot.show(*args, **kw)>
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [28]: # Same plot for log-model plt . scatter(np . exp(y_test_log), np . exp(y_pred_lasso_log), color = 'red' ) plt . ylabel( 'Preeicted price, Log-Log model' ) plt . xlabel( 'Actual price' ) plt . plot([ 0 , 70000 ], [ 0 , 70000 ], color = 'black' , lw =2 ) plt . show Note that second graph predictions are closer to 45-degree line. Moreover, larger deviations with greater price at the first figure can motivate to try logarithm features. Further Model Improvements Let us continue with some additional feature engineering. One way to do it is provide interactions, which are just products of several features. In our example, we will use interactions of color and weight. Out[28]: <function matplotlib.pyplot.show(*args, **kw)>
In [29]: X_train_interact, X_test_interact = copy . deepcopy(X_train_log), copy . deepcopy( X_test_log) # The cycle below adds interactions "manually". for X_set in [X_train_interact, X_test_interact]: for color_name in [ 'Color_E' , 'Color_F' , 'Color_G' , 'Color_H' , 'Color_I' ]: X_set[ 'Carat Weight:' + color_name] = X_set[ 'Carat Weight' ] * X_set[co lor_name] lasso_cv . fit(np . array(X_train_interact), np . array(y_train_log) . reshape( -1 ,)) y_pred_interact = lasso_cv . predict(X_test_interact) print ( "Log Lasso Interact MAPE = " , log_loss_mape(y_test_log, y_pred_interact )) Great news! The error improves to 7.5% In [30]: print ( "Best CV Parameters" , lasso_cv . best_params_) print ( "Best Estimator: " , lasso_cv . best_estimator_) print ( "Lasso Coefficients: " , lasso_cv . best_estimator_ . coef_) print ( "Intersept: " , lasso_cv . best_estimator_ . intercept_) There is a built-in way to create interactions, via PolynomialFeatures. Intuition of usage is provided below, documentation is available at https://scikit- learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html (https://scikit- learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) C:\Users\ao37\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\lin ear_model\coordinate_descent.py:475: ConvergenceWarning: Objective did not co nverge. You might want to increase the number of iterations. Duality gap: 2.3 26945633950345, tolerance: 0.19979296050361048 positive) Log Lasso Interact MAPE = 7.500687994903503 Best CV Parameters {'alpha': 0.0001, 'selection': 'random'} Best Estimator: Lasso(alpha=0.0001, copy_X=True, fit_intercept=True, max_ite r=1000, normalize=False, positive=False, precompute=False, random_state=None, selection='random', tol=0.0001, warm_start=False) Lasso Coefficients: [ 2.11633277 0.02941359 0.087802 0.23358381 0.0582 2614 -0.07068121 -0.11374927 -0.20008998 -0.29922452 -0.43542648 0.11461715 -0.43931926 -0.19923273 -0.27853613 0. -0.07199608 -0.03578932 0.00658848 -0.02242716 -0.02422842 0. -0.01991049 0.03362982 -0.06966822 -0.06653052 -0.11936785 -0.22548757 -0.25676175] Intersept: 9.035093878158339
In [31]: from sklearn.preprocessing import PolynomialFeatures # All quadratic interactions. poly = PolynomialFeatures( 2 ) # Only products of different features. poly = PolynomialFeatures( 2 , interaction_only = True ) # This is how application looks like. See documentation for better understandi ng. # poly.fit_transform(XYZ) 5. Hurray -- we are done! To obtain predictions: (1) load the prediction data, (2) apply the same feature-engineering to them, (3) feed them into the best (selected) model to obtain the predictions, and (4) write the predicted values out. Note, we have already done steps 1 and 2 since we split the data after feature-engineering; this is, in fact, the preferred approach as otherwise unmatching categories may be present. Let us save the result with currently best model. Some links to documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_stata.html (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_stata.html) https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html) In [32]: best_reg = lasso_cv . best_estimator_ best_reg . fit(np . array(X_log), np . array(y_log) . reshape( -1 ,)) predictions = np . exp(best_reg . predict(np . array(X_predict_log))) # Saving result to Pandas and exporting to CSV. df_result = pd . DataFrame . from_dict({ 'predicted_price' : predictions}) df_result . to_csv( 'Predicted Diamond Prices_python LOG INTERACTION LASSO.csv' , sep = ',' ) # Other Export Formats are also possible: df_result . to_stata( 'Predicted Diamond Prices_python LOG INTERACTION LASSO.dta' ) # df_result.to_excel('predicted.xlsx') # (Additional packages needed.)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: Data refers to A) Possible bias or influence from variables that were not controlled for in the…
Q: The blood platelet counts of a group of women have a bell-shaped distribution with a mean of 251.8…
Q: 1) Looking from the comment said by President George H.W. Bush Sr. in 1989: "Christopher Columbus…
Q: The International Fisher Effect. Define the international Fisher effect. To what extent do empirical…
Q: The shape of a given population distribution is normal with a mean of u = 85 and a standard…
Q: DNA: How is a prokaryotic chromosome is compacted ?
Q: Give an example of a graph that has all 3 of the following properties. (Note that you need to give a…
Q: 31
Q: During periods of sustained demand and in the presence of sufficient nutrients, muscle cells can…
Q: 1-Find an equation of the line that satisfies the given conditions. Through (1, −5);  parallel to…
Q: List and explain the most critical internal and external factors  with QVC INC (Strengths,…
Q: Which equation represents the hyperbola graphed below? 10 2-2=1 A 5-8-1 B 3-1 C -4-1
Q: onsider the conducting plates shown in Figure 6.30. If V(z = 0) = 0 and (z = 2 mm) = 50 V, determine…
Q: A sample of 14.2 cm3 of aluminum, in powdered form, was mixed with an excess of iron(III) oxide. A…
Q: Two sides of a triangle have the following measures: 14 and 11 Find the range of possible measures…
Q: What is the independent variable in the experiment? A the type of soil each plant was grown in с the…
Q: Do you know what occurs when a key is pressed on a keyboard while a software is creating a file on a…
Q: 33. What is the function of muscle tissue?
Q: e. f. Suppose most investors expect the inflation rate to be 5% next year, 6% the following year,…
Q: Consider the following. Find t and the terminal point determined by t for each point in the figure,…
Q: can you please provide me the correct answer for First(S') First(A)
Q: How do you think implicit attitudes impact actual discriminatory behavior? Give an example in your…