ML_assignment_2.pdf

pdf

School

Drexel University *

*We aren’t endorsed by this school

Course

613

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

6

Uploaded by CaptainMantisMaster961

Report
Meet Sakariya: 14473322 1 Theory =(x−x̄ ¯)(y−ȳ ¯)1 (x −x̄ ¯)2 1 = ( 66.4)/160.9 1= 0.41 0=ȳ ¯b1x̄ ¯ 1. Consider the following supervised 1 dataset: −21 − 5 − 4 −31 0 3 −8 11X= −,Y=25 1 0 5 − 1 − 1 − 3 6 1 (a) Compute the coefficients for closed-form linear regression using least squares estimate (LSE). Show your work and remember to add a bias feature. Since we have only one feature, there is no need to zscore it (6pts). Solution: Close ∑d-formlinearregressionisgivenby:ŷˆ=b0+b1xx̄ ¯=xi x̄ ¯= ∑n-0.9ȳ ¯=yin ȳ ¯=1.4 x y x-x̄ ¯ y-ȳ ¯ (x −x̄ ¯)2(x−x̄ ¯)(y−ȳ ¯) -2 1 -1.1 -0.4 1.21 0.44 -5 -4 -4.1 -5.4 16.81 22.14 -3 1 -2.1 -0.4 4.41 0.84 0 3 0.9 1.6 0.81 1.44 -8 11 -7.1 9.6 50.41 -68.16 -2 5 -1.1 3.6 1.21 -3.96 1 0 1.9 -1.4 3.61 -2.66 5 -1 5.9 -2.4 34.81 -14.16 -1 -3 -0.1 -4.4 0.01 0.44 6 1 6.9 -0.4 47.61 -2.76 training b b b b ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
b b =1 ( | | i Ŷ ˆ i i + |Ŷˆi )/2 √RMSE=26.46 RMSE=5.14 S M A P E = 1 N b0 = 1.4 − (−0.41)( b0 = 1.031 ŷˆ=1.031 − 0.41x −0.9) x - 2 - 5 - 3 0 8 - 2 1 5 - 1 6 y 1 -4 1 3 1 1 5 0 -1 -3 1 2 ŷ ˆ 1.831 3.031 2.231 1.031 -2.16 9 1.831 0.631 -0.96 9 1.431 -1.36 9 = 13.279/10 = 1.3279 (c) What is the RMSE and SMAPE for this training set based on the model you learned in the previous part (2pts)? Solution: R M S E = 1 N ( ˆ 2 N i = 1 Y i − Ŷ i ) (b) Using your learned model in the previous part, what are your predictions, Ŷˆ , for the train- ing data (2pts)? Solution: Now that we have the coefficients, we can make predictions for the training data using: ŷˆ=1.031 − 0.41x | | RMSE value = 5.14 SMAPE value = 1.3279 Value of Coefficients for linear regression Value of Coefficients for linear regression 0 = 1.031 1 = -0.41 ∑ N i Y Y |
2 Closed Form Linear Regression Formulas used: w = (XTX)−1 R M S E = 1 N S M A P E = 1 N Ŷˆ ) 2i 3 In this section you’ll create simple linear regression models using the dataset mentioned in the Datasets section. Use the first six columns as the features (age, sex, bmi, children, smoker, region), and the final column as the value to predict (charges). Note that the features contain a mixture of continuous valued information, binary information, and categorical information. It will be up to you to decide how to do any pre-processing of the features! First randomize (shuffle) the rows of your data and then split it into two subsets: 2/3 for training, 1/3 for validation. Next train your model using the training data, and evaluate it for the training data, and for the validation data. 1. Don’t forget to add a bias feature! 2. So that you have reproducible results, we suggest that seed the random number generate prior to using it. In particular, you might want to seed it with a value of zero so that you can compare your numeric results with others. 3. IMPORTANT If you notice there’s issues in computing the inverse of XTX due to sparcity, you maybe one to try one of the following: • Using the pseudo-inverse instead of the regular inverse. This can be more stable and accurate. • Adding some “noise” (i.e. very small values) to the binary features you made out of the enumerated features. Since your target values are relatively large, so too will your RMSE. The preprocessing steps undertaken to train the model are as follows: 1). Initially, a check for null values in the dataset was performed using the command data.isnull().sum(). In our specific case, there were no null values present. NOTE: Solution: Implementation Details | | RMSE Training set: 5757.8889922488215 RMSE Validation set: 6604.316221778547 SMAPE Training set: 36.10967598469293 SMAPE Validation set: 36.60026219709693 ( Yi | Yi Yi √XT∑YN ∑i=1Ni=1 − Ŷˆi | |+ Ŷˆi|
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 2). Next, attention was directed to three columns, namely ’sex,’ ’smoker,’ and ’region,’ which con- tained categorical or string data. 3). In the ’sex’ column, the values ’male’ were converted to ’1,’ and ’female’ values were transformed into ’0.’ 4). Similarly, in the ’smoker’ column, ’yes’ was changed to ’1,’ and ’No’ was altered to ’0.’ 5). As for the ’region’ column, it encompassed three distinct values: ’region northwest,’ ’region southeast,’ and ’region southwest.’ This column was subsequently split into three separate columns based on their values. For instance, if the region indicated ’region northwest,’ it was assigned a value of ’1,’ while the other two region values were set to ’0.’
3 Cross-Validation 5 1. Reads in the data. 2. 20 times does the following: (a) Seeds the random number generator to the current run (out of 20). (b) Shuffles the rows of the data (c) Creates S folds. (d)Fori=1toS i. Select fold i as your validation data and the remaining (S − 1) folds as your training data. ii. Train a linear regression model using the direct solution. iii. Compute the squared error for each sample in the current validation fold (e) You should now have N squared errors. Compute the RMSE for these. 3. You should now have 20 RMSE values. Compute the mean and standard deviation of these. The former should give us a better “overall” mean, whereas the latter should give us feel for the variance of the models that were created. Cross-Validation is a technique used to use more data for training a system while maintaining a reliable validation score. In this section you will do S-Folds Cross-Validation for a few different values of S. For each run you will divide your data up into S parts (folds) and build S different models using S-folds cross-validation and evaluate via root mean squared error. In addition, to observe the affect of system variance, we will repeat these experiments several times (shuffling the data each time prior to creating the folds). We will again be doing our experiment on the aforementioned dataset. The preprocessing steps undertaken to train the model are as follows: 1). Initially, a check for null values in the dataset was performed using the command data.isnull().sum(). In our specific case, there were no null values present. 2). Next, attention was directed to three columns, namely ’sex,’ ’smoker,’ and ’region,’ which con- tained categorical or string data. Write a script that: Solution: Mean RMSE for 3-Folds Cross-Validation: 6090.2168454674675 Standard Deviation of RMSE for 3-Folds Cross-Validation: 13.683844715851382 Mean RMSE for 223-Folds Cross-Validation: 5586.057692498905 Standard Deviation of RMSE for 223-Folds Cross-Validation: 25.575125448445824 Mean RMSE for 1338-Folds Cross-Validation: 4202.090245766043 Standard Deviation of RMSE for 1338-Folds Cross-Validation: 1.3830600733622805e-11
6 3). In the ’sex’ column, the values ’male’ were converted to ’1,’ and ’female’ values were transformed into ’0.’ 4). Similarly, in the ’smoker’ column, ’yes’ was changed to ’1,’ and ’No’ was altered to ’0.’ 5). As for the ’region’ column, it encompassed three distinct values: ’region northwest,’ ’region southeast,’ and ’region southwest.’ This column was subsequently split into three separate columns based on their values. For instance, if the region indicated ’region northwest,’ it was assigned a value of ’1,’ while the other two region values were set to ’0.’
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help