ML_assignment_2.pdf
pdf
keyboard_arrow_up
School
Drexel University *
*We aren’t endorsed by this school
Course
613
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
6
Uploaded by CaptainMantisMaster961
Meet Sakariya: 14473322
1
Theory
=(x−x̄ ¯)(y−ȳ ¯)1
(x
−x̄ ¯)2
1 = ( 66.4)/160.9
1= 0.41
0=ȳ ¯b1x̄ ¯
1. Consider the following supervised
1
dataset:
−21
− 5 − 4
−31
0
3
−8 11X=
−,Y=25
1
0
5
− 1
− 1 − 3
6
1
(a) Compute the coefficients for closed-form linear regression using least squares estimate
(LSE). Show your work and remember to add a bias feature. Since we have only one
feature, there is no need to zscore it (6pts).
Solution:
Close
∑d-formlinearregressionisgivenby:ŷˆ=b0+b1xx̄ ¯=xi
x̄ ¯=
∑n-0.9ȳ ¯=yin
ȳ ¯=1.4
x y x-x̄ ¯ y-ȳ ¯ (x
−x̄ ¯)2(x−x̄ ¯)(y−ȳ ¯)
-2 1 -1.1 -0.4 1.21 0.44
-5 -4 -4.1 -5.4 16.81 22.14
-3 1 -2.1 -0.4 4.41 0.84
0 3 0.9 1.6 0.81 1.44
-8 11 -7.1 9.6 50.41 -68.16
-2 5 -1.1 3.6 1.21 -3.96
1 0 1.9 -1.4 3.61 -2.66
5 -1 5.9 -2.4 34.81 -14.16
-1 -3 -0.1 -4.4 0.01 0.44
6 1 6.9 -0.4 47.61 -2.76
training
b
b
b
b
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
−
−
−
∑
∑
b
b
=1
(
|
| i Ŷ ˆ i
i +
|Ŷˆi )/2
√RMSE=26.46
RMSE=5.14
S M A P E = 1
N
b0 = 1.4
− (−0.41)(
b0 = 1.031
ŷˆ=1.031
− 0.41x
−0.9)
x
-
2
-
5
-
3
0
8
-
2
1
5
-
1
6
y
1
-4
1
3
1
1
5
0
-1
-3
1
2
ŷ ˆ
1.831
3.031
2.231
1.031
-2.16
9
1.831
0.631
-0.96
9
1.431
-1.36
9
= 13.279/10 = 1.3279
(c) What is the RMSE and SMAPE for this training set based on the model you learned in
the previous part (2pts)?
Solution:
R M S E = 1 N ( ˆ 2
N i = 1 Y i
− Ŷ i )
(b) Using your learned model in the previous part, what are your predictions, Ŷˆ , for the train-
ing data (2pts)?
Solution:
Now that we have the coefficients, we can make predictions for the training data using:
ŷˆ=1.031
− 0.41x
|
|
RMSE value = 5.14
SMAPE value = 1.3279
Value of Coefficients for linear
regression Value of Coefficients for
linear regression
0 = 1.031
1 = -0.41
√
∑
∑ N i
Y
Y
−
|
2
Closed Form Linear Regression
Formulas used:
w = (XTX)−1
R M S E =
1
N
S M A P E
= 1
N
Ŷˆ ) 2i
3
In this section you’ll create simple linear regression models using the dataset mentioned in the Datasets
section. Use the first six columns as the features (age, sex, bmi, children, smoker, region), and the final
column as the value to predict (charges). Note that the features contain a mixture of continuous valued
information, binary information, and categorical information. It will be up to you to decide how to do any
pre-processing of the features!
First randomize (shuffle) the rows of your data and then split it into two subsets: 2/3 for training, 1/3 for
validation. Next train your model using the training data, and evaluate it for the training data, and for the
validation data.
1. Don’t forget to add a bias feature!
2. So that you have reproducible results, we suggest that seed the random number generate prior
to using it. In particular, you might want to seed it with a value of zero so that you can
compare your numeric results with others.
3. IMPORTANT If you notice there’s issues in computing the inverse of XTX due to sparcity,
you maybe one to try one of the following:
• Using the pseudo-inverse instead of the regular inverse. This can be more stable and
accurate.
• Adding some “noise” (i.e. very small values) to the binary features you made out of the
enumerated features.
Since your target values are relatively large, so too will your RMSE.
The preprocessing steps undertaken to train the model are as follows:
1). Initially, a check for null values in the dataset was performed using the command data.isnull().sum(). In
our specific case, there were no null values present.
NOTE:
Solution:
Implementation Details
|
|
RMSE Training set: 5757.8889922488215
RMSE Validation set: 6604.316221778547
SMAPE Training set: 36.10967598469293
SMAPE Validation set: 36.60026219709693
(
Yi
|
Yi
Yi
√XT∑YN
∑i=1Ni=1
−
− Ŷˆi |
|+ Ŷˆi|
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
2). Next, attention was directed to three columns, namely ’sex,’ ’smoker,’ and ’region,’ which con- tained
categorical or string data.
3). In the ’sex’ column, the values ’male’ were converted to ’1,’ and ’female’ values were transformed into
’0.’
4). Similarly, in the ’smoker’ column, ’yes’ was changed to ’1,’ and ’No’ was altered to ’0.’
5). As for the ’region’ column, it encompassed three distinct values: ’region northwest,’ ’region southeast,’
and ’region southwest.’ This column was subsequently split into three separate columns based on their
values. For instance, if the region indicated ’region northwest,’ it was assigned a value of ’1,’ while the
other two region values were set to ’0.’
3
Cross-Validation
5
1. Reads in the data.
2. 20 times does the following:
(a) Seeds the random number generator to the current run (out of 20).
(b) Shuffles the rows of the data
(c) Creates S folds.
(d)Fori=1toS
i. Select fold i as your validation data and the remaining (S
− 1) folds as your training
data.
ii. Train a linear regression model using the direct solution.
iii. Compute the squared error for each sample in the current validation fold
(e) You should now have N squared errors. Compute the RMSE for these.
3. You should now have 20 RMSE values. Compute the mean and standard deviation of these.
The former should give us a better “overall” mean, whereas the latter should give us feel for
the variance of the models that were created.
Cross-Validation is a technique used to use more data for training a system while maintaining a reliable
validation score.
In this section you will do S-Folds Cross-Validation for a few different values of S. For each run you will
divide your data up into S parts (folds) and build S different models using S-folds cross-validation and
evaluate via root mean squared error. In addition, to observe the affect of system variance, we will repeat
these experiments several times (shuffling the data each time prior to creating the folds). We will again be
doing our experiment on the aforementioned dataset.
The preprocessing steps undertaken to train the model are as follows:
1). Initially, a check for null values in the dataset was performed using the command data.isnull().sum(). In
our specific case, there were no null values present.
2). Next, attention was directed to three columns, namely ’sex,’ ’smoker,’ and ’region,’ which con- tained
categorical or string data.
Write a script that:
Solution:
Mean RMSE for 3-Folds Cross-Validation: 6090.2168454674675
Standard Deviation of RMSE for 3-Folds Cross-Validation: 13.683844715851382
Mean RMSE for 223-Folds Cross-Validation: 5586.057692498905
Standard Deviation of RMSE for 223-Folds Cross-Validation: 25.575125448445824
Mean RMSE for 1338-Folds Cross-Validation: 4202.090245766043
Standard Deviation of RMSE for 1338-Folds Cross-Validation: 1.3830600733622805e-11
6
3). In the ’sex’ column, the values ’male’ were converted to ’1,’ and ’female’ values were transformed into
’0.’
4). Similarly, in the ’smoker’ column, ’yes’ was changed to ’1,’ and ’No’ was altered to ’0.’
5). As for the ’region’ column, it encompassed three distinct values: ’region northwest,’ ’region southeast,’
and ’region southwest.’ This column was subsequently split into three separate columns based on their
values. For instance, if the region indicated ’region northwest,’ it was assigned a value of ’1,’ while the
other two region values were set to ’0.’
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Related Questions
Find the LCD of
[(3/x + 1)and (5/x - 1)]
arrow_forward
Integrate the following
(x + 1)2
arrow_forward
Integrate the following:
10dx
(x–1) (x+4)*
Show all of your work.
arrow_forward
Assume the MR model:
Y=3+4x₁ + 5x2 + epsilon, with sigma = 5
Select the Excel command that will find P(Y<18) when X₁-2 and X₂ = 1.
arrow_forward
Why would you calculate sD when you have it already offered in the assignment??
arrow_forward
Please solve the following problem and write out the work
arrow_forward
Consider the following data set
Xị
-2
1
3
Yi
2
-1
1
And the model y = ao + a¿Sin(x)
In the sense of the least square, which one of the following two choices for the model
parameters fits better the data set?
1. ao = -4.1 and a, = -2.7
2. ao = -1.9 and a, = 0.1
Give detailed calculations.
arrow_forward
7. Find the value of “a” if (x2/3)(x1/4) = xa
arrow_forward
13. If mED = (9x-3)', mBF = (15x – 39)', and mZBCF = (11x-9)', find mED.
%3D
%3D
DELL
6.
8.
2
3
n
arrow_forward
(√t-9) dt
arrow_forward
11. Find m/B.
(4r + 11/
(6x - 15)
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage
Related Questions
- Assume the MR model: Y=3+4x₁ + 5x2 + epsilon, with sigma = 5 Select the Excel command that will find P(Y<18) when X₁-2 and X₂ = 1.arrow_forwardWhy would you calculate sD when you have it already offered in the assignment??arrow_forwardPlease solve the following problem and write out the workarrow_forward
- Consider the following data set Xị -2 1 3 Yi 2 -1 1 And the model y = ao + a¿Sin(x) In the sense of the least square, which one of the following two choices for the model parameters fits better the data set? 1. ao = -4.1 and a, = -2.7 2. ao = -1.9 and a, = 0.1 Give detailed calculations.arrow_forward7. Find the value of “a” if (x2/3)(x1/4) = xaarrow_forward13. If mED = (9x-3)', mBF = (15x – 39)', and mZBCF = (11x-9)', find mED. %3D %3D DELL 6. 8. 2 3 narrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Algebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:Cengage
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage