ML_assignment_2.pdf
pdf
keyboard_arrow_up
School
Drexel University *
*We aren’t endorsed by this school
Course
613
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
6
Uploaded by CaptainMantisMaster961
Meet Sakariya: 14473322
1
Theory
=(x−x̄ ¯)(y−ȳ ¯)1
(x
−x̄ ¯)2
1 = ( 66.4)/160.9
1= 0.41
0=ȳ ¯b1x̄ ¯
1. Consider the following supervised
1
dataset:
−21
− 5 − 4
−31
0
3
−8 11X=
−,Y=25
1
0
5
− 1
− 1 − 3
6
1
(a) Compute the coefficients for closed-form linear regression using least squares estimate
(LSE). Show your work and remember to add a bias feature. Since we have only one
feature, there is no need to zscore it (6pts).
Solution:
Close
∑d-formlinearregressionisgivenby:ŷˆ=b0+b1xx̄ ¯=xi
x̄ ¯=
∑n-0.9ȳ ¯=yin
ȳ ¯=1.4
x y x-x̄ ¯ y-ȳ ¯ (x
−x̄ ¯)2(x−x̄ ¯)(y−ȳ ¯)
-2 1 -1.1 -0.4 1.21 0.44
-5 -4 -4.1 -5.4 16.81 22.14
-3 1 -2.1 -0.4 4.41 0.84
0 3 0.9 1.6 0.81 1.44
-8 11 -7.1 9.6 50.41 -68.16
-2 5 -1.1 3.6 1.21 -3.96
1 0 1.9 -1.4 3.61 -2.66
5 -1 5.9 -2.4 34.81 -14.16
-1 -3 -0.1 -4.4 0.01 0.44
6 1 6.9 -0.4 47.61 -2.76
training
b
b
b
b
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
−
−
−
∑
∑
b
b
=1
(
|
| i Ŷ ˆ i
i +
|Ŷˆi )/2
√RMSE=26.46
RMSE=5.14
S M A P E = 1
N
b0 = 1.4
− (−0.41)(
b0 = 1.031
ŷˆ=1.031
− 0.41x
−0.9)
x
-
2
-
5
-
3
0
8
-
2
1
5
-
1
6
y
1
-4
1
3
1
1
5
0
-1
-3
1
2
ŷ ˆ
1.831
3.031
2.231
1.031
-2.16
9
1.831
0.631
-0.96
9
1.431
-1.36
9
= 13.279/10 = 1.3279
(c) What is the RMSE and SMAPE for this training set based on the model you learned in
the previous part (2pts)?
Solution:
R M S E = 1 N ( ˆ 2
N i = 1 Y i
− Ŷ i )
(b) Using your learned model in the previous part, what are your predictions, Ŷˆ , for the train-
ing data (2pts)?
Solution:
Now that we have the coefficients, we can make predictions for the training data using:
ŷˆ=1.031
− 0.41x
|
|
RMSE value = 5.14
SMAPE value = 1.3279
Value of Coefficients for linear
regression Value of Coefficients for
linear regression
0 = 1.031
1 = -0.41
√
∑
∑ N i
Y
Y
−
|
2
Closed Form Linear Regression
Formulas used:
w = (XTX)−1
R M S E =
1
N
S M A P E
= 1
N
Ŷˆ ) 2i
3
In this section you’ll create simple linear regression models using the dataset mentioned in the Datasets
section. Use the first six columns as the features (age, sex, bmi, children, smoker, region), and the final
column as the value to predict (charges). Note that the features contain a mixture of continuous valued
information, binary information, and categorical information. It will be up to you to decide how to do any
pre-processing of the features!
First randomize (shuffle) the rows of your data and then split it into two subsets: 2/3 for training, 1/3 for
validation. Next train your model using the training data, and evaluate it for the training data, and for the
validation data.
1. Don’t forget to add a bias feature!
2. So that you have reproducible results, we suggest that seed the random number generate prior
to using it. In particular, you might want to seed it with a value of zero so that you can
compare your numeric results with others.
3. IMPORTANT If you notice there’s issues in computing the inverse of XTX due to sparcity,
you maybe one to try one of the following:
• Using the pseudo-inverse instead of the regular inverse. This can be more stable and
accurate.
• Adding some “noise” (i.e. very small values) to the binary features you made out of the
enumerated features.
Since your target values are relatively large, so too will your RMSE.
The preprocessing steps undertaken to train the model are as follows:
1). Initially, a check for null values in the dataset was performed using the command data.isnull().sum(). In
our specific case, there were no null values present.
NOTE:
Solution:
Implementation Details
|
|
RMSE Training set: 5757.8889922488215
RMSE Validation set: 6604.316221778547
SMAPE Training set: 36.10967598469293
SMAPE Validation set: 36.60026219709693
(
Yi
|
Yi
Yi
√XT∑YN
∑i=1Ni=1
−
− Ŷˆi |
|+ Ŷˆi|
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
2). Next, attention was directed to three columns, namely ’sex,’ ’smoker,’ and ’region,’ which con- tained
categorical or string data.
3). In the ’sex’ column, the values ’male’ were converted to ’1,’ and ’female’ values were transformed into
’0.’
4). Similarly, in the ’smoker’ column, ’yes’ was changed to ’1,’ and ’No’ was altered to ’0.’
5). As for the ’region’ column, it encompassed three distinct values: ’region northwest,’ ’region southeast,’
and ’region southwest.’ This column was subsequently split into three separate columns based on their
values. For instance, if the region indicated ’region northwest,’ it was assigned a value of ’1,’ while the
other two region values were set to ’0.’
3
Cross-Validation
5
1. Reads in the data.
2. 20 times does the following:
(a) Seeds the random number generator to the current run (out of 20).
(b) Shuffles the rows of the data
(c) Creates S folds.
(d)Fori=1toS
i. Select fold i as your validation data and the remaining (S
− 1) folds as your training
data.
ii. Train a linear regression model using the direct solution.
iii. Compute the squared error for each sample in the current validation fold
(e) You should now have N squared errors. Compute the RMSE for these.
3. You should now have 20 RMSE values. Compute the mean and standard deviation of these.
The former should give us a better “overall” mean, whereas the latter should give us feel for
the variance of the models that were created.
Cross-Validation is a technique used to use more data for training a system while maintaining a reliable
validation score.
In this section you will do S-Folds Cross-Validation for a few different values of S. For each run you will
divide your data up into S parts (folds) and build S different models using S-folds cross-validation and
evaluate via root mean squared error. In addition, to observe the affect of system variance, we will repeat
these experiments several times (shuffling the data each time prior to creating the folds). We will again be
doing our experiment on the aforementioned dataset.
The preprocessing steps undertaken to train the model are as follows:
1). Initially, a check for null values in the dataset was performed using the command data.isnull().sum(). In
our specific case, there were no null values present.
2). Next, attention was directed to three columns, namely ’sex,’ ’smoker,’ and ’region,’ which con- tained
categorical or string data.
Write a script that:
Solution:
Mean RMSE for 3-Folds Cross-Validation: 6090.2168454674675
Standard Deviation of RMSE for 3-Folds Cross-Validation: 13.683844715851382
Mean RMSE for 223-Folds Cross-Validation: 5586.057692498905
Standard Deviation of RMSE for 223-Folds Cross-Validation: 25.575125448445824
Mean RMSE for 1338-Folds Cross-Validation: 4202.090245766043
Standard Deviation of RMSE for 1338-Folds Cross-Validation: 1.3830600733622805e-11
6
3). In the ’sex’ column, the values ’male’ were converted to ’1,’ and ’female’ values were transformed into
’0.’
4). Similarly, in the ’smoker’ column, ’yes’ was changed to ’1,’ and ’No’ was altered to ’0.’
5). As for the ’region’ column, it encompassed three distinct values: ’region northwest,’ ’region southeast,’
and ’region southwest.’ This column was subsequently split into three separate columns based on their
values. For instance, if the region indicated ’region northwest,’ it was assigned a value of ’1,’ while the
other two region values were set to ’0.’
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Related Questions
b) Data was collected on 344 corporate executives to find out the effect of MBA degree
and work experience on their salary. The following model was estimated:
Y = 2.3501 + 3.6306D11 – 2.6354 D2i + 0.8527 X¡ + 1.634 (D, * X)i
(2.1805) (- 3.457)
(1.263)
(7.605)
(2.98)
R? = 0.8968
Y: Annual Income in Lakhs of Rupees
Di and D2 are MBA and gender dummies respectively
X: Work experience in years
DI = 1 if one has MBA degree
= 0 otherwise
D2 = 1 for a female executive
= 0 for a male executive
i. Write the regression equations for female MBA executives and male MBA executives
separately.
ii. Find the mean income level for the reference category and interpret it.
iii. Test the statistical significance of differential intercept coefficient between female
MBA executives and Male MBA executives at 5% level of significance.
iv. Interpret the coefficient of D,*Xi.
arrow_forward
Differentiate the given
y = 5* )'
+(3-x)
arrow_forward
Two strains of bacteria are growing in separate Petri dishes. Initially, there are 300 Strain A bacteria and, from a prior experiment, you know that the population should double every 20 minutes.
Having never worked with Strain B before, you monitor its growth over the first hour and notice that there are 200 bacteria after 30 minutes and 600 bacteria after 1 hour.
Let t be the number of hours that have passed since the two populations of bacteria start growing. Express the number of strain A bacteria as a function of tt. Note: Your answer should be of the form PA(t)=Ce^(bt) PA(t)=Ce^(bt( for some numbers C and b.
From your observations, you know that the population of Strain B bacteria after tt hours can be modelled by
PB(t)=(200/3)3^(2t) PB(t)=(200/3)3^(2t)
How many strain B bacteria were present initially? (Round your answer down to the nearest integer).
After how many hours will the two populations be equal in number? Give an exact answer (no decimals).
arrow_forward
Data on alcohol content and wine quality was collected from variants of a particular wine. From a sample of
48 wines, a model was created using the percentages of alcohol to predict wine quality. The data are modeled by
Yi=−0.357+0.5392Xi, where Xi is the alcohol content (%) and Yi is the rated quality of the wine. For these data,SYX=0.9852,X=10.31,and hi=0.095564 when X=8.
Complete parts (a) through (c). Only answer part 1(A).
Question content area bottom
Part 1
a. Construct a 90% confidence interval estimate of the mean wine quality rating for all wines that have 8%alcohol.
≤μY|X=8≤
arrow_forward
+ I/
MI
* 00
Energy consumption in a particular country in quadrillion BTUS can be modeled by C(x) = - 0.014x + 1.295x+ 68,958, where x is the number of years after 1970.
a. One solution to the equation 89.258 = - 0.014x+ 1.295x+68.958 is x = 20. What does this mean?
b. Graphically verify that x 20 is a solution to 89.258 = - 0.014x+1.295x + 68.958.
c. To find when after 2020 the energy consumption in that country will be 89.258 quadrillion BTUS according to the model, is there a need to find the second solution to this equation?
Why or why not?
a. Choose the correct answer below.
O A. In 1990, consumption was 68.958 quadrillion BTUS.
O B. In 2020, consumption was 68.958 quadrillion BTUS.
O C. In 1990, consumption was 89.258 quadrillion BTUS.
O D. In 2020, consumption was 89.258 quadrillion BTUS.
Click to select your answer and then click Check Answer.
parts
2.
remaining
Clear All
Check Answer
P Pearson
&17 PM
ere to search
近
1202/LL/Z
PrtSc
F11
F12
F10
sup
F7
F8
&
Backspace
23
3.
i
5.…
arrow_forward
The number of fish,
N, in a pond is decreasing according to the model
N(t)=ab−t+40, t(greater than or equal to)0
where a and b are positive constants, and t is the time in months since the number of fish in the pond was first counted.
At the beginning 840 fish were counted.
Find the value of a.
arrow_forward
Suppose A(-6, -1) was transformed across the line y = x. What is the x coordinate for A'?
arrow_forward
The success rate a salesman may achieve through cold calling will largely depend on the approach they
take. According to Charlie Cook, author and marketing consultant, conversion rates for cold calls usually
fall in the 2 percent range, while referrals may have a 50 percent conversion and solid leads may have a 20
percent conversion. A group of salesmen was assigned to pursue 300 "solid leads" during the period of a
week.
• Let X be the number of sales generated from 300 "solid leads". Describe the distribution of X and
its parameters:
X - B
✓ (n=
• Use the random variable notation to symbolically express the probability that at least 66 sales will be
generated from 300 "solid leads":
Select an answer ✓
• Let Y be a normal variable that will be used to approximate the probability in question. Find the
parameters of Y (round the answers to 2 decimal places):
Y~ Select an answer (=
• Use the correction for continuity:
P
• Use the random variable notation to symbolically express the…
arrow_forward
The population of a city is expected to be P(x) = x(x2 + 36)−1/2 million people after x years. Find the average population between year x = 0 and year x = 8.
arrow_forward
Suppose that P(A)=0.3, P(B)=0.2, and P(C)=0.1. Further, P(AUB)=0.44, P(A^cC)=0.07, P(BC)=0.02, and P(AUBUC)=0.496. Decide whether A, B, and C are mutually independent.
arrow_forward
I'm lost on question b.
for a I got the cdf was
0 for x < 2
2(x+(1/(x-1))-3) for 2 <= x <= 3
1 for x > 3
arrow_forward
Please don't provide handwritten solution ...
arrow_forward
With data from the Social Security Trustees Report for selected years from 1950 and projected to 2030, the number of Social Security beneficiaries (in millions) can be modeled by
B(t) = 0.00024t3 − 0.026t2 + 1.6t + 2.2
arrow_forward
10. Evaluate
((x -2)'xd:
1
arrow_forward
Data on alcohol content and wine quality was collected from variants of a particular wine. From a sample of 50 wines, a model was created using the percentages of alcohol to predict wine quality. The data are modeled by Yi=−0.384+0.5332Xi, where Xi is the alcohol content (%) and Yi is the rated quality of the wine. For these data, SYX=0.9431, X=10.19, and hi=0.029612 when X=11.
b. Construct a
90%
prediction interval of the wine quality rating of an individual wine that has
11%
alcohol.
enter your response here≤YX=11≤enter your response here
(Type integers or decimals. Round to three decimal places as needed. Use ascending order.)
arrow_forward
Solve the ODE x'-2x=0 with x(0) = 1 in Simulink. Create the model and run a simulation for 2 seconds. Provide the
simulation result from the scope window
arrow_forward
Two strains of bacteria are growing in separate Petri dishes. Initially, there are 300 Strain A bacteria and, from a prior experiment, you know that the population should double every 20 minutes.
Having never worked with Strain B before, you monitor its growth over the first hour and notice that there are 200 bacteria after 30 minutes and 600 bacteria after 1 hour.
1. Let t be the number of hours that have passed since the two populations of bacteria start growing. Express the number of strain A bacteria as a function of t. Should be answered in form PA(t)=Cebt for some numers C and b.
2.The population of Strain B bascteria after t hours can be modelled by PB(t)=(200/3) 32t How many strain B bacteria were present initally. Round to nearest integer.
3. After how many hours will the two populations be equal in number? Give an exact number.
arrow_forward
3
Answer the following questions for the function f(x) = 2x³-12x²+72x-14.
a. Find formulas for f'(x) and f''(x).
f'(x)=
f''(x) =
Enter f(x), f'(x), and f''(x) into your grapher to examine the table.
Question 6, 3.2.ACQ.6
Part 1 of 9
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage
Related Questions
- b) Data was collected on 344 corporate executives to find out the effect of MBA degree and work experience on their salary. The following model was estimated: Y = 2.3501 + 3.6306D11 – 2.6354 D2i + 0.8527 X¡ + 1.634 (D, * X)i (2.1805) (- 3.457) (1.263) (7.605) (2.98) R? = 0.8968 Y: Annual Income in Lakhs of Rupees Di and D2 are MBA and gender dummies respectively X: Work experience in years DI = 1 if one has MBA degree = 0 otherwise D2 = 1 for a female executive = 0 for a male executive i. Write the regression equations for female MBA executives and male MBA executives separately. ii. Find the mean income level for the reference category and interpret it. iii. Test the statistical significance of differential intercept coefficient between female MBA executives and Male MBA executives at 5% level of significance. iv. Interpret the coefficient of D,*Xi.arrow_forwardDifferentiate the given y = 5* )' +(3-x)arrow_forwardTwo strains of bacteria are growing in separate Petri dishes. Initially, there are 300 Strain A bacteria and, from a prior experiment, you know that the population should double every 20 minutes. Having never worked with Strain B before, you monitor its growth over the first hour and notice that there are 200 bacteria after 30 minutes and 600 bacteria after 1 hour. Let t be the number of hours that have passed since the two populations of bacteria start growing. Express the number of strain A bacteria as a function of tt. Note: Your answer should be of the form PA(t)=Ce^(bt) PA(t)=Ce^(bt( for some numbers C and b. From your observations, you know that the population of Strain B bacteria after tt hours can be modelled by PB(t)=(200/3)3^(2t) PB(t)=(200/3)3^(2t) How many strain B bacteria were present initially? (Round your answer down to the nearest integer). After how many hours will the two populations be equal in number? Give an exact answer (no decimals).arrow_forward
- Data on alcohol content and wine quality was collected from variants of a particular wine. From a sample of 48 wines, a model was created using the percentages of alcohol to predict wine quality. The data are modeled by Yi=−0.357+0.5392Xi, where Xi is the alcohol content (%) and Yi is the rated quality of the wine. For these data,SYX=0.9852,X=10.31,and hi=0.095564 when X=8. Complete parts (a) through (c). Only answer part 1(A). Question content area bottom Part 1 a. Construct a 90% confidence interval estimate of the mean wine quality rating for all wines that have 8%alcohol. ≤μY|X=8≤arrow_forward+ I/ MI * 00 Energy consumption in a particular country in quadrillion BTUS can be modeled by C(x) = - 0.014x + 1.295x+ 68,958, where x is the number of years after 1970. a. One solution to the equation 89.258 = - 0.014x+ 1.295x+68.958 is x = 20. What does this mean? b. Graphically verify that x 20 is a solution to 89.258 = - 0.014x+1.295x + 68.958. c. To find when after 2020 the energy consumption in that country will be 89.258 quadrillion BTUS according to the model, is there a need to find the second solution to this equation? Why or why not? a. Choose the correct answer below. O A. In 1990, consumption was 68.958 quadrillion BTUS. O B. In 2020, consumption was 68.958 quadrillion BTUS. O C. In 1990, consumption was 89.258 quadrillion BTUS. O D. In 2020, consumption was 89.258 quadrillion BTUS. Click to select your answer and then click Check Answer. parts 2. remaining Clear All Check Answer P Pearson &17 PM ere to search 近 1202/LL/Z PrtSc F11 F12 F10 sup F7 F8 & Backspace 23 3. i 5.…arrow_forwardThe number of fish, N, in a pond is decreasing according to the model N(t)=ab−t+40, t(greater than or equal to)0 where a and b are positive constants, and t is the time in months since the number of fish in the pond was first counted. At the beginning 840 fish were counted. Find the value of a.arrow_forward
- Suppose A(-6, -1) was transformed across the line y = x. What is the x coordinate for A'?arrow_forwardThe success rate a salesman may achieve through cold calling will largely depend on the approach they take. According to Charlie Cook, author and marketing consultant, conversion rates for cold calls usually fall in the 2 percent range, while referrals may have a 50 percent conversion and solid leads may have a 20 percent conversion. A group of salesmen was assigned to pursue 300 "solid leads" during the period of a week. • Let X be the number of sales generated from 300 "solid leads". Describe the distribution of X and its parameters: X - B ✓ (n= • Use the random variable notation to symbolically express the probability that at least 66 sales will be generated from 300 "solid leads": Select an answer ✓ • Let Y be a normal variable that will be used to approximate the probability in question. Find the parameters of Y (round the answers to 2 decimal places): Y~ Select an answer (= • Use the correction for continuity: P • Use the random variable notation to symbolically express the…arrow_forwardThe population of a city is expected to be P(x) = x(x2 + 36)−1/2 million people after x years. Find the average population between year x = 0 and year x = 8.arrow_forward
- Suppose that P(A)=0.3, P(B)=0.2, and P(C)=0.1. Further, P(AUB)=0.44, P(A^cC)=0.07, P(BC)=0.02, and P(AUBUC)=0.496. Decide whether A, B, and C are mutually independent.arrow_forwardI'm lost on question b. for a I got the cdf was 0 for x < 2 2(x+(1/(x-1))-3) for 2 <= x <= 3 1 for x > 3arrow_forwardPlease don't provide handwritten solution ...arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Algebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:Cengage
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage