Week 9 Sklearn Fish
pptx
keyboard_arrow_up
School
St. John's University *
*We aren’t endorsed by this school
Course
243
Subject
Statistics
Date
Feb 20, 2024
Type
pptx
Pages
7
Uploaded by DukeMeerkatPerson605
SciKit-Learn
Linear Regression
Linear Regression
●
Linear regression is a supervised machine learning algorithm ●
Target variable modeled on independent variables
●
Can be between univariate, multivariate
●
This lab demonstrates Linear regression using Sklearn
Import Libraries
We’re going to not just model, but display our data.
Step 1: Download the following libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Load and Explore Data
Use the Fish dataset. Choose the same two columns you used for previous lab.
Step 2: Load the data using pandas. It’s always a good idea to run df.head() after you load and at least describe(). Make sure the data looks as expected
Step 3: Plot the scatter plot
Seaborn has an lmplot which can display a scatter plot and draw a regression line. Use the following parameters: ci = None, line_kws={"color": "red"}. This will remove the confidence interval, and generate a red fit line. Compare it to the Statsmodels line. Is it close?
Generate the variables and Fit the model
●
Step 4: SKlearn works with Arrays. You’ll need to convert your X and Y variables into 1D numpy arrays. There’s many ways to do this. Try reshape(-
1,1)
●
Step 5: We’ve imported the train_test_split module. This lets us perform a split for training purposes.
○
Data is split into training dataset, used to model the data
○
Testing dataset used to check accuracy
●
The code here is a little bit tough. The code is in the notes section of this slide deck. Run this code.
●
Note the score - that’s how accurate your X variable is at predicting your Y.
Visualize the model
Step 6 Sklearn uses regr.predict to predict the values for the predicted line. This line should be the best fit. Your line may look very similar to the OLS line you generated above. This will depend on the variables you chose. I chose Length1 and Weight. My accuracy was 81%. My line was nearly straight. Therefore, it looks almost the same as this model.
y_pred = regr.predict(X_test)
plt.scatter(X_test, y_test, color ='b')
plt.plot(X_test, y_pred, color ='r')
plt.show()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Metrics
We already know our accuracy, that’s contained in the variable regr.score. Similar to the way Statsmodels had a model fit, we can also generate some model statistics with Sklearn. Run the below to generate some model statistics. The RMSE or root mean squared error is the most important. Think of it as the average error of the model. The smaller the better. What was the RMSE of your model?
from sklearn.metrics import mean_absolute_error,mean_squared_error
mae = mean_absolute_error(y_true=y_test,y_pred=y_pred)
#squared True returns MSE value, False returns RMSE value.
mse = mean_squared_error(y_true=y_test,y_pred=y_pred) #default=True
rmse = mean_squared_error(y_true=y_test,y_pred=y_pred,squared=False)
print("MAE:",mae)
print("MSE:",mse)
print("RMSE:",rmse)
Related Documents
Related Questions
A researcher conducts a multiple regression with Y as the dependent variable and X1,
X2, X3 and X4 as explanatory variables. Using the regression output below, fully
describe this model and discuss important parts of the output. What is the predicted
value of Y if X1 = 3, X2 = 15, X3 = 7 and X4 = 0.003?
%3D
SUMMARY OUTPUT
Regression Staistics
Muliple R
R Square
Adjusted R Square
Standard Emor
Observations
0.7236
0.5236
0.5159
5.3928
252
ANOVA
Significance F
1. 10662E-38
SS
MS
Regression
Residual
1973 9392
29.0820
67.8749
7895.7567
7183.2599
4
247
Total
251
15079.0166
Upper 95%
33.4049
Coefficients
Standard Eror
t Stat
Pvalue
2.2273 0.026830873
Lower 95%
7.9594
2.0508
Intercept
X1
17.7278
1.5583
0.2750
5.6662
4.05265E-08
1.0166
2.0999
X2
1.8376
0.1997
9.1999
1.4442
-74708
-3721 4324
1.55861E-17
2.2310
X3
55100
-5.5348
7.94036E-08
-3.5492
X4
-3.1079
1887 8435
-0.0016
0.998687788
3715 2166
arrow_forward
We have data on Lung Capacity of persons and we wish
to build a multiple linear regression model that predicts
Lung Capacity based on the predictors Age and
Smoking Status. Age is a numeric variable whereas
Smoke is a categorical variable (0 if non-smoker, 1 if
smoker). Here is the partial result from STATISTICA.
b*
Std.Err.
of b*
Std.Err.
N=725
of b
Intercept
Age
Smoke
0.835543
-0.075120
1.085725
0.555396
0.182989
0.014378
0.021631
0.021631
-0.648588
0.186761
Which of the following statements is absolutely false?
A. The expected lung capacity of a smoker is expected
to be 0.648588 lower than that of a non-smoker.
B. The predictor variables Age and Smoker both
contribute significantly to the model.
C. For every one year that a person gets older, the lung
capacity is expected to increase by 0.555396 units,
holding smoker status constant.
D. For every one unit increase in smoker status, lung
capacity is expected to decrease by 0.648588 units,
holding age constant.
arrow_forward
Independent variable data is listed in cells B2 through B100, and dependent variable data is in cells C2 through C100. Which spreadsheet function would calculate the slope of a linear regression model of this data?
Group of answer choices
=SLOPE(B2:B100,C2:C100)
=SLOPE(C2:C100,B2:B100)
=SLOPE(B2,C2)
=SLOPE(C2,C100,B2,B100)
arrow_forward
Wanting to study the effect of exercise on preventing the common cold, a researcher collects 500 test subjects and randomly assigns them to a treatment group instructed to exercise and a control group instructed to not exercise. He later records the number of colds for each group. This is an example of:
linear regression
an experiment
an observational study
independence
arrow_forward
A residual plot is a plot in which the residuals are plot ted against the value of
the explanatory variable x.
When a residual plot exabits a noticeable pattern, the variables do not have a
linear relationship, and the least square regression line should not be used.
When a residual plot exabits no noticeable pattern, the least square regression
line may be used to describe the relationship between the variables.
For each of the following residual plots, determine whether a linear model is
appropriate.
a.
Residual
0.5
2.0
1.5
1.0
0505
0
-0.5
-1.0
-1.5
-2.0
-2
The residual plot in (a)
[Select]
square regression line
[Select]
b.
Residual
19
15
10
5
0
-5
-10-
-15
-10
10
15
and thus the least-
The residual plot in (b) does not exabit a noticeable pattern and thus the least
-square regression line regression line may be used
arrow_forward
Suppose a multiple regression model is fitted into a variable called model. Which
Python method below returns residuals for a data set based on a multiple regression
model? Select one.
model.residualsvalues
O model.residvalues
model.residuals
model.resid
arrow_forward
The residual plot for a linear regression model is shown below. Assess the fit of the linear model, and justify your answer.
The line is a good fit because the points on the residual plot have a clear pattern.
The line is a good fit because the points on the residual plot do not have any noticeable pattern.
The line is not a good fit because the points on the residual plot do not have any noticeable pattern.
The line is not a good fit because the points on the residual plot have a clear pattern.
arrow_forward
Pls help ASAP. Pls show all work.
arrow_forward
A different linear regression model predicts a student's GPA from the number of classes
missed, weekly hours spent on studying and their age. The table below gives partial
output from SPSS.
Unstandardized
Std
Standardized
Coefficients
t sig.
Coefficients
error
Constant
0.745
Number of classes missed -0.297
-0.907
Average weekly hours
0.298
0.71
spent studying
Student age
0.019
0.03
If a student misses 3 classes, studies 7 hours per week and is 19 years old, what will the model
predict as this student's GPA? Give your answer to 3 decimal places.
(Do not be alarmed if you get a negative GPA. The question is asking what the model predicts.
This would mean that the model is not that good. )
arrow_forward
B and c
arrow_forward
The volatility of a stock is often measured by its beta value. The beta value of a stock can be estimated by developing a simple linear regression model, using the percentage weekly change in the stock as the dependent variable and the percentage weekly change in a market index as the independent variable. The least-squares regression estimate of the slopeb1is the estimate of the beta value. A stock with a beta value of 1.0 tends to move the same as the overall market. A stock with a beta value of 1.5 tends to move 50% more than the overall market, and a stock with a beta value of 0.6 tends to move only 60% as much as the overall market. Stocks with negative beta values tend to move in the opposite direction of the overall market. The accompanying data contain the weekly closing values for a particular market index and the weekly closing stock prices for three companies.
Date given in image.
A) Estimate the market model for Stock 1. (Hint: Use the percentage change in the…
arrow_forward
If a regression line for two variables has a small positive slope, then the:
variables are positively associated? variables are negatively associated? association of the variables cannot be determined. variables have no association with each other.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning

Algebra and Trigonometry (MindTap Course List)
Algebra
ISBN:9781305071742
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning

College Algebra
Algebra
ISBN:9781305115545
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning


Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning
Related Questions
- A researcher conducts a multiple regression with Y as the dependent variable and X1, X2, X3 and X4 as explanatory variables. Using the regression output below, fully describe this model and discuss important parts of the output. What is the predicted value of Y if X1 = 3, X2 = 15, X3 = 7 and X4 = 0.003? %3D SUMMARY OUTPUT Regression Staistics Muliple R R Square Adjusted R Square Standard Emor Observations 0.7236 0.5236 0.5159 5.3928 252 ANOVA Significance F 1. 10662E-38 SS MS Regression Residual 1973 9392 29.0820 67.8749 7895.7567 7183.2599 4 247 Total 251 15079.0166 Upper 95% 33.4049 Coefficients Standard Eror t Stat Pvalue 2.2273 0.026830873 Lower 95% 7.9594 2.0508 Intercept X1 17.7278 1.5583 0.2750 5.6662 4.05265E-08 1.0166 2.0999 X2 1.8376 0.1997 9.1999 1.4442 -74708 -3721 4324 1.55861E-17 2.2310 X3 55100 -5.5348 7.94036E-08 -3.5492 X4 -3.1079 1887 8435 -0.0016 0.998687788 3715 2166arrow_forwardWe have data on Lung Capacity of persons and we wish to build a multiple linear regression model that predicts Lung Capacity based on the predictors Age and Smoking Status. Age is a numeric variable whereas Smoke is a categorical variable (0 if non-smoker, 1 if smoker). Here is the partial result from STATISTICA. b* Std.Err. of b* Std.Err. N=725 of b Intercept Age Smoke 0.835543 -0.075120 1.085725 0.555396 0.182989 0.014378 0.021631 0.021631 -0.648588 0.186761 Which of the following statements is absolutely false? A. The expected lung capacity of a smoker is expected to be 0.648588 lower than that of a non-smoker. B. The predictor variables Age and Smoker both contribute significantly to the model. C. For every one year that a person gets older, the lung capacity is expected to increase by 0.555396 units, holding smoker status constant. D. For every one unit increase in smoker status, lung capacity is expected to decrease by 0.648588 units, holding age constant.arrow_forwardIndependent variable data is listed in cells B2 through B100, and dependent variable data is in cells C2 through C100. Which spreadsheet function would calculate the slope of a linear regression model of this data? Group of answer choices =SLOPE(B2:B100,C2:C100) =SLOPE(C2:C100,B2:B100) =SLOPE(B2,C2) =SLOPE(C2,C100,B2,B100)arrow_forward
- Wanting to study the effect of exercise on preventing the common cold, a researcher collects 500 test subjects and randomly assigns them to a treatment group instructed to exercise and a control group instructed to not exercise. He later records the number of colds for each group. This is an example of: linear regression an experiment an observational study independencearrow_forwardA residual plot is a plot in which the residuals are plot ted against the value of the explanatory variable x. When a residual plot exabits a noticeable pattern, the variables do not have a linear relationship, and the least square regression line should not be used. When a residual plot exabits no noticeable pattern, the least square regression line may be used to describe the relationship between the variables. For each of the following residual plots, determine whether a linear model is appropriate. a. Residual 0.5 2.0 1.5 1.0 0505 0 -0.5 -1.0 -1.5 -2.0 -2 The residual plot in (a) [Select] square regression line [Select] b. Residual 19 15 10 5 0 -5 -10- -15 -10 10 15 and thus the least- The residual plot in (b) does not exabit a noticeable pattern and thus the least -square regression line regression line may be usedarrow_forwardSuppose a multiple regression model is fitted into a variable called model. Which Python method below returns residuals for a data set based on a multiple regression model? Select one. model.residualsvalues O model.residvalues model.residuals model.residarrow_forward
- The residual plot for a linear regression model is shown below. Assess the fit of the linear model, and justify your answer. The line is a good fit because the points on the residual plot have a clear pattern. The line is a good fit because the points on the residual plot do not have any noticeable pattern. The line is not a good fit because the points on the residual plot do not have any noticeable pattern. The line is not a good fit because the points on the residual plot have a clear pattern.arrow_forwardPls help ASAP. Pls show all work.arrow_forwardA different linear regression model predicts a student's GPA from the number of classes missed, weekly hours spent on studying and their age. The table below gives partial output from SPSS. Unstandardized Std Standardized Coefficients t sig. Coefficients error Constant 0.745 Number of classes missed -0.297 -0.907 Average weekly hours 0.298 0.71 spent studying Student age 0.019 0.03 If a student misses 3 classes, studies 7 hours per week and is 19 years old, what will the model predict as this student's GPA? Give your answer to 3 decimal places. (Do not be alarmed if you get a negative GPA. The question is asking what the model predicts. This would mean that the model is not that good. )arrow_forward
- B and carrow_forwardThe volatility of a stock is often measured by its beta value. The beta value of a stock can be estimated by developing a simple linear regression model, using the percentage weekly change in the stock as the dependent variable and the percentage weekly change in a market index as the independent variable. The least-squares regression estimate of the slopeb1is the estimate of the beta value. A stock with a beta value of 1.0 tends to move the same as the overall market. A stock with a beta value of 1.5 tends to move 50% more than the overall market, and a stock with a beta value of 0.6 tends to move only 60% as much as the overall market. Stocks with negative beta values tend to move in the opposite direction of the overall market. The accompanying data contain the weekly closing values for a particular market index and the weekly closing stock prices for three companies. Date given in image. A) Estimate the market model for Stock 1. (Hint: Use the percentage change in the…arrow_forwardIf a regression line for two variables has a small positive slope, then the: variables are positively associated? variables are negatively associated? association of the variables cannot be determined. variables have no association with each other.arrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Elementary Linear Algebra (MindTap Course List)AlgebraISBN:9781305658004Author:Ron LarsonPublisher:Cengage LearningAlgebra and Trigonometry (MindTap Course List)AlgebraISBN:9781305071742Author:James Stewart, Lothar Redlin, Saleem WatsonPublisher:Cengage LearningCollege AlgebraAlgebraISBN:9781305115545Author:James Stewart, Lothar Redlin, Saleem WatsonPublisher:Cengage Learning
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillFunctions and Change: A Modeling Approach to Coll...AlgebraISBN:9781337111348Author:Bruce Crauder, Benny Evans, Alan NoellPublisher:Cengage Learning

Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning

Algebra and Trigonometry (MindTap Course List)
Algebra
ISBN:9781305071742
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning

College Algebra
Algebra
ISBN:9781305115545
Author:James Stewart, Lothar Redlin, Saleem Watson
Publisher:Cengage Learning


Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Functions and Change: A Modeling Approach to Coll...
Algebra
ISBN:9781337111348
Author:Bruce Crauder, Benny Evans, Alan Noell
Publisher:Cengage Learning