CPSC4830_Assignment1
pdf
keyboard_arrow_up
School
Langara College *
*We aren’t endorsed by this school
Course
4800
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
3
Uploaded by PresidentSquidPerson872
CPSC 4830 Assignment 1 1.
Fit a linear regression model for the sklearn dataset California housing. In this task, you do not need to consider the interaction of the variables. (i.e. no X1X2 or X1X3, etc. in the model) You have to consider the following items: a)
Data preprocessing: Correlation of the variables and VIF. b)
Standardization of variables. c)
Split the data into training and testing datasets. d)
Show the model summary (i.e. include R-square, Adjusted R-squared, Coefficients etc.). Plot the Residual error. e)
By backward elimination, select the features by using Ridge regularization. f)
By backward elimination, select the features by using LASSO regularization. Note: This question is only testing if you know how to perform Linear Regression. Error rate is not considered in this question. Part of the code will be provided to you. 2.
Fit a logistic regression model for the sklearn dataset Breast Cancer. In this task, you do not need to consider the interaction of the variables. (i.e. no X1X2 or X1X3, etc. in the model) You have to consider the following items: a)
Data preprocessing: Correlation of the variables and VIF. b)
Do you need to do Standardization for all variables? c)
Include all variables? d)
Split the data into training and testing datasets and train the model. Given the ROC curve and the optimal threshold. e)
Calculate the confusion matrix with the optimal threshold. Note: This question is only testing if you know how to perform Logistic Regression. Error rate is not considered in this question. Part of the code will be provided to you. 3.
Fit a decision tree model for the sklearn dataset diabetes. You have to consider the following items: a)
Split the data into training and testing datasets. b)
Use Info Gain (Entropy) to build the model and calculate the confusion matrix. c)
Repeat b) by using Gini-index to build the model. Note: This question is only testing if you know how to perform Decision Tree. Error rate is not considered in this question. Part of the code will be provided to you.
No computer programming for the following questions. Please finish it with your calculator. 4.
Decision Tree: The following dataset will be used to learn a decision tree for predicting whether a person is Happy (H) or Sad (S) based on the colour of their clothes, whether they wear contact lens and the number of rings they have on their fingers. a.
What is entropy of Emotion (total entropy)? b.
What is the Chi-Square of Number of Rings? c.
Which attribute would the decision tree building algorithm choose to use for the root of the tree by using Information Gain? (Note that this is not the same as part b) Show all your calculations. d.
Draw the full decision tree by the result of part c. e.
How well does this model behave in terms of Accuracy? f.
Is there any potential error in this model? If yes, suggest a way to improve it. 5.
Fitting the linear regression model for the following dataset: a.
Fit ?
𝑖
= 𝛽
0
+ 𝜀
𝑖
. Find 𝛽
0
. b.
i. Fit ?
𝑖
= 𝛽
1
?
𝑖
+ 𝜀
𝑖
. Find 𝛽
1
. ii. Sketch a graph and show what ‖𝜀
𝑖
‖
2
means. Colour
Contact Lens
Num. Rings
Emotion
G
Y
0
S
G
N
0
S
G
N
0
S
B
N
0
S
B
N
0
H
R
N
0
H
R
N
0
H
R
N
0
H
R
Y
3
H
x
y
-1
1
0
-1
2
1
6.
For the table below, calculate: a.
Sensitivity b.
Specificity c.
False Discovery Rate d.
Is there any design in this experiment such that we can get a better result for the above validation methods? 7.
Suppose there is a dataset and the output attribute (or target variable) is a binary variable. a.
What is the maximum training error for decision tree that any dataset could possibly have? b.
Construct an example dataset that achieves this maximum percentage training set error with 2 input variables and 8 samples. Test +ve
Test -ve
Cancer
30
30
No Cancer
10
30
x1
x2
y
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Bill wants to explore factors affecting work stress. He would like to examine the relationship between age, number of years at the workplace, perceived social support, and work stress. He collects data on the variables from 100 employees (males and females) working in banks.
The research question is
How accurately can work stress be predicted from linear combination of the predictors (age, social support, number of years at the workplace)?
Conduct a multiple regression analysis to answer the following questions:
What is the regression equation for all the predictors?
Write a results section based on your analysis that answers the research question.
arrow_forward
Bill wants to explore factors affecting work stress. He would like to examine the relationship between age, number of years at the workplace, perceived social support, and work stress. He collects data on the variables from 100 employees (males and females) working in banks.
The research question is
How accurately can work stress be predicted from linear combination of the predictors (age, social support, number of years at the workplace)?
Conduct a multiple regression analysis to answer the following questions:
What is the relationship of age, number of years, and social support with work stress?
Is the regression significant? If yes, what does it indicate?
arrow_forward
Bill wants to explore factors affecting work stress. He would like to examine the relationship between age, number of years at the workplace, perceived social support, and work stress. He collects data on the variables from 100 employees (males and females) working in banks.
The research question is
How accurately can work stress be predicted from linear combination of the predictors (age, social support, number of years at the workplace)?
Conduct a multiple regression analysis to answer the following questions:
State the hypothesis for this study.
arrow_forward
Refer to the data set: Part a: Make a scatterplot and determine which type of model best fits the data.Part b: Find the regression equation, round to two decimal places if necessary.Part c: Use the equation from Part b to determine y when x = -3.6.
arrow_forward
Corvette, Ferrari, and Jaguar produced a variety of classic cars that continue to increase in value. The data showing the rarity rating (1–20) and the high price ($1000s) for 15 classic cars is contained in the Excel Online file below. Construct a spreadsheet to answer the following questions.
Open spreadsheet
Develop a scatter diagram of the data using the rarity rating as the independent variable and price as the dependent variable. Does a simple linear regression model appear to be appropriate?
A simple linear regression model _________appearsdoes not appear to be appropriate.
Develop an estimated multiple regression equation with rarity rating and as the two independent variables.
(to whole numbers)
What is the value of the coefficient of determination? Note: report between 0 and 1.
(to 3 decimals)
What is the value of the test statistic?
(to 2 decimals)
What is the -value?
(to 4 decimals)
Consider the nonlinear relationship shown by equation .…
arrow_forward
The sheet called HousePr contains data on prices of houses that have sold recentlyand two attributes of the house – the number of bedrooms and the size. Column 1 is the selling price of the house in thousands of dollars and column 2 is the size in hundreds of square feet.
a. Draw a scattergram of price vs size. Discuss whether this scattergram supports an assumption of a linear relationship between price and size.
b. Using Excel, obtain the equation of the linear regression line that fits this data for price vs. size.
c. Using relevant Excel output, discuss whether the true slope of the regression line is different from zero.
d. What is the expected price for a house with size 2000 square feet? Using relevant Excel output, discuss whether the margin of error of this expected price will be low or high.
e. Using Excel, obtain the equation of the linear regression line that fits this data for price vs. number of bedrooms. Is the true slope different from zero?
f. What is the expected…
arrow_forward
IConsider the following multiple linear regression model and the Excel
print out of its regression results:
Beer = Bo + BIEDUC + B2AGE + BAGE? + BAGENDER + BERACE+ BEGENDER*RACE+ E, where
Beer is monthly beer consumption (ounces), EDUC is years of education. We have 2 qualitative variables:
gender and race. Gender takes 2 values, GEN=1 if the person is male and GEN=0 for females. The
variable race also takes 2 values, RACE=1 if the person is white and RACE=0 if the person is not white.
SUMMARY OUTPUT
Regression Statistics
R Square
Adjusted R Square
Standard Error
Observations
???
0.4684
???
40
ANOVA
df
MS
Regression
Residual
???
319.3
64.8
???
???
???
8.43
Total
???
597.5
Coefficients
Standard Error
Intercept
-150.254
107.397
EDUC
-16.7755
8.4579
75.45905
-1.72456
AGE
37.3261
AGE
0.5397
GEN
238.9424
81.6054
RACE
123.7404
103.1804
GEN. RACE
76.4308
51.0670
a. Calculate the missing numbers (???).
b.Interpret the parameter of RACE (123.74).
c. Is the parameter of RACE (Bs) significant?…
arrow_forward
The sheet called HousePr contains data on prices of houses that have sold recently and two attributes of the house – the number of bedrooms and the size. Column 1 is the selling price of the house in thousands of dollars and column 2 is the size in hundreds of square feet.a. Draw a scattergram of price vs size. Discuss whether this scattergram supports an assumption of a linear relationship between price and size.b. Using Excel, obtain the equation of the linear regression line that fits this data for price vs. size.c. Using relevant Excel output, discuss whether the true slope of the regression line is different from zero.d. What is the expected price for a house with size 2000 square feet? Using relevant Excel output, discuss whether the margin of error of this expected price will be low or high.e. Using Excel, obtain the equation of the linear regression line that fits this data for price vs. number of bedrooms. Is the true slope different from zero?f. What is the expected price for…
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
data:image/s3,"s3://crabby-images/1c039/1c0399391b1550508ab346ea0129b319a0b5c2ca" alt="Text book image"
Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning
data:image/s3,"s3://crabby-images/de8e7/de8e720adb18d6b639db473f76934bb9fad70292" alt="Text book image"
Related Questions
- Bill wants to explore factors affecting work stress. He would like to examine the relationship between age, number of years at the workplace, perceived social support, and work stress. He collects data on the variables from 100 employees (males and females) working in banks. The research question is How accurately can work stress be predicted from linear combination of the predictors (age, social support, number of years at the workplace)? Conduct a multiple regression analysis to answer the following questions: What is the regression equation for all the predictors? Write a results section based on your analysis that answers the research question.arrow_forwardBill wants to explore factors affecting work stress. He would like to examine the relationship between age, number of years at the workplace, perceived social support, and work stress. He collects data on the variables from 100 employees (males and females) working in banks. The research question is How accurately can work stress be predicted from linear combination of the predictors (age, social support, number of years at the workplace)? Conduct a multiple regression analysis to answer the following questions: What is the relationship of age, number of years, and social support with work stress? Is the regression significant? If yes, what does it indicate?arrow_forwardBill wants to explore factors affecting work stress. He would like to examine the relationship between age, number of years at the workplace, perceived social support, and work stress. He collects data on the variables from 100 employees (males and females) working in banks. The research question is How accurately can work stress be predicted from linear combination of the predictors (age, social support, number of years at the workplace)? Conduct a multiple regression analysis to answer the following questions: State the hypothesis for this study.arrow_forward
- Refer to the data set: Part a: Make a scatterplot and determine which type of model best fits the data.Part b: Find the regression equation, round to two decimal places if necessary.Part c: Use the equation from Part b to determine y when x = -3.6.arrow_forwardCorvette, Ferrari, and Jaguar produced a variety of classic cars that continue to increase in value. The data showing the rarity rating (1–20) and the high price ($1000s) for 15 classic cars is contained in the Excel Online file below. Construct a spreadsheet to answer the following questions. Open spreadsheet Develop a scatter diagram of the data using the rarity rating as the independent variable and price as the dependent variable. Does a simple linear regression model appear to be appropriate? A simple linear regression model _________appearsdoes not appear to be appropriate. Develop an estimated multiple regression equation with rarity rating and as the two independent variables. (to whole numbers) What is the value of the coefficient of determination? Note: report between 0 and 1. (to 3 decimals) What is the value of the test statistic? (to 2 decimals) What is the -value? (to 4 decimals) Consider the nonlinear relationship shown by equation .…arrow_forwardThe sheet called HousePr contains data on prices of houses that have sold recentlyand two attributes of the house – the number of bedrooms and the size. Column 1 is the selling price of the house in thousands of dollars and column 2 is the size in hundreds of square feet. a. Draw a scattergram of price vs size. Discuss whether this scattergram supports an assumption of a linear relationship between price and size. b. Using Excel, obtain the equation of the linear regression line that fits this data for price vs. size. c. Using relevant Excel output, discuss whether the true slope of the regression line is different from zero. d. What is the expected price for a house with size 2000 square feet? Using relevant Excel output, discuss whether the margin of error of this expected price will be low or high. e. Using Excel, obtain the equation of the linear regression line that fits this data for price vs. number of bedrooms. Is the true slope different from zero? f. What is the expected…arrow_forward
- IConsider the following multiple linear regression model and the Excel print out of its regression results: Beer = Bo + BIEDUC + B2AGE + BAGE? + BAGENDER + BERACE+ BEGENDER*RACE+ E, where Beer is monthly beer consumption (ounces), EDUC is years of education. We have 2 qualitative variables: gender and race. Gender takes 2 values, GEN=1 if the person is male and GEN=0 for females. The variable race also takes 2 values, RACE=1 if the person is white and RACE=0 if the person is not white. SUMMARY OUTPUT Regression Statistics R Square Adjusted R Square Standard Error Observations ??? 0.4684 ??? 40 ANOVA df MS Regression Residual ??? 319.3 64.8 ??? ??? ??? 8.43 Total ??? 597.5 Coefficients Standard Error Intercept -150.254 107.397 EDUC -16.7755 8.4579 75.45905 -1.72456 AGE 37.3261 AGE 0.5397 GEN 238.9424 81.6054 RACE 123.7404 103.1804 GEN. RACE 76.4308 51.0670 a. Calculate the missing numbers (???). b.Interpret the parameter of RACE (123.74). c. Is the parameter of RACE (Bs) significant?…arrow_forwardThe sheet called HousePr contains data on prices of houses that have sold recently and two attributes of the house – the number of bedrooms and the size. Column 1 is the selling price of the house in thousands of dollars and column 2 is the size in hundreds of square feet.a. Draw a scattergram of price vs size. Discuss whether this scattergram supports an assumption of a linear relationship between price and size.b. Using Excel, obtain the equation of the linear regression line that fits this data for price vs. size.c. Using relevant Excel output, discuss whether the true slope of the regression line is different from zero.d. What is the expected price for a house with size 2000 square feet? Using relevant Excel output, discuss whether the margin of error of this expected price will be low or high.e. Using Excel, obtain the equation of the linear regression line that fits this data for price vs. number of bedrooms. Is the true slope different from zero?f. What is the expected price for…arrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Elementary Linear Algebra (MindTap Course List)AlgebraISBN:9781305658004Author:Ron LarsonPublisher:Cengage Learning
data:image/s3,"s3://crabby-images/1c039/1c0399391b1550508ab346ea0129b319a0b5c2ca" alt="Text book image"
Elementary Linear Algebra (MindTap Course List)
Algebra
ISBN:9781305658004
Author:Ron Larson
Publisher:Cengage Learning
data:image/s3,"s3://crabby-images/de8e7/de8e720adb18d6b639db473f76934bb9fad70292" alt="Text book image"