Homework 2

pdf

School

Colorado State University, Global Campus *

*We aren’t endorsed by this school

Course

315

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by UltraSquid2144

EBGN 381 Predictive Analytics Homework 2 All submissions should be uploaded to Canvas as a Word document. Any code used during analysis should be properly annotated (i.e. used # script in code to note what each step is doing) and pasted into the Word document at the bottom of the submission. 1) The following questions involve the use of the carseats.csv dataset, which can be found on the Homework 2 assignment page in Canvas. Consider Sales as the dependent/outcome variable and all others as the independent/predictor variables. a. Perform any data transformations needed to create a linear regression model with this dataset. b. Justify why a linear regression model might be useful when predicting sales with this data. Do any independent variables stand out to you for use in this model? c. Fit a multiple linear regression model to predict sales using Price, Urban, and US. Provide an interpretation of each coefficient in the model. Be careful – some of the variables in the model are qualitative! d. For which of the predictors can you reject the null hypothesis H 0 : β j = 0? Create a 95% confidence interval for each of the coefficients. e. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome. How well do the models in (a) and (e) fit the data? 2) The following questions involve the use of the heartDisease.csv dataset, which can be found on the Homework 2 assignment page in Canvas. Consider ‘TenYearCHD’ as the outcome variable and all others as predictor variables. a. Import the data, replace MVs with either the mean or mode of their respective field and windsor the outliers in the fields Glucose and cigsPerDay – set the upper percentile to 98%. (Windsoring is outlined in the Data Transformation Reference Table document in the 01 Introduction to Predictive Analytics Module). b. Use stratified random sampling to split the data into a 80/20 train and test datasets. Make sure the test dataset has 20% of the positive outcomes as well as 20% of the negative outcomes. Fit a logistic regression model to predict TenYearCHD using 'male', 'age', 'education', 'cigsPerDay','prevalentStroke', 'sysBP', and 'glucose' (I suggest using the statsmodel logit_model). c. Interpret the coefficients of this model. d. Create a confusion matrix for the performance of this model on the test set and calculate the model accuracy, precision, and recall. How well is this model performing? e. Calculate the VIF for each of the predictor variables and plot the log-odds from the model against each of the predictors. Do you think the assumptions for logistic regression modeling were met?

Discover more documents: Sign up today!

Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get: