CPSC4830_Assignment1

pdf

School

Langara College *

*We aren’t endorsed by this school

Course

4800

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

3

Uploaded by PresidentSquidPerson872

Report
CPSC 4830 Assignment 1 1. Fit a linear regression model for the sklearn dataset California housing. In this task, you do not need to consider the interaction of the variables. (i.e. no X1X2 or X1X3, etc. in the model) You have to consider the following items: a) Data preprocessing: Correlation of the variables and VIF. b) Standardization of variables. c) Split the data into training and testing datasets. d) Show the model summary (i.e. include R-square, Adjusted R-squared, Coefficients etc.). Plot the Residual error. e) By backward elimination, select the features by using Ridge regularization. f) By backward elimination, select the features by using LASSO regularization. Note: This question is only testing if you know how to perform Linear Regression. Error rate is not considered in this question. Part of the code will be provided to you. 2. Fit a logistic regression model for the sklearn dataset Breast Cancer. In this task, you do not need to consider the interaction of the variables. (i.e. no X1X2 or X1X3, etc. in the model) You have to consider the following items: a) Data preprocessing: Correlation of the variables and VIF. b) Do you need to do Standardization for all variables? c) Include all variables? d) Split the data into training and testing datasets and train the model. Given the ROC curve and the optimal threshold. e) Calculate the confusion matrix with the optimal threshold. Note: This question is only testing if you know how to perform Logistic Regression. Error rate is not considered in this question. Part of the code will be provided to you. 3. Fit a decision tree model for the sklearn dataset diabetes. You have to consider the following items: a) Split the data into training and testing datasets. b) Use Info Gain (Entropy) to build the model and calculate the confusion matrix. c) Repeat b) by using Gini-index to build the model. Note: This question is only testing if you know how to perform Decision Tree. Error rate is not considered in this question. Part of the code will be provided to you.
No computer programming for the following questions. Please finish it with your calculator. 4. Decision Tree: The following dataset will be used to learn a decision tree for predicting whether a person is Happy (H) or Sad (S) based on the colour of their clothes, whether they wear contact lens and the number of rings they have on their fingers. a. What is entropy of Emotion (total entropy)? b. What is the Chi-Square of Number of Rings? c. Which attribute would the decision tree building algorithm choose to use for the root of the tree by using Information Gain? (Note that this is not the same as part b) Show all your calculations. d. Draw the full decision tree by the result of part c. e. How well does this model behave in terms of Accuracy? f. Is there any potential error in this model? If yes, suggest a way to improve it. 5. Fitting the linear regression model for the following dataset: a. Fit ? 𝑖 = 𝛽 0 + 𝜀 𝑖 . Find 𝛽 0 . b. i. Fit ? 𝑖 = 𝛽 1 ? 𝑖 + 𝜀 𝑖 . Find 𝛽 1 . ii. Sketch a graph and show what ‖𝜀 𝑖 2 means. Colour Contact Lens Num. Rings Emotion G Y 0 S G N 0 S G N 0 S B N 0 S B N 0 H R N 0 H R N 0 H R N 0 H R Y 3 H x y -1 1 0 -1 2 1
6. For the table below, calculate: a. Sensitivity b. Specificity c. False Discovery Rate d. Is there any design in this experiment such that we can get a better result for the above validation methods? 7. Suppose there is a dataset and the output attribute (or target variable) is a binary variable. a. What is the maximum training error for decision tree that any dataset could possibly have? b. Construct an example dataset that achieves this maximum percentage training set error with 2 input variables and 8 samples. Test +ve Test -ve Cancer 30 30 No Cancer 10 30 x1 x2 y
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help