Data Mining I Final Exam Review Practice Guide 2024

BANA7046-003 Exam Review Practice, Spring 2024 Instruction: This exam review practice serves as a part of your preparation for the final exam. It contains more questions than the actual exam to provide a comprehensive review. The format of the actual exam questions will be similar to this practice exam, and the questions themselves may be similar but NOT identical. Please make sure that you read exam questions CAREFULLY . Below is the exact cover page you will have for the exam. Final Exam – 90 Minutes, Total 60% 5-6:30pm, 02/26/2023, Spring 2022-2023 Prof. Yan Yu BANA7046-003 Final Exam – 90 Minutes, Total 60% 5-6:30pm, 02/25/2024, Spring 2023-2024 Prof. Yan Yu This final exam is designed to test your mastery of important knowledge in Data Mining I. Many of the problems are similar to your weekly practices. You have 90 minutes to complete this exam. The final exam should be the SOLE work of each individual student . Any email, file transfer, chat, including ChatGPT, is strictly prohibited. You are only allowed to discuss with the instructor and TA if you have any questions. Honorlock will record your screen which is subject to review if needed. Anyone cheating or assisting another during an exam will be given a 0 for that exam and possibly a grade of F for the class. College procedures will be followed, and the graduate dean will be notified. Note: The final exam is timed 5-6:30pm, Sunday February 25, 2024, and will automatically be submitted once the timer runs out. This exam is open book and open notes. You can use all the course notes and lab notes. Please time yourself. If you have questions, you can open and log on to zoom AFTER Honorlock functions. Prof. Yan Yu Email: Yan.Yu@uc.edu Zoom: https://ucincinnati.zoom.us/j/5977465058

Name: Jiantong Wang, Ph.D. Candidate (Exam Proctoring Volunteer) Email: wang5jt@mail.uc.edu Zoom: https://ucincinnati.zoom.us/j/7148538706 The exam is in three parts. Question Type % Part I Multiple-Choice Questions 30 Part II Boston Housing Case Study 15 Part III Credit Card Case Study 15 You will answer Multiple-Choice Questions on Canvas. For computational Parts II and III, please save all your answers in a WORD file with the file name “ Exam_BANA7046_003_Last Name_M#.docx ”, where M# is your actual M# as the seed. For all computational parts with an R symbol, we will grade your R codes directly. You do NOT need to copy your code lines to the question. Please add comment ###II, a### for example, to help us to grade. Academic Integrity: As with all Lindner College of Business efforts, in this course, you will be held to the highest ethical standards, critical to building character. Ensuring your integrity is vital and ultimately your responsibility. To help ensure the alignment of incentives, the Lindner College of Business has implemented a “Two Strikes Policy” regarding Academic Integrity that supplements the UC Student Code of Conduct .  All academic programs at the Lindner College of Business use this “Two Strikes Policy”; Any student who has been found responsible for two cases of academic misconduct may be dismissed from the College.  All cases of academic misconduct (e.g., copying other students’ assignments, failure to adequately cite or reference, cheating, plagiarism, falsification, etc.) will be formally reported by faculty; and  Students will be afforded due process for allegations as outlined in the policy.

Part I: Multiple choices 1. Which of the following is an example of unsupervised learning? Lecture 1 a. In the Target example, predicting whether a teen girl is pregnant based on the purchasing record. b. In the OKCupid love story example, clustering women that best matched Chris. c. Predicting whether to grant a loan to an applicant based on demographic and financial data. d. Predict housing price based on various household demographics and area information. 2. The line of code iris[,1:4] will produce Lecture 1 a. Columns 1,2,3,4 of the Iris data b. Rows 1,2,3,4 of the Iris data c. Columns 1,4 of the Iris data d. Rows 1,4 of the Iris data 3. Which line of code randomly samples 80% of observations from the Boston Housing dataset? a. Boston[sample(nrow(Boston), nrow(Boston)*0.8),] b. Boston [1: nrow(Boston)*0.8,] c. dim(Boston)[2] d. subset(x = Boston, subset = medv > 0.8) 4. Which of following is FALSE for K-nearest neighbor (KNN)? Lecture 1 a. K stands for the number of predictors. #K stands for number of nearest neighbors b. Nearest neighbor means observations who are similar. c. KNN can be used for both regression and classification. d. Small K may result in overfitting. 5. Which of following is FALSE about LASSO method? a. The shrinkage parameter λ controls the model complexity. b. The optimal λ can be determined by cross-validation. c. For linear regression, LASSO is essentially an ordinary least square (OLS). d. LASSO can be used for both p<n and p>>n. 6. Suppose ^ β is a parameter vector estimated from a logistic regression model built from training data, when a new observation x is available, the result of x ^ β will give you: ##Lecture 4 a. A prediction whether y = 1 b. The predicted probability of y = 1 c. The predicted log odds d. The residual 7. Which of the following statement about CART is TRUE ? a. CART models offer the same computational simplicity as linear regression models b. CART models are known for their ease of interpretation. c. CART is a type of linear model. d. CART provides an analytical solution to model estimation.

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

8. In a logistic regression output, it shows that “number of Fisher Scoring iterations: 4” when fitting a logistic regression on (Taiwan) credit card default data, which of the following is FALSE? Lecture 4 a. Fisher scoring is an algorithm used for logistic regression. b. Fisher scoring algorithm optimizes a nonlinear objective function. c. Fisher scoring algorithm converges after 4 iterations. d. Logistic regression has a closed-form analytical solution as in linear regression using OLS. 9. For a linear regression with INTERCEPT, suppose you have a data of 1000 observations with 5 continuous predictor variables and 1 categorical variable with 3 categories A, B and C. What is the total number of parameters? Lecture 2 a. 1000 b. 8 c. 5 d. 2 10. Suppose we have categorical variable of 4 levels. How many binary dummy variables should we use to represent this variable? Lecture 4 a. 4 b. 3 (variable – 1 = 4-1) c. 5 d. 2 11. When using a leave-one-out cross validation to evaluate a linear regression model with least square estimation based on a data set with 1000 observations and 10 predictor variables, how many regression models are estimated? Lecture 4 a. 10 b. 1 c. 5 d. 1000 12. Recall the definitions for model AIC and BIC are: AIC =− 2ln ( L ) + 2 pBIC =− 2ln ( L ) + ln ( n ) p where L is the log likelihood, p is the number of parameters, n is number of observations, and smaller AIC/BIC is better. Which one of the following statements is FALSE ? Lecture 3 a. In the same model, if the AIC is less than the BIC then it is an indication that you have found a good model. b. When AIC is used as the model selection criterion, usually more independent variables are included in the model compared with the model selected using BIC. c. BIC penalizes complex models more than AIC. d. One can only use AIC as a model selection criterion in stepwise regression. ##both were noted as false statement in the practice quiz

For questions 13 – 14: If we fit a tree model on the Boston housing data, 13. Is this tree: a. Regression tree. b. Classification tree. 14. Suppose we want to prune a tree model. The plotcp() function gives a plot as following. Which cp value should we choose? a. A b. B c. C d. D 15. The ROC plot below is from a logistic regression model built to predict bankruptcy. If a firm went bankrupt the response variable is 1. Suppose cut-off probability values of 0.2, 0.4, and 0.9 are marked on the curve. Lecture 5

Which one of A, B, C corresponds to cut-off probability 0.9? a. A b. B c. C

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

II. Computation Part: Boston Housing Case Study For computational Parts II and III, please save all your answers in a WORD file with the file name “ Exam_BANA7046_003_Last Name_M#.docx ”, where M# is your actual M# as the seed. For all computational parts with an R symbol, we will grade your R codes directly. You do NOT need to copy your code lines to the question. Please add comment ###II, a### for example, to help us to grade. Please read your Boston Housing full data as in Lecture, Lab, and Case Practices with sample size n=506. a. (R) conduct a random sampling using your M# to set the seed. Randomly sample a training data set that contains 80% of the original data and save it as Boston.train, and set the rest 20% aside as testing data and save it as Boston.test. Use the 80% Boston.train data to build the model and use the Boston.test to test the model. b. Plot the boxplot of your “medv” response variable of the training data. Copy and paste your boxplot into your WORD file. c. Build a linear model on Boston.train with all the variables and call it a Full Model. Report the corresponding in-sample model mean squared error (MSE) and save it as “MSE.full”. Report the corresponding in-sample model average sum squared error (ASE) and save it as “ASE.full”. MSE.full= ASE.full= d. Obtain and report the out-of-sample MSPE of the testing data for the full model and save it as “MSPE.full”. MSPE.full= e. Conduct a stepwise variable selection with BIC on Boston.train and report your selected variables. Selected Variables: f. Report the corresponding in-sample model MSE and in-sample model ASE and save it as “MSE.stepBIC” and “ASE.stepBIC”. MSE.stepBIC= ASE.stepBIC= g. Obtain and report the out-of-sample MSPE of the testing data for the stepwise BIC selected model. MSPE.stepBIC=

h. Compare your out-of-sample MSPE for the full model and the stepwise BIC selected model on Boston.test , what do you conclude? i. Use cv.glm() to conduct a 5-fold cross validation on the full data . What is the cross validation score for the full model? What is the cross-validation score for the stepwise BIC selected model? j. Compare your cross-validation scores for the full model and the stepwise BIC selected model on the full data , what do you conclude? k. Build a LASSO variable selection model on Boston.train using lamba_1se and report the selected variables. l. (OPTIONAL) Build a regression tree on the training data and use >plotcp(() to determine the “best” cp value. Plot your final tree here. Hint: Use printcp() and choose the cp value that corresponds to your choice of “size of tree-1” or “nsplit”. The x-axis in the plotcp() is not the exact cp values. They are geometric means to make the plot easier to view. m. Please copy all your R code here in the WORD file here.

III. Computation Part: Credit Card Case Study (15pts) For computational Parts II and III, please save all your answers in a WORD file with the file name “ Exam_BANA7046_003_Last Name_M#.docx ”, where M# is your actual M# as the seed. For all computational parts with an R symbol, we will grade your R codes directly. You do NOT need to copy your code lines to the question. Please add comment ###III, a### for example, to help us to grade. Please read your full (Taiwan) Credit card default data as in Lecture, Lab, and Case Practices with sample size n=30,000 . Suppose that you have two candidate logistic regression models: Model 1: Two-variable Model: EDUCATION+PAY_0 Model 2: Full model with all the predictors We will conduct a model comparison based on cross-validation. a. (R) Set your M# as the seed. conduct a random sampling using your M# to set the seed. Randomly sample a training data set that contains 80% of the original data and save it as Credit.train , and set the rest 20% aside as testing data and save it as Credit.test . Use the 80% Credit.train data to build the model and use the Credit.test to test the model. a. (R) Build a glm() model using Model1 formula on the Credit.train data and save it as “Model1”. Build a glm() model using Model2 formula on the Credit.test data and save it as “Model2”. b. Please report the misclassification rate using cut-off probability 1/6 For Model 1 and Model 2 on the Credit.train data. c. Please report the misclassification rate using cut-off probability 1/6 For Model 1 and Model 2 on the Credit.test data. d. (OPTIONAL) Build a classification tree on the Credit.train data and use >plotcp(() to determine the “best” cp value. Plot your final tree here. Hint: Use printcp() and choose the cp value that corresponds to your choice of “size of tree-1” or “nsplit”. The x-axis in the plotcp() is not the exact cp values. They are geometric means to make the plot easier to view. e. What is the misclassification rate of your final tree above using cut-off probability 1/6? f. (R) We often use AUC as the cost function for a classification problem. Define the cost function “costAUC”. g. K-fold cross-validation is a method for model selection according to the predictive ability of the models. Now, consider the case below where K=5. Using the cost function “costAUC” that you defined above, refit your Model1 and Model2 with Full data, and perform cv.glm() on the FULL data “fulldata” with K=5 folds. Save and report the result as “cv.result1.AUC” for “Model1” and “cv.result2.AUC” for “Model2”. cv.result1.AUC= cv.result2.AUC=

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

h. Compare cv.result1.AUC for Model 1 with cv.result2.AUC for Model 2. Which model will you choose? Why? i. In cv.glm(), the fulldata is divided into 5 folds, four folds as training and the other fold as testing, and then repeat. Do you expect the above cv.result.AUC be comparable with the AUC of the training data or of the testing data? Why? j. Please copy all your R code here in the WORD file named “Exam_BANA7046_003_Last Name_M#.docx”. Please submit your WORD file in Canvas.

BANA7046_003_Review Practice-3

Related Documents