NCC 5010 - 2023 Practice Final 1 Solution

pdf

School

Cornell University *

*We aren’t endorsed by this school

Course

5010

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

8

Uploaded by MagistrateMusicChinchilla28

Report
1 NCC 5010: Data Analytics and Modeling Practice Final 1 Read Carefully: 1. Write your student ID and net ID below. Do not put your name on this exam. 2. You may have two 8 ½ by 11 sheet of paper with notes on both sides. Other than that, the exam is closed book and closed notes. Laptops and communication devices are not allowed. You are not allowed to share any materials or equipment. 3. You have 3.0 hours to complete this exam. The exam has 6 problems . Points for each problem are indicated. Some problems will take longer than others, so plan your time accordingly. 4. Write your solutions in the space provided in this document using the front and back of sheets as necessary. 5. Show all of your calculations clearly. $QVZHUV OLNH ³WUXH´ RU MXVW D VLQJOH QXP ber are not satisfactory and will not be given partial credit. If we cannot locate where your solution is written, you will not receive credit. State your assumptions if you must make any assumptions not given in a question. 6. Taking this exam indicates that you understand and will abide by the Cornell University Code of Academic Integrity. 7. Some common z-statistics: 3±] ²µ¶·¸¹ ºµºº¶ 3±] ²µ»²¸¹ ºµº¼ 3±] ¼µ½¸º¹ ºµº²¶ 3±] ¼µ¸¾¶¹ ºµº¶ ________________________________________________________________________ Student ID# ______________________________ Net ID# ______________________________ Do not write below this line. ________________________________________________________________________ Q1: Q2: Q3: Q4: Q5: Q6: Total: /25 /16 /18 /15 /13 /14
2 Question 1: (25 points) You are an executive at a large book publisher, Bean Publishing. Bean Publishing uses a regression model to examine several factors that influence how much a customer spends on leisure books every year. The regression output is below. The dependent variable is book sales per customer (in $). The independent variables are: x Income (in $) ± t KH FXVWRPHU¶V PRQWKO\ LQFRPH x College (0 or 1) ± D GXPP\ YDULDEOH WKDW LV VHW WR ³¼´ LI WKH FXVWRPHU KDV D FROOHJH HGXFDWLRQ x DigitalReader (0 or 1) ± D GXPP\ YDULDEOH WKDW LV VHW WR ³¼´ LI WKH FXVWRPHU RZQV D n e-reader x IncomeXDigitalReader (in $) ± Income multiplied by DigitalReader x Age (in years) ± WKH FXVWRPHU¶V DJH x AgeSQ (in years squared) ± Age squared x Children ± A categorical variable with one of three possible options (there is no missing data): 1. AdultChildren (0 or 1) ± a dummy variable that is set WR ³¼´ LI WKH FXVWRPHU KDV DGXOW FKLOGUHQ 2. NoChildren (0 or 1) ± D GXPP\ YDULDEOH WKDW LV VHW WR ³¼´ LI WKH FXVWRPHU KDV QR FKLOGUHQ 3. Children (0 or 1) ± D GXPP\ YDULDEOH WKDW LV VHW WR ³¼´ LI WKH FXVWRPHU KDV QRQ -adult children Use this information to answer the following questions. a. Why is the value for R-squared so close to the value for adjusted R-squared? Regression Statistics Multiple R 0.409 R Square 0.167 Adjusted R Square 0.167 Standard Error 33.788 Observations 20000 ANOVA df SS MS F Significance F Regression 8 4,591,233 573,904 503 0 Residual 19,991 22,822,847 1,142 Total 19,999 27,414,080 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 85.536 0.716 119.518 0.000 84.133 86.939 Income 0.003 0.001 2.934 0.003 0.001 0.004 College 2.129 0.782 2.724 0.006 0.597 3.661 DigitalReader -7.280 2.205 -3.302 0.001 -11.601 -2.959 IncomeXDigitalReader 0.012 0.005 2.215 0.027 0.001 0.022 Age 0.867 0.092 9.433 0.000 0.687 1.047 AgeSQ -0.016 0.003 -5.662 0.000 -0.022 -0.011 AdultChildren 0.113 0.828 0.136 0.892 -1.511 1.736 NoChildren -31.786 0.556 -57.155 0.000 -32.876 -30.696 3 The number of independent variables is small relative to the sample size , so the penalty factor (nn) for adjusted R2 is small .
3 b. Estimate the average amount spent on leisure books for a 35 year old customer with an annual income of $60,000, no college education, no children, and who owns an e-reader. c. Provide an economic interpretation of the impact of NoChildren on the amount spent on leisure books. d. Provide an economic interpretation of the coefficient on the Intercept. e. Provide an economic interpretation of the impact of College on the amount spent on leisure books. f. If a customer has an e-reader, what is the impact of a $1,000 LQFUHDVH LQ WKH FXVWRPHU¶ s monthly income on the amount spent on leisure books? g. If a customer is 50 years old, what is the impact RI D ¼ \HDU LQFUHDVH LQ WKH FXVWRPHU¶V DJH on the amount spent on leisure books? 3 3 4 3 4 5 $ / Mo . % ' lo $ / no = 85.536 + a 003 ( Income )+ 2.129 ( College ) - 7.280 ( BR ) -1.012 ( the DR ) yrs yrs sq + . 867 Age - . 016 Ages Q + . 113 Adult -31.786 ( No child ) = 85.536 + . 003 ( b + 2- 129 ( o ) - 7.280 (1) + . 012 ( ¥ 0 1) + . 867 ( 35 ) - . 016 ( 352 ) + < 113 ( o ) - 31.786 (1) = $ 132.22 Relative to a customer that has non - adult children , a customer with no children spends $ 31.786 less on books every year . A customer that has a zero valve for every variable is expected to spend $ 85.53b on books every Year . Relative to a customer without a college education a customer with a college education is expected to pay $ 2.129 more on books every year . tdy = 003 ( Inc ) + . 012 ( A- Inc DR ) = . 003 ( 1000 ) + . 012 ( 1000 1) = $ 15 Js , = stuff . - + . 867 ( 51 ) - . 016 ( 517 + stuff js-o-f.fi/-.t.867(5o)-.016(5o)2-stvft-#Ay = - $ 0.75
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4 Question 2: (16 points) Bean is constantly searching for promising unpublished authors. You know that 60% of all unpublished authors work hard and the rest slack off. $Q DXWKRU¶V ILUVW SXEOLFDWLRQ FDQ HLWKHU UHFHLYH FULWLFDO DFFODLP or not. Given that an author works hard, his / her first publication has a 25% probability of being critically acclaimed. There is a 5% probability that an author is a slacker and receives critical acclaim on his / her first publication. NOTE: There is no partial credit awarded for any portion of this problem. a) What is the probability that an DXWKRU¶V ILUVW SXEOLFDWLRQ UHFHLYHV FULWLFDO DFFODLP JLYHQ WKDW KH ¿ VKH is a slacker? b) What is the probability that an author is a hard worker and receives critical acclaim on his / her first publication? c) What is the probability that an author is a slacker given that his / her first publication receives critical acclaim? d) What is the probability that an DXWKRU¶V ILUVW SXEOLFDWLRQ GRHV QRW UHFHLYH FULWLFDO DFFODLP JLYHQ WKDW he / she is a slacker? 4 4 4 4 Notation : H : Hard worker S : slacker C ! Critical acclaim N : No acclaim * Kane , 45 PCHNN ) PCH )= 0.60 PCCI -113=0.25 zr 05 Pcsnc ) P( Snc ) = 0.05 875N 35 PCSNN ) P( CIS ) = 125 ( right off the tree ) P( Hnc ) = 15 ( right off the tree ) P( Slc ) = .oj°÷s = 25 PCNIS ) = 875 ( right off tree )
5 Question 3: (17 points) Bean has developed a machine learning model to predict which new book releases selected by the RUJDQL]DWLRQ¶V buying team will actually be flops. Historically, the company has carried all of the buying WHDP¶V VHOHFWLRQV . There are two relevant costs for this analysis. Each book that is carried, but flops, has a $50,000 inventory write down cost. If a book is not carried and it is not a flop, there is a $25,000 opportunity cost in the form of lost profits that could have been made on the book. The confusion matrix is provided below based on proportions. A positive indicates a flop. Bean must evaluate 200 new releases each quarter. With Machine Learning (only negative predicted values are carried by Bean): Actual Values Predicted Values Positive Negative Positive 0.15 0.15 Negative 0.05 0.65 Without Machine Learning (all books are assumed to not be flops and carried by Bean): Actual Values Predicted Values Positive Negative Positive 0.00 0.00 Negative 0.20 0.80 a) What is the accuracy with and without the machine learning algorithm? b) What is the cost each quarter with the machine learning algorithm? c) What is the cost each quarter without the machine learning algorithm? d) How much value is the machine learning algorithm expected to create each quarter? 5 5 3 4 With : Accuracy = 0-15-1,0>65-2.80 without : = °+Y8°_ = 80 Cost = ( FN 50000 t FP 25000 ) 200 = ( 05 50000 + . 15 25000 ) 200 = $ 1,250,000 Cost = (20 × 50000 + 0 × 25000) 200 = $ 2,000,000 Value = Cost w/ OML - cost w/ ML = $ 2,000,000 - $ 1,250,000 = $ 750,000
6 Question 4: (15 points) You are evaluating acquiring a small boutique publishing business. You run a simulation model to estimate the earnings of this business next year. There are 500 simulations in your model. The model yields a sample mean of $11.7M in earnings with a sample standard deviation of $8.3M. a. Construct a 99% confidence interval for the mean earnings of this business next year. b. How many simulation trials do you need to run in order to predict the mean earnings with an accuracy of plus or minus $200K, with a confidence level of 99%? Assume the population standard deviation is equal to the sample standard deviation. c. Your simulation model assumed that the number of books sold for a particular title obeys a Normal distribution with a mean of 3,000 books and a standard deviation of 1,000 books. You notice that the mean number of books in the simulation results is 3,092 books. You wonder if 3,092 is sufficiently different from 3,000 to suggest that there might be an error in your model. What is likelihood that the observed sample mean is greater than the true mean by at least 92 books? d. You run another simulation using 800 trials and calculate a 95% confidence interval for the mean monthly earnings as [$11M, $12M]. Explain whether the following remarks are TRUE or FALSE? i. If you run a second simulation with 800 trials, you would also obtain a 95% confidence interval with a range of $1M. ii. Given the results of your simulation, a 99% confidence interval for the mean monthly earnings would have a range larger than $1M. 3 3 3 3 3 I ± tate ¥ t.ws I 2- = 2.576 @ 499 dot 11.7 I 2.576 85B ¥ [ $ 10.74M , $ 12.66M ] step 1 : assume t = E n = (Zoo ¥ [- = (÷(8B° = 11,428.47 roundup 11,420 42 : unnecessary since dot > 100 . z=% ¥ _- = 3098,315 ¥ = 2.06 P ( Z > 2.06 ) = l - P (2-12.06) = 1 - -9803 = 0.0197 False . Due to random draws in simulation , your results will differ from one run to the next . True . You can be more confident in a wider range . To see this , note that t.ws I 2.576 > t.org I 1.960
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
7 Question 5: (13 points) You are evaluating whether to adopt a digital submission and screening system to evaluate unsolicited manuscripts. You will purchase the system if you believe that it will generate average savings of more than $500 per manuscript. You pilot the system and use it on a random sample of 64 manuscripts. The average savings is $476 and the standard deviation is $80. a. Clearly state the Null and Alternative Hypotheses. b. Compute the appropriate test statistic. c. Compute the p-value. d. Use your results to justify what action you should take, assuming alpha = 0.05. 4 6 2 1 Ho : it f 500 HA : it > 500 t= s ¥ = 4%1%04-1 = -2.4 _ ¥ ÷ ¥ dof= 63 P( t > - 2. 4) = I - Plt > 2.4 ) 005 2 P( t > 2. 4) 2.01 59 , qqg > Plt > -2.4 ) 7.99 since p > = 05 , do not purchase the system .
8 Question 6: (14 points ± 2 points each) $QVZHU ³7UXH´ RU ³)DOVH´ DQG H xplain why. To receive credit you must be correct both with GHVFULELQJ WKH VWDWHPHQW DV ³WUXH´ RU ³IDOVH´ DQG ZLWK \RXU H[SODQDWLRQ ZK\ . Merely answering ³WUXH´ RU ³IDOVH´ FRUUHFWO\ ZLOO UHFHLYH º SRLQWVµ a. Two events with non-zero probabilities can be independent and mutually exclusive at the same time. b. If the p-value in a hypothesis test is close to 1.0 then the Null Hypothesis must be true. c. From the required reading for Day 20, a distinguishing characteristic of machine learning models is that they are free from biases that can taint human decisions. d. If we are testing the difference between two sample means, we can use the z-distribution to complete the hypothesis test based on sample sizes of n 1 = 40 and n 2 = 25. e. For any given data set and choice of dependent and independent variables, performing least-squares regression maximizes R 2 . f. The positive predicted value tells you the probability that your machine learning model is correct given that it makes a positive prediction. g. The prevalence measure for a machine learning model captures how often the prediction is right. False . If 2 events are mutually exclusive , PCA I B) = 0 If 2 events are independent , PC Al B) = P( A) since these are non - zero probability events P( A) =/ 0 False The p valve is the probability of getting something as extreme as the sample result assuming the null is true . This implies nothing about whether the null is actually true . False . Algorithms can perpetrate biases that are established in the underlying data . False . Since dof = 40-125-2 = 63 < too , use t distribution . SSE True . Since R2 = I - and least squares regression minimizes SSE , and hence R ? True . PPV = TÉ= . . ie aip÷Ee predictions False . Prevalence is the proportion of time the positive outcome occurs in the data .