NCC 5010 - 2023 Practice Final 1 Solution
pdf
keyboard_arrow_up
School
Cornell University *
*We aren’t endorsed by this school
Course
5010
Subject
Statistics
Date
Jan 9, 2024
Type
Pages
8
Uploaded by MagistrateMusicChinchilla28
1
NCC 5010: Data Analytics and Modeling
Practice Final 1
Read Carefully:
1.
Write your student ID and net ID below. Do not put your name on this exam.
2.
You may have
two
8
½
by 11 sheet of paper with notes on both sides. Other than that,
the
exam is closed book and closed notes.
Laptops and communication devices are not
allowed.
You are not allowed to share any materials or equipment.
3.
You have
3.0 hours
to complete this exam. The exam has
6 problems
. Points for each
problem are indicated. Some problems will take longer than others, so plan your time
accordingly.
4.
Write your solutions in the space provided in this document using the front and back of
sheets as necessary.
5.
Show all of your calculations clearly.
$QVZHUV OLNH ³WUXH´ RU MXVW D VLQJOH QXP
ber are
not satisfactory and will not be given partial credit. If we cannot locate where your
solution is written, you will not receive credit. State your assumptions if you must make
any assumptions not given in a question.
6.
Taking this exam indicates that you understand and will abide by the Cornell University
Code of Academic Integrity.
7.
Some common z-statistics:
3±] ²µ¶·¸¹ ºµºº¶
3±] ²µ»²¸¹ ºµº¼
3±] ¼µ½¸º¹ ºµº²¶
3±] ¼µ¸¾¶¹ ºµº¶
________________________________________________________________________
Student ID# ______________________________
Net ID#
______________________________
Do not write below this line.
________________________________________________________________________
Q1:
Q2:
Q3:
Q4:
Q5:
Q6:
Total:
/25
/16
/18
/15
/13
/14
2
Question 1:
(25 points)
You are an executive at a large book publisher, Bean Publishing. Bean Publishing uses a regression
model to examine several factors that influence how much a customer spends on leisure books every
year. The regression output is below. The dependent variable is book sales per customer (in $). The
independent variables are:
x
Income
(in $)
±
t
KH FXVWRPHU¶V PRQWKO\ LQFRPH
x
College
(0 or 1)
±
D GXPP\ YDULDEOH WKDW LV VHW WR ³¼´ LI WKH FXVWRPHU KDV D FROOHJH HGXFDWLRQ
x
DigitalReader
(0 or 1)
±
D GXPP\ YDULDEOH WKDW LV VHW WR ³¼´ LI WKH FXVWRPHU RZQV D
n e-reader
x
IncomeXDigitalReader
(in $)
±
Income
multiplied by
DigitalReader
x
Age
(in years)
±
WKH FXVWRPHU¶V DJH
x
AgeSQ
(in years squared)
±
Age
squared
x
Children
±
A categorical variable with one of three possible options (there is no missing data):
1.
AdultChildren
(0 or 1)
±
a dummy variable that is set
WR ³¼´ LI WKH FXVWRPHU KDV DGXOW FKLOGUHQ
2.
NoChildren
(0 or 1)
±
D GXPP\ YDULDEOH WKDW LV VHW WR ³¼´ LI WKH FXVWRPHU KDV QR FKLOGUHQ
3.
Children
(0 or 1)
±
D GXPP\ YDULDEOH WKDW LV VHW WR ³¼´ LI WKH FXVWRPHU KDV QRQ
-adult children
Use this information to answer the following questions.
a.
Why is the value for R-squared so close to the value for adjusted R-squared?
Regression Statistics
Multiple R
0.409
R Square
0.167
Adjusted R Square
0.167
Standard Error
33.788
Observations
20000
ANOVA
df
SS
MS
F
Significance F
Regression
8
4,591,233
573,904
503
0
Residual
19,991
22,822,847
1,142
Total
19,999
27,414,080
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept
85.536
0.716
119.518
0.000
84.133
86.939
Income
0.003
0.001
2.934
0.003
0.001
0.004
College
2.129
0.782
2.724
0.006
0.597
3.661
DigitalReader
-7.280
2.205
-3.302
0.001
-11.601
-2.959
IncomeXDigitalReader
0.012
0.005
2.215
0.027
0.001
0.022
Age
0.867
0.092
9.433
0.000
0.687
1.047
AgeSQ
-0.016
0.003
-5.662
0.000
-0.022
-0.011
AdultChildren
0.113
0.828
0.136
0.892
-1.511
1.736
NoChildren
-31.786
0.556
-57.155
0.000
-32.876
-30.696
3
The
number
of
independent
variables
is
small
relative
to
the
sample
size
,
so
the
penalty
factor
(nn)
for
adjusted
R2
is
small
.
3
b.
Estimate the average amount spent on leisure books for a 35 year old customer with an annual
income of $60,000, no college education, no children, and who owns an e-reader.
c.
Provide an economic interpretation of the impact of
NoChildren
on the amount spent on leisure
books.
d.
Provide an economic interpretation of the coefficient on the
Intercept.
e.
Provide an economic interpretation of the impact of
College
on the amount spent on leisure books.
f.
If a customer has an e-reader, what is the impact of a $1,000
LQFUHDVH LQ WKH FXVWRPHU¶
s monthly
income on the amount spent on leisure books?
g.
If a customer is 50 years old, what is the impact
RI D ¼ \HDU LQFUHDVH LQ WKH FXVWRPHU¶V DJH
on the
amount spent on leisure books?
3
3
4
3
4
5
$
/
Mo
.
%
'
lo
$
/
no
⑨
=
85.536
+
a
003
(
Income
)+
2.129
(
College
)
-
7.280
(
BR
)
-1.012
(
the
✗
DR
)
yrs
yrs
sq
+
.
867
Age
-
.
016
Ages
Q
+
.
113
Adult
-31.786
(
No
child
)
=
85.536
+
.
003
(
b
+
2-
129
(
o
)
-
7.280
(1)
+
.
012
(
¥
0
✗
1)
+
.
867
(
35
)
-
.
016
(
352
)
+
<
113
(
o
)
-
31.786
(1)
=
$
132.22
Relative
to
a
customer
that
has
non
-
adult
children
,
a
customer
with
no
children
spends
$
31.786
less
on
books
every
year
.
A
customer
that
has
a
zero
valve
for
every
variable
is
expected
to
spend
$
85.53b
on
books
every
Year
.
Relative
to
a
customer
without
a
college
education
a
customer
with
a
college
education
is
expected
to
pay
$
2.129
more
on
books
every
year
.
tdy
=
•
003
(
☐
Inc
)
+
.
012
(
A-
Inc
✗
DR
)
=
.
003
(
1000
)
+
.
012
(
1000
✗
1)
=
$
15
Js
,
=
stuff
.
-
+
.
867
(
51
)
-
.
016
(
517
+
stuff
js-o-f.fi/-.t.867(5o)-.016(5o)2-stvft-#Ay
=
-
$
0.75
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
Question 2:
(16 points)
Bean is constantly searching for promising unpublished authors. You know that 60% of all unpublished
authors work hard and the rest slack off.
$Q DXWKRU¶V ILUVW SXEOLFDWLRQ FDQ HLWKHU UHFHLYH FULWLFDO DFFODLP
or not. Given that an author works hard, his / her first publication has a 25% probability of being
critically acclaimed. There is a 5% probability that an author is a slacker and receives critical acclaim on
his / her first publication. NOTE: There is no partial credit awarded for any portion of this problem.
a)
What is the probability that an
DXWKRU¶V ILUVW SXEOLFDWLRQ UHFHLYHV FULWLFDO DFFODLP JLYHQ WKDW KH ¿ VKH
is a slacker?
b)
What is the probability that an author is a hard worker and receives critical acclaim on his / her first
publication?
c)
What is the probability that an author is a slacker given that his / her first publication receives critical
acclaim?
d)
What is the probability that an
DXWKRU¶V ILUVW SXEOLFDWLRQ GRHV QRW UHFHLYH FULWLFDO DFFODLP JLYHQ WKDW
he / she is a slacker?
4
4
4
4
Notation
:
H
:
Hard
worker
S
:
slacker
C
!
Critical
acclaim
N
:
No
acclaim
;÷
*
Kane
,
•
45
PCHNN
)
PCH
)=
0.60
PCCI
-113=0.25
zr
•
05
Pcsnc
)
P(
Snc
)
=
0.05
•
875N
•
35
PCSNN
)
P(
CIS
)
=
•
125
(
right
off
the
tree
)
P(
Hnc
)
=
•
15
(
right
off
the
tree
)
P(
Slc
)
=
.oj°÷s
=
•
25
PCNIS
)
=
•
875
(
right
off
tree
)
5
Question 3:
(17 points)
Bean has developed a machine learning model to predict which new book releases selected by the
RUJDQL]DWLRQ¶V
buying team will actually be flops. Historically, the company has carried all of the buying
WHDP¶V VHOHFWLRQV
. There are two relevant costs for this analysis. Each book that is carried, but flops, has
a $50,000 inventory write down cost. If a book is not carried and it is not a flop, there is a $25,000
opportunity cost in the form of lost profits that could have been made on the book. The confusion matrix
is provided below based on proportions. A positive indicates a flop. Bean must evaluate 200 new
releases each quarter.
With Machine Learning (only negative predicted values are carried by Bean):
Actual Values
Predicted
Values
Positive
Negative
Positive
0.15
0.15
Negative
0.05
0.65
Without Machine Learning (all books are assumed to not be flops and carried by Bean):
Actual Values
Predicted
Values
Positive
Negative
Positive
0.00
0.00
Negative
0.20
0.80
a)
What is the accuracy with and without the machine learning algorithm?
b)
What is the cost each quarter with the machine learning algorithm?
c)
What is the cost each quarter without the machine learning algorithm?
d)
How much value is the machine learning algorithm expected to create each quarter?
5
5
3
4
With
:
Accuracy
=
0-15-1,0>65-2.80
without
:
=
°+Y8°_
=
•
80
Cost
=
(
FN
✗
50000
t
FP
✗
25000
)
✗
200
=
(
05
✗
50000
+
.
15
✗
25000
)
✗
200
=
$
1,250,000
Cost
=
(20
×
50000
+
0
×
25000)
✗
200
=
$
2,000,000
Value
=
Cost
w/
OML
-
cost
w/
ML
=
$
2,000,000
-
$
1,250,000
=
$
750,000
6
Question 4:
(15 points)
You are evaluating acquiring a small boutique publishing business. You run a simulation model to estimate
the earnings of this business next year. There are 500 simulations in your model. The model yields a
sample mean of $11.7M in earnings with a sample standard deviation of $8.3M.
a.
Construct a 99% confidence interval for the mean earnings of this business next year.
b.
How many simulation trials do you need to run in order to predict the mean earnings with an
accuracy of plus or minus $200K, with a confidence level of 99%? Assume the population standard
deviation is equal to the sample standard deviation.
c.
Your simulation model assumed that the number of books sold for a particular title obeys a Normal
distribution with a mean of 3,000 books and a standard deviation of 1,000 books.
You notice that
the mean number of books in the simulation results is 3,092 books. You wonder if 3,092 is
sufficiently different from 3,000 to suggest that there might be an error in your model. What is
likelihood that the observed sample mean is greater than the true mean by at least 92 books?
d.
You run another simulation using 800 trials and calculate a 95% confidence interval for the mean
monthly earnings as [$11M, $12M]. Explain whether the following remarks are TRUE or FALSE?
i.
If you run a second simulation with 800 trials, you would also obtain a 95% confidence
interval with a range of $1M.
ii.
Given the results of your simulation, a 99% confidence interval for the mean monthly earnings
would have a range larger than $1M.
3
3
3
3
3
I
±
tate
¥
t.ws
I
2-
=
2.576
@
499
dot
11.7
I
2.576
85B
¥
[
$
10.74M
,
$
12.66M
]
step
1
:
assume
t
=
E
n
=
(Zoo
¥
[-
=
(÷(8B°
=
11,428.47
roundup
11,420
€
42
:
unnecessary
since
dot
>
100
.
z=%
¥
_-
=
3098,315
¥
=
2.06
P
(
Z
>
2.06
)
=
l
-
P
(2-12.06)
=
1
-
-9803
=
0.0197
False
.
Due
to
random
draws
in
simulation
,
your
results
will
differ
from
one
run
to
the
next
.
True
.
You
can
be
more
confident
in
a
wider
range
.
To
see
this
,
note
that
t.ws
I
2.576
>
t.org
I
1.960
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
7
Question 5:
(13 points)
You are evaluating whether to adopt a digital submission and screening system to evaluate unsolicited
manuscripts. You will purchase the system if you believe that it will generate average savings of more
than $500 per manuscript. You pilot the system and use it on a random sample of 64 manuscripts. The
average savings is $476 and the standard deviation is $80.
a.
Clearly state the Null and Alternative Hypotheses.
b.
Compute the appropriate test statistic.
c.
Compute the p-value.
d.
Use your results to justify what action you should take, assuming alpha = 0.05.
4
6
2
1
Ho
:
it
f
500
HA
:
it
>
500
t=
✗
s
¥
=
4%1%04-1
=
-2.4
_
¥
÷
¥
dof=
63
P(
t
>
-
2.
4)
=
I
-
Plt
>
2.4
)
•
005
2
P(
t
>
2.
4)
2.01
59
,
qqg
>
Plt
>
-2.4
)
7.99
since
p
>
✗
=
•
05
,
do
not
purchase
the
system
.
8
Question 6:
(14 points
±
2 points each)
$QVZHU ³7UXH´ RU ³)DOVH´ DQG H
xplain why.
To receive credit you must be correct both with
GHVFULELQJ WKH VWDWHPHQW DV ³WUXH´ RU ³IDOVH´ DQG ZLWK \RXU H[SODQDWLRQ ZK\
. Merely answering
³WUXH´ RU ³IDOVH´ FRUUHFWO\ ZLOO UHFHLYH º SRLQWVµ
a.
Two events with non-zero probabilities can be independent and mutually exclusive at the same time.
b.
If the p-value in a hypothesis test is close to 1.0 then the Null Hypothesis must be true.
c.
From the required reading for Day 20, a distinguishing characteristic of machine learning models is
that they are free from biases that can taint human decisions.
d.
If we are testing the difference between two sample means, we can use the z-distribution to complete
the hypothesis test based on sample sizes of n
1
= 40 and n
2
= 25.
e.
For any given data set and choice of dependent and independent variables, performing least-squares
regression maximizes R
2
.
f.
The positive predicted value tells you the probability that your machine learning model is correct
given that it makes a positive prediction.
g.
The prevalence measure for a machine learning model captures how often the prediction is right.
False
.
If
2
events
are
mutually
exclusive
,
PCA
I
B)
=
0
If
2
events
are
independent
,
PC
Al
B)
=
P(
A)
since
these
are
non
-
zero
probability
events
P(
A)
=/
0
False
The
p
valve
is
the
probability
of
getting
something
as
extreme
as
the
sample
result
assuming
the
null
is
true
.
This
implies
nothing
about
whether
the
null
is
actually
true
.
False
.
Algorithms
can
perpetrate
biases
that
are
established
in
the
underlying
data
.
False
.
Since
dof
=
40-125-2
=
63
<
too
,
use
t
distribution
.
SSE
True
.
Since
R2
=
I
-
㱺
and
least
squares
regression
minimizes
SSE
,
and
hence
R
?
True
.
PPV
=
TÉ=
.
.
ie
aip÷Ee
predictions
False
.
Prevalence
is
the
proportion
of
time
the
positive
outcome
occurs
in
the
data
.
Recommended textbooks for you

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Elementary Geometry For College Students, 7e
Geometry
ISBN:9781337614085
Author:Alexander, Daniel C.; Koeberlein, Geralyn M.
Publisher:Cengage,

Algebra: Structure And Method, Book 1
Algebra
ISBN:9780395977224
Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. Cole
Publisher:McDougal Littell
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillElementary Geometry For College Students, 7eGeometryISBN:9781337614085Author:Alexander, Daniel C.; Koeberlein, Geralyn M.Publisher:Cengage,Algebra: Structure And Method, Book 1AlgebraISBN:9780395977224Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. ColePublisher:McDougal Littell

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Elementary Geometry For College Students, 7e
Geometry
ISBN:9781337614085
Author:Alexander, Daniel C.; Koeberlein, Geralyn M.
Publisher:Cengage,

Algebra: Structure And Method, Book 1
Algebra
ISBN:9780395977224
Author:Richard G. Brown, Mary P. Dolciani, Robert H. Sorgenfrey, William L. Cole
Publisher:McDougal Littell