2023S1_DATA1001_Exam_Main_v3_RELEASED (1)
pdf
keyboard_arrow_up
School
The University of Sydney *
*We aren’t endorsed by this school
Course
1001
Subject
Mathematics
Date
Feb 20, 2024
Type
Pages
22
Uploaded by BailiffBravery13461
Final Exam A Semester 1 2023
The University of Sydney
School of Mathematics and Statistics
DATA1001/1901
Foundations of Data Science
June 2023
Lecturers:
Di Warren
Time Allowed:
Reading time — 10 minutes; Writing time — 1.5 hours
Exam Conditions:
This is a closed-book examination — no material permitted. Writing
is not permitted at all during reading time.
Family Name:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SID:
. . . . . . . . . . . . . . . . . . . . . . . . . . .
Other Names:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Seat Number:
. . . . . . . . . . . . . . . . .
Please check that your examination paper is complete (23 pages) and indicate by signing below.
I have checked the examination paper and affirm it is complete.
Signature:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Date:
. . . . . . . . . . . . . . . . . . . . . . . . .
This examination has two sections: Multiple Choice and Extended Answer.
The Multiple Choice Section is worth 50% of the total examination.
There are 20 questions. The questions are of equal value.
All questions may be attempted.
Answers to the Multiple Choice questions must be entered on
the Multiple Choice Answer Sheet before the end of the examination.
The Extended Answer Section is worth 50% of the total examination.
There are 3 questions. The questions are of equal value.
All questions may be attempted. Working must be shown.
Concept Sheet & Calculators: There is a concept sheet after the last
question in this booklet. Calculators may NOT be used.
THE QUESTION PAPER MUST NOT BE REMOVED FROM THE
EXAMINATION ROOM.
Marker’s use
only
Page 1 of 23
Final Exam A Semester 1 2023
Page 2 of 23
Multiple Choice Section
In each question, choose at most one option.
Your answers must be entered on the Multiple Choice Answer Sheet.
1.
What is a complexity that is commonly associated with data linkage of human subjects?
(a) Ensuring the privacy of participants
(b) Data wrangling
(c) Getting ethics approval
(d) All of the other answers
2.
Which of the following scenarios would most likely be conducted as a randomised con-
trolled trial?
(a) An Australian clinical trial for a new drug
(b) Interviews for all new workers at Woolworths
(c) Feedback on a new teaching method
(d) A study of Sydney’s air pollution over 5 years
3.
What graphical summary could represent 1 qualitative variable and 1 quantitative vari-
able?
(a) Q-Q plot
(b) Scatter plot
(c) Clustered bar chart
(d) Comparative boxplot
4.
A company decreases all their food prices by 2%.
By how much will the mean and
standard deviation of food prices change, respectively?
(a) 2% and 4%
(b) 2% and 2%
(c) 0% and 2%
(d) 2% and 0%
Final Exam A Semester 1 2023
Page 3 of 23
5.
Given univariate, quantitative data, which of the following is impossible?
(a) Mean=
-
1
(b) Median =
-
1
(c) Standard deviation =
-
1
(d) Lower threshold =
-
1
6.
Which R command works out this area under the curve for
X
∼
N
(1
,
2
2
)?
(a)
pnorm(2,1,2)-pnorm(0,1,2)
(b)
pnorm(2,1,2)-pnorm(-2,1,2)
(c)
pnorm(2,1,4)-pnorm(0,1,4)
(d)
pnorm(2)-pnorm(0)
7.
Measurement error is defined as follows: Individual measurement = exact value + chance
error + bias.
How could we estimate the chance error?
(a) Remove any outliers and calculate the RMS.
(b) Find the systematic error (related to the bias).
(c) Replicate the measurements under the same conditions, and calculate the standard
deviation.
(d) Find the exact value and bias, and subtract them from the individual measurements.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Final Exam A Semester 1 2023
Page 4 of 23
8.
Using just the following R output, which statement is correct.
lm(y~x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
x
1.467
1.315
cor(x,y)
[1] 0.7645729
(a) ˆ
x
= 1
.
467 + 1
.
315
y
(b) The scatter plot could have many shapes.
(c) The data fits well along a line of positive slope.
(d) As
x
increases by 1 unit,
y
increases by 1.467 units.
9.
Two variables
X
and
Y
have correlation 0.7. If we swap the data values of
X
and
Y
,
and then minus 0.1 from each value of
Y
, what is the correlation of the new variables?
(a) 0.5
(b) 0.6
(c) 0.7
(d) 0.8
10.
In linear regression, what is the mean of the gaps between the data points and the
regression line?
(a) Always zero
(b) The residuals
(c) The RMS Error
(d) The standard deviation
Final Exam A Semester 1 2023
Page 5 of 23
11.
When does the Prosecutor’s Fallacy occur?
(a) When it is assumed that the chance of evidence given innocence is the same as
innocence given evidence.
(b) When it is assumed that the chance of evidence given guilt is the same as evidence
given innocence.
(c) When it is assumed that the chance of evidence given innocence is the same as
evidence given guilt.
(d) When it is assumed that the chance of guilt given innocence is the same as evidence
given guilt.
12.
Suppose we toss a biased coin 10 times with P(head)=0.3 at every toss. The results of
each toss are independent of each other. What is the chance we get exactly 3 heads?
(a)
3
10
0
.
7
3
0
.
3
7
(b)
10
3
0
.
3
3
(c)
10
3
0
.
3
10
(d)
10
7
0
.
3
3
0
.
7
7
13.
Suppose we randomly draw 100 times from a box with replacement, and sum the results.
We then repeat this process many times and plot a simulation histogram of the sums.
For which box would we expect to see an approximately normal-shaped histogram?
(a) Box = 0,1
(b) Box = 1,2,3
(c) Box = 0,0,0,0,0,0,0,0,0,1
(d) All of the boxes
14.
In a survey to determine Sydney students’ opinions on student fees, what is NOT a
possible source of bias?
(a) A poorly worded question
(b) Surveying one statistics lecture
(c) A wealthy student in the survey group
(d) Conducting an online survey
Final Exam A Semester 1 2023
Page 6 of 23
15.
In a market research study, 100 people were given a sample of brand-name chips and
home-brand chips (in random order) and asked which they preferred in taste. 70 people
preferred the brand-name chips.
Let
p
= P(preference for brand-name chips).
To test for no difference in preference between the two types of chips, what is the ap-
propriate null hypothesis?
(a)
H
0
:
p
= 0
(b)
H
0
:
p
= 0
.
5
(c)
H
0
:
p
= 0
.
7
(d)
H
0
:
p >
0
.
7
16.
What does a p-value of 0.85 mean?
(a) The data is consistent with the null hypothesis.
(b) There is a 85% chance that the null hypothesis is true.
(c) There is a 15% chance that the alternative hypothesis is true.
(d) We should accept the null hypothesis with probability 0.15.
17.
The data in
Milk.csv
consists of the milk yield of 100 cows.
t.test(Milk,mu=11)
One Sample t-test
data:
Milk
t = 4.9291, df = 99, p-value = 3.323e-06
alternative hypothesis: true mean is not equal to 11
95 percent confidence interval:
12.53485 14.60315
sample estimates:
mean of x
13.569
What would be the conclusion of the hypothesis
H
0
:
μ
= 13 vs
H
1
:
μ
6
= 13 when
α
= 0
.
05.
(a) We should reject
H
0
.
(b) The data is consistent with
H
0
.
(c) The p-value is 0.000003 (6dp).
(d) Not enough information given.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Final Exam A Semester 1 2023
Page 7 of 23
18.
A gambler is accused of using a loaded die, although she pleads innocence. Her last 60
throws are in the R vector
throws1
.
table(throws1)
throws1
1
2
3
4
5
6
7 12
6
9 13 13
What is the alternative hypothesis for a chi-squared test?
(a) The gambler is innocent.
(b) The 6 faces occur with proportions
1
6
:
1
6
:
1
6
:
1
6
:
1
6
:
1
6
(c) At least 1 of the faces occurs with a proportion different to
1
6
.
(d) The 6 faces occur with proportions
7
60
:
7
60
:
9
60
:
14
60
:
10
60
:
13
60
Final Exam A Semester 1 2023
Page 8 of 23
19.
Consider the following output for the
mtcars
dataset, for the miles per gallon (mpg)
and weight (wt) variables.
summary(lm(mpg ~ wt, data=mtcars))
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min
1Q
Median
3Q
Max
-4.5432 -2.3647 -0.1252
1.4096
6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
37.2851
1.8776
19.858
< 2e-16 ***
wt
-5.3445
0.5591
-9.559 1.29e-10 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:
0.7528,Adjusted R-squared:
0.7446
F-statistic: 91.38 on 1 and 30 DF,
p-value: 1.294e-10
What can be concluded from just this information?
(a) The linear model for predicting weight from mpg has an intercept around 37.
(b) The linear correlation coefficient is very small.
(c) A linear model for predicting mpg from weight is appropriate.
(d) None of the other answers.
Final Exam A Semester 1 2023
Page 9 of 23
20.
The following R code produces the following plot.
library(tidyverse)
ggplot(iris, aes(x=A, y = Sepal.Width)) + geom_point(aes(B = Species))
What are A and B?
(a) A = Sepal.Length, B = fill
(b) A = Sepal.Length, B = shape
(c) A = Sepal.Width, B = colour
(d) A = Sepal.Length, B = Species
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Final Exam A Semester 1 2023
Page 10 of 23
This page is left blank for your working.
Working for the Multiple Choice section will
not
be marked.
End of Multiple Choice Section
Make sure that your answers are entered on the Multiple Choice Answer Sheet
Final Exam A Semester 1 2023
Page 11 of 23
Extended Answer Section
There are
three
questions in this section, each with a number of parts.
Write your answers in the space provided below each part. There is extra space at the end
of the paper.
1. Australian Super - domain knowledge and data
‘Superannuation’ or ‘super’ is money put aside for when you retire from work.
Source 1:
A Sydney Morning Herald (SMH) article (25/2/23) states:
“Figures from
the Tax Office show ... about two-thirds of Australians or 11.3 million people holding
less than
$
100,000 in superannuation.”
, with the following data visualisation.
Source 2:
The Australian Super website (30/3/23) has the tagline “If you’re wondering
what your super balance should look like, it could help to compare with others your
age”, with the following data.
Age
Men ($)
Women ($)
less than 20
4486
4671
20 - 24
15620
14955
25 - 29
40017
30033
.
.
.
55 - 59
330720
205787
60 - 64
322184
246885
The Extended Answer Section begins on the next page
Final Exam A Semester 1 2023
Page 12 of 23
(a) Carefully read the information from the SMH article (Source 1).
(
i
)
Could there by any issues with ethics or privacy with using the Tax Office
data? Explain.
(
ii
)
What is a possible confounding variable?
Explain how it might affect the
data.
(
iii
)
Explain a weakness in the data visualisation, and suggest how you would
improve it.
(
iv
)
You have access to the Tax Office super data from 2019. Could you use this
data, with the 2023 data, to form a RCT? Explain your reasoning.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Final Exam A Semester 1 2023
Page 13 of 23
(b) Carefully read the information from the Australian Super website (Source 2).
(
i
)
Explain one interesting feature of this data, in context.
(
ii
)
Would a linear model be appropriate? Justify your answer.
Final Exam A Semester 1 2023
Page 14 of 23
2. Airbnb - domain knowledge and data
Established in 2008,
Airbnb
is a very popular global, online market place for renting
accommodation. Each listing features photos and details of the property, reviews from
previous guests, and the approximate position on a map.
Owners can rent out their
whole property or spare rooms to guests.
Inside Airbnb
is “a mission driven project that provides data and advocacy about
Airbnb’s impact on residential communities”, with data that can be downloaded from
the site.
Source: insideairbnb.com/get-the-data
Meera is considering renting out her property which is in Manly, and wants to research
the likely rental returns. She downloads the data set
listings
from the Inside Airbnb
site for Sydney December 2022.
dim(listings)
[1] 22100
18
names(listings)
[1] "id"
"name"
[3] "host_id"
"host_name"
[5] "neighbourhood_group"
"neighbourhood"
[7] "latitude"
"longitude"
[9] "room_type"
"price"
[11] "minimum_nights"
"number_of_reviews"
[13] "last_review"
"reviews_per_month"
[15] "calculated_host_listings_count" "availability_365"
[17] "number_of_reviews_ltm"
"license"
Final Exam A Semester 1 2023
Page 15 of 23
(a) Carefully read the information about Airbnb and Meera.
(
i
)
Meera is pleased to find that the data is “tidy”. What does this mean?
(
ii
)
How many properties (“listings”) are in the overall data set?
(
iii
)
Suggest a short data dictionary entry for one of the variables.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Final Exam A Semester 1 2023
Page 16 of 23
(b) Meera focuses on the properties in Manly, and produces a boxplot of the
price
variable.
listings1 = listings %>% filter(neighbourhood == "Manly")
dim(listings1)[1]/dim(listings)[1]
[1] 0.04873303
(
i
)
Approximately what percentage of the Airbnb properties are in Manly?
(
ii
)
Using the boxplot, how much could Meera rent out her property for? Justify
your answer, and include any assumptions and limitations.
(
iii
)
Write the R code to produce the boxplot in
ggplot
.
Final Exam A Semester 1 2023
Page 17 of 23
3.
(a) Meera is considering buying another investment property in Sydney, which she will
rent out on Airbnb. Suggest one concrete way that linear modelling could help her
decision-making. What data would she need?
(b) The following scatterplot shows the different Airbnb listings, with the Sydney Har-
bour Bridge highlighted as a circle. Why does a map of the Sydney coastline emerge?
plot(listings$latitude ~ listings$longitude, pch = ".")
points(y = -33.85222, x = 151.210556, col = "red", pch = 19, cex = 2)
Final Exam A Semester 1 2023
Page 18 of 23
(c) Some-one claims that 70% of properties on Airbnb are the entire home or apartment.
Test this claim using HATPC, with
α
= 0
.
07.
table(listings$room_type)
Entire home/apt
Hotel room
Private room
Shared room
15235
100
6478
287
Type = c(15235, 22100-15235)
chisq.test(Type,p=c(0.7,0.3))
Chi-squared test for given probabilities
data:
Type
X-squared = 11.899, df = 1, p-value = 0.0005615
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Final Exam A Semester 1 2023
Page 19 of 23
(d) Explain what a ‘test-statistic’ is, in terms of the context here.
Explain how the
‘p-value’ is calculated here.
Final Exam A Semester 1 2023
Page 20 of 23
Extra Space
This page is left blank, in case you need extra space for your answers to Extended Answer
Questions. If so, you must note this at the relevant part of the Extended Answers.
Final Exam A Semester 1 2023
Page 21 of 23
Extra Space
This page is left blank, in case you need extra space for your answers to Extended Answer
Questions. If so, you must note this at the relevant part of the Extended Answers.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Final Exam A Semester 1 2023
Page 22 of 23
DATA1001/1901 Exam Concept Sheet
This unit is focused on words, not formulae. The following sheet is given for your reference.
Numerical Summaries
SD = RMS of gaps from the mean =
q
mean of (gaps from the mean)
2
IQR = 75% percentile - 25% percentile =
Q
3
-
Q
1
Identifing outliers:
LT
=
Q
1
-
1
.
5
×
IQR
;
UT
=
Q
3
+ 1
.
5
×
IQR
Models
Normal:
X
∼
N
(mean
,
SD
2
); thresholds (
±
1
/
2
/
3 SD : 68%
/
95%
/
99
.
7%)
Linear: ˆ
y
=
a
+
bx
, where
b
=
r
SD
y
SD
x
and
a
= ¯
y
-
b
¯
x
.
Linear strip at
x
*
:
y
*
∼
N
(¯
y
+
rz
x
*
SD
y
,
RMS Error), where RMS Error =
√
1
-
r
2
SD
y
.
Binomial:
X
∼
Bin
(
n, p
), then
P
(
X
=
x
successes) =
n
x
p
x
(1
-
p
)
n
-
x
, for 0
≤
x
≤
n
.
Box Model: Given a population with mean M and standard deviation SD, and a sample
taken with replacement of size
n
, the Sample Sum has EV =
n
M and SE =
√
n
SD, and the
Sample Mean has EV = M and SE = SD/
√
n
.
Hypothesis Testing (HATPC)
Test
Null Hypothesis
Assumptions
1 Sample Proportion
Ho: proportion = constant
independent; constant P(success)
1 Sample T
Ho: mean = constant
independent; population Normal (if small n)
2 Sample T
Ho: difference in 2 means = constant
independent, Normal populations
Chi-squared (model)
Ho: model holds
Cochran’s Rule
Chi-squared (independence)
Ho: 2 variables are independent
Cochran’s Rule; independence
Regression
Ho: slope = 0
looks linear; homoscedastic residuals
R Code
# IDA
str(iris)
library(tidyverse)
ggplot(iris, aes(x=Sepal.Length)) + geom_histogram()
# Modelling
pnorm(5,4,3)
# Given X ~ N(4,9), find the lower tail area from 5 down.
qnorm(0.4,4,3)
# Given X ~ N(4,9), find the 40th percentile
pnorm(r*qnorm(x)) # Estimate y percentile from x percentile, in linear model
sample(c(1:6),3,replace = T)
# 3 rolls of a fair die
End of Extended Answer Section
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help