2023S1_DATA1001_Exam_Main_v3_RELEASED (1)

pdf

School

The University of Sydney *

*We aren’t endorsed by this school

Course

1001

Subject

Mathematics

Date

Feb 20, 2024

Type

pdf

Pages

22

Uploaded by BailiffBravery13461

Report
Final Exam A Semester 1 2023 The University of Sydney School of Mathematics and Statistics DATA1001/1901 Foundations of Data Science June 2023 Lecturers: Di Warren Time Allowed: Reading time — 10 minutes; Writing time — 1.5 hours Exam Conditions: This is a closed-book examination — no material permitted. Writing is not permitted at all during reading time. Family Name: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SID: . . . . . . . . . . . . . . . . . . . . . . . . . . . Other Names: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seat Number: . . . . . . . . . . . . . . . . . Please check that your examination paper is complete (23 pages) and indicate by signing below. I have checked the examination paper and affirm it is complete. Signature: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Date: . . . . . . . . . . . . . . . . . . . . . . . . . This examination has two sections: Multiple Choice and Extended Answer. The Multiple Choice Section is worth 50% of the total examination. There are 20 questions. The questions are of equal value. All questions may be attempted. Answers to the Multiple Choice questions must be entered on the Multiple Choice Answer Sheet before the end of the examination. The Extended Answer Section is worth 50% of the total examination. There are 3 questions. The questions are of equal value. All questions may be attempted. Working must be shown. Concept Sheet & Calculators: There is a concept sheet after the last question in this booklet. Calculators may NOT be used. THE QUESTION PAPER MUST NOT BE REMOVED FROM THE EXAMINATION ROOM. Marker’s use only Page 1 of 23
Final Exam A Semester 1 2023 Page 2 of 23 Multiple Choice Section In each question, choose at most one option. Your answers must be entered on the Multiple Choice Answer Sheet. 1. What is a complexity that is commonly associated with data linkage of human subjects? (a) Ensuring the privacy of participants (b) Data wrangling (c) Getting ethics approval (d) All of the other answers 2. Which of the following scenarios would most likely be conducted as a randomised con- trolled trial? (a) An Australian clinical trial for a new drug (b) Interviews for all new workers at Woolworths (c) Feedback on a new teaching method (d) A study of Sydney’s air pollution over 5 years 3. What graphical summary could represent 1 qualitative variable and 1 quantitative vari- able? (a) Q-Q plot (b) Scatter plot (c) Clustered bar chart (d) Comparative boxplot 4. A company decreases all their food prices by 2%. By how much will the mean and standard deviation of food prices change, respectively? (a) 2% and 4% (b) 2% and 2% (c) 0% and 2% (d) 2% and 0%
Final Exam A Semester 1 2023 Page 3 of 23 5. Given univariate, quantitative data, which of the following is impossible? (a) Mean= - 1 (b) Median = - 1 (c) Standard deviation = - 1 (d) Lower threshold = - 1 6. Which R command works out this area under the curve for X N (1 , 2 2 )? (a) pnorm(2,1,2)-pnorm(0,1,2) (b) pnorm(2,1,2)-pnorm(-2,1,2) (c) pnorm(2,1,4)-pnorm(0,1,4) (d) pnorm(2)-pnorm(0) 7. Measurement error is defined as follows: Individual measurement = exact value + chance error + bias. How could we estimate the chance error? (a) Remove any outliers and calculate the RMS. (b) Find the systematic error (related to the bias). (c) Replicate the measurements under the same conditions, and calculate the standard deviation. (d) Find the exact value and bias, and subtract them from the individual measurements.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Final Exam A Semester 1 2023 Page 4 of 23 8. Using just the following R output, which statement is correct. lm(y~x) Call: lm(formula = y ~ x) Coefficients: (Intercept) x 1.467 1.315 cor(x,y) [1] 0.7645729 (a) ˆ x = 1 . 467 + 1 . 315 y (b) The scatter plot could have many shapes. (c) The data fits well along a line of positive slope. (d) As x increases by 1 unit, y increases by 1.467 units. 9. Two variables X and Y have correlation 0.7. If we swap the data values of X and Y , and then minus 0.1 from each value of Y , what is the correlation of the new variables? (a) 0.5 (b) 0.6 (c) 0.7 (d) 0.8 10. In linear regression, what is the mean of the gaps between the data points and the regression line? (a) Always zero (b) The residuals (c) The RMS Error (d) The standard deviation
Final Exam A Semester 1 2023 Page 5 of 23 11. When does the Prosecutor’s Fallacy occur? (a) When it is assumed that the chance of evidence given innocence is the same as innocence given evidence. (b) When it is assumed that the chance of evidence given guilt is the same as evidence given innocence. (c) When it is assumed that the chance of evidence given innocence is the same as evidence given guilt. (d) When it is assumed that the chance of guilt given innocence is the same as evidence given guilt. 12. Suppose we toss a biased coin 10 times with P(head)=0.3 at every toss. The results of each toss are independent of each other. What is the chance we get exactly 3 heads? (a) 3 10 0 . 7 3 0 . 3 7 (b) 10 3 0 . 3 3 (c) 10 3 0 . 3 10 (d) 10 7 0 . 3 3 0 . 7 7 13. Suppose we randomly draw 100 times from a box with replacement, and sum the results. We then repeat this process many times and plot a simulation histogram of the sums. For which box would we expect to see an approximately normal-shaped histogram? (a) Box = 0,1 (b) Box = 1,2,3 (c) Box = 0,0,0,0,0,0,0,0,0,1 (d) All of the boxes 14. In a survey to determine Sydney students’ opinions on student fees, what is NOT a possible source of bias? (a) A poorly worded question (b) Surveying one statistics lecture (c) A wealthy student in the survey group (d) Conducting an online survey
Final Exam A Semester 1 2023 Page 6 of 23 15. In a market research study, 100 people were given a sample of brand-name chips and home-brand chips (in random order) and asked which they preferred in taste. 70 people preferred the brand-name chips. Let p = P(preference for brand-name chips). To test for no difference in preference between the two types of chips, what is the ap- propriate null hypothesis? (a) H 0 : p = 0 (b) H 0 : p = 0 . 5 (c) H 0 : p = 0 . 7 (d) H 0 : p > 0 . 7 16. What does a p-value of 0.85 mean? (a) The data is consistent with the null hypothesis. (b) There is a 85% chance that the null hypothesis is true. (c) There is a 15% chance that the alternative hypothesis is true. (d) We should accept the null hypothesis with probability 0.15. 17. The data in Milk.csv consists of the milk yield of 100 cows. t.test(Milk,mu=11) One Sample t-test data: Milk t = 4.9291, df = 99, p-value = 3.323e-06 alternative hypothesis: true mean is not equal to 11 95 percent confidence interval: 12.53485 14.60315 sample estimates: mean of x 13.569 What would be the conclusion of the hypothesis H 0 : μ = 13 vs H 1 : μ 6 = 13 when α = 0 . 05. (a) We should reject H 0 . (b) The data is consistent with H 0 . (c) The p-value is 0.000003 (6dp). (d) Not enough information given.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Final Exam A Semester 1 2023 Page 7 of 23 18. A gambler is accused of using a loaded die, although she pleads innocence. Her last 60 throws are in the R vector throws1 . table(throws1) throws1 1 2 3 4 5 6 7 12 6 9 13 13 What is the alternative hypothesis for a chi-squared test? (a) The gambler is innocent. (b) The 6 faces occur with proportions 1 6 : 1 6 : 1 6 : 1 6 : 1 6 : 1 6 (c) At least 1 of the faces occurs with a proportion different to 1 6 . (d) The 6 faces occur with proportions 7 60 : 7 60 : 9 60 : 14 60 : 10 60 : 13 60
Final Exam A Semester 1 2023 Page 8 of 23 19. Consider the following output for the mtcars dataset, for the miles per gallon (mpg) and weight (wt) variables. summary(lm(mpg ~ wt, data=mtcars)) Call: lm(formula = mpg ~ wt, data = mtcars) Residuals: Min 1Q Median 3Q Max -4.5432 -2.3647 -0.1252 1.4096 6.8727 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** wt -5.3445 0.5591 -9.559 1.29e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.046 on 30 degrees of freedom Multiple R-squared: 0.7528,Adjusted R-squared: 0.7446 F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10 What can be concluded from just this information? (a) The linear model for predicting weight from mpg has an intercept around 37. (b) The linear correlation coefficient is very small. (c) A linear model for predicting mpg from weight is appropriate. (d) None of the other answers.
Final Exam A Semester 1 2023 Page 9 of 23 20. The following R code produces the following plot. library(tidyverse) ggplot(iris, aes(x=A, y = Sepal.Width)) + geom_point(aes(B = Species)) What are A and B? (a) A = Sepal.Length, B = fill (b) A = Sepal.Length, B = shape (c) A = Sepal.Width, B = colour (d) A = Sepal.Length, B = Species
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Final Exam A Semester 1 2023 Page 10 of 23 This page is left blank for your working. Working for the Multiple Choice section will not be marked. End of Multiple Choice Section Make sure that your answers are entered on the Multiple Choice Answer Sheet
Final Exam A Semester 1 2023 Page 11 of 23 Extended Answer Section There are three questions in this section, each with a number of parts. Write your answers in the space provided below each part. There is extra space at the end of the paper. 1. Australian Super - domain knowledge and data ‘Superannuation’ or ‘super’ is money put aside for when you retire from work. Source 1: A Sydney Morning Herald (SMH) article (25/2/23) states: “Figures from the Tax Office show ... about two-thirds of Australians or 11.3 million people holding less than $ 100,000 in superannuation.” , with the following data visualisation. Source 2: The Australian Super website (30/3/23) has the tagline “If you’re wondering what your super balance should look like, it could help to compare with others your age”, with the following data. Age Men ($) Women ($) less than 20 4486 4671 20 - 24 15620 14955 25 - 29 40017 30033 . . . 55 - 59 330720 205787 60 - 64 322184 246885 The Extended Answer Section begins on the next page
Final Exam A Semester 1 2023 Page 12 of 23 (a) Carefully read the information from the SMH article (Source 1). ( i ) Could there by any issues with ethics or privacy with using the Tax Office data? Explain. ( ii ) What is a possible confounding variable? Explain how it might affect the data. ( iii ) Explain a weakness in the data visualisation, and suggest how you would improve it. ( iv ) You have access to the Tax Office super data from 2019. Could you use this data, with the 2023 data, to form a RCT? Explain your reasoning.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Final Exam A Semester 1 2023 Page 13 of 23 (b) Carefully read the information from the Australian Super website (Source 2). ( i ) Explain one interesting feature of this data, in context. ( ii ) Would a linear model be appropriate? Justify your answer.
Final Exam A Semester 1 2023 Page 14 of 23 2. Airbnb - domain knowledge and data Established in 2008, Airbnb is a very popular global, online market place for renting accommodation. Each listing features photos and details of the property, reviews from previous guests, and the approximate position on a map. Owners can rent out their whole property or spare rooms to guests. Inside Airbnb is “a mission driven project that provides data and advocacy about Airbnb’s impact on residential communities”, with data that can be downloaded from the site. Source: insideairbnb.com/get-the-data Meera is considering renting out her property which is in Manly, and wants to research the likely rental returns. She downloads the data set listings from the Inside Airbnb site for Sydney December 2022. dim(listings) [1] 22100 18 names(listings) [1] "id" "name" [3] "host_id" "host_name" [5] "neighbourhood_group" "neighbourhood" [7] "latitude" "longitude" [9] "room_type" "price" [11] "minimum_nights" "number_of_reviews" [13] "last_review" "reviews_per_month" [15] "calculated_host_listings_count" "availability_365" [17] "number_of_reviews_ltm" "license"
Final Exam A Semester 1 2023 Page 15 of 23 (a) Carefully read the information about Airbnb and Meera. ( i ) Meera is pleased to find that the data is “tidy”. What does this mean? ( ii ) How many properties (“listings”) are in the overall data set? ( iii ) Suggest a short data dictionary entry for one of the variables.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Final Exam A Semester 1 2023 Page 16 of 23 (b) Meera focuses on the properties in Manly, and produces a boxplot of the price variable. listings1 = listings %>% filter(neighbourhood == "Manly") dim(listings1)[1]/dim(listings)[1] [1] 0.04873303 ( i ) Approximately what percentage of the Airbnb properties are in Manly? ( ii ) Using the boxplot, how much could Meera rent out her property for? Justify your answer, and include any assumptions and limitations. ( iii ) Write the R code to produce the boxplot in ggplot .
Final Exam A Semester 1 2023 Page 17 of 23 3. (a) Meera is considering buying another investment property in Sydney, which she will rent out on Airbnb. Suggest one concrete way that linear modelling could help her decision-making. What data would she need? (b) The following scatterplot shows the different Airbnb listings, with the Sydney Har- bour Bridge highlighted as a circle. Why does a map of the Sydney coastline emerge? plot(listings$latitude ~ listings$longitude, pch = ".") points(y = -33.85222, x = 151.210556, col = "red", pch = 19, cex = 2)
Final Exam A Semester 1 2023 Page 18 of 23 (c) Some-one claims that 70% of properties on Airbnb are the entire home or apartment. Test this claim using HATPC, with α = 0 . 07. table(listings$room_type) Entire home/apt Hotel room Private room Shared room 15235 100 6478 287 Type = c(15235, 22100-15235) chisq.test(Type,p=c(0.7,0.3)) Chi-squared test for given probabilities data: Type X-squared = 11.899, df = 1, p-value = 0.0005615
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Final Exam A Semester 1 2023 Page 19 of 23 (d) Explain what a ‘test-statistic’ is, in terms of the context here. Explain how the ‘p-value’ is calculated here.
Final Exam A Semester 1 2023 Page 20 of 23 Extra Space This page is left blank, in case you need extra space for your answers to Extended Answer Questions. If so, you must note this at the relevant part of the Extended Answers.
Final Exam A Semester 1 2023 Page 21 of 23 Extra Space This page is left blank, in case you need extra space for your answers to Extended Answer Questions. If so, you must note this at the relevant part of the Extended Answers.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Final Exam A Semester 1 2023 Page 22 of 23 DATA1001/1901 Exam Concept Sheet This unit is focused on words, not formulae. The following sheet is given for your reference. Numerical Summaries SD = RMS of gaps from the mean = q mean of (gaps from the mean) 2 IQR = 75% percentile - 25% percentile = Q 3 - Q 1 Identifing outliers: LT = Q 1 - 1 . 5 × IQR ; UT = Q 3 + 1 . 5 × IQR Models Normal: X N (mean , SD 2 ); thresholds ( ± 1 / 2 / 3 SD : 68% / 95% / 99 . 7%) Linear: ˆ y = a + bx , where b = r SD y SD x and a = ¯ y - b ¯ x . Linear strip at x * : y * N y + rz x * SD y , RMS Error), where RMS Error = 1 - r 2 SD y . Binomial: X Bin ( n, p ), then P ( X = x successes) = n x p x (1 - p ) n - x , for 0 x n . Box Model: Given a population with mean M and standard deviation SD, and a sample taken with replacement of size n , the Sample Sum has EV = n M and SE = n SD, and the Sample Mean has EV = M and SE = SD/ n . Hypothesis Testing (HATPC) Test Null Hypothesis Assumptions 1 Sample Proportion Ho: proportion = constant independent; constant P(success) 1 Sample T Ho: mean = constant independent; population Normal (if small n) 2 Sample T Ho: difference in 2 means = constant independent, Normal populations Chi-squared (model) Ho: model holds Cochran’s Rule Chi-squared (independence) Ho: 2 variables are independent Cochran’s Rule; independence Regression Ho: slope = 0 looks linear; homoscedastic residuals R Code # IDA str(iris) library(tidyverse) ggplot(iris, aes(x=Sepal.Length)) + geom_histogram() # Modelling pnorm(5,4,3) # Given X ~ N(4,9), find the lower tail area from 5 down. qnorm(0.4,4,3) # Given X ~ N(4,9), find the 40th percentile pnorm(r*qnorm(x)) # Estimate y percentile from x percentile, in linear model sample(c(1:6),3,replace = T) # 3 rolls of a fair die End of Extended Answer Section
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help