COMM1190_T1_2024_Workshop_W10_Solutions

.pdf

School

University of New South Wales *

*We aren’t endorsed by this school

Course

1190

Subject

Communications

Date

Jul 2, 2024

Type

pdf

Pages

Uploaded by KidFangAnt70

CRICOS Provider Code 00098G QUESTION 1 10 MARKS Suppose you are given a dataset with information about a university class of 100 students, where for each, we can observe the marks (from 1 to 10) in Mathematics, Information Technology, Physics, Literature, and Arts and the average number of hours each spends on studying at home. You want to investigate the relationship between the mark in Physics and all other variables in the dataset by means of a linear regression model. The following output is obtained. Part A [Max 100 words] (5 marks) Write down the regression equation for the relationship between the grade in Physics and all other variables. P rovide an interpretation of the regression coefficients, including the intercept. The regression equation can be written in the following ways (any of these two are fine): for the i th student we have: 𝑃ℎ𝑦?𝑖?? ? = 𝛽 0 + 𝛽 ???ℎ? 𝑋 ???ℎ?,? + 𝛽 𝐴?? 𝑋 𝐴??,? + 𝛽 ???𝑒?????𝑒 𝑋 ???𝑒?????𝑒,? + 𝛽 𝐼𝑇 𝑋 𝐼𝑇,? + 𝛽 ℎ𝑜??? 𝑋 ℎ𝑜???,? + 𝜖 ? 𝑃ℎ𝑦?𝑖?? ? = 𝛽 0 + 𝛽 ???ℎ? ?𝑎?ℎ? ? + 𝛽 𝐴?? 𝐴?? ? + 𝛽 ???𝑒?????𝑒 ?𝑖???𝑎???? ? + 𝛽 𝐼𝑇 𝐼𝑇 ? + 𝛽 ℎ𝑜??? ℎ???? ? + 𝜖 ? Where 𝜖 ? is the error term with mean zero and constant variance. The intercept 𝛽 0 the average mark in Physics when the mark in all other subjects is equal to zero. The

Comm 1190: Data, Insights and Decisions. Week 9: Data Communication I regression coefficients 𝛽 ????𝑒?? measures the average change in the Physics mark given an increase of the mark of the corresponding subject. Similarly, 𝛽 ℎ𝑜??? measures the average change in the mark in Physics given an additional study hour. Part B [Max 100 words] (5 marks) Use the output from the regression to discuss whether the marks in IT and number of hours are significant predictors of the mark in Physics. You can enrich the discussion by describing the decision-making process around the statistical significance of each variable. From the results of the regression, 𝛽 𝐼𝑇 turns out to be statistically significant in our analysis. This can be seen by the value of its corresponding p-value (last column), which is smaller than the typically chosen significance level of 5%. These results can also be confirmed by looking at the value of the t-statistic (third column), which is larger than 2 in absolute value. Another indicator (the three are interconnected one another) is the value of the standard error: the range of values (𝛽 𝐼𝑇 − 2 × ??. ???; 𝛽 𝐼𝑇 + 2 × ??. ???) does not include zero. For analogous reasons, the regression coefficient 𝛽 ℎ𝑜??? is not statistically significant. Comment: it is sufficient to mention any of these three criteria when deciding on the statistical significance of the regression coefficient. QUESTION 2 25 MARKS In the United States, risk scores generated from automated machine learning are used to determine which patients should be automatically enrolled in high-risk care management programs. Risk scores above the 97 th percentile are automatically identified for enrolment in the program, while risk scores above the 55 th percentile get referral & considered for enrolment. As true risk is unknown, these models use patient-level data to predict variation in health costs to understand “expected” costs . So, the percentile of risk is actually the total expected costs. Part A [max 100 words] (5 Marks) Focus on the model’s ability to predict risk scores above the 97 th percentile for these 10,000 people. Predicted over 97 th % Predicted under 97 th % Actually Over 97 th % 180 120 Actually Under 97 th % 0 9700 Using the confusion matrix, how would you characterise the performance of this model? The model is good at predicting people who are not over the 97%. Since most people are under 97%, it has a high accuracy level of 98.8%. However, it poorly predicts if

Comm 1190: Data, Insights and Decisions. Week 9: Data Communication I people who are actually over the 97%, at 60%, conditional on being over the 97 percentile. It is arguably better to err on the side of caution (i.e., getting more wrong in terms of predicting people over the 97% than getting more wrong in terms of predicting people who are under the 97% but are actually over). Thus, this algorithm does not do a good job. Part B [max 100 words] (5 Marks) Unfortunately, these algorithms can have racial bias creating concern about automated machine learning methods. Identify key issues with the machine learning approach that could lead to issues with fairness regarding race. The underlying data driving the algorithm has inherent biases that skew outcomes unfairly. Since the models use costs to proxy illness severity and given that White Americans typically have better access to and spend more on healthcare than Black Americans, the cost-based outcomes don't accurately reflect the true level of illness across different races. For the same illness severity, white individuals often have higher healthcare costs, meaning that these algorithms may perpetuate racial discrimination by underestimating the healthcare needs of racially marginalised groups. Thus, the algorithms are discriminating between races in determining eligibility for high-risk care management programs. Part C [max 100 words] (5 Marks) Given this case study, explain whether the solution to this problem is based on a better algorithm (given the objective of the algorithm to predict healthcare costs). Should the algorithm change the data or the algorithm’s objective to reduce biased outcomes? The algorithm was accurate in predicting costs. So, improving the algorithm, given the objective of predicting costs, would not reduce bias. Expanding the data to include other measures to better capture social determinants of health would not be effective since the true costs will still be biased. While a more balanced representation of all groups could help, this may lead to groups being not factored into the model at all. Thus, it is better to change the algorithm’s objective to capture either the approp riate cost (e.g., the cost spent given sufficient access and resources) or adjust the objective towards illnesses. Part D [max 200 words] (10 Marks) While fairness is an issue with this algorithm, what other issues could emerge that would make deploying it irresponsible? • Reliability and safety: These AI systems are not reliably operating in accordance with their intended purpose. • Transparency : These algorithms are proprietary, which means that they are not transparent. • Contestability: As the algorithms are operated by other parties, this creates complications regarding who is ultimately responsible for contesting outcomes. • Accountability: As the algorithms are operated by other parties, this also creates complications regarding who is accountable for outcomes of the AI systems. • Privacy protection : The algoritihms require collecting and aggregating very sensitive data. Any security breaches could lead to revealing sensitive data for millions of people.

Comm 1190: Data, Insights and Decisions. Week 9: Data Communication I QUESTION 3 30 MARKS Cookie Cats is a popular puzzle game app. As players progress through the levels of the game, they occasionally encounter gates that force them to wait a non-trivial amount of time or make an in-app purchase to progress. In addition to driving in-app purchases, these gates serve the important purpose of giving players an enforced break from playing the game, hopefully resulting in the player's enjoyment of the game being increased and prolonged. The question to be explored is whether the placement of the entry-level gate matters. The hypothesis being investigated is whether moving the entry-level gate from 30 to 40 will increase player activity (and hopefully in-game purchases). Data was collected from an A/B test where a total of 90,198 players were randomly assigned to a treatment group (entered at gate level 40, g_40 =1) or a control group (entered at gate level 30, g_40 =0). The output variables of interest included in the data are: ret_1 = 1 if the player plays within one day of installing the app; and = 0 otherwise; games = total number of games played in the first 14 days of installing the app. Note: For this question, the 97.5 th percentile of a standard normal distribution is 1.96. Part A. [max 100 words] (5 Marks) Before formally investigating the hypothesis being tested within the experiment, good practice dictates that an analyst determines the key features of the data. See below for selected summary statistics and visual representations. What are your main conclusions about the key variables based on this initial analysis? • There is approximately an even split between the two treatments. • On average the players play over 50 games but this distribution is heavily skewed with a long right-hand tail indicating some players play the game a lot (a maximum of nearly 50,000 is over 250 standard deviations above the mean!) • This skewness is clear from the histogram that trims the top 5% of the distribution and from the standard deviation being much larger than the mean. • At the other extreme, the histogram has a large initial spike indicating many players do not play much or even at all (minimum is zero) in the first 14 days. • Another measure of this inactivity amongst many players is reflected in the mean of ret_1 which indicates 44.5% play the game on the day after they download the app, i.e: the majority did not play the game. Table 1: Summary statistics Variable Mean Std. Deviation Minimum Maximum g_40 0.504 0.500 0 1 games 51.9 195.1 0 49854 ret_1 0.445 0.497 0 1

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help