Midterm exam Fall 2020 solutions

pdf

School

University of Nebraska Medical Center *

*We aren’t endorsed by this school

Course

211

Subject

Economics

Date

Jun 4, 2024

Type

pdf

Pages

6

Uploaded by CorporalPencil10038

Report
MIDTERM EXAM: SOLUTIONS Econ 409 Due 8:00am CDT (Lincoln Time), October 24, 2020 NOTE ABOUT THESE SOLUTIONS: More than 1 version of the exam was distributed. In addition, datasets were not identical. Therefore, there is no single set of correct answers. However, all versions contained similar questions, so these answers should be informative about your performance. READ THESE DIRECTIONS CAREFULLY: This is an open-book, take-home exam. You may consult your textbook, the slides, the RStudio notes, etc. You may not work with your classmates or anyone else. The exam is to be completed on your own. As outlined in the syllabus, evidence of cheating or other academic dishonesty will result in a grade of 0 at a minimum. You will submit your answers via Canvas. Your submission must include: 1.) A typed sheet of your answers; 2.) Your R code. The R code may be copied and pasted into a Word document or other text document. If you did this correctly, I should be able to load the data and then run your code on my own machine to replicate your answers. Producing replicable code (meaning I can run it) is worth 10 out of the 100 total points. The exam will be posted to Canvas at 8:00am CDT, Thursday, October 22. It is due by 8:00am CDT on Saturday, October 24. Late submissions will automatically lose 25% of available points at 8:01am CDT on October 24; 50% of available points at 9:01am, 75% of available points at 10:01am, and all available points at 11:01am or after. The exam is written to take no more than one class period (75 minutes). However, you have 48 hours to complete it. Therefore, please keep in mind that entirely avoidable situations for example, Internet outages that occur at 10:00pm on October 23 are not considered valid excuses for late submissions. Good luck! Part A: Data Analysis You have been provided a dataset. A description of the variables is below. (This data has been generated using statistical software; it does not represent real measurements.) Import it into R. Then briefly answer each question below. Be sure to save a script file with any R code you need to answer these questions. Variables: ability = score on an IQ test race = reported race (white or nonwhite) earnings = annual earnings birthplace = state of birth 1.) How many observations are there? How many variables? Be sure to specify which is the number of observations and which is the number of rows. nrow(data) ncol(data) 2.) What kind of dataset is this: Time series, cross-sectional, or panel? Briefly defend your answer in one or two sentences.
Cross-sectional. These datasets show multiple individuals or countries in a snapshot in time. 3.) Produce a five-number summary of each quantitative variable. fivenum(data$variable) 4.) For each quantitative variable, tell me whether it is symmetric, skewed left, or skewed right? Write a sentence or two to defend your answer for each variable. There is no single right way to answer this question, so points were given for defensible answers. You could use a histogram, calculate the skewness, or use some other sample statistics. 5.) The proper choice of data visualization technique depends in part on whether a variable is categorical or quantitative. Using an appropriate visualization technique, produce a graph showing the distribution of EARNINGS . Write a sentence or two explaining why this is a proper choice of visualization technique. hist(data$earnings) (Other answers are possible, such as a dotplot or a boxplot) 6.) The proper choice of data visualization technique depends in part on whether a variable is categorical or quantitative. Using an appropriate visualization technique, produce a graph showing the distribution of BIRTHPLACE . Write a sentence or two explaining why this is a proper choice of visualization technique. barplot(data$birthplace) (Other answers are possible, such as a pie chart) 7.) Assume your dataset is representative of a broader population. What is the 95% confidence interval for the population mean of ABILITY ? Some simple code to evaluate this: Lower.bound <- mean(data$ability) - 1.96*sd(data$ability)/sqrt(length(ability)) Upper.bound <- mean(data$ability) + 1.96*sd(data$ability)/sqrt(length(ability)) 8.) What is the correlation coefficient for EARNINGS and ABILITY ? Then, in words, provide an intuitive explanation of what this value of the correlation coefficient tells us about the relationship between these two variables. cor(data$earnings,data$ability) The correlation coefficient gives us the strength and direction of a linear relationship between two quantitative variables. 9.) Estimate the simple regression with EARNINGS as the dependent variable and ABILITY as the independent variable. Report the slope coefficient (b1) from this regression. Then, in words,
provide an intuitive explanation of what this value of the slope coefficient tells us about the relationship between these two variables. lm(earnings ~ ability,data=data) The slope coefficient tells us how much earnings changes when ability changes by 1 unit. It tells us about the statistical association between these variables, but not what causes what. 10.) What is the predicted value of EARNINGS for an observation with ABILITY = 150? Provide at least one reason you might want to be cautious about trusting this estimate of the EARNINGS of an individual with this level of ABILITY . There are several ways to do this, but you must calculate b0 + b1*150, where b0 is the estimated intercept from your regression and b1 is the estimated slope coefficient. The best reason not to trust this predicted value is that it is an extrapolation: 150 is far outside the range of ability in your dataset. 11.) Construct a scatterplot of EARNINGS and ABILITY . Then describe the relationship between these variables in a sentence or two. Does the relationship you see in the scatterplot match the relationship implied by the regression and correlation coefficient? plot(data$earnings,data$ability) There was a curvilinear relationship. This suggests that the correlation coefficient and regression coefficient do not fully characterize the relationship between the two variables. They are still valid estimates of the linear relationship between these variables, however. 12.) In the population, for every person born in South Dakota, there are 4 people born in Kansas, 4 born in Iowa, 2 born in Nebraska, and 5 born in Colorado. Is this dataset representative of the population? How, if at all, does the answer affect your analysis of this data? Explain briefly. This population distribution of birthplace is very different from the distribution you observed in your sample. This suggests you should be cautious about extrapolating the results to the broader population of these states. The reason is that we can only use data to make inferences about a broader population if it is representative of that population. 13.) What is the interquartile range of ABILITY ? If two peop le’s ABILITY differs by an amount equal to the interquartile range, what is the expected difference in their EARNINGS ? Use your regression results to find an answer to the second question. There are a few ways to calculate this, but here is one: iqr <- quantile(data$ability,0.75) quantile(data$ability,0.25) b1 * iqr 14.) Conduct a formal hypothesis test of whether EARNINGS is different for the two groups defined by RACE . Be sure to lay out each of the 5 steps in our hypothesis testing procedure. Step 1: The null hypothesis is 𝜇 1 = 𝜇 0 . The alternative is 𝜇 1 ≠ 𝜇 0
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Step 2: Construct the test statistic: r1 <- filter(data,v2=="nonwhite") r0 <- filter(data,v2!="nonwhite") mu1 <- mean(r1$v3) mu0 <- mean(r0$v3) sd1 <- sd(r1$v3) sd0 <- sd(r0$v3) n1 <- length(r1$v3) n0 <- length(r0$v3) t <- (mu1-mu0-0)/sqrt(sd1^2/n1+sd0^2/n0) Step 3: Calculate the p-value: df <- min(n0,n1)-1 p <- 2*(1-pt(abs(t),df)) Step 4: Reject or fail to reject the null Step 5: State your conclusion in the context of the question. 15.) Suppose your data is a representative sample of the population of interest. Let 𝑝 be the population proportion. It is defined as the share of individuals in which RACE = WHITE . Suppose you wanted to use your data to test the null hypothesis that 𝑝 = 0.3 . You plan to use a significance level of 0.05. What values of 𝑝̂ will fall into the rejection region for this test? p0 <- 0.3 n <- length(data$v2) lb <- p0 - 1.96*sqrt(p0*(1-p0)/n) ub <- p0 + 1.96*sqrt(p0*(1-p0)/n) We will reject the null for values of 𝑝̂ that fall below lb or above ub . 16.) Consider the hypothesis test laid out in the question immediately above. Suppose the “true” value of the population parameter is 𝑝 = 0.25 . Calculate the statistical power of this test. Then describe in a sentence or two what the statistical power tells us. Building on the calculations in question 15: p <- 0.25 pnorm(lb,mean=p,sd=sqrt(p*(1-p)/n)) + pnorm(ub,mean=p,sd=sqrt(p*(1-p)/n),lower.tail=FALSE) The statistical power is the probability of rejecting the null hypothesis, given that the null hypothesis is false (and specifically, given some “true” value of p, which is 0.25 in this case).
Part B: Short Answer For each question below, provide a brief answer. No answer should be longer than 250 words. 1.) Suppose you are interested in whether preschool is helpful to children. You obtain observational data on a population of children who attended preschool and a population of children who did not attend preschool. The data is a random, representative sample of the population. All children were given an identical standardized test. The result of a hypothesis test shows statistically significant evidence that children who attended preschool earned better scores on the test. a.) Does this data prove that preschool improves children’s test scores? If so, explain why. If not, provide at least one alternative explanation for the result. No. This is observational data and there are likely to be confounding and/or lurking variables that could explain the pattern. One alternative explanation is that wealthier families are more likely to send their kids to preschool *and* have other advantages that lead to better test scores. (2 points) b.) Briefly describe how you might design an experiment to determine whether preschool im proves children’s test scores. Answers may vary, but a correct answer will include randomization of children into treatment (i.e., you get preschool) and control (i.e., you don’t get preschool) groups. (2 points) 2.) A survey of students found a negative correlation between the weekly hours of T.V. watched and the weekly hours spent exercising. One student explained that reducing the hours of T.V. watched (cause) would result in students sleeping longer and having more energy to exercise (effect). Give another explanation with hours of exercise as the cause and hours T.V. watched as the effect. Answers may vary. Points given for plausible explanations. (1 point) 3.) A college department wants to learn about the jobs that its alumni are working in. An online survey is set up on the department website which invites alumni to complete and includes demographic questions and questions about their job (current and history). Describe a problem that could arise with this survey. There are a number of problems that could arise, listed here are a few possible answers. First it is an online survey, thus only those alumni who check this website would potentially be able to respond. It is based on volunteer response and those alum who are not currently worki ng or working in a ‘less prestigious’ job would be less inclined to respond. (2 points total)
4.) In your own words, explain the difference between a parameter and a statistic. Give an example of each. A parameter is a characteristic of a population; a statistic is a characteristic of a sample. An example of a parameter would be the proportion of all students who will vote for a particular candidate for student body president. An example of a statistic is the proportion of students in a sample who will vote for the candidate. (4 points total)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help