tutorial_inference1

pdf

School

University of British Columbia *

*We aren’t endorsed by this school

Course

DSCI100

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

9

Uploaded by CountKuduMaster478

Report
Tutorial 11 - Introduction to Statistical Inference Lecture and Tutorial Learning Goals: After completing this week's lecture and tutorial work, you will be able to: Describe real world examples of questions that can be answered with the statistical inference methods. Name common population parameters (e.g., mean, proportion, median, variance, standard deviation) that are often estimated using sample data, and use computation to estimate these. Define the following statistical sampling terms (population, sample, population parameter, point estimate, sampling distribution). Explain the difference between a population parameter and sample point estimate. Use computation to draw random samples from a finite population. Use computation to create a sampling distribution from a finite population. Describe how sample size influences the sampling distribution. This worksheet covers parts of the Inference chapter of the online textbook. You should read this chapter before attempting the worksheet. ### Run this cell before continuing. library ( tidyverse ) library ( repr ) library ( infer ) options ( repr.matrix.max.rows = 6 ) source ( 'tests.R' ) source ( 'cleanup.R' ) Virtual sampling simulation In this tutorial you will study samples and sample means generated from different distributions. In real life, we rarely, if ever, have measurements for our entire population. Here, however, we will make simulated datasets so we can understand the behaviour of sample means. Suppose we had the data science final grades for a large population of students. # run this cell to simulate a finite population set.seed ( 20201 ) # DO NOT CHANGE students_pop <- tibble ( grade = ( rnorm ( mean = 70 , sd = 8 , n = 10000 ))) students_pop Question 1.0 {points: 1} In [ ]: In [ ]:
Visualize the distribution of the population ( students_pop ) that was just created by plotting a histogram using binwidth = 1 in the geom_histogram argument. Name the plot pop_dist and give x-axis a descriptive label. options ( repr.plot.width = 8 , repr.plot.height = 6 ) # ... <- ggplot(..., ...) + # geom_...(...) + # ... + # ggtitle("Population distribution") ### BEGIN SOLUTION pop_dist <- ggplot ( students_pop , aes ( grade )) + geom_histogram ( binwidth = 1 ) + xlab ( "Grades" ) + ggtitle ( "Population distribution" ) + theme ( text = element_text ( size = 20 )) ### END SOLUTION pop_dist test_1.0 () Question 1.1 {points: 3} Describe in words the distribution above, comment on the shape, center and how spread out the distribution is. BEGIN SOLUTION The distribution is bell-shaped, symmetric, with one large peak in the middle centered at about 70 percent. Students' scores ranged from just over 40 to just under 100% but most students got between about 60 to 80%. END SOLUTION Question 1.2 {points: 1} Use summarise to calculate the following population parameters from the students_pop population: mean (use the mean function) median (use the median function) standard deviation (use the sd function) Name this data frame pop_parameters which has the column names pop_mean , pop_med and pop_sd . ### BEGIN SOLUTION pop_parameters <- students_pop |> summarise ( pop_mean = mean ( grade ), pop_med = median ( grade ), In [ ]: In [ ]: In [ ]:
pop_sd = sd ( grade )) ### END SOLUTION pop_parameters test_1.2 () Question 1.2.1 {points: 1} Draw one random sample of 5 students from our population of students ( students_pop ). Use summarize to calculate the mean, median, and standard deviation for these 5 students. Name this data frame ests_5 which should have column names mean_5 , med_5 and sd_5 . Use the seed 4321 . set.seed ( 4321 ) # DO NOT CHANGE! ### BEGIN SOLUTION ests_5 <- students_pop |> rep_sample_n ( 5 ) |> summarize ( mean_5 = mean ( grade ), med_5 = median ( grade ), sd_5 = sd ( grade )) ### END SOLUTION ests_5 test_1.2.1 () Question 1.2.2 Multiple Choice: {points: 1} Which of the following is the point estimate for the average final grade for the population of data science students (rounded to two decimal places)? A. 70.03 B. 69.76 C. 73.52 D. 8.05 Assign your answer to an object called answer1.2.2 . Your answer should be a single character surrounded by quotes. ### BEGIN SOLUTION answer1.2.2 <- "B" ### END SOLUTION test_1.2.2 () Question 1.2.3 {points: 1} In [ ]: In [ ]: In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Draw one random sample of 100 students from our population of students ( students_pop ). Use summarize to calculate the mean, median and standard deviation for these 100 students. Name this data frame ests_100 which has the column names mean_100 , med_100 and sd_100 . Use the seed 4321 . set.seed ( 4321 ) # DO NOT CHANGE! ### BEGIN SOLUTION ests_100 <- students_pop |> rep_sample_n ( 100 ) |> summarize ( mean_100 = mean ( grade ), med_100 = median ( grade ), sd_100 = sd ( grade )) ### END SOLUTION ests_100 test_1.2.3 () Exploring the sampling distribution of the sample mean for different populations We will create the sampling distribution of the sample mean by taking 1500 random samples of size 5 from this population and visualize the distribution of the sample means. Question 1.3 {points: 1} Draw 1500 random samples from our population of students ( students_pop ). Each sample should have 5 observations. Name the data frame samples and use the seed 4321 . # ... <- rep_sample_n(..., size = ..., reps = ...) set.seed ( 4321 ) # DO NOT CHANGE! ### BEGIN SOLUTION samples <- rep_sample_n ( students_pop , size = 5 , reps = 1500 ) ### END SOLUTION head ( samples ) tail ( samples ) dim ( samples ) test_1.3 () Question 1.4 {points: 1} Group by the sample replicate number, and then for each sample, calculate the mean. Name the data frame sample_estimates . The data frame should have the column names replicate and mean_grade . In [ ]: In [ ]: In [ ]: In [ ]:
### BEGIN SOLUTION sample_estimates <- samples |> group_by ( replicate ) |> summarise ( mean_grade = mean ( grade )) ### END SOLUTION head ( sample_estimates ) tail ( sample_estimates ) test_1.4 () Question 1.5 {points: 1} Visualize the distribution of the sample estimates ( sample_estimates ) you just calculated by plotting a histogram using binwidth = 1 in the geom_histogram argument. Name the plot sampling_distribution_5 and give the plot (using ggtitle ) and the x axis a descriptive label. options ( repr.plot.width = 8 , repr.plot.height = 6 ) ### BEGIN SOLUTION sampling_distribution_5 <- ggplot ( sample_estimates , aes ( x = mean_grade )) + geom_histogram ( binwidth = 1 ) + xlab ( "Sample means \n(mean grade)" ) + ggtitle ( "Sampling distribution of the sample means \n for n = 5" ) + theme ( text = element_text ( size = 20 )) ### END SOLUTION sampling_distribution_5 test_1.5 () Question 1.6 {points: 3} Describe in words the distribution above, comment on the shape, center and how spread out the distribution is. Compare this sampling distribution to the population distribution of students' grades above. BEGIN SOLUTION The distribution is bell-shaped, with one large peak in the middle centered at the population mean. The sample means range from 60 to 80%, but most samples had a mean between about 65 to 75%. The shape of the sampling distribution is the same (bell-shaped, one peak, symmetric), but the spread is smaller than that of the population distribution. END SOLUTION Question 1.6.1 {points: 3} Repeat Q1.3 - 1.5 , but now for 100 observations: In [ ]: In [ ]: In [ ]: In [ ]:
1. Draw 1500 random samples from our population of students ( students_pop ). Each sample should have 100 observations. Use the seed 4321 . 2. Group by the sample replicate number, and then for each sample, calculate the mean (call this column mean_grade_100 ). 3. Visualize the distribution of the sample estimates you calculated by plotting a histogram using binwidth = 0.5 in the geom_histogram argument. Name the plot sampling_distribution_100 and give the plot title (using ggtitle ) and the x axis a descriptive label. set.seed ( 4321 ) # DO NOT CHANGE! ### BEGIN SOLUTION sample_means <- rep_sample_n ( students_pop , size = 100 , reps = 1500 ) |> group_by ( replicate ) |> summarise ( mean_grade_100 = mean ( grade )) sampling_distribution_100 <- ggplot ( sample_means , aes ( x = mean_grade_100 )) + geom_histogram ( binwidth = 0.5 ) + xlab ( "Sample means \n(mean grade)" ) + ggtitle ( "Sampling distribution of the sample means \n for n = 100" ) + theme ( text = element_text ( size = 20 )) ### END SOLUTION sampling_distribution_100 set.seed ( 4321 ) # DO NOT CHANGE! # We check that you've created objects with the right names below # But all other tests were intentionally hidden so that you can practice decidin # when you have the correct answer. test_that ( 'Did not create objects named sampling_distribution_100' , { expect_true ( exists ( "sampling_distribution_100" )) }) ### BEGIN HIDDEN TESTS properties <- c ( sampling_distribution_100 $ layers [[ 1 ]] $ mapping , sampling_distribu labels <- sampling_distribution_100 $ labels test_that ( 'mean_grade_100 should be on the x-axis.' , { expect_true ( "mean_grade_100" == rlang :: get_expr ( properties $ x )) }) test_that ( 'sampling_distribution_100 should be a histogram.' , { expect_true ( "GeomBar" %in% class ( sampling_distribution_100 $ layers [[ 1 ]] $ geom ) }) test_that ( 'sampling_distribution data should be used to create the histogram' , { expect_equal ( int_round ( nrow ( sampling_distribution_100 $ data ), 0 ), 1500 ) expect_equal ( digest ( int_round ( sum ( sampling_distribution_100 $ data ), 2 )), '487 }) test_that ( 'Labels on the x axis should be descriptive. The plot should have a de expect_false (( labels $ x ) == 'mean_grade_100' ) expect_false ( is.null ( labels $ title )) }) print ( "Success!" ) ### END HIDDEN TESTS Question 1.6.2 {points: 3} In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Suppose we do not know the parameter value for the population of data science students (as is usually the case in real life). Compare your point estimates for the population mean from Q1.2.1 and 1.2.3 above. Which of the two point estimates is more likely to be closer to the actual value of the average final grade of the population of data science students? Briefly explain. (Hint: look at the sampling distributions for your samples of size 5 and size 100 to help you answer this question). BEGIN SOLUTION The point estimate for the sample of size 100. We can see from the sampling distributions above the sampling distribution for samples of size 100 has less variability/spread. So the larger the sample the estimate is based on, the more likely it will be close to the parameter it estimates. END SOLUTION Question 1.7 {points: 1} Let's create a simulated dataset of the number of cups of coffee drunk per week for our population of students. Describe in words the distribution, comment on the shape, center and how spread out the distribution is. # run this cell to simulate a finite population set.seed ( 2020 ) # DO NOT REMOVE coffee_data = tibble ( cups = rexp ( n = 2000 , rate = 0.34 )) coffee_dist <- ggplot ( coffee_data , aes ( cups )) + geom_histogram ( binwidth = 0.5 ) + xlab ( "Cups of coffee per week" ) + ggtitle ( "Population distribution" ) + theme ( text = element_text ( size = 20 )) coffee_dist BEGIN SOLUTION The distribution is not symmetric (specifically its right skewed), with one large peak on the left side of the distribution. Most students drink a small (0, 1, or 2) cups of coffee per week. Though there is a long tail on the right side with some students drinking as many as 20 cups per week. The range of the distribution is between 0 and 20. END SOLUTION Question 1.8 {points: 1} In [ ]:
Draw 1500 random samples from coffee_data . Each sample should have 5 observations. Assign this data frame to an object called coffee_samples_5 . Group by the sample replicate number, and then for each sample, calculate the mean. Name the data frame coffee_sample_estimates_5 . The data frame should have the column names replicate and coffee_mean_cups_5 . Finally, create a plot of the sampling distribution called coffee_sampling_distribution_5 . Hint: a binwidth of 1 is a little too big for this data, try a binwidth of 0.5 instead. set.seed ( 4321 ) # DO NOT CHANGE! ### BEGIN SOLUTION coffee_samples_5 <- rep_sample_n ( coffee_data , size = 5 , reps = 1500 ) coffee_sample_estimates_5 <- coffee_samples_5 |> group_by ( replicate ) |> summarise ( coffee_mean_cups_5 = mean ( cups )) coffee_sampling_distribution_5 <- ggplot ( coffee_sample_estimates_5 , aes ( x = cof geom_histogram ( binwidth = 0.5 ) + xlab ( "Sample means \n(mean cups of coffee per week)" ) + ggtitle ( "Sampling distribution of the \n mean cups of coffee per week" ) + theme ( text = element_text ( size = 20 )) + xlim ( c ( 0 , 8 )) ### END SOLUTION coffee_sampling_distribution_5 test_1.8 () Question 1.9 {points: 3} Describe in words the distribution above, comment on the shape, center and how spread out the distribution is. Compare this sampling distribution to the population distribution above. BEGIN SOLUTION The distribution is not symmetric. Its right skewed, but not as highly skewed as the population, with one large peak around 2.5 cups. The range of the distribution is between 0 and 10. END SOLUTION Question 2.0 {points: 1} In [ ]: In [ ]:
Draw 1500 random samples from coffee_data . Each sample should have 30 observations. Assign this data frame to an object called coffee_samples_30 . Group by the sample replicate number, and then for each sample, calculate the mean. Name the data frame coffee_sample_estimates_30 . The data frame should have the column names replicate and coffee_mean_cups_30 . Finally, create a plot of the sampling distribution called coffee_sampling_distribution_30 . Hint: use xlim to control the x-axis limits so that they are similar to those in the histogram above. This will make it easier to compare this histogram with that one. set.seed ( 4321 ) # DO NOT CHANGE! ### BEGIN SOLUTION coffee_samples_30 <- rep_sample_n ( coffee_data , size = 30 , reps = 1500 ) coffee_sample_estimates_30 <- coffee_samples_30 |> group_by ( replicate ) |> summarise ( coffee_mean_cups_30 = mean ( cups )) coffee_sampling_distribution_30 <- ggplot ( coffee_sample_estimates_30 , aes ( x = c geom_histogram ( binwidth = 0.5 ) + xlab ( "Sample means \n(mean cups of coffee per week)" ) + ggtitle ( "Sampling distribution of the \n mean cups of coffee per week" ) + theme ( text = element_text ( size = 20 )) + xlim ( c ( 0 , 8 )) ### END SOLUTION coffee_sampling_distribution_30 test_2.0 () Question 2.1 {points: 3} Describe in words the distribution above, comment on the shape, center and how spread out the distribution is. Compare this sampling distribution with samples of size 30 to the sampling distribution with samples of size 5. BEGIN SOLUTION The distribution is bell-shaped, with one large peak in the middle centered around 2.5 cups per week. The spread is smaller than that of the sampling distribution for size 5. END SOLUTION source ( 'cleanup.R' ) In [ ]: In [ ]: In [ ]:
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help