Understanding LLN, CLT, and Confidence Intervals in Statistics

GPHP 2000: Assignment #3: LLN/CLT/CI Sonam Christopher Due:November 15, 2022 1. What does the Law of Large Numbers tell us? ##As you increase the sample size, the sample mean becomes the true population mean.## 2. What does the Central Limit Theorem tell us? ##If you repeat a study over and over again, the means of the studies become normally distributed.## 3. How do the law of large numbers and the Central Limit Theorem work together? ##The LLN tells us a large enough sample size gives us a reasonable estimate of the true mean. The CLT tells us that if we repeated the study over and over that the means would be normally distributed. We then desire to ensure we have a good sample size and apply the CLT for probabilities.## 4. What are 3 reasons we would want to create a confidence interval? ## To test your hypothesis. It can give a range of values and a confidence level range to see if you have “confidence” in your mean to make assumptions on the sample population. ## 5. How do we know whether or not we can get our critical value in a confidence interval from the Z distribution or a t distribution. ##We can know whether or not we can get a critical value in a confidence interval from the T distribution or a Z distribution based on whether we have the variance. To use a Z distribution, you should have a variance or standard deviation to estimate more precisely. ## 6. Generate 1000 random values from a N ( 2 , 4 ) distribution. a. Create and interpret a 90% confidence interval using the Z distribution. b. Create and interpret a 90% confidence interval using the t distribution. library (BSDA) ## Loading required package: lattice ## ## Attaching package: 'BSDA' ## The following object is masked from 'package:datasets': ## ## Orange data <- rnorm ( 1000 , 2 , 4 ) dist_z <- z.test (data, mu = 2 , sigma.x = 2 , conf.level = 0.9 ) dist_z $ conf.int

## [1] 1.704461 1.912520 ## attr(,"conf.level") ## [1] 0.9 dist_t <- t.test (data, mu = 2 , conf.level = 0.9 ) dist_t $ conf.int ## [1] 1.601060 2.015921 ## attr(,"conf.level") ## [1] 0.9 c. How do these compare? ##The confidence interval with the z distribution is (1.899, 2.107), and the confidence interval with the t distribution is (1.796, 2.210). In this instance, it is better to use the z distribution because we have a known variance, so it can make more precise estimates without having to approximate the variance.## Data Problems We will work with the BRFSS data which was used in modules. load ( "~/Desktop/R Studio/course_data (13).RData" ) 7. What is the mean number of poor mental health days ( menthlth )? ##The mean number of poor mental health days is 5.508424 days.## library (dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union library (magrittr) brfss2 %>% summarise ( mean (menthlth, na.rm = TRUE )) ## mean(menthlth, na.rm = TRUE) ## 1 5.508428 8. Graph and describe the distribution of poor mental health days. ##This is not a normal distribution for the number of poor mental health days in a 30-day period. We notice that the highest density of individuals reported between 1-5 days of poor mental health in the past 30 days. It reflects the New York City skyline with the highest number of individuals, 800, reporting 2 poor mental health days with spikes at 5, 7, 10, 15, 20, 25, and 30. These numbers reflect time frames and terms that

might have been easier for people to respond to. For example, we have nice round numbers of 5, 10, 15, 20, 25, a week (7), and the whole month (30). library (ggplot2) dmh <- brfss2 %>% filter ( ! ( is.na (menthlth) | menthlth > 30 | menthlth == 0 )) ggplot ( data = dmh, aes ( x= menthlth)) + geom_histogram ( binwidth= 1 ) + xlab ( "Poor Mental Health Days" ) + ylab ( "Number of Individuals" ) 9. Create and interpret a confidence interval for poor mental health days a. Using the t distribution. mn <- mean (brfss2 $ menthlth, na.rm = TRUE ) std.dev <- sd (brfss2 $ menthlth, na.rm = TRUE ) n <- length (brfss2 $ menthlth) low = mn - 2.26 * std.dev / sqrt (n) high = mn + 2.26 * std.dev / sqrt (n) low ## [1] 5.269694 high ## [1] 5.747162

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

b. Using the bootstrap approach with 1000 bootstraps. library (purrr) ## ## Attaching package: 'purrr' ## The following object is masked from 'package:magrittr': ## ## set_names library (rsample) library (boot) ## ## Attaching package: 'boot' ## The following object is masked from 'package:lattice': ## ## melanoma library (bootstrap) library (dplyr) brfss4 <- brfss2 %>% select (menthlth) set.seed ( 123 ) bt_data <- bootstraps (brfss4, times = 1000 ) bt_data ## # Bootstrap sampling ## # A tibble: 1,000 × 2 ## splits id ## <list> <chr> ## 1 <split [6706/2464]> Bootstrap0001 ## 2 <split [6706/2468]> Bootstrap0002 ## 3 <split [6706/2425]> Bootstrap0003 ## 4 <split [6706/2468]> Bootstrap0004 ## 5 <split [6706/2460]> Bootstrap0005 ## 6 <split [6706/2451]> Bootstrap0006 ## 7 <split [6706/2419]> Bootstrap0007 ## 8 <split [6706/2470]> Bootstrap0008 ## 9 <split [6706/2501]> Bootstrap0009 ## 10 <split [6706/2543]> Bootstrap0010 ## # … with 990 more rows bt_data $ splits[[ 1 ]] ## <Analysis/Assess/Total> ## <6706/2464/6706>

the_mean <- function (split){ split_data <- analysis (split) split_mean <- mean (split_data $ menthlth, na.rm = TRUE ) } bt_data $ bt_means <- map_dbl (bt_data $ splits, the_mean) bt_data ## # Bootstrap sampling ## # A tibble: 1,000 × 3 ## splits id bt_means ## <list> <chr> <dbl> ## 1 <split [6706/2464]> Bootstrap0001 5.54 ## 2 <split [6706/2468]> Bootstrap0002 5.57 ## 3 <split [6706/2425]> Bootstrap0003 5.36 ## 4 <split [6706/2468]> Bootstrap0004 5.54 ## 5 <split [6706/2460]> Bootstrap0005 5.59 ## 6 <split [6706/2451]> Bootstrap0006 5.51 ## 7 <split [6706/2419]> Bootstrap0007 5.54 ## 8 <split [6706/2470]> Bootstrap0008 5.44 ## 9 <split [6706/2501]> Bootstrap0009 5.51 ## 10 <split [6706/2543]> Bootstrap0010 5.56 ## # … with 990 more rows bt_ci <- round ( quantile (bt_data $ bt_means, c ( 0.025 , 0.975 )), 3 ) bt_ci ## 2.5% 97.5% ## 5.310 5.735 10. Consider the variable of Binary General Health, genhlth_bin , and the relationship it has with poor mental health days, menthlth . a. Bootstrap a confidence interval for the mean number of poor mental health days for those who are generally healthy. set.seed ( 123 ) brfss6 <- brfss2 %>% filter (genhlth_bin == "Excell/VG/G" ) %>% select (menthlth) brfss6 %>% summarise ( mean (menthlth, na.rm = TRUE )) ## mean(menthlth, na.rm = TRUE) ## 1 4.682564 bt_data <- bootstraps (brfss6, times = 1000 ) get_mean <- function (split) { # access to the sample data split_data <- analysis (split) # calculate the sample mean value split_mean <- mean (split_data $ menthlth, na.rm= TRUE ) return (split_mean) }

bt_data $ bt_means <- map_dbl (bt_data $ splits, get_mean) bt_ci <- round ( quantile (bt_data $ bt_means, c ( 0.025 , 0.975 )), 3 ) bt_ci ## 2.5% 97.5% ## 4.471 4.919 ggplot (bt_data, aes ( x = bt_data $ bt_means)) + geom_line ( stat = "density" ) + xlab ( "Mean of Poor Mental Health Days for Those Who are Generally Healthy" ) ## Warning: Use of `bt_data$bt_means` is discouraged. ## Use `bt_means` instead. ℹ ##By our confidence interval, we can interpret that we are 95% confident that the mean number of poor mental health days for people who are generally healthy fall between 4.471 and 4.919. Our average of poor mental health days for people who were generally unhealthy was 4.68, which falls within our confidence interval.# b. Bootstrap a confidence interval for the mean number of poor mental health days for those who are not generally healthy. set.seed ( 123 ) brfss7 <- brfss2 %>% filter (genhlth_bin == "Fair/Poor" ) %>%

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

select (menthlth) brfss7 %>% summarise ( mean (menthlth, na.rm = TRUE )) ## mean(menthlth, na.rm = TRUE) ## 1 8.731943 bt_data1 <- bootstraps (brfss7, times = 1000 ) get_mean <- function (split) { # access to the sample data split_data <- analysis (split) # calculate the sample mean value split_mean <- mean (split_data $ menthlth, na.rm= TRUE ) return (split_mean) } bt_data1 $ bt_means <- map_dbl (bt_data1 $ splits, get_mean) bt1_ci <- round ( quantile (bt_data1 $ bt_means, c ( 0.025 , 0.975 )), 3 ) bt1_ci ## 2.5% 97.5% ## 8.141 9.326 ggplot (bt_data1, aes ( x = bt_data1 $ bt_means)) + geom_line ( stat = "density" ) + xlab ( "Mean of Poor Mental Health Days for Those Who are Not Generally Healthy" ) ## Warning: Use of `bt_data1$bt_means` is discouraged. ## Use `bt_means` instead. ℹ

##By our confidence interval of 8.141 9.326, we can interpret that we are 95% confident that the mean number of poor mental health days for people who are generally healthy, which is 8.731943, falls between our confidence interval.# c. Interpret and discuss what connections you see between these confidence intervals. ##The difference between the confidence intervals for the mean number of poor mental health days for those who are generally healthy vs. those who are generally not healthy is by half. This supports the correlation between mind and body that if we are usually in good health, we naturally feel better. Good health indicates less chronic pain, which is a significant indicator of depression, as well as the ability to conduct activities of daily living without very much difficulty.## d. Bootstrap and interpret a confidence interval for the difference in mean poor mental health days between those who are generally healthy and those who are not. ##We have a confidence interval between -4.664 -3.426. Since 0 isn’t not included in the interval, we can rule out the possibility of random chance in which there would be a possibility that there is no difference between the generally healthy and generally not healthy groups in relation to their number of poor mental health days in a 30- day period. brfss8 <- brfss2 %>% select (menthlth, genhlth_bin) %>% filter ( ! is.na (menthlth) & ! is.na (genhlth_bin))

set.seed ( 123 ) bt_data2 <- bootstraps (brfss8, times = 1000 ) bt_data2 ## # Bootstrap sampling ## # A tibble: 1,000 × 2 ## splits id ## <list> <chr> ## 1 <split [6585/2415]> Bootstrap0001 ## 2 <split [6585/2424]> Bootstrap0002 ## 3 <split [6585/2381]> Bootstrap0003 ## 4 <split [6585/2418]> Bootstrap0004 ## 5 <split [6585/2418]> Bootstrap0005 ## 6 <split [6585/2405]> Bootstrap0006 ## 7 <split [6585/2376]> Bootstrap0007 ## 8 <split [6585/2426]> Bootstrap0008 ## 9 <split [6585/2452]> Bootstrap0009 ## 10 <split [6585/2498]> Bootstrap0010 ## # … with 990 more rows analysis (bt_data2 $ splits[[ 1 ]]) %>% as_tibble () ## # A tibble: 6,585 × 2 ## menthlth genhlth_bin ## <dbl> <fct> ## 1 7 Excell/VG/G ## 2 30 Excell/VG/G ## 3 15 Fair/Poor ## 4 0 Excell/VG/G ## 5 5 Excell/VG/G ## 6 1 Fair/Poor ## 7 0 Excell/VG/G ## 8 0 Excell/VG/G ## 9 2 Excell/VG/G ## 10 1 Excell/VG/G ## # … with 6,575 more rows get_diff <- function (splits) { d <- analysis (splits) mean_yes <- mean (d $ menthlth[d $ genhlth_bin == "Excell/VG/G" ], na.rm= T) mean_no <- mean (d $ menthlth[d $ genhlth_bin == "Fair/Poor" ], na.rm= T) mean_yes - mean_no } bt_data2 $ bt_diff <- map_dbl (bt_data2 $ splits, get_diff) ggplot (bt_data2, aes ( x = bt_diff)) + geom_line ( stat = "density" ) + xlab ( "Difference in Average Poor Mental Health Days Between Those Who are and are Not Generally Healthy" )

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

bt_ci <- round ( quantile (bt_data2 $ bt_diff, c ( 0.025 , 0.975 )), 3 ) bt_ci ## 2.5% 97.5% ## -4.664 -3.426

assignment-3

Related Documents