Homework 2

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

135

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

8

Uploaded by GeneralSteel1537

Report
Homework 2 Stat 158, Fall 2023 Due Sep 14, 2023 by 11:59 PM PDT on Gradescope 1. Scientists are interested in whether the energy costs involved in reproduction affect longevity1 In this experiment, 125 male fruit flies were divided at random into five sets of 25. In group 1, the males were kept by themselves. In groups 3 and 5, the males were supplied with one or eight receptive virgin female fruit flies per day, respectively. In groups 2 and 4, the males were supplied with one or eight unreceptive (pregnant) female fruit flies per day, respectively. Other than the number and type of companions, the males were treated identically. The longevity of the flies was observed. Using the dataset fruit flies.txt on bCourses, which records the results of this experiment, do the following: (a) (6 points) Visualize the data and interpret what you see. I made a column chart(source code at the bottom) but it looks like treatment 3 had the best longevity in comparison to the other groups. Groups 1 and 2 seem to have a very similar longevity while 4 and 5 seemed to be the lowest. Group 5 in this experiment shows the worst longevity. (b) (20 points) Create a one-way ANOVA table minus the p-values (see below for hypothesis testing). You can omit the grand average/benchmark row of the ANOVA table. Source code at the bottom (c) State your conclusion about the scientific hypothesis of the ANOVA table Using Null hypothesis: There is a difference in longevity in different groups i. (10 points) Model-based inference My F-statistic came out to be 119.3928 > 1 therefore resulting in rejecting the null hypothesis because it is significantly greater than 1. ii. (12 points) Randomization inference After permuting the data 1000 times and comparing the permuted F statistic with the original F statistic, resulted in a p-value of 0. 0 < 0.05 therefore we reject the null hypothesis. Do not use the anova command in R or any built-in functions that do an entire ANOVA analysis, but you may use R to do any component calculations, simulations, or distribution look-ups you would like. You may also use the ri.test package to do randomization inference. 2. In the experiment regarding the fruit flies in the previous problem, the ANOVA analysis only tests whether there were any differences between the groups. Suppose there was further interest in whether the 1 pregnant group was different
from the 1 virgin group Null hypothesis: There is a difference between the pregnant group and the virgin groups (a) (10 points) Perform a randomization test to compare these two groups using the t-statistic (assume same variances in the two groups). I compared groups 2 and 3 and got a p-value of 0.776. 0.776 > 0.05 by a significant amount, therefore we fail to reject the null hypothesis. (b) (10 points) Create a one-way ANOVA table, but use only the two groups 1 pregnant and 1 virgin, and use a randomization test to compute a p-value. How do the results of this analysis this compare to the results in part (a)? What is the estimate of standard deviation (σ) for the t-statistic and what is it for the ANOVA table? My P-value for part b = 0.416 so there’s a .36 difference in my p-value. However, since both p- values are still greater than 0.05 we still reject the null hypothesis. The estimate of sd for the t- statistic = 16.05729 and for the ANOVA table the estimate of sd = 1 (c) (15 points) If we assume the variances are the same in all of the 5 groups, then the ANOVA results in the previous question using all 5 groups also provided an estimate of σ. Construct a t-statistic and perform a randomization test comparing 1 pregnant and 1 virgin groups, but in estimating the denominator of the t-statistic, use the estimate of σ given by your ANOVA analysis when you used all the groups. The appropriate degrees of freedom for the t-distribution when the t-statistic is calculated in this way is given by the degrees of freedom of your estimate of σ from the ANOVA table. I got a p-value = 0.795 therefore rejecting the null hypothesis. (d) (10 points) Why could the method that uses all of the groups be ‘better’ at detecting differences between groups? Why could this method be worse? I think by including all the groups data can be better because it accounts for the variability across all groups which can provide more information about the differences among the groups in comparison to just using just 2 groups. Another example of how using all groups may be better would be reducing the residual variance making it easier to detect systematic diffences. I think this method could be worse because the ANOVA assumes that all the variances are equal, if they weren’t all equal it could make it difficult finding the true differences between the groups. (e) (7 points) How do the randomization tests we have discussed so far compare to a test using the 2-group t-statistic that doesn’t assume the same variances between the group? A randomization test are non-parametric that doesn’t assume anything about distribution or variances. A test using the 2-group t-statistic are a parametric test that assume normality that also doesn’t assume the variances are not equal.
1Question from Problem 3.2 in Oehlert textbook, original data from Hanley and Shapiro (1994), “Sexual activity and the lifespan of male fruitflies: a dataset that gets attention,” Journal of Statistics Education 2. Source Code --- title: "HW 2" output: pdf_document date: "2023-09-12" --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) install.packages("permute") install.packages("ri2") library(ggplot2) library(dplyr) library(readr) library(permute) library(ri2) ``` ```{r} #loading in the text file df = read.table("Stat 138/fruit_flies.txt", header = TRUE) ``` ```{r} #Q1A ggplot(df, aes(x = trt, y = days)) + geom_col()+ ggtitle('Fruit Flies') ``` ```{r} #Q1B #Defining groups group1 <- df[df$trt == 1, "days"] group2 <- df[df$trt == 2, "days"] group3 <- df[df$trt == 3, "days"] group4 <- df[df$trt == 4, "days"]
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
group5 <- df[df$trt == 5, "days"] #Calculating their means m1 <- mean(group1) m2 <- mean(group2) m3 <- mean(group3) m4 <- mean(group4) m5 <- mean(group5) #Overall mean overall_mean <- mean(c(group1, group2, group3, group4, group5)) #SSCondition ss_condition <- sum((m1 - overall_mean)^2, (m2 - overall_mean)^2, (m3 - overall_mean)^2, (m4 - overall_mean)^2, (m5 - overall_mean)^2) #Degree of freedom for condition(g - 1) df_condition = 4 #SS_Residual ss_residual <- sum((group1 - m1)^2, (group2 - m2)^2, (group3 - m3)^2, (group4 - m4)^2, (group5 - m5)^2) #Degree of freedom for residual(ng - g) df_residual <- (length(group1) + length(group2) + length(group3) + length(group4) + length(group5) - 5) #mean squares for both ms_condition <- ss_condition/df_condition ms_residual <- ss_residual/ss_residual #F-statistic F_stat <- ms_condition / ms_residual ```
```{r} #Q1Cii #wasn't sure how to use the ri.test so I will use permutation num_permutations <- 1000 permuted_F_stat <- numeric(num_permutations) for(i in 1:num_permutations) { shuffled_groups <- sample(c(group1,group2,group3,group4,group5)) #To get F stat of permuated data have to recalculate the means of shuffled groups shuffled_m1 <- mean(shuffled_groups[1:25]) shuffled_m2 <- mean(shuffled_groups[26:50]) shuffled_m3 <- mean(shuffled_groups[51:75]) shuffled_m4 <- mean(shuffled_groups[76:100]) shuffled_m5 <- mean(shuffled_groups[101:25]) shuffled_ss_condition <- sum((shuffled_m1 - overall_mean)^2, (shuffled_m2 - overall_mean)^2, (shuffled_m3 - overall_mean)^2, (shuffled_m4 - overall_mean)^2, (shuffled_m5 - overall_mean)^2) shuffled_ms_condition <- shuffled_ss_condition / df_condition shuffled_F <- shuffled_ms_condition/ ms_residual #storing it back to the permuted f stat permuted_F_stat[i] <- shuffled_F } ``` ```{r} #calculating p-value p_value <- sum(permuted_F_stat >= F_stat) / num_permutations p_value ``` Question 2 ```{r} #Q2a #I'm going to compare group 2 and 3 #Already have the means so need the variances v2 <- var(group2) v3 <- var(group3)
ngroup2 <- length(group2) ngroup3 <- length(group3) #calculate a t-stat for the original data t1 <- (m2 - m3) / sqrt((v2/ngroup2) + (v3/ngroup3)) #num_permutations is defined already in Q1 permuted_t <- numeric(num_permutations) for(i in 1:num_permutations){ shuffled_groups <- sample(c(group2,group3)) shuffled_m2 <- mean(shuffled_groups[1:ngroup2]) shuffled_m3 <- mean(shuffled_groups[(ngroup3 + 1):(ngroup2 + ngroup3)]) shuffled_v2 <- var(shuffled_groups[1:ngroup2]) shuffled_v3 <- var(shuffled_groups[(ngroup2 + 1):(ngroup2 + ngroup3)]) shuffled_t <- (shuffled_m2 - shuffled_m3) / sqrt((shuffled_v2/ngroup2) + (shuffled_v3/ngroup3)) permuted_t[i] <- shuffled_t } ``` ```{r} #Compute the p value p_value <- sum(abs(permuted_t) >= abs(t1)) / num_permutations p_value ``` ```{r} #Q2B #Going to use the same groups from 2a #F_statistic = var2/var3 F_statistic <- v2/v3 permuted_F_stat <- numeric(num_permutations) #using permutation for(i in 1:num_permutations){ shuffled_groups <- sample(c(group2,group3)) shuffled_v2 <- var(shuffled_groups[1:ngroup2])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
shuffled_v3 <- var(shuffled_groups[(ngroup2 + 1):(ngroup2 + ngroup3)]) shuffled_F <- shuffled_v2/shuffled_v3 permuted_F_stat[i] <- shuffled_F } ``` ```{r} #calculate p-value p_value <- sum(permuted_F_stat >= F_statistic) / num_permutations p_value ``` ```{r} #estimate of the Sd of the t-statistic = sqrt of the pooled variance between the two groups pooled_sd <- sqrt(((ngroup2 - 1) * v2 + (ngroup3 - 1) * v3) / (ngroup2 + ngroup3 - 2)) pooled_sd #estimate of the Sd for the ANOVA Table = sqrt of ms_residual anova_sd <- sqrt(ms_residual) anova_sd ``` ```{r} #Q2c #t-statistic using the ANOVA sd estimate observed_t <- (m2 - m3) / (anova_sd * sqrt(1/ngroup2 + 1/ngroup3)) permuted_t <- numeric(num_permutations) for(i in 1:num_permutations){ shuffled_groups <- sample(c(group2,group3)) #calculate t-statistic with the ANOVA sd estimate shuffled_m2 <- mean(shuffled_groups[1:ngroup2]) shuffled_m3 <- mean(shuffled_groups[(ngroup2 + 1):(ngroup2 + ngroup3)]) shuffled_t <- (shuffled_m2 - shuffled_m3) / (anova_sd * sqrt(1/ngroup2 + 1/ngroup3)) permuted_t[i] <- shuffled_t }
``` ```{r} #calculating the p-value p_value <- sum(abs(permuted_t) >= abs(observed_t)) / num_permutations ```