Entity Academy Lesson 7 t-Tests

docx

School

University of South Florida *

*We aren’t endorsed by this school

Course

102

Subject

Statistics

Date

Jan 9, 2024

Type

docx

Pages

16

Uploaded by ssaintclair

Report
Sindy Saintclair Thursday, December 24, 2021 Lesson 7 – t-Tests Learning Objectives and Questions Notes and Answers Hypothesis Tests on the Mean In this lesson, I will use a built-in function t.test() that automates all three types of t-tests. There is a bewildering array of hypothesis tests; each one applying to slightly different circumstances. All are useful and important, but can be trying to figure out which test applies to your particular data set. 1. Single Sample t-test: testing the mean of a single population Based on a sample, is the mean of a population equal to a value or not equal to a value? 2. Independent t-test: testing the mean of two populations Based on two samples, are the means of two populations equal to each other? 3. Dependent t-test: testing the mean of two populations using paired data Based on two samples, where each observation in one sample is related to a specific observation in the other sample, are the means of the two populations equal to each other? Single sample t- tests T-test for One Sample You can use the t.test() function to perform a hypothesis test on the mean. First you have to create a null hypothesis and then determine if my data gives me evidence to reject the null hypothesis. For this example, you will use the frostedflakes data set . Import this data, and then you should now be able to access the frostedflakes dataset: head (frostedflakes) Lab IA400 1 36.3 35.1
2 33.2 35.9 3 39.0 40.1 4 37.3 35.5 5 40.7 37.9 6 38.4 39.5 This is a frame with 2 variables: Lab , which contains the percentage of sugar measured in a 25 gram sample of Frosted Flakes using a laboratory high performance liquid chromatography technique, and IA400 , which contains the percentage of sugar in the same sample measured by a machine (the Infra-Analyzer 400). According to the nutritional information supplied with Frosted Flakes, the sugar percentage by weight is 37%. I will create a hypothesis test to see if the data set provides evidence to the contrary. To set up the hypothesis test, define the null and alternate hypotheses as follows: I will now see if my data provides evidence that I should reject the null hypothesis by using a t-test. In R, I will use the following commands: t_obj <- t.test(frostedflakes $Lab , mu = 37 ) print (t_obj) I will save the object returned by t.test() in the variable t_obj , then print this object. The arguments for t.test() are the name of the dataset, frostedflakes , followed by the variable I am testing in the dataset Lab , followed by the argument mu= . Remember that mu is the population mean, so it is the number I want to test against. In this case, I want to see if the sample sugar percentage is 37% or not, so that will be my mu= value
here. This code yields the following output: One Sample t-test data: frostedflakes$Lab t = 2.4155, df = 99, p-value = 0.01755 alternative hypothesis: true mean is not equal to 37 95 percent confidence interval: 37.10642 38.08558 sample estimates: mean of x 37.596 The object includes plenty of information. For my current purposes of doing a hypothesis test, I will focus on the following two lines: t = 2.4155, df = 99, p-value = 0.01755 alternative hypothesis: true mean is not equal to 37 The second line confirm that I have set up the test correctly. The alternate hypothesis is that mu, the true mean, is not equal to 37. The first line give plenty of information. The first part tells me that the t-score computed using the hyptoehsized mean of 37 is 2.4155. The last part gives me the p-value for this test: 0.01755.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The p-value is an indication nof whether the data provide evidence that I should reject the null hypothesis. The smaller the p-value, the stronger the evidence that I should reject the null hypothesis. Generally, data scientists compare the p-value to a threshold of 0.05 to determine whether to reject the null hypothesis; in this case, the p-value is 0.01755, which is smaller than 0.05, so I should reject the null hypothesis. I have decided that the percentage of sugar in frosted flakes is not equal to 37%. The fact that the mean of the measured values is 37.596 give me some evidence that the percentage of sugar is somewhat higher than 37%. I can also see a graphical interpretation of this test. In the plot below, there is a histogram of the data with the 95% confidence interval computed by t.test() in red; I can also show the value of mu for the null hypothesis in green. I have seen the first three lines of the code below before; it’s standard for histograms. But the last three lines are new. The function geom_vline() plots vertical lines on the graph, ,and they are plotted at values of the xintercept= argument. You can pull these directly from the t.test() function by providing the object name followed by the name of the information in the t.test() output. conf.int[1] is the lower confidence level, conf.int[2] is the upper confidence level, and null.value is the mean. d <- ggplot(frostedflakes, aes(x = Lab)) d + geom_histogram( binwidth = 1) + geom_vline( xintercept = t_obj$conf . int [1], color = "red" ) + geom_vline( xintercept = t_obj$conf . int [2], color = "red" ) + geom_vline( xintercept = t_obj$null . value , color = "green" ) This code will yield this image:
Note that the value of mu for the null hypotheses (which is 37) is not in the 95% confidence interval. Roughly speaking, this indicates that the probability that 37 is the true value of the mean is lower than 5%. This tells you that you have strong evidence to reject the null hypothesis, which states that the true value of the mean is 37. When using the t-test, it is always a good idea to check to see if your data have an approximate normal distribution. You can check to see if the Lab variable in the frostedflakes data set is normal by creating the normal probability plot for it: ggplot(frostedflakes, aes(sample = Lab)) + geom_ qq()
The data fall pretty much on a straight line, so I can conclude that they come from a normal distribution and the hypothesis test is built on solid assumptions. This can also be seen in the histogram above; it looks approximately bell shaped. Independent t- tests I will use the frosted flakes data again. Suppose I now want to determine if the measurements made by the IA- 400 give us the same average values as the lab measurements. To do this, I will think of the measurements in the data set as coming from two populations: one population is the lab measurements, and the other is the IA-400 measurements. Define the null and alternative hypotheses as follows: The null hypothesis that the two-sample means are equal, and the alternative hypothesis is that the two sample means differ. This is a two-sided t-test. I will use t.test() to create a test on the two sample means; in this case, each sample is an argument to t.test() . As before you save the object returned by t.test() into a variable ( this one is named t_ind ), then print that object. There are two other arguments included in t.test() as well. The first is alternative= . My options are "two.sided" for a two-tailed hypothesis test, or "greater" ” and "less" for a one-tailed test, each for the particular direction I am hypothesizing for the one-tailed
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
test. The last argument is var.equal= and the options for this are only TRUE or FALSE . This is very similar to the different independent t test options I receive in MS Excel; one type is for when I have homogenieity, or equal variance, and the other is fro when I have heterogeneous (unequal) variance. If I don’t know, or don’t want to both to find out, just make sure to use the FALSE option. If I were to leave off the var.equal= argument altogether, the FALSE option would be the default. t_ind <- t.test(frostedflakes $Lab , frostedflakes $IA400 , alternative = "two.sided" , var. equal = FALSE ) print (t_ind) The code above yields the following output. I get a reminder that I have chosen unequal variance, because this is a Welch Two Sample t-Test : Welch Two Sample t-test data: frostedflakes$Lab and frostedflakes$IA400 t = -1.6699, df = 195.08, p-value = 0.09654 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.3565905 0.1125905 sample estimates: mean of x mean of y
37.596 38.218 The output here reminds me what data I ran (handy if I am running lots of tests in quick succession!) and then provides information about the t test itself, including the t and p values associated with the test and the degrees of freedom. Remember that if the p value is less than 0.05, you can reject the null hypothesis, and there is a significant difference between the groups. Since your p value is 0.09, that is greater than 0.05, and there is statistically significant difference between the sugar content measured in the lab and with the machine. The next line helps you determine that you have set up the test correctly. Your alternate hypothesis is that the two-population means are not equal; if they are not equal, the difference of one subtracted from the other will not be zero. Graphing Data for an Independent t Test You can see graphically what is going on in this test by creating a box plot for the Lab values and comparing it to a box plot for the IA400 values. To create this plot with ggplot() , you need the data values to be in one column of the data frame, and a label of whether it was a measurement taken in the lab or by machine in other column of the data frame. The simplest way to create such a data frame is the melt() function from the reshape2 package. First install and then load the reshape2 package: library (reshape2) ff <- melt(frostedflakes, id= "X" ) The result from this command is:
See how the columsn are no longer labeled Lab and IA400 ? Now which column it came from is denoted in the variable column, and the actual number that used to be contained within those columns is under a new column labeled value . With your data happily reformatted, you can then proceed onto the box plot: ggplot(ff) + geom_boxplot( aes ( x = variable , y = value )) + xlab( "Test Method" ) + ylab( "Percentage of Sugar" )
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
As you can see from this plot, the IA400 group has a somewhat different median value than the Lab group, and the IA400 group has a larger variation as evidenced by the longer whiskers and the larger box. However, there is not a large difference in the median values. So it makes sense that the t-test would not indicate that there is strong evidence that the two means are different. dependent t- tests Paired t-Test for Two Samples Example: An instructor is required to find out how much students learn in her course. At the beginning of the course, she randomly selects a group of students in her course, and gives each student in this group a pre-test over the class material and records their scores. At the end of the course, she gives each student in the group the same test as a post-test. She wants to see if the difference in scores is statistically significant; in other words, she wants to see if, on average, each student’s scores improved. In this case, you have two populations: her class at the beginning of the course and her class at the end of the course. The populations are made up of the same people, but hopefully their level of expertise is different at the end of the course than at the beginning. When there is a one-to-one relationship between elements of two populations, a dependent, or paired, t- test is appropriate to determine whether the means of the two populations are different. In this case, you will call the
students at the beginning of the course Population 1, and the students at the end of the course Population 2. Your hypothesis test looks the same as in the previous section: The difference, however, is that the two samples are paired. The scores for the tests are in this file . Read it into a data frame using code or the wizard and then store this data frame in the variable scores as follows: scores <- read.csv( "scores.csv" ) head (scores) Here’s the head of the data: Student .Number pretest postest 1 1 13 24 2 2 10 17 3 3 11 22 4 4 14 21 5 5 16 25 6 6 10 23 This shows you that the data frame has three variables: Student.Number , pretest , and postest . pretest and postest contain the beginning and ending test scores. Use t.test() to perform the test on the two samples. The very crucial part of this code, which sets it apart from its Single Sample and Independent fellows, is the paired= argument. It must be set to TRUE or an independent t test
will be run instead. Here is the code: t_dep <- t.test(scores$postest, scores$pretest, paired = TRUE ) t_dep And the output that is provided by R: Paired t-test data : scores$postest and scores$pretest t = 18 . 569 , df = 33 , p-value < 2 . 2 e- 16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 7 . 411558 9 . 235501 sample estimates: mean of the differences 8 . 323529 The meat and potatoes of this output is really these two lines: t = 18 . 569 , df = 33 , p-value < 2 . 2 e- 16 alternative hypothesis: true difference in means is not equal to 0 The second line verifies that the alternative hypothesis is that the difference in the population means is not equal to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
zero or equivalently that the two population means are not equal to each other. The first line gives you the p- value; you reject the null hypothesis of the p-value is sufficiently small. The p-value for this test is so small it is given in scientific notation; written as a decimal, it is 0.00000000000000022. This is a very small number, and in particular, it is much smaller than 0.05, so you can say that the data give you strong evidence that you can reject the null hypothesis and that the two population means are not equal. Graphing Data for a Dependent t Test You will graph the dependent t test data the same way you did the independent t test. First you’ll need to reshape the data using melt() . You’ll use an extra argument to melt() because there are more than 2 variables present in your dataset. When you did this for the frostedflakes dataset, there were only two columns, so R had no confusion about which two to move around. But here, with added variable of Student.Number , you need to let R know that you want to fill the variable column with pretest and postest . This is done using the measure.vars= argument. library (reshape2) ss <- melt(scores, measure .vars = c( "pretest" , "postest" )) And here is the how the data ends up being formatted:
See how Student.Number is untouched, but that you now have those same pre-determined columns of variable and value ? Now you’re ready to plot your data. You can make box plots of the pretest and postest data values as follows: ggplot(ss) + geom_boxplot( aes ( x = variable , y = value )) + xlab( "Test" ) + ylab( "Score" ) And here is the result! You can see that rejecting the null hypothesis was clearly the right decision. The median of the postest scores is much higher than the median of the pretest scores. Checking for t- test assumption of normality Graphing the Difference Scores Another way to graph these data is compute the difference between the postest score and the pretest score
using a histogram and/or QQ plot for each student, and create a histogram for this difference. You can do this as follows: dd <- scores $postest - scores $pretest df <- data.frame(dd) ggplot (df, aes(x = dd) ) + geom_histogram(binwidth = 1 ) + xlab ( "Difference between postest and pretest" ) The first part of the code subtracts the pretest scores from the postest scores, and then it is turned into a data frame with the command data.frame . Then it’s business as usual for histogram, with this result: From this histogram of differences, you can see that the post-test score for most students was between 5 and 13 points higher than the pretest score. Summary Summary t-test are quite handy for when your sample size is relatively small, and you are looking to determine a difference between means. The type of t-test depends on the means you want to compare. If you are comparing a sample to a population mean, then choose a single sample t-test. If you are comparing two unrelated samples, then choose an independent t-test. And if you are comparing two related samples, then you’ll want to choose a dependent t-test. R makes t-testing so very easy with the function t.test() ,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
which can handle all three types. It also provides you with means and confidence intervals for your t-tests, which is a big help.