Entity Academy Lesson 7 t-Tests
docx
keyboard_arrow_up
School
University of South Florida *
*We aren’t endorsed by this school
Course
102
Subject
Statistics
Date
Jan 9, 2024
Type
docx
Pages
16
Uploaded by ssaintclair
Sindy Saintclair
Thursday, December 24, 2021
Lesson 7 – t-Tests
Learning
Objectives and
Questions
Notes and Answers
Hypothesis
Tests on the
Mean
In this lesson, I will use a built-in function
t.test()
that
automates all three types of t-tests. There is a bewildering
array of hypothesis tests; each one applying to slightly
different circumstances. All are useful and important, but
can be trying to figure out which test applies to your
particular data set.
1.
Single Sample t-test: testing the mean of a single
population
Based on a sample, is the mean of a population equal to a
value or not equal to a value?
2.
Independent t-test: testing the mean of two
populations
Based on two samples, are the means of two populations
equal to each other?
3.
Dependent t-test: testing the mean of two
populations using paired data
Based on two samples, where each observation in one
sample is related to a specific observation in the other
sample, are the means of the two populations equal to
each other?
Single sample t-
tests
T-test for One Sample
You can use the
t.test()
function to perform a hypothesis
test on the mean. First you have to create a null
hypothesis and then determine if my data gives me
evidence to reject the null hypothesis. For this example,
you will use the
frostedflakes data set
. Import this data,
and then you should now be able to access the
frostedflakes
dataset:
head
(frostedflakes)
Lab IA400
1 36.3 35.1
2 33.2 35.9
3 39.0 40.1
4 37.3 35.5
5 40.7 37.9
6 38.4 39.5
This is a frame with 2 variables:
Lab
, which contains the
percentage of sugar measured in a 25 gram sample of
Frosted Flakes using a laboratory high performance liquid
chromatography technique, and
IA400
, which contains the
percentage of sugar in the same sample measured by a
machine (the Infra-Analyzer 400).
According to the nutritional information supplied with
Frosted Flakes, the sugar percentage by weight is 37%. I
will create a hypothesis test to see if the data set provides
evidence to the contrary. To set up the hypothesis test,
define the null and alternate hypotheses as follows:
I will now see if my data provides evidence that I should
reject the null hypothesis by using a t-test. In R, I will use
the following commands:
t_obj <- t.test(frostedflakes
$Lab
, mu =
37
)
print
(t_obj)
I will save the object returned by
t.test()
in the variable
t_obj
, then print this object. The arguments for
t.test()
are the name of the dataset,
frostedflakes
,
followed by the variable I am testing in the dataset
Lab
,
followed by the argument
mu=
. Remember that mu is the
population mean, so it is the number I want to test
against. In this case, I want to see if the sample sugar
percentage is 37% or not, so that will be my
mu=
value
here. This code yields the following output:
One Sample t-test
data: frostedflakes$Lab
t = 2.4155, df = 99, p-value = 0.01755
alternative hypothesis: true mean is not equal
to 37
95 percent confidence interval:
37.10642 38.08558
sample estimates:
mean of x
37.596
The object includes plenty of information. For my current
purposes of doing a hypothesis test, I will focus on the
following two lines:
t = 2.4155, df = 99, p-value = 0.01755
alternative hypothesis: true mean is not equal
to 37
The second line confirm that I have set up the test
correctly. The alternate hypothesis is that mu, the true
mean, is not equal to 37. The first line give plenty of
information. The first part tells me that the t-score
computed using the hyptoehsized mean of 37 is 2.4155.
The last part gives me the p-value for this test: 0.01755.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The p-value is an indication nof whether the data provide
evidence that I should reject the null hypothesis. The
smaller the p-value, the stronger the evidence that I
should reject the null hypothesis.
Generally, data scientists compare the p-value to a
threshold of 0.05 to determine whether to reject the null
hypothesis; in this case, the p-value is 0.01755, which is
smaller than 0.05, so I should reject the null hypothesis. I
have decided that the percentage of sugar in frosted
flakes is not equal to 37%.
The fact that the mean of the measured values is 37.596
give me some evidence that the percentage of sugar is
somewhat higher than 37%.
I can also see a graphical interpretation of this test. In the
plot below, there is a histogram of the data with the 95%
confidence interval computed by
t.test()
in red; I can
also show the value of mu for the null hypothesis in green.
I have seen the first three lines of the code below before;
it’s standard for histograms. But the last three lines are
new. The function
geom_vline()
plots vertical lines on the
graph, ,and they are plotted at values of the
xintercept=
argument. You can pull these directly from the
t.test()
function by providing the object name followed by the
name of the information in the
t.test()
output.
conf.int[1]
is the lower confidence level,
conf.int[2]
is
the upper confidence level, and
null.value
is the mean.
d <- ggplot(frostedflakes, aes(x = Lab))
d + geom_histogram(
binwidth
= 1) +
geom_vline(
xintercept
=
t_obj$conf
.
int
[1],
color
=
"red"
) +
geom_vline(
xintercept
=
t_obj$conf
.
int
[2],
color
=
"red"
) +
geom_vline(
xintercept
=
t_obj$null
.
value
,
color
=
"green"
)
This code will yield this image:
Note that the value of mu for the null hypotheses (which
is 37) is not in the 95% confidence interval. Roughly
speaking, this indicates that the probability that 37 is the
true value of the mean is lower than 5%. This tells you
that you have strong evidence to reject the null
hypothesis, which states that the true value of the mean
is 37.
When using the t-test, it is always a good idea to check to
see if your data have an approximate normal distribution.
You can check to see if the
Lab
variable in the
frostedflakes
data set is normal by creating the normal
probability plot for it:
ggplot(frostedflakes, aes(sample = Lab)) +
geom_
qq()
The data fall pretty much on a straight line, so I can
conclude that they come from a normal distribution and
the hypothesis test is built on solid assumptions. This can
also be seen in the histogram above; it looks
approximately bell shaped.
Independent t-
tests
I will use the frosted flakes data again. Suppose I now
want to determine if the measurements made by the IA-
400 give us the same average values as the lab
measurements. To do this, I will think of the
measurements in the data set as coming from two
populations: one population is the lab measurements, and
the other is the IA-400 measurements.
Define the null and alternative hypotheses as follows:
The null hypothesis that the two-sample means are equal,
and the alternative hypothesis is that the two sample
means differ. This is a two-sided t-test.
I will use
t.test()
to create a test on the two sample
means; in this case, each sample is an argument to
t.test()
. As before you save the object returned by
t.test()
into a variable ( this one is named
t_ind
), then
print that object. There are two other arguments included
in
t.test()
as well. The first is
alternative=
. My options
are
"two.sided"
for a two-tailed hypothesis test, or
"greater"
” and
"less"
for a one-tailed test, each for the
particular direction I am hypothesizing for the one-tailed
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
test. The last argument is
var.equal=
and the options for
this are only
TRUE
or
FALSE
. This is very similar to the
different independent t test options I receive in MS Excel;
one type is for when I have homogenieity, or equal
variance, and the other is fro when I have heterogeneous
(unequal) variance. If I don’t know, or don’t want to both
to find out, just make sure to use the
FALSE
option. If I were
to leave off the
var.equal=
argument altogether, the
FALSE
option would be the default.
t_ind <- t.test(frostedflakes
$Lab
,
frostedflakes
$IA400
,
alternative
=
"two.sided"
,
var.
equal
=
FALSE
)
print
(t_ind)
The code above yields the following output. I get a
reminder that I have chosen unequal variance,
because this is a
Welch Two Sample t-Test
:
Welch Two Sample t-test
data: frostedflakes$Lab and frostedflakes$IA400
t = -1.6699, df = 195.08, p-value = 0.09654
alternative hypothesis: true difference in means
is not equal to 0
95 percent confidence interval:
-1.3565905 0.1125905
sample estimates:
mean of x mean of y
37.596 38.218
The output here reminds me what data I ran (handy if I
am running lots of tests in quick succession!) and then
provides information about the t test itself, including the t
and p values associated with the test and the degrees of
freedom. Remember that if the p value is less than 0.05,
you can reject the null hypothesis, and there is a
significant difference between the groups. Since your p
value is 0.09, that is greater than 0.05, and there is
statistically significant difference between the sugar
content measured in the lab and with the machine.
The next line helps you determine that you have set up
the test correctly. Your alternate hypothesis is that the
two-population means are not equal; if they are not equal,
the difference of one subtracted from the other will not be
zero.
Graphing Data for an Independent t Test
You can see graphically what is going on in this test by
creating a box plot for the
Lab
values and comparing it to
a box plot for the
IA400
values. To create this plot with
ggplot()
, you need the data values to be in one column of
the data frame, and a label of whether it was a
measurement taken in the lab or by machine in other
column of the data frame. The simplest way to create
such a data frame is the
melt()
function from the
reshape2
package.
First install and then load the
reshape2
package:
library
(reshape2)
ff <- melt(frostedflakes, id=
"X"
)
The result from this command is:
See how the columsn are no longer labeled
Lab
and
IA400
? Now which column it came from is denoted in the
variable
column, and the actual number that used to be
contained within those columns is under a new column
labeled
value
.
With your data happily reformatted, you can then proceed
onto the box plot:
ggplot(ff) + geom_boxplot(
aes
(
x
=
variable
,
y
=
value
)) +
xlab(
"Test Method"
) + ylab(
"Percentage of
Sugar"
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
As you can see from this plot, the
IA400
group has a
somewhat different median value than the
Lab
group, and
the
IA400
group has a larger variation as evidenced by the
longer whiskers and the larger box. However, there is not
a large difference in the median values. So it makes sense
that the t-test would not indicate that there is strong
evidence that the two means are different.
dependent t-
tests
Paired t-Test for Two Samples
Example: An instructor is required to find out how much
students learn in her course. At the beginning of the
course, she randomly selects a group of students in her
course, and gives each student in this group a pre-test
over the class material and records their scores. At the
end of the course, she gives each student in the group the
same test as a post-test.
She wants to see if the difference in scores is statistically
significant; in other words, she wants to see if, on
average, each student’s scores improved. In this case,
you have two populations: her class at the beginning of
the course and her class at the end of the course. The
populations are made up of the same people, but
hopefully their level of expertise is different at the end of
the course than at the beginning.
When there is a one-to-one relationship between
elements of two populations, a dependent, or paired, t-
test is appropriate to determine whether the means of the
two populations are different. In this case, you will call the
students at the beginning of the course Population 1, and
the students at the end of the course Population 2. Your
hypothesis test looks the same as in the previous section:
The difference, however, is that the two samples are
paired.
The scores for the tests are in
this file
.
Read it into a data frame using code or the wizard and
then store this data frame in the variable scores as
follows:
scores <- read.csv(
"scores.csv"
)
head
(scores)
Here’s the head of the data:
Student
.Number pretest postest
1
1 13 24
2
2 10 17
3
3 11 22
4
4 14 21
5
5 16 25
6
6 10 23
This shows you that the data frame has three variables:
Student.Number
,
pretest
, and
postest
.
pretest
and
postest
contain the beginning and ending test scores.
Use
t.test()
to perform the test on the two samples. The
very crucial part of this code, which sets it apart from its
Single Sample and Independent fellows, is the
paired=
argument. It must be set to
TRUE
or an independent t test
will be run instead. Here is the code:
t_dep <- t.test(scores$postest, scores$pretest,
paired =
TRUE
)
t_dep
And the output that is provided by R:
Paired
t-test
data
: scores$postest and scores$pretest
t
=
18
.
569
, df =
33
, p-value <
2
.
2
e-
16
alternative
hypothesis: true difference in means
is not equal to
0
95
percent confidence interval:
7
.
411558 9
.
235501
sample
estimates:
mean
of the differences
8
.
323529
The meat and potatoes of this output is really these two
lines:
t
=
18
.
569
, df =
33
, p-value <
2
.
2
e-
16
alternative
hypothesis: true difference in means
is not equal to
0
The second line verifies that the alternative hypothesis is
that the difference in the population means is not equal to
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
zero or equivalently that the two population means are
not equal to each other. The first line gives you the p-
value; you reject the null hypothesis of the p-value is
sufficiently small. The p-value for this test is so small it is
given in scientific notation; written as a decimal, it is
0.00000000000000022. This is a very small number, and
in particular, it is much smaller than 0.05, so you can say
that the data give you strong evidence that you can reject
the null hypothesis and that the two population means are
not equal.
Graphing Data for a Dependent t Test
You will graph the dependent t test data the same way
you did the independent t test. First you’ll need to
reshape the data using
melt()
. You’ll use an extra
argument to
melt()
because there are more than 2
variables present in your dataset. When you did this for
the
frostedflakes
dataset, there were only two columns,
so R had no confusion about which two to move around.
But here, with added variable of
Student.Number
, you need
to let R know that you want to fill the variable column with
pretest
and
postest
. This is done using the
measure.vars=
argument.
library
(reshape2)
ss <- melt(scores, measure
.vars
= c(
"pretest"
,
"postest"
))
And here is the how the data ends up being formatted:
See how
Student.Number
is untouched, but that you now
have those same pre-determined columns of
variable
and
value
? Now you’re ready to plot your data. You can make
box plots of the pretest and postest data values as
follows:
ggplot(ss) + geom_boxplot(
aes
(
x
=
variable
,
y
=
value
)) +
xlab(
"Test"
) + ylab(
"Score"
)
And here is the result! You can see that rejecting the null
hypothesis was clearly the right decision. The median of
the postest scores is much higher than the median of the
pretest scores.
Checking for t-
test assumption
of normality
Graphing the Difference Scores
Another way to graph these data is compute the
difference between the
postest
score and the
pretest
score
using a
histogram
and/or QQ plot
for each student, and create a histogram for this
difference. You can do this as follows:
dd
<- scores
$postest
- scores
$pretest
df <- data.frame(dd)
ggplot
(df, aes(x = dd)
) +
geom_histogram(binwidth =
1
) +
xlab
(
"Difference between postest and pretest"
)
The first part of the code subtracts the
pretest
scores from
the
postest
scores, and then it is turned into a data frame
with the command
data.frame
. Then it’s business as usual
for histogram, with this result:
From this histogram of differences, you can see that the
post-test score for most students was between 5 and 13
points higher than the pretest score.
Summary
Summary
t-test are quite handy for when your sample size is
relatively small, and you are looking to determine a
difference between means. The type of t-test depends on
the means you want to compare. If you are comparing a
sample to a population mean, then choose a single
sample t-test. If you are comparing two unrelated
samples, then choose an independent t-test. And if you
are comparing two related samples, then you’ll want to
choose a dependent t-test.
R makes t-testing so very easy with the function
t.test()
,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
which can handle all three types. It also provides you with
means and confidence intervals for your t-tests, which is a
big help.
Related Documents
Recommended textbooks for you

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Recommended textbooks for you
- Holt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin HarcourtGlencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill