Entity Academy Lesson 6 Statistical Inference
docx
keyboard_arrow_up
School
Liberty University *
*We aren’t endorsed by this school
Course
BASIC
Subject
Statistics
Date
Jan 9, 2024
Type
docx
Pages
10
Uploaded by JusticeFogPrairieDog37
Sindy Saintclair
Monday, November 28 2021
Lesson 6 – Statistical Inference
Learning Objectives
and Questions
Notes and Answers
Sampling Methods
Balance Accuracy with Practicality
Sampling –
when you take a subset of the
population and make assertions about the entire
population by just observing a small subset of that
population.
-
population: everyone (not as practical)
-
sample: some (more practical)
The
risk
involved with this is since you are
purposely excluding most of the households
in the city, you are obtaining incomplete
information. You can take multiple samples
and they will all be slightly different, with no
way to tell which one is the most accurate.
The
advantage
is that my workload is
dramatically decreased if I decide to sample.
Sample Size
Sample size is referred to as n. For instance, if you
talk to 30 people out of a larger group, this is a
sample size of 30, or n=30.
Simple Random Sample
Everyone has an equal chance of being selected
-
drawing names out of hat
-
scientists will assign people a number and
will have Excel or Python select someone
Cluster Sampling
Randomly select a group, not an individual
Example
Stratified Sampling –
usually demographic in
nature
Population 10,000
Sample: 100
Females: 20% - 2,000
Female: 20% - 20
Males: 80% - 8,000
Males: 80%
Systematic Sampling
Convenience Sampling
Just sampling those that are easy to sample, for
instance only the people closest to you, the
smallest file, or the beginning of the alphabet—
usually what’s convenient for the scientist.
Sample Size: Number of People in the Sample
-
How many people do you need in your
sample?
Enough to represent the population
accuracy
Not too many to be impractical
Simple Random
Sampling
Often referred to as SRS, every potential candidate
for data collection has the same probability of
being chosen as every other candidate for data
collection. Whoever gets selected is
random
.
The best way to do this is to assign all candidates a
unique number, and then have a random number
generator select which of those candidates should
be a part of the sample. This method provides the
best samples but may not always be done because
it can be logistically difficult or expensive.
Sampling Method Examples
I work for a healthcare provider. My team is tasked
by a federal agency to add to the knowledge base
of 10
th
grade students by collecting medical
information. The population is all 10
th
grade
students in the state of Ohio. You will collect height,
weight, and hearing test data for each child
sampled.
Simple Random Sampling Example
Each 10
th
grader in Ohio is assigned a number from
1 to 38,559 (that is the number of 10
th
graders in
the state). A random number generator is used to
select 3000 of these numbers, with no repeats. Your
team will spread out across the state to meet with
each of these 3000 students and collect the health
data you need.
Cluster Sampling
Occurs when the population is broken down into
groups based on some sort of information such as
location, age, or gender. Usually demographic in
nature, but not always. Then, a few of the groups,
called clusters, are randomly selected, and within
each of the chosen clusters, you sample at 100%--
which means that everyone in a chosen cluster
becomes part of the sample, and everyone in a
non-chosen cluster is not part of the sample.
Example –
Rather than looking at individual
students as sampling candidates, you decide to
look at each school as a sampling candidate. There
are 270 high schools in Ohio, so you randomly
select 18 of these high schools and then sample
every 10
th
grader at each of these 18 high schools.
There will be no data collected from the 10
th
graders at the high schools that weren’t selected.
Stratified Sampling
Occurs when the population is divided into strata,
usually based on demographic characteristics, such
as gender, age, or education level, and then within
each strata candidate are randomly open. The
strata are sampled according to their relative size
within the population. For example, if the largest
strata have 5x as many candidates as the smallest
data, then the final sample should have five times
as many sampled candidates from the large strata
as the small strata.
In this image, the population has been stratified or
broken into groups based on a particular
characteristic (color). However, unlike cluster
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
sampling, you don’t randomly select the strata and
sample everyone in it. Instead, you will randomly
sample within the strata at a rate that is
proportionate to the how large the strata are. In
this image, since the purple strata is the largest,
you will randomly select more people from those
strata then from either the red or the green strata,
which are smaller.
Example
– There are some huge high schools in
Ohio (3500+) students, and there are some smaller
schools and charter schools with as few as 50
students. I break the schools into three different
strata: schools with 1500 or more students, schools
with 400 to 1499 students, and schools with 399 or
fewer students. I will label these schools as big,
medium, and small schools.
The big schools are where 45% of Ohio 10
th
grade
students are going to school, with 38% going to
medium-sized schools, and 17% going to the small
schools. I want my total sample size to be 3000
students, so I randomly select 3000 x 45% = 1350
students from the big schools ,3000 x 38% = 1140
students from the medium schools, and 3000 x
17% = 510 students from the small schools.
This way, I have broken down the population into
stratum and sampled from each strata
proportionally so that each is fairly represented in
the final sample of 3000 students.
Systematic
Sampling
Systematic sampling happens when a start point is
chosen at random, and then every nth item is
selected after that. This ultimately makes the
sampling as a whole no longer random.
Example
Each 10
th
grader is assigned a number again, and
then a random number between 1 and 13 is
generated by a random number generator. Suppose
you generate the number 10. Then, every 13
th
student after that is selected. In this case, that
would be student number 10, 23, 36, 49, 62, etc.
Sampling in this manner will get you pretty close to
sample size of 3000.
Convenience
Sampling
When someone makes little or no effort to
randomly sample, and they collect the information
that is easiest to get. Convenience sampling is
usually the most biased type of sampling, but
unfortunately, the most common.
Example
Right across the street from your facility is a high
school. You send a team member over there to
measure all of the 10
th
graders at that school. One
of your team members has a spouse that teaches
high school, so tomorrow she goes with her
husband to the high school and measures all of
those 10
th
– grade students there.
Another team member has a cousin who is the
school nurse at a high school in another city, so you
asked the cousin to collect data from the 10
th
grade
students at that high school, and so forth. It turns
out that for some of the high schools, there was a
volleyball tournament, so some of the 10
th
graders
were missing because they are at the tournament,
and another high school had a bunch of students
missing because of band camp, but you just don’t
care.
Choosing a
Dependent on one’s sampling goals and what is the
Sampling Method
most practical. Most academics believe that
convenience sampling should be avoided whenever
possible. The trick is to approach the convenience
sampling in a thoughtful rather than a ‘lazy’
manner and to incorporate some aspects of
randomization within the constraints of the
convenience sample.
For example, a manufacturer is interested in how
much of their output gets returned for being
defective. Their goal is to sample all returned
materials and then read through the description of
reasons for the return, because some of it gets
returned for defects, but there are many other
reasons for returns.
Because most of the returns data is returned
through a vendor and some vendors won’t share
the information, collect useable data, or in some
cases, collect plenty of data in different languages
do not know by the manufacturer—the
manufacturer carefully selects the vendors that
provide useful data, and then use some
randomization to look at returns for different
products, from different vendors, and at different
times of the year. They purposely ignore a lot of
data because it just isn’t useful to them.
Systematic sampling doesn’t make a bunch of
sense in the 10
th
graders example, because you
haven’t made the sampling any easier than simple
random sampling would be. All you have done is
systematically selected students instead of
randomly picking them. However, if you were
auditing quality on a manufactured item such as a
wireless keyboard, and you are standing in the
warehouse to select the keyboards you will test,
doing systematic sampling makes perfect sense
because you have shelves and shelves of finished
product waiting to be shipped. Pulling out every
40
th
keyboard is no problem at all.
Why Sampling and
Analysis?
Why Should You Sample?
-
takes a lot of time and effort to reach an
entire population
-
not every person will consent in a
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
population anyway
-
because you want to know something about a
bigger set, usually called a population.
-
helps me to develop estimates of what is
going on with the whole population, and
usually can provide you with a confidence
interval around that estimate
-
Balance rigor and feasibility
Why Analyze Data?
From a practicality perspective:
-
is 1.1 really bigger than 1?
-
is 2.5 really bigger than 2.3?
Hypothesis Testing
Statistical inference has been called the science of
comparison. Mathematics has a lot of branches,
and most of them are academic and theoretical.
Null
: H
0
equal
Alternative
: H
a
or H
1
not equal, greater than, or
lesser than
Steps of Hypothesis Testing
-
create a hypothesis
-
calculate a test statistic
-
calculate the probability – p value
-
determine alpha
-
decision rule
Determine Alpha (α)
α Level
Accuracy
Error
α = 0.01
99% chance of accuracy
1%
chance of error
α = 0.05
95% chance of accuracy
5%
chance of error
α = 0.10
90% chance of accuracy
10% chance of error
Alpha is a pre-determined number that represents
how often you are willing to draw a conclusion from
the data but be wrong. I will compare my
p
value
with the alpha value to determine whether
something is significant or not.
The most typical value for alpha is 0.05. This
means that there is a 95% chance that my results
are accurate, and I am willing to accept a 5%
chance that my data is wrong. However, on the
more rigorous side, I may also see alpha = 0.01,
Rigor
meaning that there is a 99% chance that my results
are accurate, and I am willing to accept a 1%
chance of error. And on the less rigorous side, from
time to time I may see an alpha of 0.10, meaning
that there is a 90% chance my results are accurate
and a 10% chance my data is inaccurate.
Creating a Hypothesis
Statistical inference is used to test hypotheses. But
to test a hypothesis, it has to be created. The most
important part of creating a hypothesis is to
remember that it must contain the ‘=’ sign, either
explicitly or implicitly.
Example #1)
You have invented a new drug to
treat HTN. You are sure your drug is better than the
most commonly used current drug, but as you
develop a hypothesis, you hypothesize that the two
drugs are exactly the same in effectiveness. What
you really want to do is prove yours is better but
will start by assuming they are both equally good.
H
0
: new drug = baseline drug
H
α
: new drug > baseline drug
Example #2)
You are taking a survey and you
believe that the crab cakes sold at “The Shack” are
better than the crab cakes sold at “Seafood
Supreme.” Your hypothesis should be that both
crab cakes are equally desirable to customers, and
you hope to use data to disprove the hypothesis.
H
0
: The shack = Seafood Supreme
H
α
: The shack > Seafood Supreme
Example #3)
You want to compare baseball
players from 2 different eras, and you think one is
better than the other. Your hypothesis is that they
are equally skilled, and then you hope to use data
to disprove that hypothesis.
H
0
: Player A = Player B
H
α
: Player A
≠
Player B
Example #4)
You want to collect some data to see
if movie preference is dependent on gender. Your
hypothesis is that movie preference is the same for
both genders, and then collect data to see if the
hypothesis should be rejected or not.
H
0
: movie preference is independent of gender
H
α
: movie preference depends on gender
Calculating a Test Statistic
When one assumes that the null hypothesis is true
(which is in every case of hypothesis testing), a test
statistic will be calculated. Know the distribution of
the test statistic, assuming that the null hypothesis
is true. Then you can calculate the probability.
Calculating the Probability
You have already done this a bunch, but now it is
formalized. If you know what a distribution looks
like, you can calculate the probability of having
value greater than or less than the test statistic
value. The statistical term for this probability is the
p
value.
Decision Rule
If the
p
value is less than alpha, I should reject the
null hypothesis in favor of the alternative
hypothesis. However, if the
p
value is greater than
alpha, I fail to reject the null hypothesis, which is
also accepting the null.
Type I and Type II
Error
When trying to make inferences on a population
from a sample, there will always be an amount of
error no matter what you do.
Type I (α) – the probability of the time I will be
wrong for any given statistical test
-
reject the null when you should have
accepted it
Type II (β) - beta
-
accept the null when you should have
rejected it
Reject Null
Accept Null
ART-BAF stands for:
Null True
Type I
Correct
A:
Alpha
Null False
Correct
Type II
R:
Reject
ART-BAF
T: True
B: Beta
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
A: Accept
F: False
EXAMPLES
COURTROOM ANALOGY
The null hypothesis is not guilty because everyone
is innocent until proven guilty.
if the null is false, then the person will get arrested
if I reject the null.
If I chose to accept the null, that would be a type II
error.
if the null is true, then the person goes free if I
accept the null.
If I chose to reject the null, that would be a type I
error.
Type II Error Example – if the defendant is guilty,
but the jury finds him “not guilty”, they have
committed a Type II error.
Type I Error Example – if the defendant did not
commit the crime, but the jury finds him guilty
anyway, the jury has committed a Type I Error.
A Real-Life Example of Type II Error
OJ Simpson murder trial – in short, the test doesn’t
determine whether or not A=B, it determines if
there is enough convincing evidence that A and B
are different, or not just as trials don’t determine
innocence or guilt but whether if the prosecution
has enough convincing evidence to find the
defendant guilty, or insufficient evidence, which will
lead to a verdict of being not guilty.
Recommended textbooks for you

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

College Algebra (MindTap Course List)
Algebra
ISBN:9781305652231
Author:R. David Gustafson, Jeff Hughes
Publisher:Cengage Learning
Recommended textbooks for you
- Holt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALGlencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
- College Algebra (MindTap Course List)AlgebraISBN:9781305652231Author:R. David Gustafson, Jeff HughesPublisher:Cengage Learning

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt

College Algebra (MindTap Course List)
Algebra
ISBN:9781305652231
Author:R. David Gustafson, Jeff Hughes
Publisher:Cengage Learning