Lab 8
pdf
keyboard_arrow_up
School
Western University *
*We aren’t endorsed by this school
Course
1000A
Subject
Statistics
Date
Jan 9, 2024
Type
Pages
4
Uploaded by SargentStingrayPerson1024
11/2/22, 5:35 PM
Lab 8
localhost:8888/nbconvert/html/Documents/Python_examples/Lab_8/Lab 8.ipynb?download=false
1/4
Lab 8
In this lab we discuss simple random sampling, systematic sampling, and stratified random sampling.
Simple Random Sampling
random.sample:
https://www.w3schools.com/python/ref_random_sample.asp
(https://www.w3schools.com/python/ref_random_sample.asp)
In [1]:
import
numpy
as
np
import
pandas
as
pd
import
random
In [2]:
# Random sampling using random.sample()
# random.sample returns unique random elements from a sequence or set. This is sampling without replacement.
random
.
seed(
52
)
## setting the seed so that we all get the same sample
names
=
[
"Roger"
,
"Jack"
,
"John"
,
"Jason"
,
"Laura"
,
"Mariya"
,
"Martina"
,
"Lauren"
]
sampled_list1
=
random
.
sample(names,
3
)
print
(sampled_list1)
In [3]:
# if we change the seed another sample is obtained
random
.
seed(
17
)
sampled_list2
=
random
.
sample(names,
3
)
print
(sampled_list2)
In [4]:
# an alternative way to sample is to assign a number to each name (or ID) in the list and then sample the numbers
# generating 8 consecutive numbers corresponding to the positions of each name in the list:
sequence_numbers
=
list
(
range
(
8
))
print
(sequence_numbers)
# Python starts with zero
random
.
seed(
17
)
# same seed as before to obtain the same sample of names as in the code cell above
sample_list3
=
random
.
sample(sequence_numbers,
3
)
# randomly sampling 3 number positions
print
(sample_list3)
In [5]:
[names[i]
for
i
in
sample_list3 ]
# getting the names corresponding to the sampled number positions
In [6]:
# Getting a sample array from a multidimensional array
array
=
np
.
array([[
2
,
5
,
7
], [
5
,
11
,
16
], [
6
,
13
,
19
], [
7
,
15
,
22
], [
8
,
17
,
25
]])
print
(
"2D array
\n
"
, array)
In [7]:
random
.
seed(
48
)
random_rows
=
random
.
sample(
range
(
5
),
2
)
# randomly selecting two row indices, without replacement
print
(random_rows)
In [8]:
array[random_rows, :]
Systematic Sampling
Systematic sampling is a type of sampling where we obtain a sample by going through a list of the population at fixed intervals from a randomly chosen starting point.
['Laura', 'Roger', 'Mariya']
['Martina', 'Lauren', 'John']
[0, 1, 2, 3, 4, 5, 6, 7]
[6, 7, 2]
Out[5]:
['Martina', 'Lauren', 'John']
2D array
[[ 2
5
7]
[ 5 11 16]
[ 6 13 19]
[ 7 15 22]
[ 8 17 25]]
[4, 2]
Out[8]:
array([[ 8, 17, 25],
[ 6, 13, 19]])
11/2/22, 5:35 PM
Lab 8
localhost:8888/nbconvert/html/Documents/Python_examples/Lab_8/Lab 8.ipynb?download=false
2/4
np.arange:
https://numpy.org/doc/stable/reference/generated/numpy.arange.html
(https://numpy.org/doc/stable/reference/generated/numpy.arange.html)
In [9]:
# Let's assume we are interested in sampling from a population of 15 students with the following ID list:
df_students
=
pd
.
DataFrame({
'ID'
:np
.
arange(
1
,
16
)
.
tolist()})
df_students
In [10]:
# Defining the function for systematic sampling
def
systematic_sampling
(df, starting_index, step):
indices
=
np
.
arange(starting_index,
len
(df), step
=
step)
systematic_sample
=
df
.
iloc[indices]
return
systematic_sample
In [11]:
# Obtaining a systematic sample of size 5
# Because 15/3=5, choose one of the first 3 IDs on the list at random and then every 3rd ID after that.
random
.
seed(
68
)
random_start
=
random
.
randint(
0
,
2
)
print
(random_start)
# another way
# random.seed(68)
# random_start = random.sample(range(3),1)
# print(random_start)
In [12]:
systematic_sample
=
systematic_sampling(df
=
df_students, starting_index
=
random_start, step
=3
)
systematic_sample
# recall that Python starts at position 0, so position 2 corresponds to ID = 3
Stratified Random Sampling
Another type of sampling is stratified random sampling, in which a population is split into groups and a certain number of members from each group are randomly
selected to be included in the sample.
Stratified Random Sampling Using Counts
Out[9]:
ID
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
2
Out[12]:
ID
2
3
5
6
8
9
11
12
14
15
11/2/22, 5:35 PM
Lab 8
localhost:8888/nbconvert/html/Documents/Python_examples/Lab_8/Lab 8.ipynb?download=false
3/4
In [13]:
# Suppose we have the following dataframe containing the ID of 8 students from 2 different undergrad programs.
# This is our population list.
df
=
pd
.
DataFrame({
'ID'
:np
.
arange(
1
,
9
)
.
tolist(),
'program'
:[
'Stats'
]
*4 +
[
'Math'
]
*4
})
# 4 students in Stats, 4 students in Math
df
In [14]:
# random sample of 2 Stats students
df_Stats
=
df[df[
'program'
]
==
'Stats'
]
df_Stats
random_rows
=
random
.
sample(
range
(
4
),
2
)
#randomly selecting 2 students from the 4 Stats students in the populatio
n
print
(df_Stats
.
iloc[random_rows])
In [15]:
# random sample of 2 Math students
df_Math
=
df[df[
'program'
]
==
'Math'
]
df_Math
random_rows
=
random
.
sample(
range
(
4
),
2
)
#randomly selecting 2 students from the 4 Math students in the population
print
(df_Math
.
iloc[random_rows])
In [16]:
# Alternative way using one line of code and additional Python functions
DataFrame.groupby:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)
DataFrame.apply:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)
DataFrame.sample:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html
(https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)
In [17]:
# Stratified random sampling by randomly selecting 2 students from each program to be included in the sample
df
.
groupby(
'program'
, group_keys
=
False
)
.
apply(
lambda
x:x
.
sample(
2
))
Stratified Random Sampling Using Proportions
Out[13]:
ID
program
0
1
Stats
1
2
Stats
2
3
Stats
3
4
Stats
4
5
Math
5
6
Math
6
7
Math
7
8
Math
ID program
3
4
Stats
2
3
Stats
ID program
4
5
Math
6
7
Math
Out[17]:
ID
program
4
5
Math
6
7
Math
0
1
Stats
1
2
Stats
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
11/2/22, 5:35 PM
Lab 8
localhost:8888/nbconvert/html/Documents/Python_examples/Lab_8/Lab 8.ipynb?download=false
4/4
In [18]:
# Now suppose we have the following population dataframe with 25% Stats and 75% Math students.
df
=
pd
.
DataFrame({
'ID'
:np
.
arange(
1
,
9
)
.
tolist(),
'program'
:[
'Stats'
]
*2 +
[
'Math'
]
*6
})
# 2 students in Stats, 6 students in Math
df
np.rint:
https://numpy.org/doc/stable/reference/generated/numpy.rint.html
(https://numpy.org/doc/stable/reference/generated/numpy.rint.html)
In [19]:
# Stratified random sampling such that the proportion of students in each program sample
# matches the proportion of students from each program in the population dataframe
N
= 4
# sample size
# So, the sample must contain 1 random student from Stats and 3 from Math to maintain the population proportions
In [20]:
# random sample of Stats students
df_Stats
=
df[df[
'program'
]
==
'Stats'
]
df_Stats
random_rows
=
random
.
sample(
range
(
2
),
1
)
#sampling 1 student from the 2 Stats students in the population
print
(df_Stats
.
iloc[random_rows])
In [21]:
# random sample of Math students
df_Math
=
df[df[
'program'
]
==
'Math'
]
df_Math
random_rows
=
random
.
sample(
range
(
6
),
3
)
#sampling 3 students from the 6 Math students in the population
print
(df_Math
.
iloc[random_rows])
Alternative way:
np.rint:
https://numpy.org/doc/stable/reference/generated/numpy.rint.html
(https://numpy.org/doc/stable/reference/generated/numpy.rint.html)
In [22]:
df
.
groupby(
'program'
, group_keys
=
False
)
.
apply(
lambda
x:x
.
sample(
int
(np
.
rint(N
*
len
(x)
/
len
(df)))))
In [ ]:
Out[18]:
ID
program
0
1
Stats
1
2
Stats
2
3
Math
3
4
Math
4
5
Math
5
6
Math
6
7
Math
7
8
Math
ID program
0
1
Stats
ID program
5
6
Math
7
8
Math
4
5
Math
Out[22]:
ID
program
4
5
Math
2
3
Math
3
4
Math
0
1
Stats
Related Documents
Related Questions
None
arrow_forward
Student Absences and Grades on the Final Exam
Dr. V. noticed that the more frequently a student is late or absent from class the worse he or she performs on the final exam. He decided to investigate. Dr. V. collected a random sample of 22 students. This sample includes the number of times a student is absent and their grades on the final exam. The data can be found in the Excel file Assignment10.xlsx. Do not use any software that I did not assign.
Question 1: Which or the two variables (times absent or grades on the final exam) is the independent variable and which is the dependent variable?
Question 2: Using Microsoft Excel: SHOW YOUR WORK
Times Late/Absent
Final Exam Grade
1
9
84.00
2
2
92.50
3
12
52.00
4
5
87.00
5
11
75.00
6
24
45.00
7
7
39.00
8
10
46.00
9
19
63.00
10
2
65.00
11
2
98.00
12
17
24.50
13
8
58.00
14
20
69.50
15
55
49.00
16
23
68.50
17
6
70.00
18
4
75.50
19
2
85.00
20
7
97.00
21
2
97.00
22
2
93.50
23
0
93.50
24
2…
arrow_forward
Give and explain one way to find minimum variance unbiased estimator.
arrow_forward
What is Scatterplot?
arrow_forward
how to find the variance?
arrow_forward
Discuss how sampling error and standard error might arise within the context of an original real world application
arrow_forward
Project 4 - Correlation
Assume that your sample(s) comes from a normally distributed population.
Please show all work. State the null hypothesis and alternative hypothesis. State your decision in terms of the null hypothesis and the alpha level. State your results in standard APA format.
1. What, if any relationship, exists between self-reported height for females and self-reported weight for the same female participants?
Write a null and an alternative hypothesis for this research question (you may want to review Example 15.7 and Example 15.8 on page 508).
Determine the relationship between the two measures for this sample (calculate r).
Discuss and interpret the relation strength and direction indicated by this statistic.
Test your hypothesis.
Report your decision. Then, report your results using standard APA format (see page 509 in your text).
2. Is there a relationship between self-reported major and self-reported IQ in the sample?
Write a null and an alternative…
arrow_forward
Describe how you would randomise observations and why it is important that the service user is unable to predict when you will observe them in the health care?
arrow_forward
Public Policy Polling recently conducted a survey asking adults across the U.S. about music preferences. When asked, 80 of the 571 participants admitted that they have illegally downloaded music. This survey was conducted through automated telephone interviews. The error bound of the survey compensates for sampling error, or natural variability among samples. List some factors that could affect the survey's outcome that are not covered by the margin of error.
arrow_forward
Explain the three major sources of error variance
related to reliability. Provide an example of a type of
reliability thst could be used to estimate each source
of error variance. Your examples must be different for
each source of variance.
arrow_forward
To obtain a maximum amount of variance, an item ought to have a degree of difficulty of what?
arrow_forward
Q6, what is the correct answer for this question
arrow_forward
Sample size is larger if
arrow_forward
Mr. Selig is also interested in the bank’s ATMs. Is there a difference in ATM use among the branches? Also, do customers who have debit cards tend to use ATMs differently from those who do not have debit cards? Is there a difference in ATM use by those with checking accounts that pay interest versus those that do not? (show formulas, calculations)
arrow_forward
Suppose a grocery store is considering the purchase of a new self-checkout machine that will get customers through the checkout line faster than their current machine. Before he spends the money on the equipment, he wants to know how much faster the customers will check out compared to the current machine. The store manager recorded the checkout times, in seconds, for a randomly selected sample of checkouts from each machine. The summary statistics are provided in the table.
Group
Description
Samplesize
Samplemean (min)
Sample standarddeviation (min)
Standard errorestimate (min)
1
old machine
?1=49
?⎯⎯⎯1=126.4
?1=27.8
SE1=3.97143
2
new machine
?2=46
?⎯⎯⎯2=111.0
?2=22.2
SE2=3.27321
df=90.71233
Compute the lower and upper limits of a 95% confidence interval to estimate the difference of the mean checkout times for all customers. Estimate the difference for the old machine minus the new machine, so that a positive result reflects faster checkout times with the new machine. Use…
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
data:image/s3,"s3://crabby-images/9ae58/9ae58d45ce2e430fbdbd90576f52102eefa7841e" alt="Text book image"
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Related Questions
- Nonearrow_forwardStudent Absences and Grades on the Final Exam Dr. V. noticed that the more frequently a student is late or absent from class the worse he or she performs on the final exam. He decided to investigate. Dr. V. collected a random sample of 22 students. This sample includes the number of times a student is absent and their grades on the final exam. The data can be found in the Excel file Assignment10.xlsx. Do not use any software that I did not assign. Question 1: Which or the two variables (times absent or grades on the final exam) is the independent variable and which is the dependent variable? Question 2: Using Microsoft Excel: SHOW YOUR WORK Times Late/Absent Final Exam Grade 1 9 84.00 2 2 92.50 3 12 52.00 4 5 87.00 5 11 75.00 6 24 45.00 7 7 39.00 8 10 46.00 9 19 63.00 10 2 65.00 11 2 98.00 12 17 24.50 13 8 58.00 14 20 69.50 15 55 49.00 16 23 68.50 17 6 70.00 18 4 75.50 19 2 85.00 20 7 97.00 21 2 97.00 22 2 93.50 23 0 93.50 24 2…arrow_forwardGive and explain one way to find minimum variance unbiased estimator.arrow_forward
- Project 4 - Correlation Assume that your sample(s) comes from a normally distributed population. Please show all work. State the null hypothesis and alternative hypothesis. State your decision in terms of the null hypothesis and the alpha level. State your results in standard APA format. 1. What, if any relationship, exists between self-reported height for females and self-reported weight for the same female participants? Write a null and an alternative hypothesis for this research question (you may want to review Example 15.7 and Example 15.8 on page 508). Determine the relationship between the two measures for this sample (calculate r). Discuss and interpret the relation strength and direction indicated by this statistic. Test your hypothesis. Report your decision. Then, report your results using standard APA format (see page 509 in your text). 2. Is there a relationship between self-reported major and self-reported IQ in the sample? Write a null and an alternative…arrow_forwardDescribe how you would randomise observations and why it is important that the service user is unable to predict when you will observe them in the health care?arrow_forwardPublic Policy Polling recently conducted a survey asking adults across the U.S. about music preferences. When asked, 80 of the 571 participants admitted that they have illegally downloaded music. This survey was conducted through automated telephone interviews. The error bound of the survey compensates for sampling error, or natural variability among samples. List some factors that could affect the survey's outcome that are not covered by the margin of error.arrow_forward
- Explain the three major sources of error variance related to reliability. Provide an example of a type of reliability thst could be used to estimate each source of error variance. Your examples must be different for each source of variance.arrow_forwardTo obtain a maximum amount of variance, an item ought to have a degree of difficulty of what?arrow_forwardQ6, what is the correct answer for this questionarrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Holt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALGlencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw Hill
data:image/s3,"s3://crabby-images/9ae58/9ae58d45ce2e430fbdbd90576f52102eefa7841e" alt="Text book image"
Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL
data:image/s3,"s3://crabby-images/b9e14/b9e141b888912793d57db61a53fa701d5defdb09" alt="Text book image"
Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill