Lab08.ipynb - Colaboratory
pdf
keyboard_arrow_up
School
University of North Texas *
*We aren’t endorsed by this school
Course
5502
Subject
Statistics
Date
Feb 20, 2024
Type
Pages
11
Uploaded by venkatasai1999
Welcome to Lab 8! In this lab we will go over the topic of sampling and distributions
.
The data used in this lab will contain salary data and other statistics for basketball players from the 2014-2015 NBA season. This data was
collected from the following sports analytic sites: Basketball Reference and Spotrac
.
First, set up the tests and imports by running the cell below.
Lab 8: Sampling and Distributions - Sampling Basketball Data
# Run this cell, but please don't change it.
# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *
# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
Run the cell below to load player and salary data that we will use for our sampling.
Name
Age
Team
Games
Rebounds
Assists
Steals
Blocks
Turnovers
Points
James Harden
25
HOU
81
459
565
154
60
321
2217
Chris Paul
29
LAC
82
376
838
156
15
190
1564
Stephen Curry
26
GSW
80
341
619
163
16
249
1900
... (489 rows omitted)
PlayerName
Salary
Kobe Bryant
23500000
Amar'e Stoudemire
23410988
Joe Johnson
23180790
... (489 rows omitted)
PlayerName
Salary
Age
Team
Games
Rebounds
Assists
Steals
Blocks
Turnovers
Points
A.J. Price
62552
28
TOT
26
32
46
7
0
14
133
Aaron Brooks
1145685
30
CHI
82
166
261
54
15
157
954
Aaron Gordon
3992040
19
ORL
47
169
33
21
22
38
243
... (489 rows omitted)
player_data = Table().read_table("player_data.csv")
salary_data = Table().read_table("salary_data.csv")
full_data = salary_data.join("PlayerName", player_data, "Name")
# The show method immediately displays the contents of a table. # This way, we can display the top of two tables using a single cell.
player_data.show(3)
salary_data.show(3)
full_data.show(3)
Rather than getting data on every player (as in the tables loaded above), imagine that we had gotten data on only a smaller subset of the
players. For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky.
If we want to make estimates about a certain numerical property of the population (known as a statistic, e.g. the mean or median), we may
have to come up with these estimates based only on a smaller sample. Whether these estimates are useful or not often depends on how the
sample was gathered. We have prepared some example sample datasets to see how they compare to the full NBA dataset. Later we'll ask you
to create your own samples to see how they behave.
To save typing and increase the clarity of your code, we will package the analysis code into a few functions. This will be useful in the rest of the
lab as we will repeatedly need to create histograms and collect summary statistics from that data.
We've de±ned the histograms
function below, which takes a table with columns Age
and Salary
and draws a histogram for each one. It uses
bin widths of 1 year for Age
and $1,000,000 for Salary
.
def histograms(t):
ages = t.column('Age')
salaries = t.column('Salary')/1000000
t1 = t.drop('Salary').with_column('Salary', salaries)
age_bins = np.arange(min(ages), max(ages) + 2, 1) salary_bins = np.arange(min(salaries), max(salaries) + 1, 1)
t1.hist('Age', bins=age_bins, unit='year')
plt.title('Age distribution')
t1.hist('Salary', bins=salary_bins, unit='million dollars')
plt.title('Salary distribution') histograms(full_data)
print('Two histograms should be displayed below')
Two histograms should be displayed below
Question 1.
. Create a function called compute_statistics
that takes a table containing ages and salaries and:
Draws a histogram of ages
Draws a histogram of salaries
Returns a two-element array containing the average age and average salary (in that order)
You can call the histograms
function to draw the histograms!
Note:
More charts will be displayed when running the test cell. Please feel free to ignore the charts.
def compute_statistics(age_and_salary_data):
a=[]
age = age_and_salary_data.column('Age')
salary = age_and_salary_data.column('Salary')/1000000
age_bins = np.arange(min(age), max(age) + 2, 1) salary_bins = int(max(salary)-min(salary)/1000)
age_and_salary_data.hist('Age', bins=age_bins, unit='year')
plt.title('Age distribution')
age_and_salary_data.hist('Salary', bins=salary_bins, unit='million dollars')
plt.title('Salary distribution') a=([sum(age)/len(age),sum(salary)/len(salary)*1000000])
return a
full_stats = compute_statistics(full_data)
full_stats
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[26.536585365853657, 4269775.7662601592]
# TEST
stats = compute_statistics(full_data)
plt.close()
plt.close()
round(float(stats[0]), 2) == 26.54
True
# TEST
stats = compute_statistics(full_data)
plt.close()
plt.close()
round(float(stats[1]), 2) == 4269775.77
True
One sampling methodology, which is generally a bad idea
, is to choose players who are somehow convenient to sample. For example, you
might choose players from one team who are near your house, since it's easier to survey them. This is called, somewhat pejoratively,
convenience sampling
.
Suppose you survey only relatively new
players with ages less than 22. (The more experienced players didn't bother to answer your surveys
about their salaries.)
Question 2.
Assign convenience_sample
to a subset of full_data
that contains only the rows for players under the age of 22.
Convenience sampling
PlayerName
Salary
Age
Team
Games
Rebounds
Assists
Steals
Blocks
Turnovers
Points
Aaron Gordon
3992040
19
ORL
47
169
33
21
22
38
243
Alex Len
3649920
21
PHO
69
454
32
34
105
74
432
Andre Drummond
2568360
21
DET
82
1104
55
73
153
120
1130
Andrew Wiggins
5510640
19
MIN
82
374
170
86
50
177
1387
Anthony Bennett
5563920
21
MIN
57
216
48
27
16
36
298
Anthony Davis
5607240
21
NOP
68
696
149
100
200
95
1656
Archie Goodwin
1112280
20
PHO
41
74
44
18
9
48
231
Ben McLemore
3026280
21
SAC
82
241
140
77
19
138
996
Bradley Beal
4505280
21
WAS
63
241
194
76
18
123
962
Bruno Caboclo
1458360
19
TOR
8
2
0
0
1
4
10
... (34 rows omitted)
convenience_sample = full_data.where(full_data.column('Age') < 22)
convenience_sample
# TEST
convenience_sample.num_columns == 11
True
# TEST
convenience_sample.num_rows == 44
True
Question 3.
Assign convenience_stats
to an array of the average age and average salary of your convenience sample, using the
compute_statistics
function. Since they're computed on a sample, these are called sample averages
.
age=convenience_sample.column('Age')
salary=convenience_sample.column('Salary')
convenience_stats = ([sum(age)/len(age),sum(salary)/len(salary)])
convenience_stats
[20.363636363636363, 2383533.8181818184]
# TEST
len(convenience_stats) == 2
True
# TEST round(float(convenience_stats[0]), 2) == 20.36
True
# TEST
round(float(convenience_stats[1]), 2) == 2383533.82
True
Next, we'll compare the convenience sample salaries with the full data salaries in a single histogram. To do that, we'll need to use the
bin_column
option of the hist
method, which indicates that all columns are counts of the bins in a particular column. The following cell does
not require any changes; just run it
.
def compare_salaries(first, second, first_title, second_title):
"""Compare the salaries in two tables."""
first_salary_in_millions = first.column('Salary')/1000000
second_salary_in_millions = second.column('Salary')/1000000
first_tbl_millions = first.drop('Salary').with_column('Salary', first_salary_in_millions)
second_tbl_millions = second.drop('Salary').with_column('Salary', second_salary_in_millions)
max_salary = max(np.append(first_tbl_millions.column('Salary'), second_tbl_millions.column('Salary')))
bins = np.arange(0, max_salary+1, 1)
first_binned = first_tbl_millions.bin('Salary', bins=bins).relabeled(1, first_title)
second_binned = second_tbl_millions.bin('Salary', bins=bins).relabeled(1, second_title)
first_binned.join('bin', second_binned).hist(bin_column='bin', unit='million dollars')
plt.title('Salaries for all players and convenience sample')
compare_salaries(full_data, convenience_sample, 'All Players', 'Convenience Sample')
Question 4.
Does the convenience sample give us an accurate picture of the salary of the full population? Would you expect it to, in general?
Before you move on, write a short answer in English below. You can refer to the statistics calculated above or perform your own analysis.
No, the convenience sample is not fairly representing the wage of the total population. The graph is having positive skewness, therefore if we
distinguish between the color combinations, the graph will be even more accurate. The graph combo should be accurate if we only make minor
adjustments.
A more justi±able approach is to sample uniformly at random from the players. In a simple random sample (SRS) without replacement
, we
ensure that each player is selected at most once. Imagine writing down each player's name on a card, putting the cards in an box, and shu²ing
the box. Then, pull out cards one by one and set them aside, stopping when the speci±ed sample size is reached.
Simple random sampling
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Sometimes, it’s useful to take random samples even when we have the data for the whole population. It helps us understand sampling
accuracy.
sample
The table method sample
produces a random sample from the table. By default, it draws at random with replacement
from the rows of a table.
It takes in the sample size as its argument and returns a table
with only the rows that were selected.
Run the cell below to see an example call to sample()
with a sample size of 5, with replacement.
Producing simple random samples
PlayerName
Salary
Bruno Caboclo
1458360
Alec Burks
3034356
Tony Wroten
1210080
Andre Roberson
1160880
Kendrick Perkins
9154342
# Just run this cell
salary_data.sample(5)
The optional argument with_replacement=False
can be passed through sample()
to specify that the sample should be drawn without
replacement.
Run the cell below to see an example call to sample()
with a sample size of 5, without replacement.
PlayerName
Salary
Noah Vonleh
2524200
Aron Baynes
2077000
Mike Dunleavy
3326235
Ben McLemore
3026280
Eric Bledsoe
13000000
# Just run this cell
salary_data.sample(5, with_replacement=False)
Question 5.
Produce a simple random sample of size 44 from full_data
. Run your analysis on it again. Run the cell a few times to see how the
histograms and statistics change across different samples.
my_small_srswor_data = full_data.sample(44)
print(my_small_srswor_data)
def histograms(s):
ages = s.column('Age')
salaries = s.column('Salary')/1000000
s1 = s.drop('Salary').with_column('Salary', salaries)
age_bins = np.arange(min(ages), max(ages) + 2, 1)
salary_bins = np.arange(min(salaries), max(salaries) + 1, 1)
s1.hist('Age', bins=age_bins, unit='year')
plt.title('Age distribution')
s1.hist('Salary', bins=salary_bins, unit='million dollars')
plt.title('Salary distribution') plt.show()
histograms(my_small_srswor_data)
my_small_stats = ([sum(my_small_srswor_data.column('Age'))/len(my_small_srswor_data.column('Age')),sum(my_small_srswor_data.column('Salary'))/len(my_small_srswor_data.column('Salary'))*1000000])
my_small_stats
PlayerName | Salary | Age | Team | Games | Rebounds | Assists | Steals | Blocks | Turnovers | Points
Jabari Brown | 44765 | 22 | LAL | 19 | 36 | 40 | 12 | 2 | 32 | 227
Dahntay Jones | 613478 | 34 | LAC | 33 | 11 | 2 | 3 | 0 | 1 | 21
Steve Blake | 2077000 | 34 | POR | 81 | 137 | 288 | 41 | 5 | 104 | 350
David Wear | 29843 | 24 | SAC | 2 | 2 | 1 | 0 | 0 | 0 | 0
Kobe Bryant | 23500000 | 36 | LAL | 35 | 199 | 197 | 47 | 7 | 128 | 782
James McAdoo | 167122 | 22 | GSW | 15 | 37 | 2 | 5 | 9 | 6 | 62
Roy Hibbert | 14898938 | 28 | IND | 76 | 540 | 84 | 18 | 125 | 107 | 802
Stephen Curry | 10629213 | 26 | GSW | 80 | 341 | 619 | 163 | 16 | 249 | 1900
Johnny O'Bryant | 600000 | 21 | MIL | 34 | 64 | 17 | 5 | 4 | 25 | 100
Andre Iguodala | 12289544 | 31 | GSW | 77 | 257 | 228 | 89 | 25 | 88 | 604
... (34 rows omitted)
[26.90909090909091, 4761880750000.0]
Before you move on, write a short answer for the following questions in English:
How much does the average age change across samples?
What about average salary?
1. The average ages can be calculated using the mean of all ages. Our population is divided into different age groups, each with a different
percentage of people; the age groups 25 and 30 have the highest percentage, so the average age must be between 26 and 27. The
provided data often often used to track the changes in the average age across different samples. The age trend initially increases and
then declines over time.
2. If we use this sample histogram to calculate the average income, we can see that as the number of millions rises, the salary distribution
decreases.The average salary is spread across all age groups. The graph indicates that the typical wage ranges between 7 and 9.
Question 6.
As in the previous question, analyze several simple random samples of size 100 from full_data
.
my_large_srswor_data = full_data.sample(100)
def histograms(t):
ages = t.column('Age')
salaries = t.column('Salary')/1000000
t1 = t.drop('Salary').with_column('Salary', salaries)
age_bins = np.arange(min(ages), max(ages) + 2, 1) salary_bins = np.arange(min(salaries), max(salaries) + 1, 1)
t1.hist('Age', bins=age_bins, unit='year')
plt.title('Age distribution')
t1.hist('Salary', bins=salary_bins, unit='million dollars')
plt.title('Salary distribution') plt.show()
histograms(my_small_srswor_data)
age=my_large_srswor_data.column('Age')
salary=my_large_srswor_data.column('Salary')
my_large_stats = ([sum(age)/len(age),sum(salary)/len(salary)*1000000])
my_large_stats
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[25.780000000000001, 4206290650000.0005]
Answer the following questions in English:
Do the histogram shapes seem to change more or less across samples of 100 than across samples of size 44?
Are the sample averages and histograms closer to their true values/shape for age or for salary? What did you expect to see?
1. Yes,It appears that histogram shapes vary more with samples of 100 compared to 44. The inclusion of more bins caused these variations
in the histogram's shape.
2. The age and pay to sample averages and histograms were closer to their true shapes; they weren't exact, but they were close. The graph
can be deemed perfect if it is more transparent.
Congratulations, you're done with Lab 8! Be sure to...
run all the tests,
print the notebook as a PDF,
and submit both the notebook and the PDF to Canvas.
Recommended textbooks for you

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw HillHolt Mcdougal Larson Pre-algebra: Student Edition...AlgebraISBN:9780547587776Author:HOLT MCDOUGALPublisher:HOLT MCDOUGALBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill

Holt Mcdougal Larson Pre-algebra: Student Edition...
Algebra
ISBN:9780547587776
Author:HOLT MCDOUGAL
Publisher:HOLT MCDOUGAL

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt