Lab08.ipynb - Colaboratory

pdf

School

University of North Texas *

*We aren’t endorsed by this school

Course

5502

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

11

Uploaded by venkatasai1999

Report
Welcome to Lab 8! In this lab we will go over the topic of sampling and distributions . The data used in this lab will contain salary data and other statistics for basketball players from the 2014-2015 NBA season. This data was collected from the following sports analytic sites: Basketball Reference and Spotrac . First, set up the tests and imports by running the cell below. Lab 8: Sampling and Distributions - Sampling Basketball Data # Run this cell, but please don't change it. # These lines import the Numpy and Datascience modules. import numpy as np from datascience import * # These lines do some fancy plotting magic import matplotlib %matplotlib inline import matplotlib.pyplot as plt plt.style.use('fivethirtyeight') Run the cell below to load player and salary data that we will use for our sampling. Name Age Team Games Rebounds Assists Steals Blocks Turnovers Points James Harden 25 HOU 81 459 565 154 60 321 2217 Chris Paul 29 LAC 82 376 838 156 15 190 1564 Stephen Curry 26 GSW 80 341 619 163 16 249 1900 ... (489 rows omitted) PlayerName Salary Kobe Bryant 23500000 Amar'e Stoudemire 23410988 Joe Johnson 23180790 ... (489 rows omitted) PlayerName Salary Age Team Games Rebounds Assists Steals Blocks Turnovers Points A.J. Price 62552 28 TOT 26 32 46 7 0 14 133 Aaron Brooks 1145685 30 CHI 82 166 261 54 15 157 954 Aaron Gordon 3992040 19 ORL 47 169 33 21 22 38 243 ... (489 rows omitted) player_data = Table().read_table("player_data.csv") salary_data = Table().read_table("salary_data.csv") full_data = salary_data.join("PlayerName", player_data, "Name") # The show method immediately displays the contents of a table. # This way, we can display the top of two tables using a single cell. player_data.show(3) salary_data.show(3) full_data.show(3)
Rather than getting data on every player (as in the tables loaded above), imagine that we had gotten data on only a smaller subset of the players. For 492 players, it's not so unreasonable to expect to see all the data, but usually we aren't so lucky. If we want to make estimates about a certain numerical property of the population (known as a statistic, e.g. the mean or median), we may have to come up with these estimates based only on a smaller sample. Whether these estimates are useful or not often depends on how the sample was gathered. We have prepared some example sample datasets to see how they compare to the full NBA dataset. Later we'll ask you to create your own samples to see how they behave. To save typing and increase the clarity of your code, we will package the analysis code into a few functions. This will be useful in the rest of the lab as we will repeatedly need to create histograms and collect summary statistics from that data. We've de±ned the histograms function below, which takes a table with columns Age and Salary and draws a histogram for each one. It uses bin widths of 1 year for Age and $1,000,000 for Salary . def histograms(t): ages = t.column('Age') salaries = t.column('Salary')/1000000 t1 = t.drop('Salary').with_column('Salary', salaries) age_bins = np.arange(min(ages), max(ages) + 2, 1) salary_bins = np.arange(min(salaries), max(salaries) + 1, 1) t1.hist('Age', bins=age_bins, unit='year') plt.title('Age distribution') t1.hist('Salary', bins=salary_bins, unit='million dollars') plt.title('Salary distribution') histograms(full_data) print('Two histograms should be displayed below')
Two histograms should be displayed below Question 1. . Create a function called compute_statistics that takes a table containing ages and salaries and: Draws a histogram of ages Draws a histogram of salaries Returns a two-element array containing the average age and average salary (in that order) You can call the histograms function to draw the histograms! Note: More charts will be displayed when running the test cell. Please feel free to ignore the charts. def compute_statistics(age_and_salary_data): a=[] age = age_and_salary_data.column('Age') salary = age_and_salary_data.column('Salary')/1000000 age_bins = np.arange(min(age), max(age) + 2, 1) salary_bins = int(max(salary)-min(salary)/1000) age_and_salary_data.hist('Age', bins=age_bins, unit='year') plt.title('Age distribution') age_and_salary_data.hist('Salary', bins=salary_bins, unit='million dollars') plt.title('Salary distribution') a=([sum(age)/len(age),sum(salary)/len(salary)*1000000]) return a full_stats = compute_statistics(full_data) full_stats
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[26.536585365853657, 4269775.7662601592] # TEST stats = compute_statistics(full_data) plt.close() plt.close() round(float(stats[0]), 2) == 26.54 True # TEST stats = compute_statistics(full_data) plt.close() plt.close() round(float(stats[1]), 2) == 4269775.77 True One sampling methodology, which is generally a bad idea , is to choose players who are somehow convenient to sample. For example, you might choose players from one team who are near your house, since it's easier to survey them. This is called, somewhat pejoratively, convenience sampling . Suppose you survey only relatively new players with ages less than 22. (The more experienced players didn't bother to answer your surveys about their salaries.) Question 2. Assign convenience_sample to a subset of full_data that contains only the rows for players under the age of 22. Convenience sampling
PlayerName Salary Age Team Games Rebounds Assists Steals Blocks Turnovers Points Aaron Gordon 3992040 19 ORL 47 169 33 21 22 38 243 Alex Len 3649920 21 PHO 69 454 32 34 105 74 432 Andre Drummond 2568360 21 DET 82 1104 55 73 153 120 1130 Andrew Wiggins 5510640 19 MIN 82 374 170 86 50 177 1387 Anthony Bennett 5563920 21 MIN 57 216 48 27 16 36 298 Anthony Davis 5607240 21 NOP 68 696 149 100 200 95 1656 Archie Goodwin 1112280 20 PHO 41 74 44 18 9 48 231 Ben McLemore 3026280 21 SAC 82 241 140 77 19 138 996 Bradley Beal 4505280 21 WAS 63 241 194 76 18 123 962 Bruno Caboclo 1458360 19 TOR 8 2 0 0 1 4 10 ... (34 rows omitted) convenience_sample = full_data.where(full_data.column('Age') < 22) convenience_sample # TEST convenience_sample.num_columns == 11 True # TEST convenience_sample.num_rows == 44 True Question 3. Assign convenience_stats to an array of the average age and average salary of your convenience sample, using the compute_statistics function. Since they're computed on a sample, these are called sample averages . age=convenience_sample.column('Age') salary=convenience_sample.column('Salary') convenience_stats = ([sum(age)/len(age),sum(salary)/len(salary)]) convenience_stats [20.363636363636363, 2383533.8181818184] # TEST len(convenience_stats) == 2 True # TEST round(float(convenience_stats[0]), 2) == 20.36 True # TEST round(float(convenience_stats[1]), 2) == 2383533.82 True
Next, we'll compare the convenience sample salaries with the full data salaries in a single histogram. To do that, we'll need to use the bin_column option of the hist method, which indicates that all columns are counts of the bins in a particular column. The following cell does not require any changes; just run it . def compare_salaries(first, second, first_title, second_title): """Compare the salaries in two tables.""" first_salary_in_millions = first.column('Salary')/1000000 second_salary_in_millions = second.column('Salary')/1000000 first_tbl_millions = first.drop('Salary').with_column('Salary', first_salary_in_millions) second_tbl_millions = second.drop('Salary').with_column('Salary', second_salary_in_millions) max_salary = max(np.append(first_tbl_millions.column('Salary'), second_tbl_millions.column('Salary'))) bins = np.arange(0, max_salary+1, 1) first_binned = first_tbl_millions.bin('Salary', bins=bins).relabeled(1, first_title) second_binned = second_tbl_millions.bin('Salary', bins=bins).relabeled(1, second_title) first_binned.join('bin', second_binned).hist(bin_column='bin', unit='million dollars') plt.title('Salaries for all players and convenience sample') compare_salaries(full_data, convenience_sample, 'All Players', 'Convenience Sample') Question 4. Does the convenience sample give us an accurate picture of the salary of the full population? Would you expect it to, in general? Before you move on, write a short answer in English below. You can refer to the statistics calculated above or perform your own analysis. No, the convenience sample is not fairly representing the wage of the total population. The graph is having positive skewness, therefore if we distinguish between the color combinations, the graph will be even more accurate. The graph combo should be accurate if we only make minor adjustments. A more justi±able approach is to sample uniformly at random from the players. In a simple random sample (SRS) without replacement , we ensure that each player is selected at most once. Imagine writing down each player's name on a card, putting the cards in an box, and shu²ing the box. Then, pull out cards one by one and set them aside, stopping when the speci±ed sample size is reached. Simple random sampling
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Sometimes, it’s useful to take random samples even when we have the data for the whole population. It helps us understand sampling accuracy. sample The table method sample produces a random sample from the table. By default, it draws at random with replacement from the rows of a table. It takes in the sample size as its argument and returns a table with only the rows that were selected. Run the cell below to see an example call to sample() with a sample size of 5, with replacement. Producing simple random samples PlayerName Salary Bruno Caboclo 1458360 Alec Burks 3034356 Tony Wroten 1210080 Andre Roberson 1160880 Kendrick Perkins 9154342 # Just run this cell salary_data.sample(5) The optional argument with_replacement=False can be passed through sample() to specify that the sample should be drawn without replacement. Run the cell below to see an example call to sample() with a sample size of 5, without replacement. PlayerName Salary Noah Vonleh 2524200 Aron Baynes 2077000 Mike Dunleavy 3326235 Ben McLemore 3026280 Eric Bledsoe 13000000 # Just run this cell salary_data.sample(5, with_replacement=False) Question 5. Produce a simple random sample of size 44 from full_data . Run your analysis on it again. Run the cell a few times to see how the histograms and statistics change across different samples. my_small_srswor_data = full_data.sample(44) print(my_small_srswor_data) def histograms(s): ages = s.column('Age') salaries = s.column('Salary')/1000000 s1 = s.drop('Salary').with_column('Salary', salaries) age_bins = np.arange(min(ages), max(ages) + 2, 1)
salary_bins = np.arange(min(salaries), max(salaries) + 1, 1) s1.hist('Age', bins=age_bins, unit='year') plt.title('Age distribution') s1.hist('Salary', bins=salary_bins, unit='million dollars') plt.title('Salary distribution') plt.show() histograms(my_small_srswor_data) my_small_stats = ([sum(my_small_srswor_data.column('Age'))/len(my_small_srswor_data.column('Age')),sum(my_small_srswor_data.column('Salary'))/len(my_small_srswor_data.column('Salary'))*1000000]) my_small_stats
PlayerName | Salary | Age | Team | Games | Rebounds | Assists | Steals | Blocks | Turnovers | Points Jabari Brown | 44765 | 22 | LAL | 19 | 36 | 40 | 12 | 2 | 32 | 227 Dahntay Jones | 613478 | 34 | LAC | 33 | 11 | 2 | 3 | 0 | 1 | 21 Steve Blake | 2077000 | 34 | POR | 81 | 137 | 288 | 41 | 5 | 104 | 350 David Wear | 29843 | 24 | SAC | 2 | 2 | 1 | 0 | 0 | 0 | 0 Kobe Bryant | 23500000 | 36 | LAL | 35 | 199 | 197 | 47 | 7 | 128 | 782 James McAdoo | 167122 | 22 | GSW | 15 | 37 | 2 | 5 | 9 | 6 | 62 Roy Hibbert | 14898938 | 28 | IND | 76 | 540 | 84 | 18 | 125 | 107 | 802 Stephen Curry | 10629213 | 26 | GSW | 80 | 341 | 619 | 163 | 16 | 249 | 1900 Johnny O'Bryant | 600000 | 21 | MIL | 34 | 64 | 17 | 5 | 4 | 25 | 100 Andre Iguodala | 12289544 | 31 | GSW | 77 | 257 | 228 | 89 | 25 | 88 | 604 ... (34 rows omitted) [26.90909090909091, 4761880750000.0] Before you move on, write a short answer for the following questions in English: How much does the average age change across samples? What about average salary? 1. The average ages can be calculated using the mean of all ages. Our population is divided into different age groups, each with a different percentage of people; the age groups 25 and 30 have the highest percentage, so the average age must be between 26 and 27. The provided data often often used to track the changes in the average age across different samples. The age trend initially increases and then declines over time. 2. If we use this sample histogram to calculate the average income, we can see that as the number of millions rises, the salary distribution decreases.The average salary is spread across all age groups. The graph indicates that the typical wage ranges between 7 and 9. Question 6. As in the previous question, analyze several simple random samples of size 100 from full_data . my_large_srswor_data = full_data.sample(100) def histograms(t): ages = t.column('Age') salaries = t.column('Salary')/1000000 t1 = t.drop('Salary').with_column('Salary', salaries) age_bins = np.arange(min(ages), max(ages) + 2, 1) salary_bins = np.arange(min(salaries), max(salaries) + 1, 1) t1.hist('Age', bins=age_bins, unit='year') plt.title('Age distribution') t1.hist('Salary', bins=salary_bins, unit='million dollars') plt.title('Salary distribution') plt.show() histograms(my_small_srswor_data) age=my_large_srswor_data.column('Age') salary=my_large_srswor_data.column('Salary') my_large_stats = ([sum(age)/len(age),sum(salary)/len(salary)*1000000]) my_large_stats
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
[25.780000000000001, 4206290650000.0005] Answer the following questions in English: Do the histogram shapes seem to change more or less across samples of 100 than across samples of size 44? Are the sample averages and histograms closer to their true values/shape for age or for salary? What did you expect to see? 1. Yes,It appears that histogram shapes vary more with samples of 100 compared to 44. The inclusion of more bins caused these variations in the histogram's shape. 2. The age and pay to sample averages and histograms were closer to their true shapes; they weren't exact, but they were close. The graph can be deemed perfect if it is more transparent. Congratulations, you're done with Lab 8! Be sure to... run all the tests, print the notebook as a PDF, and submit both the notebook and the PDF to Canvas.