problem24 v2

pdf

School

University of Michigan *

*We aren’t endorsed by this school

Course

215

Subject

Computer Science

Date

Nov 24, 2024

Type

pdf

Pages

25

Uploaded by TareqA5

Report
Final Exam, Spring 2022: March (through May) Madness Version 1.0 All of the header information is important. Please read it.. Topics, number of exercises: This problem builds on your knowledge of Pandas, SQL, and numeric computation. It has 9 exercises, numbered 0 to 8. There are 19 available points. However, to earn 100% the threshold is 12 points. (Therefore, once you hit 12 points, you can stop. There is no extra credit for exceeding this threshold.) Exercise ordering: Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you can still move on and try the next one. Use this to your advantage, as the exercises are not necessarily ordered in terms of difficulty. Higher point values generally indicate more difficult exercises. Demo cells: Code cells starting with the comment ### define demo inputs load results from prior exercises applied to the entire data set and use those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the starter code. Debugging you code: Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a time). Exercise point breakdown: Exercise 0: 1 point Exercise 1: 3 point Exercise 2: 1 point Exercise 3: 3 point Exercise 4: 2 point Exercise 5: 3 point Exercise 6: 3 point Exercise 7: 1 point Exercise 8: 2 point Final reminders: Submit after every exercise Review the generated grade report after you submit to see what errors were returned Stay calm, skip problems as needed, and take short breaks at your leisure Basketball basics In this notebook we want to predict which team will win a basketball game based on their past performance in a given season. You do not have to know anything about basketball aside from the background below to complete this notebook. Games. A basketball game is played between two teams. In most games, there is a "home" team and an "away" team. However sometimes games are played at a neutral site, in which case neither team is "home" nor "away." Scoring and winning. In a game, the team that scores more points wins. There are no ties. Possessions. A possession is an event where one team continuously controls the ball. In [1]: import sqlite3 import numpy as np import pandas as pd import re conn = sqlite3.connect('resource/asnlib/publicdata/basketball_db.sqlite') Exercise 0 - (1 Points):
We have a big collection of real data from several seasons of men's NCAA basketball. We are most interested in one table of this database, named MRegularSeasonDetailedResults . Complete the function get_cols(conn) to return a list of the columns in the MRegularSeasonDetailedResults table in the db connection conn . The order of the columns in the list should be the same as the order they appear in the database table. In [2]: ### Define get_cols def get_cols(conn): ### ### YOUR CODE HERE ### return pd.read_sql('''select * from MRegularSeasonDetailedResults limit 1''', conn)\ .columns\ .tolist() The demo cell below should display the following output: ['index', 'Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc', 'NumOT', 'WFGM', 'WFGA', ' WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM 3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'] In [3]: ### call demo funtion print(get_cols(conn)) The cell below will test your solution for Exercise 0. The testing variables will be available for debugging under the following names in a dictionary format. input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. In [4]: ### test_cell_ex0 from tester_fw.testers import Tester_ex0 tester = Tester_ex0() for _ in range(20): try : tester.run_test(get_cols) (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() except : (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() raise ### ### AUTOGRADER TEST - DO NOT REMOVE ### print('Passed! Please submit.') ['index', 'Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc', 'NumOT', 'WFGM', 'WF GA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', ' LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF'] initializing tester_fw.tester_6040 Passed! Please submit.
Abriged data dictionary In the MRegularSeasonDetailedResults table, each record is one game. All columns prefixed by 'W' refer to the winning team, and columns prefixed by 'L' refer to the losing team. We are primarily interested in the following columns: Column Type Has Prefix Description Season Integer No Identifies the calendar year when a season ends. DayNum Integer No Identifies the day in a season when a game occurred. TeamID Integer W or L Identifies the teams participating in a game. Score Integer W or L Number of points a team scored in a game. Loc Char W Only Identifies whether the winning team was home ('H'), away ('A'), or the game was played at a neutral site ('N'). FGA Integer W or L Number of field goal attempts occurring when a team had in a game FTA Integer W or L Number of free throw attempts a team had in a game. TO Integer W or L Number of turnovers a team had in a game. OR Integer W or L Number of offensive rebounds a team had in a game. For example, the column with the winning team's ID will be WTeamID, and the column with the losing team's free-throw attempts will be LFTA. Exercise 1 - (3 Points):
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
We want to extract some information about each team involved in a game from that team's perspective. We will use the term "primary team" to refer to this team and "primary team's opponent" to refer to the other team in the game. Do not infer any other meaning from this term. Since there are two participants in each game there will be a record from the perspective of both teams after the extraction. (i.e. one record where the winning team is the primary team and one record where the losing team is the primary team will be in the result. Complete the function get_pace_adjusted(conn) to query the table MRegularSeasonDetailedResults using the db connection conn and return a Pandas DataFrame with the columns outlined below. The "Winning Team" and "Losing Team" headers in the chart below indicate which columns in MRegularSeasonDetailedResults to use in your computations. Column Name dtype description Special Considerations Winning Team Losing Team Won int64 1 if the primary team won the game, 0 if the primary team lost 1 0 Season int64 Current Season DayNum int64 Current DayNum TeamID int64 Team id for primary team WTeamID LTeamID OppID int64 Team id for primary team's opponent LTeamID WTeamID Home int64 1 if the primary team is at home for the game, 0 otherwise. WLoc is 'H' WLoc is 'A' Pos float64 Estimated number of possessions for primary team Round to 5 decimal places W prefix L prefix OppPos float64 Estimated number of possessions for primary team's opponent Round to 5 decimal places L prefix W prefix PtsForPer100 float64 100 primary team's score estimate of primary team's possessions Round to 5 decimal places (only round after division) W prefix L prefix PtsAgstPer100 float64 100 primary team's opponent's score estimate of primary team's opponent's possessions Round to 5 decimal places (only round after division) L prefix W prefix There is no column for possessions in our source data. We will need to estimate it! The formula below is widely used in sports analytics to estimate the number of possessions a team had in a basketball game. ( is # of possessions): = + 0.44( ) + - You can derive all required results from these columns: Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR, WTO, LFGA, LFTA, LOR, LTO . Notes: Each record in the database table MRegularSeasonDetailedResults will correspond to two records in the result. One record where the winning team is the primary team and one record where the losing team is the primary team. For neutral site games, neither team will be home or away. The WLoc column will have an 'N' entry. The columns should be in the exact order given above, but the records can be sorted any way you like. The dtypes attribute of your result must match exactly . This question can be answered using either SQL or Pandas. The UNION ALL phrase may be helpful if you are using SQL. The pd.concat() method may be helpful if you are using Pandas. One strategy that will work is making a table/df for all the winning teams and another for all the losing teams then vertically "stacking" them.
In [5]: ### Define get_pace_adjusted def get_pace_adjusted(conn): ### ### YOUR CODE HERE ### return pd.read_sql(''' -- winning team select 1 as Won, season, DayNum, wteamid as TeamID, lteamid as OppID, wloc='H' as Home, wFGA + 0.44*(wFTA) + wTO - wOR as Pos, lFGA + 0.44*(lFTA) + lTO - lOR as OppPos, 100 * wscore / (wFGA + 0.44*(wFTA) + wTO - wOR) as PtsForPer100, 100 * lscore / (lFGA + 0.44*(lFTA) + lTO - lOR) as PtsAgstPer100 from mregularseasondetailedresults union all -- losing team select 0 as Won, season, DayNum, lteamid as TeamID, wteamid as OppID, wloc='A' as Home, lFGA + 0.44*(lFTA) + lTO - lOR as Pos, wFGA + 0.44*(wFTA) + wTO - wOR as OppPos, 100 * lscore / (lFGA + 0.44*(lFTA) + lTO - lOR) as PtsForPer100, 100 * wscore / (wFGA + 0.44*(wFTA) + wTO - wOR) as PtsAgstPer100 from mregularseasondetailedresults ''', conn).apply( lambda f: round(f, 5)) This is the demo input: Season DayNum WTeamID WScore LTeamID LScore WLoc WFGA WFTA WOR WTO LFGA LFTA LOR LTO 0 2003 10 1104 68 1328 62 N 58 18 14 23 53 22 10 18 The demo cell below should display the following output: Won Season DayNum TeamID OppID Home Pos OppPos PtsForPer100 PtsAgstPer100 0 1 2003 10 1104 1328 0 74.92 70.68 90.76348 87.71930 1 0 2003 10 1328 1104 0 70.68 74.92 87.71930 90.76348 Note that there are two rows, but they come from a single input row. In [6]: ### define demo inputs from tester_fw.test_utils import dfs_to_conn demo_df_1 = pd.read_sql('select Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR, WTO, LFG A, LFTA, LOR, LTO from MRegularSeasonDetailedResults limit 1', conn) demo_conn_1 = dfs_to_conn({'MRegularSeasonDetailedResults': demo_df_1}) # demo_df_1 In [7]: ### call demo funtion print(get_pace_adjusted(demo_conn_1)) Won Season DayNum TeamID OppID Home Pos OppPos PtsForPer100 \ 0 1 2003 10 1104 1328 0 74.92 70.68 90.76348 1 0 2003 10 1328 1104 0 70.68 74.92 87.71930 PtsAgstPer100 0 87.71930 1 90.76348
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format. input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. In [8]: ### test_cell_ex1 from tester_fw.testers import Tester_ex1 tester = Tester_ex1() for _ in range(20): try : tester.run_test(get_pace_adjusted) (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() except : (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() raise ### ### AUTOGRADER TEST - DO NOT REMOVE ### print('Passed! Please submit.') Exercise 2 - (1 Points): Our next task is to compute daily totals of Pos , OppPos , PtsForPer100 , and PtsAgstPer100 as well as a count of the number of teams participating in games that day. Complete the function daily_totals(conn) to query the table PaceAdjusted (structured the same as the output of the previous exercise) in the db connection conn and return a Pandas DataFrame with the following columns: Column Name dtype description Season int64 DayNum int64 Pos float64 Sum of values in Pos column of PaceAdjusted occuring on a unique combination of Season and DayNum Pph float64 Sum of values in PtsForPer100 column of PaceAdjusted occuring on a unique combination of Season and DayNum Count int64 Count of rows in PaceAdjusted with a unique combination of Season and DayNum Notes: The columns should be in the exact order given above, but the records can be sorted any way you like. The dtypes attribute of your result must match exactly . This question can be answered using either SQL or Pandas. In [9]: ### Define daily_totals def daily_totals(conn): ### ### YOUR CODE HERE ### return pd.read_sql(''' select Season, DayNum, sum(pos) as Pos, sum(ptsforper100) as Pph, count(*) as Count from paceadjusted group by season, daynum ''', conn) initializing tester_fw.tester_6040 Passed! Please submit.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The demo cell below should display the following output: Season DayNum Pos Pph Count 0 2010 94 277.28 433.33581 4 1 2010 108 281.64 415.62023 4 2 2010 122 265.88 395.26418 4 3 2016 49 365.04 467.93858 5 4 2016 115 204.08 309.66380 3 In [10]: ### define demo inputs from tester_fw.test_utils import dfs_to_conn demo_df_2 = pd.read_sql('select * from PaceAdjustedSample', conn) demo_conn_2 = dfs_to_conn({'PaceAdjusted': demo_df_2}) # demo_df_2.sort_values(['Season', 'DayNum']).reset_index(drop=True) In [11]: ### call demo funtion print(daily_totals(demo_conn_2)) The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format. input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. In [12]: ### test_cell_ex2 from tester_fw.testers import Tester_ex2 tester = Tester_ex2() for _ in range(20): try : tester.run_test(daily_totals) (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() except : (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() raise ### ### AUTOGRADER TEST - DO NOT REMOVE ### print('Passed! Please submit.') On Window Functions... This is useful information Window functions define a slice of data relative to each row in a dataset and then perform an aggregate or order-based calculation on the slice. For example, say you want to perform the following calculation for each row in DailyTotals : SumPos Sum of the Pos column for all rows where Season = Season and DayNum DayNum Window functions can do this. Below are examples of this computation in Pandas and SQL. First, we will load some data to demonstrate. Season DayNum Pos Pph Count 0 2010 94 277.28 433.33581 4 1 2010 108 281.64 415.62023 4 2 2010 122 265.88 395.26418 4 3 2016 49 365.04 467.93858 5 4 2016 115 204.08 309.66380 3 initializing tester_fw.tester_6040 Passed! Please submit.
In [13]: # define demo data from tester_fw.test_utils import dfs_to_conn demo_conn_dfs_ex3 = {} demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn) demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3) daily_totals_df = demo_conn_dfs_ex3['DailyTotals'] The code below will return the sample data with a new column "SumPos". In [14]: ### Pandas Example def pandas_window_helper(group): group = group.sort_values('DayNum') group['SumPos'] = group['Pos'].expanding(1).sum() return group pandas_ex = daily_totals_df.groupby('Season', as_index= False ).apply(pandas_window_helper) The code below will return the sample data with a new column "SumPos". In [15]: ### SQL Example query = ''' SELECT *, SUM(pos) OVER(PARTITION BY Season ORDER BY DayNum ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as Su mPos FROM DailyTotals ORDER BY Season, DayNum ''' sql_ex = pd.read_sql(query, demo_conn_ex3) The code below will print the result and verify that the two approaches give equivalent results. In [16]: display(sql_ex) print(f'''SQL and Pandas results match? { ((sql_ex.reset_index(drop=True) - pandas_ex.reset_index(drop=True)).abs() <= 0.00001).all().all()}''') Exercise 3 - (3 Points): Season DayNum Pos Pph Count SumPos 0 2014 12 8725.84 13482.43561 128 8725.84 1 2014 15 6146.96 9299.04948 88 14872.80 2 2014 18 7301.64 11764.86247 108 22174.44 3 2014 41 4452.52 6702.86892 66 26626.96 4 2014 44 5381.96 8625.92495 82 32008.92 5 2014 59 8673.96 13265.97248 128 40682.88 6 2014 102 1400.72 2062.01746 20 42083.60 7 2014 127 1696.20 2684.90264 26 43779.80 8 2021 29 4150.56 5600.57650 58 4150.56 9 2021 96 13485.76 19887.61275 194 17636.32 SQL and Pandas results match? True
Our next task is to compute the weighted running average for Pos and Pph for all games played in a Season up to and including each DayNum . We want a snapshot of these averages as if it were that particular day and we had no knowledge of events occurring after that day. Complete the function get_running_avg_pace(conn) to query the table DailyTotals (structured the same as the output of the previous exercise) using the db connection conn and return a Pandas DataFrame with the columns mentioned below. You should calculate the intermediate values then use division to get the end result. Intermediate values SumPos - Running sum of the "Pos" column for current "Season" up to and including the current "DayNum". SumPph - Running sum of the "Pph" column for current "Season" up to and including the current "DayNum". SumCount - Running sum of the "Count" column for current "Season" up to and including the current "DayNum". Column Name dtype description Special Considerations Season int64 Current Season DayNum int64 Current DayNum AvgPace float64 SumPos SumCount Round to 5 decimal places AvgPPH float64 SumPph SumCount Round to 5 decimal places Notes: The columns should be in the exact order given above, but the records can be sorted any way you like. The dtypes attribute of your result must match exactly . This question can be answered using either SQL or Pandas. See "On Window Functions..." above this exercise for more info on computing a running sum. For Pandas pd.DataFrame.groupby().apply() and pd.Series.expanding() may be useful in calculating intermediate values. For SQL the "window function" syntax may be useful in calculating intermediate values. SUM(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) In [17]: ### Define def get_running_avg_pace def get_running_avg_pace(conn): def window_helper(agg_col, partition_col, sort_col): return f'SUM( {agg_col} ) OVER(PARTITION BY {partition_col} ORDER BY {sort_col} ROWS BETWEEN UNBOUNDED PREC EDING AND CURRENT ROW)' # return window_helper('Pos', 'Season', 'DayNum') intermediate_result = pd.read_sql(f''' select Season, DayNum, {window_helper('Pos', 'Season', 'DayNum')} SumPos, {window_helper('Pph', 'Season', 'DayNum')} SumPph, {window_helper('Count', 'Season', 'DayNum')} SumCount from DailyTotals ''', conn) df = intermediate_result df['AvgPace'] = df['SumPos'] / df['SumCount'] df['AvgPPH'] = df['SumPph'] / df['SumCount'] return df.drop(columns=['SumPos', 'SumPph', 'SumCount']) The demo cell below should display the following output: Season DayNum AvgPace AvgPPH 0 2014 12 68.17063 105.33153 1 2014 15 68.85556 105.46984 2 2014 18 68.43963 106.62453 3 2014 41 68.27426 105.76722 4 2014 44 67.81551 105.66767 5 2014 59 67.80480 105.23519 6 2014 102 67.87677 105.16634 7 2014 127 67.77059 105.08984 8 2021 29 71.56138 96.56166 9 2021 96 69.98540 101.14361
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [18]: ### define demo inputs from tester_fw.test_utils import dfs_to_conn demo_conn_dfs_ex3 = {} demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn) demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3) # demo_conn_dfs_ex3['DailyTotals'].sort_values(['Season', 'DayNum']).reset_index(drop=True) In [19]: ### call demo funtion print(get_running_avg_pace(demo_conn_ex3)) The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format. input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. In [20]: ### test_cell_ex3 from tester_fw.testers import Tester_ex3 tester = Tester_ex3() for _ in range(20): try : tester.run_test(get_running_avg_pace) (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() except : (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() raise ### ### AUTOGRADER TEST - DO NOT REMOVE ### print('Passed! Please submit.') Exercise 4 - (2 Points): Season DayNum AvgPace AvgPPH 0 2014 12 68.170625 105.331528 1 2014 15 68.855556 105.469838 2 2014 18 68.439630 106.624530 3 2014 41 68.274256 105.767222 4 2014 44 67.815508 105.667673 5 2014 59 67.804800 105.235190 6 2014 102 67.876774 105.166341 7 2014 127 67.770588 105.089836 8 2021 29 71.561379 96.561664 9 2021 96 69.985397 101.143608 initializing tester_fw.tester_6040 Passed! Please submit.
Suppose you are given two SQL tables: 1. PaceAdjusted , structured the same as the output of Exercise 1. 2. RunningAvgPace , structured the same as the output of Exercise 3. Complete the function get_center_pace_adjusted(conn) to query tables RunningAvgPace and PaceAdjusted and return a Pandas DataFrame with the following columns: Column Name dtype description Won int64 Unchanged from PaceAdjusted Season int64 Unchanged from PaceAdjusted DayNum int64 Unchanged from PaceAdjusted TeamID int64 Unchanged from PaceAdjusted OppID int64 Unchanged from PaceAdjusted Home int64 Unchanged from PaceAdjusted Pos float64 PaceAdjusted.Pos - RunningAvgPace.AvgPace for corresponding Season and DayNum OppPos float64 PaceAdjusted.OppPos - RunningAvgPace.AvgPace for corresponding Season and DayNum PtsAgstPer100 float64 PaceAdjusted.PtsAgstPer100 - RunningAvgPace.AvgPPH for corresponding Season and DayNum PtsForPer100 float64 PaceAdjusted.PtsForPer100 - RunningAvgPace.AvgPPH for corresponding Season and DayNum Notes: The columns should be in the exact order given above, but the records can be sorted any way you like. The dtypes attribute of your result must match exactly . This question can be answered using either SQL or Pandas. In [21]: ### Define get_center_pace_adjusted def get_center_pace_adjusted(conn): ### ### YOUR CODE HERE ### return pd.read_sql(''' select pa.Won, pa.Season, pa.DayNum, pa.TeamID, pa.OppID, pa.Home, pa.pos - rap.avgpace as Pos, pa.opppos - rap.avgpace as OppPos, pa.ptsagstper100 - rap.avgpph as PtsAgstPer100, pa.ptsforper100 - rap.avgpph PtsForPer100 from PaceAdjusted pa join RunningAvgPace rap on pa.season = rap.season and pa.daynum = rap.daynum ''', conn)
The demo cell below should display the following output: Won Season DayNum TeamID OppID Home Pos OppPos \ 0 1 2014 94 1374 1396 1 -3.86907 -2.18907 1 0 2016 119 1224 1313 0 2.65129 5.57129 2 0 2016 90 1353 1277 0 4.05371 2.13371 3 1 2016 119 1300 1366 1 -8.78871 -9.30871 4 1 2014 94 1383 1384 1 -2.30907 -4.66907 5 1 2014 94 1190 1297 0 -9.42907 -9.62907 6 0 2014 94 1223 1101 0 0.29093 0.69093 7 0 2014 94 1218 1364 1 -1.98907 -3.58907 8 0 2014 92 1159 1221 1 -13.28500 -12.96500 9 0 2014 81 1297 1252 1 1.52248 1.76248 PtsAgstPer100 PtsForPer100 0 -25.32271 13.22592 1 7.78985 -12.71682 2 29.99420 -19.84044 3 -31.11146 -3.56274 4 -22.13333 14.99904 5 -18.41211 5.59099 6 22.95005 14.80074 7 12.70182 -7.11914 8 22.07960 20.97894 9 3.17351 2.09897 In [22]: ### define demo inputs ### use naming convention demo_varname_ex_* to name demo variables from tester_fw.test_utils import dfs_to_conn demo_conn_dfs_ex4 = {} demo_conn_dfs_ex4['PaceAdjusted'] = pd.read_sql('select * from PaceAdjustedSampleTwo', conn) demo_conn_dfs_ex4['RunningAvgPace'] = pd.read_sql('select * from RunningAvgPaceSample', conn) demo_conn_ex4 = dfs_to_conn(demo_conn_dfs_ex4) # print('PaceAdjusted') # display(demo_conn_dfs_ex4['PaceAdjusted']) # print('RunningAvgPace') # display(demo_conn_dfs_ex4['RunningAvgPace']) In [23]: ### call demo funtion print(get_center_pace_adjusted(demo_conn_ex4)) Won Season DayNum TeamID OppID Home Pos OppPos \ 0 1 2014 94 1374 1396 1 -3.86907 -2.18907 1 0 2016 119 1224 1313 0 2.65129 5.57129 2 0 2016 90 1353 1277 0 4.05371 2.13371 3 1 2016 119 1300 1366 1 -8.78871 -9.30871 4 1 2014 94 1383 1384 1 -2.30907 -4.66907 5 1 2014 94 1190 1297 0 -9.42907 -9.62907 6 0 2014 94 1223 1101 0 0.29093 0.69093 7 0 2014 94 1218 1364 1 -1.98907 -3.58907 8 0 2014 92 1159 1221 1 -13.28500 -12.96500 9 0 2014 81 1297 1252 1 1.52248 1.76248 PtsAgstPer100 PtsForPer100 0 -25.32271 13.22592 1 7.78985 -12.71682 2 29.99420 -19.84044 3 -31.11146 -3.56274 4 -22.13333 14.99904 5 -18.41211 5.59099 6 22.95005 14.80074 7 12.70182 -7.11914 8 22.07960 20.97894 9 3.17351 2.09897
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format. input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. In [24]: ### test_cell_ex4 from tester_fw.testers import Tester_ex4 tester = Tester_ex4() for _ in range(20): try : tester.run_test(get_center_pace_adjusted) (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() except : (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() raise ### ### AUTOGRADER TEST - DO NOT REMOVE ### print('Passed! Please submit.') Exercise 5 - (3 Points): initializing tester_fw.tester_6040 Passed! Please submit.
Now we have centered and pace adjusted stats for each Season, DayNum, and TeamID, it's time to compute running averages to set as inputs to our model. Since we are using these stats as inputs to a predictive model, we need to introduce a lag of 1 game so that the results of the game we are trying to predict are not part of these inputs. Complete the function get_team_stats(conn) to query the PaceAdjustedCentered (structured the same as the output of the previous exercise) table with the db connection conn and return a Pandas DataFrame with the following columns: Column Name dtype description Special Considerations Won int64 Unchanged from PaceAdjustedCentered Season int64 Current Season Unchanged from PaceAdjustedCentered DayNum int64 Current Day Unchanged from PaceAdjustedCentered TeamID int64 Current TeamID Unchanged from PaceAdjustedCentered OppID int64 Unchanged from PaceAdjustedCentered Home int64 Unchanged from PaceAdjustedCentered Pos float64 Running Average of PaceAdjustedCentered.Pos including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum Round to 5 decimal places OppPos float64 Running Average of PaceAdjustedCentered.OppPos including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum Round to 5 decimal places PtsAgstPer100 float64 Running Average of PaceAdjustedCentered.PtsAgstPer100 including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum Round to 5 decimal places PtsForPer100 float64 Running Average of PaceAdjustedCentered.PtsForPer100 including all rows with the current Season and TeamID with PaceAdjustedCentered.DayNum < DayNum Round to 5 decimal places WinPct float64 Running Average of PaceAdjustedCentered.Won including all rows with the current Season and TeamID with DayNum < PaceAdjustedCentered.DayNum Round to 5 decimal places Notes: The columns should be in the exact order given above, but the records can be sorted any way you like. The dtypes attribute of your result must match exactly . This question can be answered using either SQL or Pandas. See "On Window Functions..." above Exercise 3 for more info on computing a running average. For Pandas pd.DataFrame.groupby().apply() and pd.Series.expanding() may be useful. For SQL the "window function" syntax may be useful. AVG(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) In [25]: ### Define get_team_stats def get_team_stats(conn): def window_helper(agg_col, partition_cols, sort_col): return f'AVG( {agg_col} ) OVER(PARTITION BY {partition_cols} ORDER BY {sort_col} ROWS BETWEEN UNBOUNDED PRE CEDING AND 1 preceding)' return pd.read_sql(f''' select Won, Season, DayNum, TeamID, OppID, Home, {window_helper('pos', 'season, teamid', 'daynum')} Pos, {window_helper('OppPos', 'season, teamid', 'daynum')} OppPos, {window_helper('PtsAgstPer100', 'season, teamid', 'daynum')} PtsAgstPer100, {window_helper('PtsForPer100', 'season, teamid', 'daynum')} PtsForPer100, {window_helper('won', 'season, teamid', 'daynum')} WinPct from PaceAdjustedCentered ''', conn)
The demo cell below should display the following output: Won Season DayNum TeamID OppID Home Pos OppPos \ 0 0 2012 17 1108 1452 0 NaN NaN 1 0 2012 25 1108 1408 0 8.25697 8.53697 2 0 2012 26 1108 1360 0 5.65314 5.93314 3 1 2012 105 1108 1115 1 2.76368 2.09702 4 0 2012 16 1355 1208 0 NaN NaN 5 1 2012 31 1355 1237 0 -5.81233 -5.33233 6 0 2012 45 1355 1315 0 5.07403 4.81403 7 1 2012 124 1355 1237 0 4.12049 4.84049 8 0 2012 29 1431 1332 0 NaN NaN 9 1 2012 52 1431 1155 0 -4.58353 -5.26353 10 0 2012 72 1431 1409 0 -5.65785 -7.11785 11 1 2012 79 1431 1187 1 -6.81332 -8.17332 12 1 2012 103 1431 1408 1 -6.71588 -7.96588 13 1 2017 47 1252 1399 0 NaN NaN 14 1 2017 50 1252 1122 0 9.78901 10.22901 15 1 2017 68 1252 1244 1 7.09400 8.63400 16 0 2017 82 1252 1316 0 6.62264 8.36930 17 1 2017 86 1252 1367 1 7.55313 8.32313 18 0 2017 108 1252 1316 1 6.32893 6.93693 19 0 2017 122 1252 1316 1 6.91365 7.39365 PtsAgstPer100 PtsForPer100 WinPct 0 NaN NaN NaN 1 26.44115 -18.48766 0.00000 2 21.51181 -30.28407 0.00000 3 16.34005 -20.09929 0.00000 4 NaN NaN NaN 5 14.27057 -2.29566 0.00000 6 13.42293 6.86026 0.50000 7 16.60255 4.68752 0.33333 8 NaN NaN NaN 9 2.30485 -6.73367 0.00000 10 -7.95942 -3.23226 0.50000 11 -4.29258 -8.20323 0.33333 12 -5.03189 -2.51522 0.50000 13 NaN NaN NaN 14 -5.62638 -2.59159 1.00000 15 0.33659 13.16336 1.00000 16 -0.57178 10.26290 1.00000 17 2.37803 7.85893 0.75000 18 2.68629 9.02187 0.80000 19 4.37327 7.30692 0.66667 In [26]: ### define demo inputs from tester_fw.test_utils import dfs_to_conn demo_conn_dfs_ex5 = {} demo_conn_dfs_ex5['PaceAdjustedCentered'] = pd.read_sql('select * from PaceAdjustedCenteredSample', conn).sort_va lues(['Season', 'TeamID', 'DayNum']) demo_conn_ex5 = dfs_to_conn(demo_conn_dfs_ex5) # display(demo_conn_dfs_ex5['PaceAdjustedCentered'])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [27]: ### call demo funtion print(get_team_stats(demo_conn_ex5)) The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format. input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. In [28]: ### test_cell_ex5 from tester_fw.testers import Tester_ex5 tester = Tester_ex5() for _ in range(20): try : tester.run_test(get_team_stats) (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() except : (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() raise ### ### AUTOGRADER TEST - DO NOT REMOVE ### print('Passed! Please submit.') Won Season DayNum TeamID OppID Home Pos OppPos \ 0 0 2012 17 1108 1452 0 NaN NaN 1 0 2012 25 1108 1408 0 8.256970 8.536970 2 0 2012 26 1108 1360 0 5.653140 5.933140 3 1 2012 105 1108 1115 1 2.763683 2.097017 4 0 2012 16 1355 1208 0 NaN NaN 5 1 2012 31 1355 1237 0 -5.812330 -5.332330 6 0 2012 45 1355 1315 0 5.074030 4.814030 7 1 2012 124 1355 1237 0 4.120487 4.840487 8 0 2012 29 1431 1332 0 NaN NaN 9 1 2012 52 1431 1155 0 -4.583530 -5.263530 10 0 2012 72 1431 1409 0 -5.657850 -7.117850 11 1 2012 79 1431 1187 1 -6.813323 -8.173323 12 1 2012 103 1431 1408 1 -6.715882 -7.965883 13 1 2017 47 1252 1399 0 NaN NaN 14 1 2017 50 1252 1122 0 9.789010 10.229010 15 1 2017 68 1252 1244 1 7.093995 8.633995 16 0 2017 82 1252 1316 0 6.622637 8.369303 17 1 2017 86 1252 1367 1 7.553130 8.323130 18 0 2017 108 1252 1316 1 6.328930 6.936930 19 0 2017 122 1252 1316 1 6.913647 7.393647 PtsAgstPer100 PtsForPer100 WinPct 0 NaN NaN NaN 1 26.441150 -18.487660 0.000000 2 21.511805 -30.284065 0.000000 3 16.340047 -20.099290 0.000000 4 NaN NaN NaN 5 14.270570 -2.295660 0.000000 6 13.422930 6.860260 0.500000 7 16.602553 4.687520 0.333333 8 NaN NaN NaN 9 2.304850 -6.733670 0.000000 10 -7.959420 -3.232260 0.500000 11 -4.292583 -8.203230 0.333333 12 -5.031885 -2.515215 0.500000 13 NaN NaN NaN 14 -5.626380 -2.591590 1.000000 15 0.336590 13.163360 1.000000 16 -0.571780 10.262900 1.000000 17 2.378032 7.858930 0.750000 18 2.686288 9.021872 0.800000 19 4.373267 7.306922 0.666667 initializing tester_fw.tester_6040 Passed! Please submit.
In [29]: returned_output_vars['df'].dtypes In [30]: true_output_vars['df'].dtypes Exercise 6 - (3 Points): Out[29]: Won int64 Season int64 DayNum int64 TeamID int64 OppID int64 Home int64 Pos float64 OppPos float64 PtsAgstPer100 float64 PtsForPer100 float64 WinPct float64 dtype: object Out[30]: Won int64 Season int64 DayNum int64 TeamID int64 OppID int64 Home int64 Pos float64 OppPos float64 PtsAgstPer100 float64 PtsForPer100 float64 WinPct float64 dtype: object
Up to this point we have calculated our metrics for each team before they play each game. We're almost ready to start modeling! There is still one last task. Remember that there were two records in PaceAdjusted for each game. This is also true of LaggedTeamStats , which is a running average of that table. If "Team A" and "Team B" played a game on in some Season on some DayNum then there will be one record where "Team A" is the primary team and "Team B" is the opponent team and one record where "Team B" is the pirmary team and "Team A" is the opponent team. We need to join these records together so that we will have one record identifying each game. Complete the function get_matchup_stats(conn) to query the LaggedTeamStats table using the db connection conn and return a Pandas DataFrame with the following columns: Column Name dtype description Won int64 1 if primary team ( _x suffix) won the matchup 0 otherwise Season int64 DayNum int64 TeamID int64 Primary team - associated with _x suffixes OppID int64 Opponent team - associated with _y suffixes Home_x int64 Stats associated with primary team Pos_x float64 Stats associated with primary team OppPos_x float64 Stats associated with primary team PtsAgstPer100_x float64 Stats associated with primary team PtsForPer100_x float64 Stats associated with primary team WinPct_x float64 Stats associated with primary team Home_y int64 Stats associated with opponent team Pos_y float64 Stats associated with opponent team OppPos_y float64 Stats associated with opponent team PtsAgstPer100_y float64 Stats associated with opponent team PtsForPer100_y float64 Stats associated with opponent team WinPct_y float64 Stats associated with opponent team Notes: The columns should be in the exact order given above, but the records can be sorted any way you like. The dtypes attribute of your result must match exactly . This question can be answered using either SQL or Pandas. Pandas will give the required suffixes by default though. Implementation Strategy You should start by joining/merging LaggedTeamStats to itself where the OppID in the left table is the same as the TeamID in the right table. The Season and DayNum should be the same for both tables. After joining/merging: Remove any rows where the "left" TeamID is larger than the "left" OppID . This will avoid duplicating the data. Rows occurring before DayNum 56 should not be included in the result to make sure there is a reasonable sample of games so that the stats are meaningful. Drop any unnecessary columns. Rename the columns as necessary to meet the requirements above. Any rows containing NULL or NaN values should be dropped. In [31]: ### Define get_matchup_stats def get_matchup_stats(conn): df = pd.read_sql('select * from laggedteamstats', conn) return df.merge(df, left_on=['Season', 'DayNum', 'OppID'], right_on=['Season', 'DayNum', 'TeamID'])\ .query('TeamID_x < OppID_x')\ .query('DayNum >= 56')\ .drop(columns=['Won_y', 'TeamID_y', 'OppID_y'])\ .rename(columns={'Won_x':'Won', 'TeamID_x':'TeamID', 'OppID_x':'OppID'})\ .dropna()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
The demo cell below should display the following output: Won Season DayNum TeamID OppID Home_x Pos_x OppPos_x \ 2 1 2019 78 1210 1323 1 -0.90244 -1.03355 4 1 2011 72 1253 1413 1 2.31956 1.98356 PtsAgstPer100_x PtsForPer100_x WinPct_x Home_y Pos_y OppPos_y \ 2 -11.59260 -3.70151 0.55556 0 -2.65178 -2.09178 4 4.31931 -0.18928 0.46667 0 -2.84824 -2.20252 PtsAgstPer100_y PtsForPer100_y WinPct_y 2 -2.28041 6.55518 0.61111 4 3.37227 -0.80974 0.28571 In [32]: ### define demo inputs from tester_fw.test_utils import dfs_to_conn demo_conn_dfs_ex6 = {} demo_conn_dfs_ex6['LaggedTeamStats'] = pd.read_sql('select * from LaggedTeamStatsSample', conn) demo_conn_ex6 = dfs_to_conn(demo_conn_dfs_ex6) # display(pd.read_sql('select * from LaggedTeamStats', demo_conn_ex6)) In [33]: ### call demo funtion print(get_matchup_stats(demo_conn_ex6)) The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format. input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. In [34]: ### test_cell_ex6 from tester_fw.testers import Tester_ex6 tester = Tester_ex6() for _ in range(20): try : tester.run_test(get_matchup_stats) (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() except : (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() raise ### ### AUTOGRADER TEST - DO NOT REMOVE ### print('Passed! Please submit.') Won Season DayNum TeamID OppID Home_x Pos_x OppPos_x \ 2 1 2019 78 1210 1323 1 -0.90244 -1.03355 4 1 2011 72 1253 1413 1 2.31956 1.98356 PtsAgstPer100_x PtsForPer100_x WinPct_x Home_y Pos_y OppPos_y \ 2 -11.59260 -3.70151 0.55556 0 -2.65178 -2.09178 4 4.31931 -0.18928 0.46667 0 -2.84824 -2.20252 PtsAgstPer100_y PtsForPer100_y WinPct_y 2 -2.28041 6.55518 0.61111 4 3.37227 -0.80974 0.28571 initializing tester_fw.tester_6040 Passed! Please submit.
Aside - Random Forest classification model Feel free to skip to the next exercise. Logistic regression relies on a few assumptions about the predictive variables which are not met here (namely normality and each one individually having a significant effect on the response variable). Instead, we will use the off the shelf Scikit-Learn modules to implement the random forest modeling technique. We're building the model for you, but you will have to do some analysis of the predictions it makes. In [35]: from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import log_loss, roc_auc_score from sklearn.model_selection import GridSearchCV, train_test_split import warnings import pickle import os ### Example code showing how we got X and Y from the output of `get_matchup_stats` # stats_df = get_matchup_stats(conn) # # Horizontally partition our stats into predictive (we hope!) variables and the response variable # X = stats_df.drop(columns=['Won', 'Season', 'DayNum', 'TeamID', 'OppID']) # Remove the response and identificat ion data # y = stats_df['Won'] # response variable. There are two classes, 0 and 1, which correspond to a loss and a win f or the primary team. # load pre-computed X and y with open('resource/asnlib/publicdata/X.pkl', 'rb') as f: X = pickle.load(f) with open('resource/asnlib/publicdata/y.pkl', 'rb') as f: y = pickle.load(f) # Vertically partition data into training and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=6040) # Check to see if we already saved the model if not os.path.exists('resource/asnlib/publicdata/clf.pkl'): # if not - train and save the model clf = GridSearchCV(RandomForestClassifier(), scoring='roc_auc', param_grid={ 'max_depth': [2,4,6,8,9,10] }, n_jobs=-1, verbose=2).fit(X_train,y_train) with open('resource/asnlib/publicdata/clf.pkl', 'wb') as f: pickle.dump(clf, f) # open the saved model from a pickle file with open('resource/asnlib/publicdata/clf.pkl', 'rb') as f: with warnings.catch_warnings(): warnings.simplefilter('ignore') clf = pickle.load(f) # this is how you predict the probability that the response variable belongs to each class # probs[:, 0] is the estimated probabilites of losing and probs[:, 1] is the estimated probabilities of winning. probs = clf.predict_proba(X_test) # this is how you predict which class each record belongs to preds = clf.predict(X_test) # printing some metrics about how well the model fits the test set print(f''' Log-Loss: \t\t {round(log_loss(y_test, probs),4)} - 0 is perfect and 0.693 is considered "non-informative". Area under ROC curve: \t {round(roc_auc_score(y_test, probs[:, 1]),4)} - 1 is perfect and 0.5 is considered "non-in formative". ''') Exercise 7 - (1 Points): Log-Loss: 0.5674 - 0 is perfect and 0.693 is considered "non-informative". Area under ROC curve: 0.7775 - 1 is perfect and 0.5 is considered "non-informative".
Without advanced statistical training, these fit metrics would probably not make sense. We see that they are both somewhere in between "perfect" and "non-informative", but what is that telling us? One understandable metric we can calculate is the prediction accuracy. Complete the function pred_accuracy(obs, preds) to calculate the relative frequency that the predicted value preds is the same as the ovserved value obs . You can assume that both will be array-like, 1-dimensional, the same length and that they only contain the integers 0 and 1. Round your result to 5 decimal places. Note: There will be no type check. If your number matches our number you will pass. In [36]: ### Define pred_accuracy def pred_accuracy(obs, preds): return round(np.mean(obs==preds), 5) The demo cell below should display the following output: 0.53333 In [37]: ### define demo inputs rng = np.random.default_rng(6040) demo_obs_ex7 = rng.integers(0, 2, 15) demo_preds_ex7 = rng.integers(0, 2, 15) # print('obs: ', demo_obs_ex7) # print('preds:', demo_preds_ex7) In [38]: ### call demo funtion pred_accuracy(demo_obs_ex7, demo_preds_ex7) The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format. input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. In [39]: ### test_cell_ex7 from tester_fw.testers import Tester_ex7 tester = Tester_ex7() for _ in range(20): try : tester.run_test(pred_accuracy) (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() except : (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() raise ### ### AUTOGRADER TEST - DO NOT REMOVE ### print('Passed! Please submit.') Exercise 8 - (2 Points): Out[38]: 0.53333 initializing tester_fw.tester_6040 Passed! Please submit.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Another way to evaluate our model is to check how accuracy of the probabilities it generates. Suppose you have a list of the observations obs (0's and 1's indicating whether the primary team wins (1) or not(0)) and a list of the predicted probability probs that the primary team won for each observation. We can bucket the predictions based on the probabilities and then calculate the accuracy of our model within that bucket by taking the mean of the observations. If the observed win % for each bucket is between or close to its bounds, we have a decent model. Complete the function bucket_evaluation(obs, prob, n_buckets) to partition the interval into n_buckets equal intervals and compute the observed relative win percentage for the events in that bucket. You should return your result as a Pandas DataFrame with the following columns: Column Name dtype description Special Considerations lower float64 lower bound of bucket round to 2 decimal places upper float64 upper bound of bucket round to 2 decimal places obs_win_pct float64 win % for observations where lower estimated win probability < upper round to 2 decimal places (as a percent) n int64 number of observations where lower estimated win probability < upper Notes: The columns should be in the exact order given above, and the records should be sorted with the 'lower' column in ascending order and a reset index . The dtypes attribute of your result must match exactly . The bounds should be rounded before deciding which bucket each observation belongs in. The "obs_win_pct" column should have NaN entries for any "empty" buckets which have no observations. Your solution raising a warning will not cause it to fail the test cell. Implementation Strategy np.linspace() can be helpful for calculating the bounds for each bucket. Don't forget to round! The function zip(bounds[:-1], bounds[1:] can be iterated over to sequentially examine each pair of bounds ( bounds would be the output of linspace after rounding). Use each pair of bounds to filter the obs to only include observations where the prob is between the lower and upper bounds. You can calculate the observed relative frequency and count from the filtered results. Keep track of the endpoints, relative frequency, and count for each interval. After evaluating for all intervals, construct a DataFrame using the stored results for each interval. In [49]: ### Define bucket_evaluation def bucket_evaluation(obs, prob, n_buckets): bounds = np.round(np.linspace(0,1, n_buckets+1),2) lod = [] for lower, upper in zip(bounds[:-1], bounds[1:]): inds = np.argwhere((prob >= lower) & (prob < upper)).reshape((-1,)) n = inds.shape[0] win_pct = 100*np.mean(obs[inds]) d = {'lower': lower, 'upper': upper, 'obs_win_pct': win_pct, 'n': n} lod.append(d) return pd.DataFrame(lod).apply( lambda f: round(f, 2)) The demo cell below should display the following output: lower upper obs_win_pct n 0 0.00 0.25 25.00 4 1 0.25 0.50 23.08 13 2 0.50 0.75 66.67 3 3 0.75 1.00 40.00 5 Unlike previous exercises the order and index must match (in addition to the usual requirements) to produce an easily understood result. In [50]: ### define demo inputs rng = np.random.default_rng(6040) demo_obs_ex8 = rng.integers(0, 2, 25) demo_prob_ex8 = rng.random(25) demo_n_buckets_ex8 = 4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [51]: ### call demo funtion print(bucket_evaluation(demo_obs_ex8, demo_prob_ex8, demo_n_buckets_ex8)) The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format. input_vars - Input variables for your solution. original_input_vars - Copy of input variables from prior to running your solution. These should be the same as input_vars - otherwise the inputs were modified by your solution. returned_output_vars - Outputs returned by your solution. true_output_vars - The expected output. This should "match" returned_output_vars based on the question requirements - otherwise, your solution is not returning the correct output. In [43]: ### test_cell_ex8 from tester_fw.testers import Tester_ex8 tester = Tester_ex8() for _ in range(20): try : tester.run_test(bucket_evaluation) (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() except : (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars() raise ### ### AUTOGRADER TEST - DO NOT REMOVE ### print('Passed! Please submit.') [ 1 2 3 16] [ 0 5 8 11 12 13 17 18 19 20 21 22 23] [ 4 9 14] [ 6 7 10 15 24] None initializing tester_fw.tester_6040 Passed! Please submit.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [44]: from pandas.testing import assert_frame_equal assert_frame_equal(returned_output_vars["df"], true_output_vars["df"]) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount) /usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty slice. return _methods._mean(a, axis=axis, dtype=dtype, /usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc ountered in double_scalars ret = ret.dtype.type(ret / rcount)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In [138]: returned_output_vars["df"] In [139]: true_output_vars["df"] Fin. If you have made it this far, congratulations on completing the semester. Don't forget to submit! Out[138]: lower upper obs_win_pct n 0 0.00 0.08 33.33 3 1 0.08 0.17 75.00 4 2 0.17 0.25 55.56 9 3 0.25 0.33 100.00 2 4 0.33 0.42 37.50 8 5 0.42 0.50 50.00 2 6 0.50 0.58 66.67 3 7 0.58 0.67 0.00 3 8 0.67 0.75 100.00 2 9 0.75 0.83 25.00 4 10 0.83 0.92 57.14 7 11 0.92 1.00 33.33 3 Out[139]: lower upper obs_win_pct n 0 0.00 0.08 0.00 2 1 0.08 0.17 66.67 6 2 0.17 0.25 62.50 8 3 0.25 0.33 100.00 2 4 0.33 0.42 37.50 8 5 0.42 0.50 50.00 2 6 0.50 0.58 66.67 3 7 0.58 0.67 0.00 3 8 0.67 0.75 100.00 2 9 0.75 0.83 25.00 4 10 0.83 0.92 62.50 8 11 0.92 1.00 0.00 2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help