problem24 v2
pdf
keyboard_arrow_up
School
University of Michigan *
*We aren’t endorsed by this school
Course
215
Subject
Computer Science
Date
Nov 24, 2024
Type
Pages
25
Uploaded by TareqA5
Final Exam, Spring 2022: March (through May) Madness
Version 1.0
All of the header information is important. Please read it..
Topics, number of exercises:
This problem builds on your knowledge of Pandas, SQL, and numeric computation. It has 9 exercises, numbered 0 to 8.
There are
19
available points. However, to earn 100% the threshold is
12
points. (Therefore, once you hit
12
points, you can stop. There is no extra credit
for exceeding this threshold.)
Exercise ordering:
Each exercise builds logically on previous exercises, but you may solve them in any order. That is, if you can't solve an exercise, you
can still move on and try the next one. Use this to your advantage, as the exercises are
not
necessarily ordered in terms of difficulty. Higher point values
generally indicate more difficult exercises.
Demo cells:
Code cells starting with the comment
### define demo inputs
load results from prior exercises applied to the entire data set and use
those to build demo inputs. These must be run for subsequent demos to work properly, but they do not affect the test cells. The data loaded in these cells
may be rather large (at least in terms of human readability). You are free to print or otherwise use Python to explore them, but we did not print them in the
starter code.
Debugging you code:
Right before each exercise test cell, there is a block of text explaining the variables available to you for debugging. You may use
these to test your code and can print/display them as needed (careful when printing large objects, you may want to print the head or chunks of rows at a
time).
Exercise point breakdown:
Exercise 0: 1 point
Exercise 1: 3 point
Exercise 2: 1 point
Exercise 3: 3 point
Exercise 4: 2 point
Exercise 5: 3 point
Exercise 6: 3 point
Exercise 7: 1 point
Exercise 8: 2 point
Final reminders:
Submit after
every exercise
Review the generated grade report after you submit to see what errors were returned
Stay calm, skip problems as needed, and take short breaks at your leisure
Basketball basics
In this notebook we want to predict which team will win a basketball game based on their past performance in a given season. You do not have to know
anything about basketball aside from the background below to complete this notebook.
Games.
A basketball game is played between two teams.
In most games, there is a "home" team and an "away" team. However sometimes games are played at a neutral site, in which case neither team
is "home" nor "away."
Scoring and winning.
In a game, the team that scores more points wins. There are no ties.
Possessions.
A possession is an event where one team continuously controls the ball.
In [1]:
import sqlite3
import numpy as np
import pandas as pd
import re
conn = sqlite3.connect('resource/asnlib/publicdata/basketball_db.sqlite')
Exercise 0 - (1 Points):
We have a big collection of real data from several seasons of men's NCAA basketball. We are most interested in one table of this database, named
MRegularSeasonDetailedResults
.
Complete the function
get_cols(conn)
to return a
list
of the columns in the
MRegularSeasonDetailedResults
table in the db connection
conn
.
The order of the columns in the list should be the same as the order they appear in the database table.
In [2]:
### Define get_cols
def
get_cols(conn):
###
### YOUR CODE HERE
###
return
pd.read_sql('''select * from MRegularSeasonDetailedResults limit 1''', conn)\
.columns\
.tolist()
The demo cell below should display the following output:
['index', 'Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc', 'NumOT', 'WFGM', 'WFGA', '
WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM
3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF']
In [3]:
### call demo funtion
print(get_cols(conn))
The cell below will test your solution for Exercise 0. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution.
original_input_vars
- Copy of input variables from prior to running your solution. These
should
be the same as
input_vars
- otherwise
the inputs were modified by your solution.
returned_output_vars
- Outputs returned by your solution.
true_output_vars
- The expected output. This
should
"match"
returned_output_vars
based on the question requirements - otherwise,
your solution is not returning the correct output.
In [4]:
### test_cell_ex0
from tester_fw.testers import
Tester_ex0
tester = Tester_ex0()
for
_
in
range(20):
try
:
tester.run_test(get_cols)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except
:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
['index', 'Season', 'DayNum', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'WLoc', 'NumOT', 'WFGM', 'WF
GA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl', 'WBlk', 'WPF', 'LFGM', '
LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl', 'LBlk', 'LPF']
initializing tester_fw.tester_6040
Passed! Please submit.
Abriged data dictionary
In the
MRegularSeasonDetailedResults
table, each record is one game. All columns prefixed by 'W' refer to the winning team, and columns prefixed
by 'L' refer to the losing team. We are primarily interested in the following columns:
Column
Type
Has Prefix Description
Season
Integer No
Identifies the calendar year when a season ends.
DayNum Integer No
Identifies the day in a season when a game occurred.
TeamID
Integer W or L
Identifies the teams participating in a game.
Score
Integer W or L
Number of points a team scored in a game.
Loc
Char
W Only
Identifies whether the winning team was home ('H'), away ('A'), or the game was played at a neutral site ('N').
FGA
Integer W or L
Number of field goal attempts occurring when a team had in a game
FTA
Integer W or L
Number of free throw attempts a team had in a game.
TO
Integer W or L
Number of turnovers a team had in a game.
OR
Integer W or L
Number of offensive rebounds a team had in a game.
For example, the column with the winning team's ID will be WTeamID, and the column with the losing team's free-throw attempts will be LFTA.
Exercise 1 - (3 Points):
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
We want to extract some information about each team involved in a game from that team's perspective.
We will use the term "primary team" to refer to
this team and "primary team's opponent" to refer to the other team in the game. Do not infer any other meaning from this term.
Since there are
two participants in each game there will be a record from the perspective of
both teams
after the extraction. (i.e. one record where the winning team is the
primary team and one record where the losing team is the primary team will be in the result.
Complete the function
get_pace_adjusted(conn)
to query the table
MRegularSeasonDetailedResults
using the db connection
conn
and return
a Pandas DataFrame with the columns outlined below. The "Winning Team" and "Losing Team" headers in the chart below indicate which columns in
MRegularSeasonDetailedResults
to use in your computations.
Column
Name
dtype
description
Special Considerations
Winning
Team
Losing
Team
Won
int64
1 if the primary team won the game, 0 if the primary team lost
1
0
Season
int64
Current
Season
DayNum
int64
Current
DayNum
TeamID
int64
Team id for primary team
WTeamID
LTeamID
OppID
int64
Team id for primary team's opponent
LTeamID
WTeamID
Home
int64
1 if the primary team is at home for the game, 0 otherwise.
WLoc
is
'H'
WLoc
is
'A'
Pos
float64 Estimated number of possessions for primary team
Round to 5 decimal places
W
prefix
L
prefix
OppPos
float64 Estimated number of possessions for primary team's opponent
Round to 5 decimal places
L
prefix
W
prefix
PtsForPer100
float64
100
primary team's score
estimate of primary team's
possessions
Round to 5 decimal places (only
round after division)
W
prefix
L
prefix
PtsAgstPer100 float64
100
primary team's opponent's score
estimate of primary
team's opponent's possessions
Round to 5 decimal places (only
round after division)
L
prefix
W
prefix
There is no column for possessions in our source data. We will need to estimate it! The formula below is widely used in sports analytics to estimate the
number of possessions a team had in a basketball game. (
is # of possessions):
=
+ 0.44(
) +
-
You can derive all required results from these columns:
Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR,
WTO, LFGA, LFTA, LOR, LTO
.
Notes:
Each record in the database table
MRegularSeasonDetailedResults
will correspond to
two
records in the result. One record where the
winning team is the primary team and one record where the losing team is the primary team.
For neutral site games, neither team will be home or away. The
WLoc
column will have an
'N'
entry.
The columns should be in the
exact
order given above, but the records can be sorted any way you like.
The
dtypes
attribute of your result must match
exactly
.
This question can be answered using either SQL or Pandas.
The
UNION ALL
phrase may be helpful if you are using SQL.
The
pd.concat()
method may be helpful if you are using Pandas.
One strategy that will work is making a table/df for all the winning teams and another for all the losing teams then vertically "stacking" them.
In [5]:
### Define get_pace_adjusted
def
get_pace_adjusted(conn):
###
### YOUR CODE HERE
###
return
pd.read_sql('''
-- winning team
select
1 as Won,
season, DayNum,
wteamid as TeamID,
lteamid as OppID,
wloc='H' as Home,
wFGA + 0.44*(wFTA) + wTO - wOR as Pos,
lFGA + 0.44*(lFTA) + lTO - lOR as OppPos,
100 * wscore / (wFGA + 0.44*(wFTA) + wTO - wOR) as PtsForPer100,
100 * lscore / (lFGA + 0.44*(lFTA) + lTO - lOR) as PtsAgstPer100
from
mregularseasondetailedresults
union all
-- losing team
select
0 as Won,
season, DayNum,
lteamid as TeamID,
wteamid as OppID,
wloc='A' as Home,
lFGA + 0.44*(lFTA) + lTO - lOR as Pos,
wFGA + 0.44*(wFTA) + wTO - wOR as OppPos,
100 * lscore / (lFGA + 0.44*(lFTA) + lTO - lOR) as PtsForPer100,
100 * wscore / (wFGA + 0.44*(wFTA) + wTO - wOR) as PtsAgstPer100
from
mregularseasondetailedresults
''', conn).apply(
lambda
f: round(f, 5))
This is the demo input:
Season
DayNum
WTeamID
WScore
LTeamID
LScore WLoc
WFGA
WFTA
WOR
WTO
LFGA
LFTA
LOR
LTO
0
2003
10
1104
68
1328
62
N
58
18
14
23
53
22
10
18
The demo cell below should display the following output:
Won
Season
DayNum
TeamID
OppID
Home
Pos
OppPos
PtsForPer100
PtsAgstPer100
0
1
2003
10
1104
1328
0
74.92
70.68
90.76348
87.71930
1
0
2003
10
1328
1104
0
70.68
74.92
87.71930
90.76348
Note that there are two rows, but they come from a single input row.
In [6]:
### define demo inputs
from tester_fw.test_utils import
dfs_to_conn
demo_df_1 = pd.read_sql('select Season, DayNum, WTeamID, WScore, LTeamID, LScore, WLoc, WFGA, WFTA, WOR, WTO, LFG
A, LFTA, LOR, LTO from MRegularSeasonDetailedResults limit 1', conn)
demo_conn_1 = dfs_to_conn({'MRegularSeasonDetailedResults': demo_df_1})
# demo_df_1
In [7]:
### call demo funtion
print(get_pace_adjusted(demo_conn_1))
Won
Season
DayNum
TeamID
OppID
Home
Pos
OppPos
PtsForPer100
\
0
1
2003
10
1104
1328
0
74.92
70.68
90.76348
1
0
2003
10
1328
1104
0
70.68
74.92
87.71930
PtsAgstPer100
0
87.71930
1
90.76348
The cell below will test your solution for Exercise 1. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution.
original_input_vars
- Copy of input variables from prior to running your solution. These
should
be the same as
input_vars
- otherwise
the inputs were modified by your solution.
returned_output_vars
- Outputs returned by your solution.
true_output_vars
- The expected output. This
should
"match"
returned_output_vars
based on the question requirements - otherwise,
your solution is not returning the correct output.
In [8]:
### test_cell_ex1
from tester_fw.testers import
Tester_ex1
tester = Tester_ex1()
for
_
in
range(20):
try
:
tester.run_test(get_pace_adjusted)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except
:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Exercise 2 - (1 Points):
Our next task is to compute daily totals of
Pos
,
OppPos
,
PtsForPer100
, and
PtsAgstPer100
as well as a count of the number of teams participating in
games that day.
Complete the function
daily_totals(conn)
to query the table
PaceAdjusted
(structured the same as the output of the previous exercise) in the db
connection
conn
and return a Pandas DataFrame with the following columns:
Column Name dtype
description
Season
int64
DayNum
int64
Pos
float64 Sum of values in
Pos
column of
PaceAdjusted
occuring on a unique combination of
Season
and
DayNum
Pph
float64 Sum of values in
PtsForPer100
column of
PaceAdjusted
occuring on a unique combination of
Season
and
DayNum
Count
int64
Count of rows in
PaceAdjusted
with a unique combination of
Season
and
DayNum
Notes:
The columns should be in the
exact
order given above, but the records can be sorted any way you like.
The
dtypes
attribute of your result must match
exactly
.
This question can be answered using either SQL or Pandas.
In [9]:
### Define daily_totals
def
daily_totals(conn):
###
### YOUR CODE HERE
###
return
pd.read_sql('''
select
Season, DayNum,
sum(pos) as Pos,
sum(ptsforper100) as Pph,
count(*) as Count
from
paceadjusted
group by
season, daynum
''', conn)
initializing tester_fw.tester_6040
Passed! Please submit.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The demo cell below should display the following output:
Season
DayNum
Pos
Pph
Count
0
2010
94
277.28
433.33581
4
1
2010
108
281.64
415.62023
4
2
2010
122
265.88
395.26418
4
3
2016
49
365.04
467.93858
5
4
2016
115
204.08
309.66380
3
In [10]:
### define demo inputs
from tester_fw.test_utils import
dfs_to_conn
demo_df_2 = pd.read_sql('select * from PaceAdjustedSample', conn)
demo_conn_2 = dfs_to_conn({'PaceAdjusted': demo_df_2})
# demo_df_2.sort_values(['Season', 'DayNum']).reset_index(drop=True)
In [11]:
### call demo funtion
print(daily_totals(demo_conn_2))
The cell below will test your solution for Exercise 2. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution.
original_input_vars
- Copy of input variables from prior to running your solution. These
should
be the same as
input_vars
- otherwise
the inputs were modified by your solution.
returned_output_vars
- Outputs returned by your solution.
true_output_vars
- The expected output. This
should
"match"
returned_output_vars
based on the question requirements - otherwise,
your solution is not returning the correct output.
In [12]:
### test_cell_ex2
from tester_fw.testers import
Tester_ex2
tester = Tester_ex2()
for
_
in
range(20):
try
:
tester.run_test(daily_totals)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except
:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
On Window Functions...
This is useful information
Window functions define a slice of data relative to each row in a dataset and then perform an aggregate or order-based calculation on the slice. For
example, say you want to perform the following calculation for each row
in
DailyTotals
:
SumPos
Sum of the
Pos
column for all rows where
Season
=
Season
and
DayNum
DayNum
Window functions can do this. Below are examples of this computation in Pandas and SQL. First, we will load some data to demonstrate.
Season
DayNum
Pos
Pph
Count
0
2010
94
277.28
433.33581
4
1
2010
108
281.64
415.62023
4
2
2010
122
265.88
395.26418
4
3
2016
49
365.04
467.93858
5
4
2016
115
204.08
309.66380
3
initializing tester_fw.tester_6040
Passed! Please submit.
In [13]:
# define demo data
from tester_fw.test_utils import
dfs_to_conn
demo_conn_dfs_ex3 = {}
demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn)
demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3)
daily_totals_df = demo_conn_dfs_ex3['DailyTotals']
The code below will return the sample data with a new column "SumPos".
In [14]:
### Pandas Example
def
pandas_window_helper(group):
group = group.sort_values('DayNum')
group['SumPos'] = group['Pos'].expanding(1).sum()
return
group
pandas_ex = daily_totals_df.groupby('Season', as_index=
False
).apply(pandas_window_helper)
The code below will return the sample data with a new column "SumPos".
In [15]:
### SQL Example
query = '''
SELECT
*,
SUM(pos) OVER(PARTITION BY Season ORDER BY DayNum ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as Su
mPos
FROM
DailyTotals
ORDER BY
Season, DayNum
'''
sql_ex = pd.read_sql(query, demo_conn_ex3)
The code below will print the result and verify that the two approaches give equivalent results.
In [16]:
display(sql_ex)
print(f'''SQL and Pandas results match? {
((sql_ex.reset_index(drop=True) - pandas_ex.reset_index(drop=True)).abs() <= 0.00001).all().all()}''')
Exercise 3 - (3 Points):
Season DayNum
Pos
Pph Count
SumPos
0
2014
12
8725.84
13482.43561 128
8725.84
1
2014
15
6146.96
9299.04948
88
14872.80
2
2014
18
7301.64
11764.86247 108
22174.44
3
2014
41
4452.52
6702.86892
66
26626.96
4
2014
44
5381.96
8625.92495
82
32008.92
5
2014
59
8673.96
13265.97248 128
40682.88
6
2014
102
1400.72
2062.01746
20
42083.60
7
2014
127
1696.20
2684.90264
26
43779.80
8
2021
29
4150.56
5600.57650
58
4150.56
9
2021
96
13485.76 19887.61275 194
17636.32
SQL and Pandas results match? True
Our next task is to compute the weighted running average for
Pos
and
Pph
for all games played in a
Season
up to and including each
DayNum
. We want
a snapshot of these averages as if it were that particular day and we had no knowledge of events occurring after that day.
Complete the function
get_running_avg_pace(conn)
to query the table
DailyTotals
(structured the same as the output of the previous exercise)
using the db connection
conn
and return a Pandas DataFrame with the columns mentioned below. You should calculate the intermediate values then use
division to get the end result.
Intermediate values
SumPos
- Running sum of the "Pos" column for current "Season" up to and including the current "DayNum".
SumPph
- Running sum of the "Pph" column for current "Season" up to and including the current "DayNum".
SumCount
- Running sum of the "Count" column for current "Season" up to and including the current "DayNum".
Column Name dtype
description
Special Considerations
Season
int64
Current
Season
DayNum
int64
Current
DayNum
AvgPace
float64
SumPos
SumCount
Round to 5 decimal places
AvgPPH
float64
SumPph
SumCount
Round to 5 decimal places
Notes:
The columns should be in the
exact
order given above, but the records can be sorted any way you like.
The
dtypes
attribute of your result must match
exactly
.
This question can be answered using either SQL or Pandas.
See "On Window Functions..." above this exercise for more info on computing a running sum.
For Pandas
pd.DataFrame.groupby().apply()
and
pd.Series.expanding()
may be useful in calculating intermediate values.
For SQL the "window function" syntax may be useful in calculating intermediate values.
SUM(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
In [17]:
### Define def get_running_avg_pace
def
get_running_avg_pace(conn):
def
window_helper(agg_col, partition_col, sort_col):
return
f'SUM(
{agg_col}
) OVER(PARTITION BY
{partition_col}
ORDER BY
{sort_col}
ROWS BETWEEN UNBOUNDED PREC
EDING AND CURRENT ROW)'
#
return window_helper('Pos', 'Season', 'DayNum')
intermediate_result = pd.read_sql(f'''
select
Season, DayNum,
{window_helper('Pos', 'Season', 'DayNum')} SumPos,
{window_helper('Pph', 'Season', 'DayNum')} SumPph,
{window_helper('Count', 'Season', 'DayNum')} SumCount
from
DailyTotals
''', conn)
df = intermediate_result
df['AvgPace'] = df['SumPos'] / df['SumCount']
df['AvgPPH'] = df['SumPph'] / df['SumCount']
return
df.drop(columns=['SumPos', 'SumPph', 'SumCount'])
The demo cell below should display the following output:
Season
DayNum
AvgPace
AvgPPH
0
2014
12
68.17063
105.33153
1
2014
15
68.85556
105.46984
2
2014
18
68.43963
106.62453
3
2014
41
68.27426
105.76722
4
2014
44
67.81551
105.66767
5
2014
59
67.80480
105.23519
6
2014
102
67.87677
105.16634
7
2014
127
67.77059
105.08984
8
2021
29
71.56138
96.56166
9
2021
96
69.98540
101.14361
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [18]:
### define demo inputs
from tester_fw.test_utils import
dfs_to_conn
demo_conn_dfs_ex3 = {}
demo_conn_dfs_ex3['DailyTotals'] = pd.read_sql('select * from DailyTotalsSample', conn)
demo_conn_ex3 = dfs_to_conn(demo_conn_dfs_ex3)
# demo_conn_dfs_ex3['DailyTotals'].sort_values(['Season', 'DayNum']).reset_index(drop=True)
In [19]:
### call demo funtion
print(get_running_avg_pace(demo_conn_ex3))
The cell below will test your solution for Exercise 3. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution.
original_input_vars
- Copy of input variables from prior to running your solution. These
should
be the same as
input_vars
- otherwise
the inputs were modified by your solution.
returned_output_vars
- Outputs returned by your solution.
true_output_vars
- The expected output. This
should
"match"
returned_output_vars
based on the question requirements - otherwise,
your solution is not returning the correct output.
In [20]:
### test_cell_ex3
from tester_fw.testers import
Tester_ex3
tester = Tester_ex3()
for
_
in
range(20):
try
:
tester.run_test(get_running_avg_pace)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except
:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Exercise 4 - (2 Points):
Season
DayNum
AvgPace
AvgPPH
0
2014
12
68.170625
105.331528
1
2014
15
68.855556
105.469838
2
2014
18
68.439630
106.624530
3
2014
41
68.274256
105.767222
4
2014
44
67.815508
105.667673
5
2014
59
67.804800
105.235190
6
2014
102
67.876774
105.166341
7
2014
127
67.770588
105.089836
8
2021
29
71.561379
96.561664
9
2021
96
69.985397
101.143608
initializing tester_fw.tester_6040
Passed! Please submit.
Suppose you are given two SQL tables:
1.
PaceAdjusted
, structured the same as the output of Exercise 1.
2.
RunningAvgPace
, structured the same as the output of Exercise 3.
Complete the function
get_center_pace_adjusted(conn)
to query tables
RunningAvgPace
and
PaceAdjusted
and return a Pandas DataFrame
with the following columns:
Column Name dtype
description
Won
int64
Unchanged from
PaceAdjusted
Season
int64
Unchanged from
PaceAdjusted
DayNum
int64
Unchanged from
PaceAdjusted
TeamID
int64
Unchanged from
PaceAdjusted
OppID
int64
Unchanged from
PaceAdjusted
Home
int64
Unchanged from
PaceAdjusted
Pos
float64
PaceAdjusted.Pos
-
RunningAvgPace.AvgPace
for corresponding
Season
and
DayNum
OppPos
float64
PaceAdjusted.OppPos
-
RunningAvgPace.AvgPace
for corresponding
Season
and
DayNum
PtsAgstPer100
float64
PaceAdjusted.PtsAgstPer100
-
RunningAvgPace.AvgPPH
for corresponding
Season
and
DayNum
PtsForPer100
float64
PaceAdjusted.PtsForPer100
-
RunningAvgPace.AvgPPH
for corresponding
Season
and
DayNum
Notes:
The columns should be in the
exact
order given above, but the records can be sorted any way you like.
The
dtypes
attribute of your result must match
exactly
.
This question can be answered using either SQL or Pandas.
In [21]:
### Define get_center_pace_adjusted
def
get_center_pace_adjusted(conn):
###
### YOUR CODE HERE
###
return
pd.read_sql('''
select
pa.Won, pa.Season, pa.DayNum, pa.TeamID, pa.OppID, pa.Home,
pa.pos - rap.avgpace as Pos,
pa.opppos - rap.avgpace as OppPos,
pa.ptsagstper100 - rap.avgpph as PtsAgstPer100,
pa.ptsforper100 - rap.avgpph PtsForPer100
from
PaceAdjusted pa
join
RunningAvgPace rap
on
pa.season = rap.season and pa.daynum = rap.daynum
''', conn)
The demo cell below should display the following output:
Won
Season
DayNum
TeamID
OppID
Home
Pos
OppPos
\
0
1
2014
94
1374
1396
1
-3.86907
-2.18907
1
0
2016
119
1224
1313
0
2.65129
5.57129
2
0
2016
90
1353
1277
0
4.05371
2.13371
3
1
2016
119
1300
1366
1
-8.78871
-9.30871
4
1
2014
94
1383
1384
1
-2.30907
-4.66907
5
1
2014
94
1190
1297
0
-9.42907
-9.62907
6
0
2014
94
1223
1101
0
0.29093
0.69093
7
0
2014
94
1218
1364
1
-1.98907
-3.58907
8
0
2014
92
1159
1221
1 -13.28500 -12.96500
9
0
2014
81
1297
1252
1
1.52248
1.76248
PtsAgstPer100
PtsForPer100
0
-25.32271
13.22592
1
7.78985
-12.71682
2
29.99420
-19.84044
3
-31.11146
-3.56274
4
-22.13333
14.99904
5
-18.41211
5.59099
6
22.95005
14.80074
7
12.70182
-7.11914
8
22.07960
20.97894
9
3.17351
2.09897
In [22]:
### define demo inputs
### use naming convention demo_varname_ex_* to name demo variables
from tester_fw.test_utils import
dfs_to_conn
demo_conn_dfs_ex4 = {}
demo_conn_dfs_ex4['PaceAdjusted'] = pd.read_sql('select * from PaceAdjustedSampleTwo', conn)
demo_conn_dfs_ex4['RunningAvgPace'] = pd.read_sql('select * from RunningAvgPaceSample', conn)
demo_conn_ex4 = dfs_to_conn(demo_conn_dfs_ex4)
# print('PaceAdjusted')
# display(demo_conn_dfs_ex4['PaceAdjusted'])
# print('RunningAvgPace')
# display(demo_conn_dfs_ex4['RunningAvgPace'])
In [23]:
### call demo funtion
print(get_center_pace_adjusted(demo_conn_ex4))
Won
Season
DayNum
TeamID
OppID
Home
Pos
OppPos
\
0
1
2014
94
1374
1396
1
-3.86907
-2.18907
1
0
2016
119
1224
1313
0
2.65129
5.57129
2
0
2016
90
1353
1277
0
4.05371
2.13371
3
1
2016
119
1300
1366
1
-8.78871
-9.30871
4
1
2014
94
1383
1384
1
-2.30907
-4.66907
5
1
2014
94
1190
1297
0
-9.42907
-9.62907
6
0
2014
94
1223
1101
0
0.29093
0.69093
7
0
2014
94
1218
1364
1
-1.98907
-3.58907
8
0
2014
92
1159
1221
1 -13.28500 -12.96500
9
0
2014
81
1297
1252
1
1.52248
1.76248
PtsAgstPer100
PtsForPer100
0
-25.32271
13.22592
1
7.78985
-12.71682
2
29.99420
-19.84044
3
-31.11146
-3.56274
4
-22.13333
14.99904
5
-18.41211
5.59099
6
22.95005
14.80074
7
12.70182
-7.11914
8
22.07960
20.97894
9
3.17351
2.09897
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The cell below will test your solution for Exercise 4. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution.
original_input_vars
- Copy of input variables from prior to running your solution. These
should
be the same as
input_vars
- otherwise
the inputs were modified by your solution.
returned_output_vars
- Outputs returned by your solution.
true_output_vars
- The expected output. This
should
"match"
returned_output_vars
based on the question requirements - otherwise,
your solution is not returning the correct output.
In [24]:
### test_cell_ex4
from tester_fw.testers import
Tester_ex4
tester = Tester_ex4()
for
_
in
range(20):
try
:
tester.run_test(get_center_pace_adjusted)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except
:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Exercise 5 - (3 Points):
initializing tester_fw.tester_6040
Passed! Please submit.
Now we have centered and pace adjusted stats for each Season, DayNum, and TeamID, it's time to compute running averages to set as inputs to our
model. Since we are using these stats as inputs to a predictive model, we need to introduce a lag of 1 game so that the results of the game we are trying
to predict are not part of these inputs.
Complete the function
get_team_stats(conn)
to query the
PaceAdjustedCentered
(structured the same as the output of the previous exercise)
table with the db connection
conn
and return a Pandas DataFrame with the following columns:
Column
Name
dtype
description
Special Considerations
Won
int64
Unchanged from
PaceAdjustedCentered
Season
int64
Current Season
Unchanged from
PaceAdjustedCentered
DayNum
int64
Current Day
Unchanged from
PaceAdjustedCentered
TeamID
int64
Current TeamID
Unchanged from
PaceAdjustedCentered
OppID
int64
Unchanged from
PaceAdjustedCentered
Home
int64
Unchanged from
PaceAdjustedCentered
Pos
float64
Running Average of
PaceAdjustedCentered.Pos
including all rows with the current
Season
and
TeamID
with
PaceAdjustedCentered.DayNum
<
DayNum
Round to 5 decimal places
OppPos
float64
Running Average of
PaceAdjustedCentered.OppPos
including all rows with the current
Season
and
TeamID
with
PaceAdjustedCentered.DayNum
<
DayNum
Round to 5 decimal places
PtsAgstPer100 float64
Running Average of
PaceAdjustedCentered.PtsAgstPer100
including all rows with the
current
Season
and
TeamID
with
PaceAdjustedCentered.DayNum
<
DayNum
Round to 5 decimal places
PtsForPer100
float64
Running Average of
PaceAdjustedCentered.PtsForPer100
including all rows with the current
Season
and
TeamID
with
PaceAdjustedCentered.DayNum
<
DayNum
Round to 5 decimal places
WinPct
float64
Running Average of
PaceAdjustedCentered.Won
including all rows with the current
Season
and
TeamID
with
DayNum
<
PaceAdjustedCentered.DayNum
Round to 5 decimal places
Notes:
The columns should be in the
exact
order given above, but the records can be sorted any way you like.
The
dtypes
attribute of your result must match
exactly
.
This question can be answered using either SQL or Pandas.
See "On Window Functions..." above Exercise 3 for more info on computing a running average.
For Pandas
pd.DataFrame.groupby().apply()
and
pd.Series.expanding()
may be useful.
For SQL the "window function" syntax may be useful.
AVG(...) OVER(PARTITION BY ... ORDER BY ... ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
In [25]:
### Define get_team_stats
def
get_team_stats(conn):
def
window_helper(agg_col, partition_cols, sort_col):
return
f'AVG(
{agg_col}
) OVER(PARTITION BY
{partition_cols}
ORDER BY
{sort_col}
ROWS BETWEEN UNBOUNDED PRE
CEDING AND 1 preceding)'
return
pd.read_sql(f'''
select
Won, Season, DayNum, TeamID, OppID, Home,
{window_helper('pos', 'season, teamid', 'daynum')} Pos,
{window_helper('OppPos', 'season, teamid', 'daynum')} OppPos,
{window_helper('PtsAgstPer100', 'season, teamid', 'daynum')} PtsAgstPer100,
{window_helper('PtsForPer100', 'season, teamid', 'daynum')} PtsForPer100,
{window_helper('won', 'season, teamid', 'daynum')} WinPct
from
PaceAdjustedCentered
''', conn)
The demo cell below should display the following output:
Won
Season
DayNum
TeamID
OppID
Home
Pos
OppPos
\
0
0
2012
17
1108
1452
0
NaN
NaN
1
0
2012
25
1108
1408
0
8.25697
8.53697
2
0
2012
26
1108
1360
0
5.65314
5.93314
3
1
2012
105
1108
1115
1
2.76368
2.09702
4
0
2012
16
1355
1208
0
NaN
NaN
5
1
2012
31
1355
1237
0 -5.81233
-5.33233
6
0
2012
45
1355
1315
0
5.07403
4.81403
7
1
2012
124
1355
1237
0
4.12049
4.84049
8
0
2012
29
1431
1332
0
NaN
NaN
9
1
2012
52
1431
1155
0 -4.58353
-5.26353
10
0
2012
72
1431
1409
0 -5.65785
-7.11785
11
1
2012
79
1431
1187
1 -6.81332
-8.17332
12
1
2012
103
1431
1408
1 -6.71588
-7.96588
13
1
2017
47
1252
1399
0
NaN
NaN
14
1
2017
50
1252
1122
0
9.78901
10.22901
15
1
2017
68
1252
1244
1
7.09400
8.63400
16
0
2017
82
1252
1316
0
6.62264
8.36930
17
1
2017
86
1252
1367
1
7.55313
8.32313
18
0
2017
108
1252
1316
1
6.32893
6.93693
19
0
2017
122
1252
1316
1
6.91365
7.39365
PtsAgstPer100
PtsForPer100
WinPct
0
NaN
NaN
NaN
1
26.44115
-18.48766
0.00000
2
21.51181
-30.28407
0.00000
3
16.34005
-20.09929
0.00000
4
NaN
NaN
NaN
5
14.27057
-2.29566
0.00000
6
13.42293
6.86026
0.50000
7
16.60255
4.68752
0.33333
8
NaN
NaN
NaN
9
2.30485
-6.73367
0.00000
10
-7.95942
-3.23226
0.50000
11
-4.29258
-8.20323
0.33333
12
-5.03189
-2.51522
0.50000
13
NaN
NaN
NaN
14
-5.62638
-2.59159
1.00000
15
0.33659
13.16336
1.00000
16
-0.57178
10.26290
1.00000
17
2.37803
7.85893
0.75000
18
2.68629
9.02187
0.80000
19
4.37327
7.30692
0.66667
In [26]:
### define demo inputs
from tester_fw.test_utils import
dfs_to_conn
demo_conn_dfs_ex5 = {}
demo_conn_dfs_ex5['PaceAdjustedCentered'] = pd.read_sql('select * from PaceAdjustedCenteredSample', conn).sort_va
lues(['Season', 'TeamID', 'DayNum'])
demo_conn_ex5 = dfs_to_conn(demo_conn_dfs_ex5)
# display(demo_conn_dfs_ex5['PaceAdjustedCentered'])
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [27]:
### call demo funtion
print(get_team_stats(demo_conn_ex5))
The cell below will test your solution for Exercise 5. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution.
original_input_vars
- Copy of input variables from prior to running your solution. These
should
be the same as
input_vars
- otherwise
the inputs were modified by your solution.
returned_output_vars
- Outputs returned by your solution.
true_output_vars
- The expected output. This
should
"match"
returned_output_vars
based on the question requirements - otherwise,
your solution is not returning the correct output.
In [28]:
### test_cell_ex5
from tester_fw.testers import
Tester_ex5
tester = Tester_ex5()
for
_
in
range(20):
try
:
tester.run_test(get_team_stats)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except
:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Won
Season
DayNum
TeamID
OppID
Home
Pos
OppPos
\
0
0
2012
17
1108
1452
0
NaN
NaN
1
0
2012
25
1108
1408
0
8.256970
8.536970
2
0
2012
26
1108
1360
0
5.653140
5.933140
3
1
2012
105
1108
1115
1
2.763683
2.097017
4
0
2012
16
1355
1208
0
NaN
NaN
5
1
2012
31
1355
1237
0 -5.812330
-5.332330
6
0
2012
45
1355
1315
0
5.074030
4.814030
7
1
2012
124
1355
1237
0
4.120487
4.840487
8
0
2012
29
1431
1332
0
NaN
NaN
9
1
2012
52
1431
1155
0 -4.583530
-5.263530
10
0
2012
72
1431
1409
0 -5.657850
-7.117850
11
1
2012
79
1431
1187
1 -6.813323
-8.173323
12
1
2012
103
1431
1408
1 -6.715882
-7.965883
13
1
2017
47
1252
1399
0
NaN
NaN
14
1
2017
50
1252
1122
0
9.789010
10.229010
15
1
2017
68
1252
1244
1
7.093995
8.633995
16
0
2017
82
1252
1316
0
6.622637
8.369303
17
1
2017
86
1252
1367
1
7.553130
8.323130
18
0
2017
108
1252
1316
1
6.328930
6.936930
19
0
2017
122
1252
1316
1
6.913647
7.393647
PtsAgstPer100
PtsForPer100
WinPct
0
NaN
NaN
NaN
1
26.441150
-18.487660
0.000000
2
21.511805
-30.284065
0.000000
3
16.340047
-20.099290
0.000000
4
NaN
NaN
NaN
5
14.270570
-2.295660
0.000000
6
13.422930
6.860260
0.500000
7
16.602553
4.687520
0.333333
8
NaN
NaN
NaN
9
2.304850
-6.733670
0.000000
10
-7.959420
-3.232260
0.500000
11
-4.292583
-8.203230
0.333333
12
-5.031885
-2.515215
0.500000
13
NaN
NaN
NaN
14
-5.626380
-2.591590
1.000000
15
0.336590
13.163360
1.000000
16
-0.571780
10.262900
1.000000
17
2.378032
7.858930
0.750000
18
2.686288
9.021872
0.800000
19
4.373267
7.306922
0.666667
initializing tester_fw.tester_6040
Passed! Please submit.
In [29]:
returned_output_vars['df'].dtypes
In [30]:
true_output_vars['df'].dtypes
Exercise 6 - (3 Points):
Out[29]:
Won
int64
Season
int64
DayNum
int64
TeamID
int64
OppID
int64
Home
int64
Pos
float64
OppPos
float64
PtsAgstPer100
float64
PtsForPer100
float64
WinPct
float64
dtype: object
Out[30]:
Won
int64
Season
int64
DayNum
int64
TeamID
int64
OppID
int64
Home
int64
Pos
float64
OppPos
float64
PtsAgstPer100
float64
PtsForPer100
float64
WinPct
float64
dtype: object
Up to this point we have calculated our metrics for each team before they play each game. We're almost ready to start modeling! There is still one last
task. Remember that there were two records in
PaceAdjusted
for each game. This is also true of
LaggedTeamStats
, which is a running average of
that table. If "Team A" and "Team B" played a game on in some
Season
on some
DayNum
then there will be one record where "Team A" is the primary
team and "Team B" is the opponent team and one record where "Team B" is the pirmary team and "Team A" is the opponent team. We need to join these
records together so that we will have one record identifying each game.
Complete the function
get_matchup_stats(conn)
to query the
LaggedTeamStats
table using the db connection
conn
and return a Pandas
DataFrame with the following columns:
Column Name
dtype
description
Won
int64
1 if primary team (
_x
suffix) won the matchup 0 otherwise
Season
int64
DayNum
int64
TeamID
int64
Primary team - associated with
_x
suffixes
OppID
int64
Opponent team - associated with
_y
suffixes
Home_x
int64
Stats associated with primary team
Pos_x
float64 Stats associated with primary team
OppPos_x
float64 Stats associated with primary team
PtsAgstPer100_x float64 Stats associated with primary team
PtsForPer100_x
float64 Stats associated with primary team
WinPct_x
float64 Stats associated with primary team
Home_y
int64
Stats associated with opponent team
Pos_y
float64 Stats associated with opponent team
OppPos_y
float64 Stats associated with opponent team
PtsAgstPer100_y float64 Stats associated with opponent team
PtsForPer100_y
float64 Stats associated with opponent team
WinPct_y
float64 Stats associated with opponent team
Notes:
The columns should be in the
exact
order given above, but the records can be sorted any way you like.
The
dtypes
attribute of your result must match
exactly
.
This question can be answered using either SQL or Pandas. Pandas will give the required suffixes by default though.
Implementation Strategy
You should start by joining/merging
LaggedTeamStats
to itself where the
OppID
in the left table is the same as the
TeamID
in the right table.
The
Season
and
DayNum
should be the same for both tables.
After joining/merging:
Remove any rows where the "left"
TeamID
is larger than the "left"
OppID
. This will avoid duplicating the data.
Rows occurring before
DayNum
56 should not be included in the result to make sure there is a reasonable sample of games so that the
stats are meaningful.
Drop any unnecessary columns.
Rename the columns as necessary to meet the requirements above.
Any rows containing
NULL
or
NaN
values should be dropped.
In [31]:
### Define get_matchup_stats
def
get_matchup_stats(conn):
df = pd.read_sql('select * from laggedteamstats', conn)
return
df.merge(df,
left_on=['Season', 'DayNum', 'OppID'],
right_on=['Season', 'DayNum', 'TeamID'])\
.query('TeamID_x < OppID_x')\
.query('DayNum >= 56')\
.drop(columns=['Won_y', 'TeamID_y', 'OppID_y'])\
.rename(columns={'Won_x':'Won', 'TeamID_x':'TeamID', 'OppID_x':'OppID'})\
.dropna()
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
The demo cell below should display the following output:
Won
Season
DayNum
TeamID
OppID
Home_x
Pos_x
OppPos_x
\
2
1
2019
78
1210
1323
1 -0.90244
-1.03355
4
1
2011
72
1253
1413
1
2.31956
1.98356
PtsAgstPer100_x
PtsForPer100_x
WinPct_x
Home_y
Pos_y
OppPos_y
\
2
-11.59260
-3.70151
0.55556
0 -2.65178
-2.09178
4
4.31931
-0.18928
0.46667
0 -2.84824
-2.20252
PtsAgstPer100_y
PtsForPer100_y
WinPct_y
2
-2.28041
6.55518
0.61111
4
3.37227
-0.80974
0.28571
In [32]:
### define demo inputs
from tester_fw.test_utils import
dfs_to_conn
demo_conn_dfs_ex6 = {}
demo_conn_dfs_ex6['LaggedTeamStats'] = pd.read_sql('select * from LaggedTeamStatsSample', conn)
demo_conn_ex6 = dfs_to_conn(demo_conn_dfs_ex6)
# display(pd.read_sql('select * from LaggedTeamStats', demo_conn_ex6))
In [33]:
### call demo funtion
print(get_matchup_stats(demo_conn_ex6))
The cell below will test your solution for Exercise 6. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution.
original_input_vars
- Copy of input variables from prior to running your solution. These
should
be the same as
input_vars
- otherwise
the inputs were modified by your solution.
returned_output_vars
- Outputs returned by your solution.
true_output_vars
- The expected output. This
should
"match"
returned_output_vars
based on the question requirements - otherwise,
your solution is not returning the correct output.
In [34]:
### test_cell_ex6
from tester_fw.testers import
Tester_ex6
tester = Tester_ex6()
for
_
in
range(20):
try
:
tester.run_test(get_matchup_stats)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except
:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Won
Season
DayNum
TeamID
OppID
Home_x
Pos_x
OppPos_x
\
2
1
2019
78
1210
1323
1 -0.90244
-1.03355
4
1
2011
72
1253
1413
1
2.31956
1.98356
PtsAgstPer100_x
PtsForPer100_x
WinPct_x
Home_y
Pos_y
OppPos_y
\
2
-11.59260
-3.70151
0.55556
0 -2.65178
-2.09178
4
4.31931
-0.18928
0.46667
0 -2.84824
-2.20252
PtsAgstPer100_y
PtsForPer100_y
WinPct_y
2
-2.28041
6.55518
0.61111
4
3.37227
-0.80974
0.28571
initializing tester_fw.tester_6040
Passed! Please submit.
Aside - Random Forest classification model
Feel free to skip to the next exercise.
Logistic regression relies on a few assumptions about the predictive variables which are not met here (namely normality and each one individually having
a significant effect on the response variable). Instead, we will use the off the shelf Scikit-Learn modules to implement the random forest modeling
technique. We're building the model for you, but you will have to do some analysis of the predictions it makes.
In [35]:
from sklearn.ensemble import
RandomForestClassifier
from sklearn.metrics import
log_loss, roc_auc_score
from sklearn.model_selection import
GridSearchCV, train_test_split
import warnings
import pickle
import os
### Example code showing how we got X and Y from the output of `get_matchup_stats`
# stats_df = get_matchup_stats(conn)
# # Horizontally partition our stats into predictive (we hope!) variables and the response variable
# X = stats_df.drop(columns=['Won', 'Season', 'DayNum', 'TeamID', 'OppID']) # Remove the response and identificat
ion data
# y = stats_df['Won'] # response variable. There are two classes, 0 and 1, which correspond to a loss and a win f
or the primary team.
# load pre-computed X and y
with
open('resource/asnlib/publicdata/X.pkl', 'rb')
as
f:
X = pickle.load(f)
with
open('resource/asnlib/publicdata/y.pkl', 'rb')
as
f:
y = pickle.load(f)
# Vertically partition data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.3, random_state=6040)
# Check to see if we already saved the model
if not
os.path.exists('resource/asnlib/publicdata/clf.pkl'):
# if not - train and save the model
clf = GridSearchCV(RandomForestClassifier(),
scoring='roc_auc',
param_grid={
'max_depth': [2,4,6,8,9,10]
},
n_jobs=-1,
verbose=2).fit(X_train,y_train)
with
open('resource/asnlib/publicdata/clf.pkl', 'wb')
as
f:
pickle.dump(clf, f)
# open the saved model from a pickle file
with
open('resource/asnlib/publicdata/clf.pkl', 'rb')
as
f:
with
warnings.catch_warnings():
warnings.simplefilter('ignore')
clf = pickle.load(f)
# this is how you predict the probability that the response variable belongs to each class
# probs[:, 0] is the estimated probabilites of losing and probs[:, 1] is the estimated probabilities of winning.
probs = clf.predict_proba(X_test)
# this is how you predict which class each record belongs to
preds = clf.predict(X_test)
# printing some metrics about how well the model fits the test set
print(f'''
Log-Loss:
\t\t
{round(log_loss(y_test, probs),4)} - 0 is perfect and 0.693 is considered "non-informative".
Area under ROC curve:
\t
{round(roc_auc_score(y_test, probs[:, 1]),4)} - 1 is perfect and 0.5 is considered "non-in
formative".
''')
Exercise 7 - (1 Points):
Log-Loss:
0.5674 - 0 is perfect and 0.693 is considered "non-informative".
Area under ROC curve:
0.7775 - 1 is perfect and 0.5 is considered "non-informative".
Without advanced statistical training, these fit metrics would probably not make sense. We see that they are both somewhere in between "perfect" and
"non-informative", but what is that telling us? One understandable metric we can calculate is the prediction accuracy.
Complete the function
pred_accuracy(obs, preds)
to calculate the relative frequency
that the predicted value
preds
is the same
as the ovserved value
obs
. You can assume that both will be array-like, 1-dimensional, the same length and that they only contain the integers 0 and 1.
Round your result to 5 decimal places.
Note:
There will be no type check. If your number matches our number you will pass.
In [36]:
### Define pred_accuracy
def
pred_accuracy(obs, preds):
return
round(np.mean(obs==preds), 5)
The demo cell below should display the following output:
0.53333
In [37]:
### define demo inputs
rng = np.random.default_rng(6040)
demo_obs_ex7 = rng.integers(0, 2, 15)
demo_preds_ex7 = rng.integers(0, 2, 15)
# print('obs:
', demo_obs_ex7)
# print('preds:', demo_preds_ex7)
In [38]:
### call demo funtion
pred_accuracy(demo_obs_ex7, demo_preds_ex7)
The cell below will test your solution for Exercise 7. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution.
original_input_vars
- Copy of input variables from prior to running your solution. These
should
be the same as
input_vars
- otherwise
the inputs were modified by your solution.
returned_output_vars
- Outputs returned by your solution.
true_output_vars
- The expected output. This
should
"match"
returned_output_vars
based on the question requirements - otherwise,
your solution is not returning the correct output.
In [39]:
### test_cell_ex7
from tester_fw.testers import
Tester_ex7
tester = Tester_ex7()
for
_
in
range(20):
try
:
tester.run_test(pred_accuracy)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except
:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
Exercise 8 - (2 Points):
Out[38]:
0.53333
initializing tester_fw.tester_6040
Passed! Please submit.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Another way to evaluate our model is to check how accuracy of the probabilities it generates. Suppose you have a list of the observations
obs
(0's and 1's
indicating whether the primary team wins (1) or not(0)) and a list of the predicted probability
probs
that the primary team won for each observation. We
can bucket the predictions based on the probabilities and then calculate the accuracy of our model within that bucket by taking the mean of the
observations. If the observed win % for each bucket is between or close to its bounds, we have a decent model.
Complete the function
bucket_evaluation(obs, prob, n_buckets)
to partition the interval
into
n_buckets
equal intervals and compute
the observed relative win percentage for the events in that bucket. You should return your result as a Pandas DataFrame with the following columns:
Column Name dtype
description
Special Considerations
lower
float64 lower bound of bucket
round to 2 decimal places
upper
float64 upper bound of bucket
round to 2 decimal places
obs_win_pct
float64 win % for observations where lower
estimated win probability < upper
round to 2 decimal places (as a percent)
n
int64
number of observations where lower
estimated win probability < upper
Notes:
The columns should be in the
exact
order given above,
and the records should be sorted with the
'lower'
column in ascending order and
a reset index
.
The
dtypes
attribute of your result must match
exactly
.
The bounds should be rounded
before
deciding which bucket each observation belongs in.
The "obs_win_pct" column should have
NaN
entries for any "empty" buckets which have no observations.
Your solution raising a warning will not cause it to fail the test cell.
Implementation Strategy
np.linspace()
can be helpful for calculating the bounds for each bucket.
Don't forget to round!
The function
zip(bounds[:-1], bounds[1:]
can be iterated over to sequentially examine each pair of bounds (
bounds
would be the output
of
linspace
after rounding).
Use each pair of bounds to filter the
obs
to only include observations where the
prob
is between the lower and upper bounds. You can
calculate the observed relative frequency and count from the filtered results.
Keep track of the endpoints, relative frequency, and count for each interval.
After evaluating for all intervals, construct a DataFrame using the stored results for each interval.
In [49]:
### Define bucket_evaluation
def
bucket_evaluation(obs, prob, n_buckets):
bounds = np.round(np.linspace(0,1, n_buckets+1),2)
lod = []
for
lower, upper
in
zip(bounds[:-1], bounds[1:]):
inds = np.argwhere((prob >= lower) & (prob < upper)).reshape((-1,))
n = inds.shape[0]
win_pct = 100*np.mean(obs[inds])
d = {'lower': lower, 'upper': upper, 'obs_win_pct': win_pct, 'n': n}
lod.append(d)
return
pd.DataFrame(lod).apply(
lambda
f: round(f, 2))
The demo cell below should display the following output:
lower
upper
obs_win_pct
n
0
0.00
0.25
25.00
4
1
0.25
0.50
23.08
13
2
0.50
0.75
66.67
3
3
0.75
1.00
40.00
5
Unlike previous exercises the order and index must match (in addition to the usual requirements) to produce an easily understood result.
In [50]:
### define demo inputs
rng = np.random.default_rng(6040)
demo_obs_ex8 = rng.integers(0, 2, 25)
demo_prob_ex8 = rng.random(25)
demo_n_buckets_ex8 = 4
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [51]:
### call demo funtion
print(bucket_evaluation(demo_obs_ex8, demo_prob_ex8, demo_n_buckets_ex8))
The cell below will test your solution for Exercise 8. The testing variables will be available for debugging under the following names in a dictionary format.
input_vars
- Input variables for your solution.
original_input_vars
- Copy of input variables from prior to running your solution. These
should
be the same as
input_vars
- otherwise
the inputs were modified by your solution.
returned_output_vars
- Outputs returned by your solution.
true_output_vars
- The expected output. This
should
"match"
returned_output_vars
based on the question requirements - otherwise,
your solution is not returning the correct output.
In [43]:
### test_cell_ex8
from tester_fw.testers import
Tester_ex8
tester = Tester_ex8()
for
_
in
range(20):
try
:
tester.run_test(bucket_evaluation)
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
except
:
(input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
raise
###
### AUTOGRADER TEST - DO NOT REMOVE
###
print('Passed! Please submit.')
[ 1
2
3 16]
[ 0
5
8 11 12 13 17 18 19 20 21 22 23]
[ 4
9 14]
[ 6
7 10 15 24]
None
initializing tester_fw.tester_6040
Passed! Please submit.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [44]:
from pandas.testing import
assert_frame_equal
assert_frame_equal(returned_output_vars["df"], true_output_vars["df"])
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3474: RuntimeWarning: Mean of empty
slice.
return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:189: RuntimeWarning: invalid value enc
ountered in double_scalars
ret = ret.dtype.type(ret / rcount)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In [138]:
returned_output_vars["df"]
In [139]:
true_output_vars["df"]
Fin.
If you have made it this far, congratulations on completing the semester.
Don't forget to submit!
Out[138]:
lower upper obs_win_pct n
0
0.00
0.08
33.33
3
1
0.08
0.17
75.00
4
2
0.17
0.25
55.56
9
3
0.25
0.33
100.00
2
4
0.33
0.42
37.50
8
5
0.42
0.50
50.00
2
6
0.50
0.58
66.67
3
7
0.58
0.67
0.00
3
8
0.67
0.75
100.00
2
9
0.75
0.83
25.00
4
10
0.83
0.92
57.14
7
11
0.92
1.00
33.33
3
Out[139]:
lower upper obs_win_pct n
0
0.00
0.08
0.00
2
1
0.08
0.17
66.67
6
2
0.17
0.25
62.50
8
3
0.25
0.33
100.00
2
4
0.33
0.42
37.50
8
5
0.42
0.50
50.00
2
6
0.50
0.58
66.67
3
7
0.58
0.67
0.00
3
8
0.67
0.75
100.00
2
9
0.75
0.83
25.00
4
10
0.83
0.92
62.50
8
11
0.92
1.00
0.00
2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you
![Text book image](https://www.bartleby.com/isbn_cover_images/9781337102124/9781337102124_smallCoverImage.gif)
Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
![Text book image](https://www.bartleby.com/isbn_cover_images/9781305080195/9781305080195_smallCoverImage.gif)
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
![Text book image](https://www.bartleby.com/isbn_cover_images/9781305627482/9781305627482_smallCoverImage.gif)
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
![Text book image](https://www.bartleby.com/isbn_cover_images/9781285196145/9781285196145_smallCoverImage.gif)
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781285196145
Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel
Publisher:Cengage Learning
Recommended textbooks for you
- Programming with Microsoft Visual Basic 2017Computer ScienceISBN:9781337102124Author:Diane ZakPublisher:Cengage LearningProgramming Logic & Design ComprehensiveComputer ScienceISBN:9781337669405Author:FARRELLPublisher:CengageSystems ArchitectureComputer ScienceISBN:9781305080195Author:Stephen D. BurdPublisher:Cengage Learning
- Np Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:CengageDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781305627482Author:Carlos Coronel, Steven MorrisPublisher:Cengage LearningDatabase Systems: Design, Implementation, & Manag...Computer ScienceISBN:9781285196145Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos CoronelPublisher:Cengage Learning
![Text book image](https://www.bartleby.com/isbn_cover_images/9781337102124/9781337102124_smallCoverImage.gif)
Programming with Microsoft Visual Basic 2017
Computer Science
ISBN:9781337102124
Author:Diane Zak
Publisher:Cengage Learning
Programming Logic & Design Comprehensive
Computer Science
ISBN:9781337669405
Author:FARRELL
Publisher:Cengage
![Text book image](https://www.bartleby.com/isbn_cover_images/9781305080195/9781305080195_smallCoverImage.gif)
Systems Architecture
Computer Science
ISBN:9781305080195
Author:Stephen D. Burd
Publisher:Cengage Learning
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
![Text book image](https://www.bartleby.com/isbn_cover_images/9781305627482/9781305627482_smallCoverImage.gif)
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781305627482
Author:Carlos Coronel, Steven Morris
Publisher:Cengage Learning
![Text book image](https://www.bartleby.com/isbn_cover_images/9781285196145/9781285196145_smallCoverImage.gif)
Database Systems: Design, Implementation, & Manag...
Computer Science
ISBN:9781285196145
Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel
Publisher:Cengage Learning