UGBA88 Lab DX assignment - Jupyter Notebook
pdf
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
113
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
Pages
13
Uploaded by MegaSeahorseMaster651
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
1/13
Lab DX: Data Visualization
1 Setup
In [8]:
2
Fundraising for a Political Campaign continued
2.1
Business Decision
The Director of Fundraising in California for a US presidential candidate is considering where best
to focus the campaign's effort during the early period January to September of the year prior to the
election. To inform her decision, she would like insight into how fundraising has gone for other
candidates in the past.
# Import some useful functions
from
numpy import
*
from
numpy.random import
*
from
datascience import
*
from
statsmodels.formula.api import
*
# Define some useful functions
def
correlation
(array_1, array_2):
return
corrcoef(array_1, array_2).item(
1
)
# Customize look of graphics
import
matplotlib.pyplot as
plt
plt.style.use(
'fivethirtyeight'
)
plt.rcParams[
'figure.dpi'
] =
60
%
matplotlib inline
# Force display of all values from
IPython.core.interactiveshell import
InteractiveShell
InteractiveShell.ast_node_interactivity =
"all"
# Handle some obnoxious warning messages
import
warnings
warnings.filterwarnings(
"ignore"
)
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
2/13
2.2 Data
The donor dataset here is from the US Federal Election Commission (FEC) and itemizes individual
contributions from California to federal campaign committees for US presidential candidates
received January 2019 to September 2019 (refer: US Federal Election Commission
https://www.fec.gov/data/browse-data/?tab=bulk-data
(https://www.fec.gov/data/browse-data/?
tab=bulk-data)
).
Retrieve the donor dataset, filter to include only transactions where amount is between $0 and
$2,200, show the number of transactions and the first few transactions.
In [9]:
2.3 Analysis
Construct a table to show $ donated to Democrats for each city that donated to Democrats.
Out[9]:
264119
Out[9]:
CAND_NAME
PARTY
NAME
CITY
OCCUPATION
DATE
AMOUNT
SANFORD,
MARSHALL HON
REP
CARATAN, PATRICK
BAKERSFIELD
blank
2019-
09-08
400
SANFORD,
MARSHALL HON
REP
PARROTT, IDA
FRESNO
RETIRED
2019-
09-09
500
SANFORD,
MARSHALL HON
REP
HUEBSCHER,
FRED
HERMOSA
BEACH
CONSULTANT
2019-
09-08
1000
SANFORD,
MARSHALL HON
REP
MURPHY, MIKE
LOS ANGELES
CONSULTANT
2019-
09-08
1000
DELANEY, JOHN K.
DEM
ELVERUM, JAMIE
CARMICHAEL
SALES
MANAGER
2019-
07-14
25
DELANEY, JOHN K.
DEM
FLYNN, GREG
SAN
FRANCISCO
blank
2019-
03-31
1000
DELANEY, JOHN K.
DEM
SAUNDERS,
MARSHALL L
CORONADO
RETIRED
2019-
07-08
1000
DELANEY, JOHN K.
DEM
HOLMSTROM,
RICK
WOODSIDE
blank
2019-
06-20
200
DELANEY, JOHN K.
DEM
COWAN,
GEOFFREY
LOS ANGELES
PROFESSOR
2019-
04-10
20
DELANEY, JOHN K.
DEM
ELVERUM, JAMIE
CARMICHAEL
SALES
MANAGER
2019-
07-28
25
... (264109 rows omitted)
data =
Table.read_table(
'donations.csv'
, low_memory
=
False
)
data =
data.where(
'AMOUNT'
, are.between_or_equal_to(
0
,
2200
))
data.num_rows
data
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
3/13
In [11]:
Construct a table to show $ donated to Republicans for each city that donated to Republicans.
In [12]:
Construct a table to compare $ donated to Democrats vs. $ donated to Republicans for each city
(from among cities that donated to both Democrats and Republicans).
Out[11]:
CITY
DEM sum
ACTON
1602
ADELANTO
159
AGOURA
131
AGOURA HILLS
23251
AGUA DULCE
849
AHWAHNEE
54
ALAMEDA
105673
ALAMO
31812
ALBANY
54959
ALBION
2746
... (1026 rows omitted)
Out[12]:
CITY
REP sum
ACAMPO
250
ACTON
80
ADIN
250
AGOURA HILLS
2055
AHWAHNEE
275
ALAMEDA
1688
ALAMO
4505
ALHAMBRA
748
ALISO VIEJO
2295
ALPINE
1035
... (680 rows omitted)
agg_dem =
data.where(
'PARTY'
,
'DEM'
).select(
'CITY'
,
'AMOUNT'
).group(
'CITY'
,
su
agg_dem =
agg_dem.relabel(
'AMOUNT sum'
, 'DEM sum'
)
agg_dem
agg_rep =
data.where(
'PARTY'
,
'REP'
).select(
'CITY'
,
'AMOUNT'
).group(
'CITY'
, s
agg_rep =
agg_rep.relabel(
'AMOUNT sum'
, 'REP sum'
)
agg_rep
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
4/13
In [13]:
Find the 10 cities with the highest $ donated to Democrats (from among cities that donated to both
Democrats and Republicans). These are the top-10 Democrat donor cities.
In [15]:
Use a horizontal bar chart to visualize $ donated to Democrats and $ donated to Republicans for
each of the top-10 Democrat donor cities, sorted by $ donated to Democrats.
Out[13]:
CITY
DEM sum
REP sum
ACTON
1602
80
AGOURA HILLS
23251
2055
AHWAHNEE
54
275
ALAMEDA
105673
1688
ALAMO
31812
4505
ALHAMBRA
13508
748
ALISO VIEJO
24879
2295
ALPINE
1310
1035
ALTA LOMA
1505
285
ALTADENA
56061
1900
... (612 rows omitted)
Out[15]:
CITY
DEM sum
REP sum
SAN FRANCISCO
3569632
23528
LOS ANGELES
3178304
41417
OAKLAND
767837
5600
BERKELEY
619856
841
SAN DIEGO
557147
32812
SANTA MONICA
540150
4365
PALO ALTO
532022
3620
SAN JOSE
529012
29433
BEVERLY HILLS
408892
9167
SACRAMENTO
403744
11932
agg =
agg_dem.join(
'CITY'
,agg_rep)
agg
agg_top10dem =
agg.sort(
'DEM sum'
, descending =
True
).take(arange(
10
))
agg_top10dem
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
5/13
In [16]:
Find the 10 cities with the highest $ donated to Republicans (from among cities that donated to
both Democrats and Republicans). These are the top-10 Republican donor cities.
In [17]:
Use a horizontal bar chart to visualize $ donated to Democrats and $ donated to Republicans for
each of the top-10 Republican donor cities, sorted by $ donated to Republicans.
Out[17]:
CITY
DEM sum
REP sum
LOS ANGELES
3178304
41417
SAN DIEGO
557147
32812
SAN JOSE
529012
29433
SAN FRANCISCO
3569632
23528
SANTA ANA
51036
21938
HUNTINGTON BEACH
80170
15833
NEWPORT BEACH
78004
15640
IRVINE
151146
15405
BAKERSFIELD
32740
15076
FRESNO
72900
14105
agg_top10dem.barh(
'CITY'
)
agg_top10rep =
agg.sort(
'REP sum'
, descending =
True
).take(arange(
10
))
agg_top10rep
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
6/13
In [18]:
Calculate the ratio of $ donated to Democrats vs. $ donated to Republicans from the top-
4
Republican donor cities combined.
In [19]:
Construct a table to show the cumulative $ donated in February 2019, by date.
Out[19]:
CITY
DEM sum
REP sum
LOS ANGELES
3178304
41417
SAN DIEGO
557147
32812
SAN JOSE
529012
29433
SAN FRANCISCO
3569632
23528
Out[19]:
61.593639437062663
agg_top10rep.barh(
'CITY'
)
agg_top4rep =
agg_top10rep.take(arange(
4
))
agg_top4rep
sum
(agg_top4rep.column(
'DEM sum'
)) /
sum
(agg_top4rep.column(
'REP sum'
))
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
7/13
In [20]:
Visualize the cumulative $ donated in February 2019, by date, as a line plot.
Out[20]:
CAND_NAME
PARTY
NAME
CITY
OCCUPATION
DATE
AMOUNT
CUMSUM
TRUMP,
DONALD J.
REP
MEZHEBITSKIY,
SERGEY
CITRUS
HEIGHTS
SINGLE
2019-
02-01
25
25
TRUMP,
DONALD J.
REP
SCHNEIDER,
GARY
MOUNTAIN
VIEW
LYFT DRIVER
2019-
02-01
50
75
TRUMP,
DONALD J.
REP
WAGNER, RUEL
E. MR.
LA MIRADA
RETIRED
2019-
02-01
50
125
TRUMP,
DONALD J.
REP
WALKOV,
CINDRA
VAN NUYS
REAL EATATE
2019-
02-01
35
160
TRUMP,
DONALD J.
REP
DIEHL, LINDA
EXETER
RETIRED
2019-
02-01
25
185
TRUMP,
DONALD J.
REP
HIGHTOWER,
TIMOTHY
SAN JOSE
RETIRED
2019-
02-01
250
435
TRUMP,
DONALD J.
REP
LUONG,
WILLIAM T. MR.
SAN JOSE
RETIRED
2019-
02-01
50
485
TRUMP,
DONALD J.
REP
WALKOV,
CINDRA
VAN NUYS
REAL EATATE
2019-
02-01
35
520
TRUMP,
DONALD J.
REP
NEIGHBOR,
JOSEPH
RANCHO
CUCAMONGA
MANAGER
2019-
02-01
100
620
TRUMP,
DONALD J.
REP
WARD, DELMA
PASO ROBLES
RETIRED
2019-
02-01
225
845
... (5304 rows omitted)
=
data.where(
'DATE'
, are.between_or_equal_to(
"2019-02-01"
,
"2019-02-28"
))
=
data_02.sort(
'DATE'
).with_column(
'CUMSUM'
, cumsum(data_02.sort(
'DATE'
).col
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
8/13
In [21]:
2.4 Quiz
The dataset, filtered to include only transactions where amount is between $0 and $2,200,
comprises ____
observations.
City of Acton donated $
____
to Democrat candidates.
City of Alpine donated $
____
to Republican candidates.
City of Alta Loma donated $
____
to Democrat candidates.
City of Alta Loma donated $
____
to Republican candidates.
City of ____
donated more to Democrats than any other city did.
BERKELEY
BEVERLY HILLS
LOS ANGELES
SAN DIEGO
SAN FRANCISCO
SAN JOSE
City of Fresno donated more to Republicans than city of ____
did.
BAKERSFIELD
IRVINE
SACRAMENTO
SANTA ANA
SAN FRANCISCO
SAN JOSE
The top-
4
Republican donor cities donated ____
times as much to Democrats as they did to
Republicans.
During February 2019, the largest one-day sum of $ donated was on February ____
, 2019.
The line plot of cumulative $ donated in February 2019, by date, is shaped like ____
.
discrete stair steps because there are typically several donations made each day
discrete stair steps because the donation amounts in the dataset are rounded to the
nearest dollar
data_02.select(
'DATE'
,
'CUMSUM'
).plot(
'DATE'
)
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
9/13
discrete stair steps because the python .plot() function displays line segments at low
resolution
a continuous ramp because there are typically several donations made each day
a continuous ramp because the donation amounts in the dataset are rounded to the
nearest dollar
a continuous ramp because the python .plot() function displays line segments at low
resolution
3
Airline Flight On-Time Performance
3.1
Business Decision
The practice manager of a management consulting firm has been engaged by a consortium of
airport operators to investiugate and recommend process changes to improve airline flight on-time
performance. His first task is to identify characteristics surrounding very delayed flights. Based on
insights gained there, he will decide where to focus his efforts next.
3.2 Data
The dataset describes flight performance of a sample of 1,049 flights during August 2015 (refer: US
Bureau of Transportation Statistics). Each observation describes one flight.
Retrieve the flight performance dataset, show the number of observations and the first few
observations.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
10/13
In [22]:
3.3 Analysis
Calculate the mean and standard deviation of arrival delays.
In [23]:
Visualize the distribution of arrival delays as a 10-bin histogram, normed, range -75 to 75.
Out[22]:
1049
Out[22]:
DayofMonth
DayOfWeek
Carrier
Origin
Origin
City
Dest
Dest City
Departure
Delay
(mins)
Arrival
Delay
(min)
Schd
Elapsed
Time
7
5
American
JFK
New York,
NY
LAX
Los
Angeles,
CA
-3
-19
364
25
2
American
SFO
San
Francisco,
CA
JFK
New York,
NY
-9
-31
343
23
7
American
PDX
Portland,
OR
ORD
Chicago,
IL
-2
-26
243
21
5
American
DEN
Denver,
CO
MIA
Miami, FL
4
5
236
18
2
American
MIA
Miami, FL
IAH
Houston,
TX
-6
-19
163
15
6
American
JFK
New York,
NY
MIA
Miami, FL
73
58
213
8
6
American
SEA
Seattle,
WA
DFW
Dallas/Fort
Worth, TX
0
-5
235
27
4
American
LAX
Los
Angeles,
CA
BOS
Boston,
MA
-2
-20
340
7
5
American
ELP
El Paso,
TX
DFW
Dallas/Fort
Worth, TX
-7
-10
108
15
6
American
LAS
Las
Vegas, NV
DFW
Dallas/Fort
Worth, TX
-2
-6
165
... (1039 rows omitted)
Out[23]:
-1.7988560533841753
Out[23]:
18.738900333975252
data =
Table.read_table(
'flight_performance.csv'
)
data.num_rows
data
mean_delay =
mean(data.column(
'Arrival Delay (min)'
))
mean_delay
std_delay =
std(data.column(
'Arrival Delay (min)'
))
std_delay
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
11/13
In [ ]:
Calculate the 90th percentile of arrival delays. 10% of the flights were delayed longer than this
value.
In [ ]:
Calculate the mean arrival delay for each origin airport.
In [ ]:
Visualize the distribution of mean arrival delays as a 10-bin histogram, normed, range -75 to 75.
(Each mean arrival delay corresponds to a unique origin airport.)
In [ ]:
Calculate the 90th percentile of mean arrival delays. 10% of the origin airports had flights delayed
on average longer than this value.
In [ ]:
Find the 10% of origin airports with the highest mean arrival delays. These are the origin airports
that had flights delayed on average longer than the 90th percentile. These are the problem origin
airports.
In [ ]:
Find the flights originating from problem origin airports.
Hint: Join the table of problem origin airports with the table of all flights.
In [ ]:
Calculate the number of flights originating from problem origin airports for each carrier.
In [ ]:
Calculate the shortest arrival delay originating from problem origin airports for each carrier. (Treat a
negative arrival delay as shorter than a positive arrival delay.)
data.hist(
'Arrival Delay (min)'
), normed =
True
, range =
make_array(
-
75
,
75
)
p =
percentile(
90
, data.column(
'Arrival Delay (min)'
))
p
data_group =
data.select(
'Origin'
, 'Arrival Delay (min)'
).group(
'Origin'
, m
data_group
data_group.where(
'Arrival Delay (min) mean'
, are.between(
-
75
/
5
,
0
)).num_rows
data_group.hist(
'Arrival Delay (min) mean'
, normed =
True
, range =
make_ar
p_group =
percentile(
90
, data_group.column(
'Arrival Delay (min) mean'
))
p_group
data_group_worst =
data_group.where(
'Arrival Delay (min) mean'
, are.above(p
date_group_worst.show()
data_worst =
data.join(
'Origin'
, data_group_worst)
data_worst.show()
data_worst.select(
'Carrier'
).group(
'Carrier'
)
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
12/13
In [ ]:
Calculate the mean arrival delay originating from problem origin airports for each carrier. (Treat a
negative arrival delay as shorter than a positive arrival delay.)
In [ ]:
Calculate the longest arrival delay originating from problem origin airports for each carrier. (Treat a
negative arrival delay as shorter than a positive arrival delay.)
In [ ]:
Use a horizontal bar chart to visualize shortest arrival delay, mean arrival delay, and longest arrival
delay originating from problem origin airports for each carrier.
In [ ]:
3.4 Quiz
The dataset comprises ____
observations.
The mean arrival delay is ____
minutes. (could be negative)
The standard deviation of arrival delay is ____
minutes.
The histogram of arrival delays includes ____
bars with heights greater than zero.
10% of flights arrived at their destinations delayed by more than ____
minutes.
Flights originating from ASE arrived at their destinations delayed by ____
minutes on
average.
The area of all bars in the histogram of mean arrival delays is ____
.
The area of the highest bar in the histogram of mean arrival delays is ____
.
10% of origin airports had flights that arrived at their destinations delayed by more than
____
minutes on average.
____
origin airports had flights that arrived at their destinations delayed by more than 12
minutes on average.
The mean arrival delay of flights originating from MLI is ____
minutes.
The mean arrival delay of flights originating from FAI is ____
minutes.
____
flights originated from IAD.
____
SkyWest flights originated from problem origin airports.
The shortest arrival delay of Alaska flights originating from problem origin airports is ____
minutes.
The mean arrival delay of Atlantic SE flights originating from problem origin airports is ____
minutes.
The longest arrival delay of UA flights originating from problem origin airports is ____
minutes.
The bar chart of shortest, mean, and longest arrival delays includes ____
bars.
The length of the longest bar in the bar chart of shortest, mean, and longest arrival delays is
____
.
____
contributes to arrival delays associated with problem origin airports due to having
many flights and high mean arrival delay, even though at least one flight arrived ahead of
d1 =
data_worst.select(
'Carrier'
, 'Arrival Delay (min)'
).group(
'Carrier'
,
mi
d1
d2 =
data_worst.select(
'Carrier'
,
'Arrival Delay (min)'
)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
2/12/24, 5:50 PM
UGBA88 Lab DX assignment - Jupyter Notebook
https://ugba88.haas.berkeley.edu/user/darrinfan/notebooks/my-work/Lab D Data Visualization/UGBA88 Lab DX assignment.ipynb
13/13
schedule.
American
Atlantic SE
SkyWest
Southwest
UA
Southwest ____
contribute to arrival delays associated with problem origin airports.
does
does not
Document revised September 16, 2023
Copyright (c) Huntsinger Associates, LLC