hw04 (1)
pdf
keyboard_arrow_up
School
Santa Clara University *
*We aren’t endorsed by this school
Course
19
Subject
Statistics
Date
Apr 3, 2024
Type
Pages
32
Uploaded by CountRaccoon297
0.0.1
Question 1a
What is the granularity of the data (i.e., what does each row represent)?
Hint:
Examine all variables present in the dataset carefully before answering this question!
Pay special
attention to the time-based columns.
The granularity of the data is daily as each row corresponds to a single day and contains data for that day.
1
2
0.0.2
Question 1b
For this assignment, we’ll be using this data to study bike usage in Washington, DC. Based on the granularity
and the variables present in the data, what might some limitations of using this data be?
What are two
additional data categories/variables that one could collect to address some of these limitations?
One drawback is the absence of comprehensive meteorological information, especially on precipitation
amounts, which can greatly affect the quantity of bikes rented on a daily basis. Furthermore, the dataset
lacks detailed geographic information that would be necessary to comprehend regional trends of bike uti-
lization and pinpoint high-demand regions, such as the precise locations or bike stations inside the city.
Precipitation data would improve the dataset and offer important insights into how weather influences bike
rentals. Incorporating location data with precise start and finish locations for bike rentals may also help with
the placement and supply of bikes across the city, uncover popular routes, and allow for a more thorough
examination of user behavior and citywide mobility patterns.
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
4
0.0.3
Question 3a
Use the
sns.histplot
(documentation)
function to create a plot that overlays the distribution of the daily
counts of bike users, using blue to represent
casual
riders, and green to represent
registered
riders. The
temporal granularity of the records should be daily counts, which you should have after completing question
2.c. In other words, you should be using
daily_counts
to answer this question.
Hints:
- You will need to set the
stat
parameter appropriately to match the desired plot.
- The
label
parameter of
sns.histplot
allows you to specify, as a string, how the plot should be labeled in the legend.
For example, passing in
label="My data"
would give your plot the label “My data” in the legend. - You
will need to make two calls to
sns.histplot
.
Include a
legend
,
xlabel
,
ylabel
, and
title
. Read the
seaborn plotting tutorial
if you’re not sure how
to add these. After creating the plot, look at it and make sure you understand what the plot is actually
telling us, e.g., on a given day, the most likely number of registered riders we expect is ~4000, but it could
be anywhere from nearly 0 to 7000.
For all visualizations in Data 100, our grading team will evaluate your plot based on its similarity to the
provided example. While your plot does not need to be
identical
to the example shown, we do expect it to
capture its main features, such as the
general shape of the distribution
, the
axis labels
, the
legend
,
and the
title
.
It is okay if your plot contains small stylistic differences, such as differences in color, line
weight, font, or size/scale.
In [36]:
sns
.
histplot(data
=
daily_counts, x
=
'casual'
, color
=
'blue'
, stat
=
'density'
, label
=
'Casual Riders'
sns
.
histplot(data
=
daily_counts, x
=
'registered'
, color
=
'green'
, stat
=
'density'
, label
=
'Registere
plt
.
xlabel(
'Rider Count'
)
plt
.
ylabel(
'Density'
)
plt
.
title(
'Distribution Comparison of Casual vs Registered Riders'
)
plt
.
legend()
plt
.
show()
5
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.0.4
Question 3b
In the cell below, describe the differences you notice between the density curves for casual and registered
riders. Consider concepts such as modes, symmetry, skewness, tails, gaps, and outliers. Include a comment
on the spread of the distributions.
The bike ridership histogram shows clear trends in the usage habits of both registered and casual riders.
The distribution of casual riders is right-skewed, meaning that days with very high casual ridership are rare,
and days with fewer casual riders are more common. On the other hand, the distribution of registered riders
is less peaked and broader, suggesting a more uniform distribution of values and a consistent use pattern
among these riders. Casual ridership exhibits a sharp decline as the counts move away from this common
range, reflecting infrequent occurrences of extremely high or low ridership.
Registered ridership shows a
pronounced peak, suggesting a common range within which casual ridership numbers typically fall.
For
registered riders, occasional gaps or dips in the distribution may reflect external influences on ridership, such
as varying weather conditions or holidays, for the most part it is a symmetric distribution.
7
8
0.0.5
Question 3c
The density plots do not show us how the counts for
registered
and
casual
riders vary together.
Use
sns.lmplot
(documentation)
to make a scatter plot to investigate the relationship between casual and
registered counts. This time, let’s use the
bike DataFrame
to plot hourly counts instead of daily counts.
The
lmplot
function will also try and draw a linear regression line (just as you saw in Data 8). Color the
points in the scatterplot according to whether or not the day is a working day (your colors do not have to
match ours exactly, but they should be different based on whether the day is a working day).
Hints:
* Check out this helpful
tutorial on
lmplot
. * There are many points in the scatter plot, so make
them small to help reduce overplotting. Check out the
scatter_kws
parameter of
lmplot
. * You can set
the
height
parameter if you want to adjust the size of the
lmplot
. * Add a descriptive title and axis labels
for your plot. * You should be using the
bike DataFrame
to create your plot. * It is okay if the scales of
your
x
and
y
axis (i.e., the numbers labeled on the two axes) are different from those used in the provided
example.
In [35]:
sns
.
set(font_scale
=1
)
# This line automatically makes the font size a bit bigger on the plot. Y
sns
.
lmplot(x
=
'casual'
, y
=
'registered'
, data
=
bike, hue
=
'workingday'
, scatter_kws
=
{
's'
:
0.5
})
plt
.
title(
'Comparison of Casual vs Registered Riders on Working and Non-working Days'
)
plt
.
xlabel(
'Casual Riders'
)
plt
.
ylabel(
'Registered Riders'
)
plt
.
show()
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
10
0.0.6
Question 3d
What does this scatterplot seem to reveal about the relationship (if any) between casual and registered riders
and whether or not the day is on the weekend? What effect does overplotting have on your ability to describe
this relationship?
The number of casual and registered riders appears to be positively correlated; however, this relationship is
stronger on non-working days. The rise in registered ridership during working days appears to be independent
of the number of casual riders, indicating that registered users may be using the service for their daily
commute, irrespective of the casual ridership.
The scatterplot shows overplotting, especially among the casual riders on non-working days, which may mask
more subtle trends in the densest regions of the plot. It may be challenging to determine the precise number
of points in the busiest regions or to spot possible outliers due to this overplotting.
11
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.0.7
Question 4a (Bivariate Kernel Density Plot)
Generate a bivariate kernel density plot with workday and non-workday separated using the
daily_counts
DataFrame
.
Hints:
You only need to call
sns.kdeplot
once. Take a look at the
hue
parameter and adjust other inputs
as needed.
After you get your plot working, experiment by setting
fill=True
in
kdeplot
to see the difference between
the shaded and unshaded versions. Please submit your work with
fill=False
.
In [31]:
# Set the figure size for the plot
plt
.
figure(figsize
=
(
12
,
8
))
sns
.
kdeplot(data
=
daily_counts, x
=
'casual'
, y
=
'registered'
, hue
=
'workingday'
, fill
=
False
)
plt
.
title(
'Bivariate KDE Plot Comparison of Registered vs Casual Riders'
)
plt
.
show()
13
14
0.0.8
Question 4b
With some modification to your 4a code (this modification is not in scope), we can generate the plot above.
In your own words, describe what the lines and the color shades of the lines signify about the data. What
does each line and color represent?
Hint
: You may find it helpful to compare it to a contour or topographical map as shown
here
.
The plot displays the concentration of rides made by both casual and registered users, with clear differences
between workdays and non-workdays. For example, workdays (red tones) may have a higher concentration
of registered rides than non-workdays (blue tones), suggesting that workdays are when people use the service
more frequently for routine or commute-related purposes. On the other hand, the distribution may be more
dispersed or have a different center on non-workdays, indicating a more flexible or recreational use of the
service.
15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
16
0.0.9
Question 4c
What additional details about the riders can you identify from this contour plot that were diffcult to
determine from the scatter plot?
The contour plot can reveal information about the riders that a scatter plot might not be able to, such
as the density and the relationship between the counts of casual and registered riders. The contour plot,
in particular, aids in locating the densest areas by displaying the typical numbers of casual and registered
riders. Additionally, it can display the distribution and trend of the data, indicating the type of relationship
that exists between casual and registered riders.
For instance, the contour plot may reveal that on certain days, there is a high density of registered riders
with a low count of casual riders, or vice versa. It might also show if there is a consistent pattern where the
increase in registered riders corresponds to an increase or decrease in casual riders. Such patterns might be
less discernible in a scatter plot, especially if many points overlap, because the scatter plot does not directly
show the density of points.
17
18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.1
5: Joint Plot
As an alternative approach to visualizing the data, construct the following set of three plots where the main
plot shows the contours of the kernel density estimate of daily counts for registered and casual riders plotted
together, and the two “margin” plots (at the top and right of the figure) provide the univariate kernel density
estimate of each of these variables. Note that this plot makes it harder to see the linear relationships between
casual and registered for the two different conditions (weekday vs. weekend). You should be making use of
daily_counts
.
Hints
: * The
seaborn plotting tutorial
has examples that may be helpful. * Take a look at
sns.jointplot
and its
kind
parameter. *
set_axis_labels
can be used to rename axes on a
seaborn
plot. For example,
if we wanted to plot a scatterplot with ‘Height’ on the x-axis and ‘Weight’ on the y-axis from some dataset
stats_df
, we could write the following:
graph = sns.scatterplot(data=stats_df, x='Height', y='Weight')
graph.set_axis_labels("Height (cm)", "Weight (kg)")
Note
: * At the end of the cell, we called
plt.suptitle
to set a custom location for the title. * We also
called
plt.subplots_adjust(top=0.9)
in case your title overlaps with your plot.
In [20]:
# Set the figure size
plt
.
figure(figsize
=
(
12
,
8
))
sns
.
jointplot(data
=
daily_counts, x
=
"casual"
, y
=
"registered"
, kind
=
"kde"
)
plt
.
suptitle(
"KDE Contours of Casual vs Registered Rider Count"
)
plt
.
subplots_adjust(top
=0.9
);
<Figure size 1800x1200 with 0 Axes>
19
20
0.2
6: Understanding Daily Patterns
0.2.1
Question 6a
Let’s examine the behavior of riders by plotting the average number of riders for each hour of the day over
the
entire dataset
(that is,
bike DataFrame
), stratified by rider type.
Your plot should look like the plot below. While we don’t expect your plot’s colors to match ours exactly,
your plot should have a legend in the plot and different colored lines for different kinds of riders, in addition
to the title and axis labels.
In [21]:
average_counts
=
bike
.
groupby(
'hr'
)[[
'casual'
,
'registered'
]]
.
mean()
.
reset_index()
plt
.
figure(figsize
=
(
12
,
6
))
sns
.
lineplot(x
=
'hr'
, y
=
'casual'
, data
=
average_counts, label
=
'casual'
)
sns
.
lineplot(x
=
'hr'
, y
=
'registered'
, data
=
average_counts, label
=
'registered'
)
plt
.
title(
'Average Count of Casual vs. Registered by Hour'
)
plt
.
xlabel(
'Hour of the Day'
)
plt
.
ylabel(
'Average Count'
)
plt
.
legend()
plt
.
show()
21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.2.2
Question 6b
What can you observe from the plot? Discuss your observations for both types of riders, and hypothesize
about the meaning of the peaks in the registered riders’ distribution.
We can infer different use patterns for casual and registered riders during the day by looking at the plot for
question 6a. It appears that casual ridership rises gradually in the morning, peaks in the early afternoon,
and then declines as dusk draws near.
This implies that non-commuters or tourists who use the service
occasionally prefer to use it for errands or leisure during the day.
In contrast, there are two clear peaks in the registered ridership, one in the early morning and another in
the late afternoon, which correspond to regular work commute times. This suggests that commuters who
use the service to get to work in the morning and go home in the evening are probably among the registered
riders. The midday and late night troughs indicate that demand is lower during working hours and when
most people are at home.
The work commute could be the cause of the peaks in the registered riders’ distribution; the morning peak
would be associated with the rush to get to work, and the evening peak would be associated with the mass
return home. This pattern highlights potential opportunities for targeted promotions or increased service
availability during peak hours, underscoring the importance of the service for regular commuters.
23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
24
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.2.3
Question 7b
In our case, with the bike ridership data, we want 7 curves, one for each day of the week. The x-axis will be
the temperature (as given in the
'temp'
column), and the y-axis will be a smoothed version of the proportion
of casual riders.
You should use
statsmodels.nonparametric.smoothers_lowess.lowess
just like the example above. Un-
like the example above, plot ONLY the lowess curve. Do not plot the actual data, which would result in
overplotting. For this problem, the simplest way is to use a loop.
You do not need to match the colors on our sample plot as long as the colors in your plot make it easy to
distinguish which day they represent.
Hints:
* Start by plotting only one day of the week to make sure you can do that first. Then, consider
using a
for
loop to repeat this plotting operation for all days of the week.
• The
lowess
function expects the
y
coordinate first, then the
x
coordinate.
You should also set the
return_sorted
field to
False
.
•
You will need to rescale the normalized temperatures stored in this dataset to Fahrenheit
values.
Look at the section of this notebook titled ‘Loading Bike Sharing Data’ for a description
of the (normalized) temperature field to know how to convert back to Celsius first.
After doing so,
convert it to Fahrenheit. By default, the temperature field ranges from 0.0 to 1.0. In case you need it,
Fahrenheit
=
Celsius
×
9
5
+ 32
.
Note: If you prefer plotting temperatures in Celsius, that’s fine as well! Just remember to convert accordingly
so the graph is still interpretable.
In [34]:
from
statsmodels.nonparametric.smoothers_lowess
import
lowess
plt
.
figure(figsize
=
(
10
,
8
))
bike[
'temp_F'
]
=
bike[
'temp'
]
.
apply(
lambda
x: x
* 41 *
(
9 / 5
)
+ 32
)
for
day
in
bike[
'weekday'
]
.
unique():
data
=
bike[bike[
'weekday'
]
==
day]
line
=
lowess(data[
'prop_casual'
], data[
'temp_F'
])
plt
.
plot(line[:,
0
], line[:,
1
], label
=
day)
plt
.
xlabel(
'Temperature (Fahrenheit)'
)
plt
.
ylabel(
'Casual Rider Proportion'
)
plt
.
title(
'Temperature vs Casual Rider Proportion by Weekday'
)
plt
.
legend()
plt
.
show()
25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.2.4
Question 7c
What do you observe in the above plot? How is
prop_casual
changing as a function of temperature? Do
you notice anything else interesting?
The percentage of casual riders (prop_casual) in the plot generally rises with temperature. The fact that
this pattern holds true on all days suggests that the arrival of warmer weather will probably lead to an
increase in the casual usage of ride-sharing services.
In addition, the plot indicates that on weekends (Saturday and Sunday) as opposed to weekdays, the slope
of the increase appears to be steeper. This may suggest that weekend temperatures have a greater impact
on casual riders, who may have more time to engage in leisure activities such as using ride-sharing services
during weekends.
Furthermore, the beginning points of the curves for each day differ, with weekdays starting lower than
weekend values, even though the increase in prop_casual is clearly visible as temperatures rise. This may
suggest that there is a base level of casual ridership that is independent of temperature, perhaps for essential
travel, but that leisure travel—which is more temperature-dependent, particularly on weekends—contributes
to higher levels of casual ridership.
27
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.2.5
Question 8a
Imagine you are working for a bike-sharing company that collaborates with city planners, transportation
agencies, and policymakers in order to implement bike-sharing in a city. These stakeholders would like to
reduce congestion and lower transportation costs.
They also want to ensure the bike-sharing program is
implemented equitably. In this sense, equity is a social value that informs the deployment and assessment
of your bike-sharing technology.
Equity in transportation includes: Improving the ability of people of different socio-economic classes, gen-
ders, races, and neighborhoods to access and afford transportation services and assessing how inclusive
transportation systems are over time.
Do you think the
bike
data as it is can help you assess equity? If so, please explain. If not, how would you
change the dataset? You may discuss how you would change the granularity, what other kinds of variables
you’d introduce to it, or anything else that might help you answer this question.
Note
:
There is no single “right” answer to this question – we are looking for thoughtful reflection and
commentary on whether or not this dataset, in its current form, encodes information about equity.
Even though the current bike dataset might shed light on usage trends, it might not be enough to conduct
a thorough assessment of equity. The dataset would need to contain geographic data to determine where
the service is being used and who has access to it, as well as demographic data about riders, such as age,
gender, income level, and race, in order to address equity. Surveys of riders’ satisfaction, accommodations
for people with disabilities, and multilingual accessibility are possible additional factors.
Adjustments to
collect longitudinal data may demonstrate shifts over time, illustrating the effects of policy adjustments or
service expansions. The dataset would need to measure accessibility and usage across various demographics
and reflect the diversity of the community in order to accurately assess equity.
29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
30
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.2.6
Question 8b
Bike sharing is growing in popularity
, and new cities and regions are making efforts to implement bike-
sharing systems that complement their other transportation offerings. The
goals of these efforts
are to have
bike sharing serve as an alternate form of transportation in order to alleviate congestion, provide geographic
connectivity, reduce carbon emissions, and promote inclusion among communities.
Bike-sharing systems have spread to many cities across the country. The company you work for asks you to
determine the feasibility of expanding bike sharing to additional cities in the US.
Based on your plots in this assignment, would you recommend expanding bike sharing to additional cities
in the US? If so, what cities (or types of cities) would you suggest? Please list at least two reasons why, and
mention which plot(s) you drew your analysis from.
Note
:
There isn’t a set right or wrong answer for this question.
Feel free to come up with your own
conclusions based on evidence from your plots!
The analysis suggests that expanding bike-sharing to more US cities is a promising idea. The most important
graphs that help with this decision might be those that demonstrate heavy use during commute hours, which
suggests that bike sharing can reduce peak traffc, and the positive relationship between pleasant weather
and higher casual usage, which points to the possibility of bike sharing becoming popular in cities with
temperate climates. Expansion, however, ought to take into account the particular requirements of every
city, taking into account the local transit system, the density of the city, and the cultural perspectives on
cycling. Bike-sharing schemes would probably be most beneficial to cities with strong public transportation
networks and a dedication to environmentally friendly mobility. Additionally, bike-sharing might be used by
smaller or mid-sized cities where traffc is starting to become an issue as a preventative measure.
31
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
32
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Recommended textbooks for you
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage
Recommended textbooks for you
- Algebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:Cengage
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage