hw04

pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

61B

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

32

Uploaded by CaptainMonkey3984

Report
0.0.1 Question 1a What is the granularity of the data (i.e., what does each row represent)? Hint: Examine all variables present in the dataset carefully before answering this question! Pay special attention to the time-based columns. The dataset contains information on multiple days present in the dataset and each row represents a singular day’s hour, its weather conditions, season, and other details about the hour/day. The rows also denotes how many rental bikes were used per hour and also has additional details on casual, registered users. 1
2
0.0.2 Question 1b For this assignment, we’ll be using this data to study bike usage in Washington, DC. Based on the granularity and the variables present in the data, what might some limitations of using this data be? What are two additional data categories/variables that one could collect to address some of these limitations? The data is representative of DC but DC is a large area and may have varying weather conditions, even within a singular hour. Furthermore, although the rows contain information on rider counts per hour, some riders may utilize bike-sharing for multiple rows sequentially and this is not captured in the dataset, as we do not know how many riders are “double-counted” across several hours. New variables that would be helpful would be having additoinal information on the location of bike usages per hour or a variable on how many new riders that are just now utilizing the service per hour is helpful. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4
0.0.3 Question 3a Use the sns.histplot (documentation) function to create a plot that overlays the distribution of the daily counts of bike users, using blue to represent casual riders, and green to represent registered riders. The temporal granularity of the records should be daily counts, which you should have after completing question 2.c. In other words, you should be using daily_counts to answer this question. Hints: - You will need to set the stat parameter appropriately to match the desired plot. - The label parameter of sns.histplot allows you to specify, as a string, how the plot should be labeled in the legend. For example, passing in label="My data" would give your plot the label “My data” in the legend. - You will need to make two calls to sns.histplot . Include a legend , xlabel , ylabel , and title . Read the seaborn plotting tutorial if you’re not sure how to add these. After creating the plot, look at it and make sure you understand what the plot is actually telling us, e.g., on a given day, the most likely number of registered riders we expect is ~4000, but it could be anywhere from nearly 0 to 7000. For all visualizations in Data 100, our grading team will evaluate your plot based on its similarity to the provided example. While your plot does not need to be identical to the example shown, we do expect it to capture its main features, such as the general shape of the distribution , the axis labels , the legend , and the title . It is okay if your plot contains small stylistic differences, such as differences in color, line weight, font, or size/scale. In [24]: sns . histplot(data = daily_counts, x = 'casual' , stat = 'density' , alpha =0.5 , kde = True , label = 'casual' sns . histplot(data = daily_counts, x = 'registered' , stat = 'density' , color = 'green' , alpha =0.5 , kde = T plt . title( 'Distribution Comparison of Casual vs Registered Riders' ); plt . xlabel( 'Rider Count' ); plt . legend(); 5
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.0.4 Question 3b In the cell below, describe the differences you notice between the density curves for casual and registered riders. Consider concepts such as modes, symmetry, skewness, tails, gaps, and outliers. Include a comment on the spread of the distributions. The registered riders curve, compared to the casual riders curve is more symmetric. The casual riders curve seems to be heavily skewed to the right. Furthermore, the spread of the registered riders curve is quite broad and reached all the way to around 6950. However, the casual rider curve has a more condensed spread, with the range being from 0-3500 approximately. The registered riders curve also has a mode between 3500 and 4000 while the casual riders curve has a low mode of around 290. 7
8
0.0.5 Question 3c The density plots do not show us how the counts for registered and casual riders vary together. Use sns.lmplot (documentation) to make a scatter plot to investigate the relationship between casual and registered counts. This time, let’s use the bike DataFrame to plot hourly counts instead of daily counts. The lmplot function will also try and draw a linear regression line (just as you saw in Data 8). Color the points in the scatterplot according to whether or not the day is a working day (your colors do not have to match ours exactly, but they should be different based on whether the day is a working day). Hints: * Check out this helpful tutorial on lmplot . * There are many points in the scatter plot, so make them small to help reduce overplotting. Check out the scatter_kws parameter of lmplot . * You can set the height parameter if you want to adjust the size of the lmplot . * Add a descriptive title and axis labels for your plot. * You should be using the bike DataFrame to create your plot. * It is okay if the scales of your x and y axis (i.e., the numbers labeled on the two axes) are different from those used in the provided example. In [25]: sns . set(font_scale =1 ) # This line automatically makes the font size a bit bigger on the plot. Y sns . lmplot(data = bike, x = 'casual' , y = 'registered' , hue = 'workingday' , scatter_kws = { 's' : 10 }); plt . title( 'Comparison of Casual vs Registered Riders on Working and Non-working Days' ); 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10
0.0.6 Question 3d What does this scatterplot seem to reveal about the relationship (if any) between casual and registered riders and whether or not the day is on the weekend? What effect does overplotting have on your ability to describe this relationship? When it’s a working day, there are more registered riders while there are more casual riders when it’s a non-working day, which intuitively makes sense. However, I cannot fully describe this relationship because overplotting limits the conclusions I am able to draw because there is possible there are also lots of casual riders on working days. 11
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.0.7 Question 4a (Bivariate Kernel Density Plot) Generate a bivariate kernel density plot with workday and non-workday separated using the daily_counts DataFrame . Hints: You only need to call sns.kdeplot once. Take a look at the hue parameter and adjust other inputs as needed. After you get your plot working, experiment by setting fill=True in kdeplot to see the difference between the shaded and unshaded versions. Please submit your work with fill=False . In [27]: # Set the figure size for the plot plt . figure(figsize = ( 12 , 8 )) sns . kdeplot(data = daily_counts, x = 'casual' , y = 'registered' , hue = 'workingday' ); plt . title( 'Bivariate KDE Plot Comparison of Registered vs Casual Riders' ); 13
14
0.0.8 Question 4b With some modification to your 4a code (this modification is not in scope), we can generate the plot above. In your own words, describe what the lines and the color shades of the lines signify about the data. What does each line and color represent? Hint : You may find it helpful to compare it to a contour or topographical map as shown here . In this graph, we see that the data is more concentrated where there are darker colors in the working and non-working days. For example, the non-working days are represented by blue and it is the most darkest, concentrated area in the middle of the plot indicating that there are usually between 1000-2000 non-registered, casual riders and 2000-4000 registered riders. The edge of the plot and the density of lines there also suggests that the number of riders fluctuate drastically during both types of days. The orange plot, whcih represents workdays, revelas that there are less than 1000 casual riders usually and almost 4000 and higher for registered users through the darkest segments of the graph. 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16
0.0.9 Question 4c What additional details about the riders can you identify from this contour plot that were diffcult to determine from the scatter plot? We are able to see that there are registered riders on non-working days, which is something we weren’t able to see prior. These datapoints were really obscured on the scatterplot and this countour plot really reveals to us that both registered and non-registered casual riders are widely distributed across working and non-working days, rather than being just related to solely either working or non-working days. 17
18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.1 5: Joint Plot As an alternative approach to visualizing the data, construct the following set of three plots where the main plot shows the contours of the kernel density estimate of daily counts for registered and casual riders plotted together, and the two “margin” plots (at the top and right of the figure) provide the univariate kernel density estimate of each of these variables. Note that this plot makes it harder to see the linear relationships between casual and registered for the two different conditions (weekday vs. weekend). You should be making use of daily_counts . Hints : * The seaborn plotting tutorial has examples that may be helpful. * Take a look at sns.jointplot and its kind parameter. * set_axis_labels can be used to rename axes on a seaborn plot. For example, if we wanted to plot a scatterplot with ‘Height’ on the x-axis and ‘Weight’ on the y-axis from some dataset stats_df , we could write the following: graph = sns.scatterplot(data=stats_df, x='Height', y='Weight') graph.set_axis_labels("Height (cm)", "Weight (kg)") Note : * At the end of the cell, we called plt.suptitle to set a custom location for the title. * We also called plt.subplots_adjust(top=0.9) in case your title overlaps with your plot. In [29]: plot = sns . jointplot(data = daily_counts, x = 'casual' , y = 'registered' , kind = 'kde' ) plot . set_axis_labels( 'Daily Count Casual Riders' , 'Daily Count Registered Riders' ) plt . suptitle( "KDE Contours of Casual vs Registered Rider Count" ) plt . subplots_adjust(top =0.9 ); 19
20
0.2 6: Understanding Daily Patterns 0.2.1 Question 6a Let’s examine the behavior of riders by plotting the average number of riders for each hour of the day over the entire dataset (that is, bike DataFrame ), stratified by rider type. Your plot should look like the plot below. While we don’t expect your plot’s colors to match ours exactly, your plot should have a legend in the plot and different colored lines for different kinds of riders, in addition to the title and axis labels. In [31]: bike_by_hour = bike . groupby( 'hr' ) . mean(numeric_only = True ) sns . lineplot(data = bike_by_hour, x = 'hr' , y = 'casual' , label = 'casual' ) sns . lineplot(data = bike_by_hour, x = 'hr' , y = 'registered' , label = 'registered' ) plt . legend() plt . title( 'Average Count of Casual vs. Registered by Hour' ) plt . xlabel( 'Hour of the Day' ); plt . ylabel( 'Average Count' ); 21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.2 Question 6b What can you observe from the plot? Discuss your observations for both types of riders, and hypothesize about the meaning of the peaks in the registered riders’ distribution. From the plot, we see that there are more registered riders than casual riders across all hours. The number of casual riders usually remains constant across the hours, but the casual riders has a peak around 3 PM in the number of riders they have before the curve drops off. The registered rider distribution has two spikes in the rider count at 8 AM and 5 PM before there are sharp declines; this is likely because these are prime times for commuting to work and anyone who are using these bikes are usually registered users because they utilize this bike consistently for commuting to work. 23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
24
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.3 Question 7b In our case, with the bike ridership data, we want 7 curves, one for each day of the week. The x-axis will be the temperature (as given in the 'temp' column), and the y-axis will be a smoothed version of the proportion of casual riders. You should use statsmodels.nonparametric.smoothers_lowess.lowess just like the example above. Un- like the example above, plot ONLY the lowess curve. Do not plot the actual data, which would result in overplotting. For this problem, the simplest way is to use a loop. You do not need to match the colors on our sample plot as long as the colors in your plot make it easy to distinguish which day they represent. Hints: * Start by plotting only one day of the week to make sure you can do that first. Then, consider using a for loop to repeat this plotting operation for all days of the week. • The lowess function expects the y coordinate first, then the x coordinate. You should also set the return_sorted field to False . You will need to rescale the normalized temperatures stored in this dataset to Fahrenheit values. Look at the section of this notebook titled ‘Loading Bike Sharing Data’ for a description of the (normalized) temperature field to know how to convert back to Celsius first. After doing so, convert it to Fahrenheit. By default, the temperature field ranges from 0.0 to 1.0. In case you need it, Fahrenheit = Celsius × 9 5 + 32 . Note: If you prefer plotting temperatures in Celsius, that’s fine as well! Just remember to convert accordingly so the graph is still interpretable. In [40]: from statsmodels.nonparametric.smoothers_lowess import lowess bike[ 'temp_f' ] = bike[ 'temp' ] * 41 * ( 9 / 5 ) + 32 days = [ 'Sat' , 'Sun' , 'Mon' , 'Tue' , 'Wed' , 'Thu' , 'Fri' ] plt . figure(figsize = ( 10 , 8 )) for day in days: filtered = bike[bike[ 'weekday' ] == day] graph = lowess(filtered[ 'prop_casual' ], filtered[ 'temp_f' ],return_sorted = False ) sns . lineplot(data = filtered, x = 'temp_f' , y = graph, label = day) plt . title( 'Temperature vs Casual Rider Proportion by Weekday' ) plt . xlabel( 'Temperature (Fahrenheit)' ) plt . ylabel( 'Casual Rider Proportion' ); 25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.4 Question 7c What do you observe in the above plot? How is prop_casual changing as a function of temperature? Do you notice anything else interesting? As temperature increases, the proportion of non-registered, casual riders increases. This changes, however, after a certain temperature mark because the slope at which the increase is shifts. Across the days, it is also to notice that on Fridays, Saturdays, and Sundays, the proportion of casual riders increases, regardless of the parallel increase in temperature. However, when the temperature reached around 60F, the increase becomes slower. However, on the other hand, the other days sees a different effect, as their rate of increase became faster after the temperature reached a high of around 70. 27
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.5 Question 8a Imagine you are working for a bike-sharing company that collaborates with city planners, transportation agencies, and policymakers in order to implement bike-sharing in a city. These stakeholders would like to reduce congestion and lower transportation costs. They also want to ensure the bike-sharing program is implemented equitably. In this sense, equity is a social value that informs the deployment and assessment of your bike-sharing technology. Equity in transportation includes: Improving the ability of people of different socio-economic classes, gen- ders, races, and neighborhoods to access and afford transportation services and assessing how inclusive transportation systems are over time. Do you think the bike data as it is can help you assess equity? If so, please explain. If not, how would you change the dataset? You may discuss how you would change the granularity, what other kinds of variables you’d introduce to it, or anything else that might help you answer this question. Note : There is no single “right” answer to this question – we are looking for thoughtful reflection and commentary on whether or not this dataset, in its current form, encodes information about equity. I do not believe the bike data can help us assess equity because the data provides information on bike- sharing in DC, but no further insights are provided on specific neighborhoods, income levels, socioeconomic standings, race, costs. Lots of new variables need to be added for this dataset to be able to aid us and with added variables on the cost of different riders and ride durations, the locations of bike-sharing stations in relation to income levels and demographis, where bikes are found, and costs can support the granularity of the dataset. With these variables, there could also be more information on a specific location connecting to where riders get bikes. 29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
30
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.6 Question 8b Bike sharing is growing in popularity , and new cities and regions are making efforts to implement bike- sharing systems that complement their other transportation offerings. The goals of these efforts are to have bike sharing serve as an alternate form of transportation in order to alleviate congestion, provide geographic connectivity, reduce carbon emissions, and promote inclusion among communities. Bike-sharing systems have spread to many cities across the country. The company you work for asks you to determine the feasibility of expanding bike sharing to additional cities in the US. Based on your plots in this assignment, would you recommend expanding bike sharing to additional cities in the US? If so, what cities (or types of cities) would you suggest? Please list at least two reasons why, and mention which plot(s) you drew your analysis from. Note : There isn’t a set right or wrong answer for this question. Feel free to come up with your own conclusions based on evidence from your plots! I would recommend expanding bikesharing to additional cities. I would suggest cities that are similar to Washington DC in regards to population density and work patterns such as SF, LA, Boston, Seattle, NJ, etc. These cities have high levels of commuters who utilize various forms of transportation to go to work and because are dense in terms of their landscape, will also have people who require some mobile to reach their work destination. Furthermore, because of the warm temperatures within these places during summertime and springs, it would also boost the ability to travel more. Expanding our service into this landscape would reduce strain on modes of public transportation. A specific plot that informs my analysis is plot 4a, which indicates there is a large amount of, specifically, registered riders using bikesharing. Furthermore, if we look at 6a, a majority of registered riders where utilizing bikesharing services. A 31
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
32
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help