hw04 (1)

pdf

School

Santa Clara University *

*We aren’t endorsed by this school

Course

19

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

32

Uploaded by CountRaccoon297

Report
0.0.1 Question 1a What is the granularity of the data (i.e., what does each row represent)? Hint: Examine all variables present in the dataset carefully before answering this question! Pay special attention to the time-based columns. The granularity of the data is daily as each row corresponds to a single day and contains data for that day. 1
2
0.0.2 Question 1b For this assignment, we’ll be using this data to study bike usage in Washington, DC. Based on the granularity and the variables present in the data, what might some limitations of using this data be? What are two additional data categories/variables that one could collect to address some of these limitations? One drawback is the absence of comprehensive meteorological information, especially on precipitation amounts, which can greatly affect the quantity of bikes rented on a daily basis. Furthermore, the dataset lacks detailed geographic information that would be necessary to comprehend regional trends of bike uti- lization and pinpoint high-demand regions, such as the precise locations or bike stations inside the city. Precipitation data would improve the dataset and offer important insights into how weather influences bike rentals. Incorporating location data with precise start and finish locations for bike rentals may also help with the placement and supply of bikes across the city, uncover popular routes, and allow for a more thorough examination of user behavior and citywide mobility patterns. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
4
0.0.3 Question 3a Use the sns.histplot (documentation) function to create a plot that overlays the distribution of the daily counts of bike users, using blue to represent casual riders, and green to represent registered riders. The temporal granularity of the records should be daily counts, which you should have after completing question 2.c. In other words, you should be using daily_counts to answer this question. Hints: - You will need to set the stat parameter appropriately to match the desired plot. - The label parameter of sns.histplot allows you to specify, as a string, how the plot should be labeled in the legend. For example, passing in label="My data" would give your plot the label “My data” in the legend. - You will need to make two calls to sns.histplot . Include a legend , xlabel , ylabel , and title . Read the seaborn plotting tutorial if you’re not sure how to add these. After creating the plot, look at it and make sure you understand what the plot is actually telling us, e.g., on a given day, the most likely number of registered riders we expect is ~4000, but it could be anywhere from nearly 0 to 7000. For all visualizations in Data 100, our grading team will evaluate your plot based on its similarity to the provided example. While your plot does not need to be identical to the example shown, we do expect it to capture its main features, such as the general shape of the distribution , the axis labels , the legend , and the title . It is okay if your plot contains small stylistic differences, such as differences in color, line weight, font, or size/scale. In [36]: sns . histplot(data = daily_counts, x = 'casual' , color = 'blue' , stat = 'density' , label = 'Casual Riders' sns . histplot(data = daily_counts, x = 'registered' , color = 'green' , stat = 'density' , label = 'Registere plt . xlabel( 'Rider Count' ) plt . ylabel( 'Density' ) plt . title( 'Distribution Comparison of Casual vs Registered Riders' ) plt . legend() plt . show() 5
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.0.4 Question 3b In the cell below, describe the differences you notice between the density curves for casual and registered riders. Consider concepts such as modes, symmetry, skewness, tails, gaps, and outliers. Include a comment on the spread of the distributions. The bike ridership histogram shows clear trends in the usage habits of both registered and casual riders. The distribution of casual riders is right-skewed, meaning that days with very high casual ridership are rare, and days with fewer casual riders are more common. On the other hand, the distribution of registered riders is less peaked and broader, suggesting a more uniform distribution of values and a consistent use pattern among these riders. Casual ridership exhibits a sharp decline as the counts move away from this common range, reflecting infrequent occurrences of extremely high or low ridership. Registered ridership shows a pronounced peak, suggesting a common range within which casual ridership numbers typically fall. For registered riders, occasional gaps or dips in the distribution may reflect external influences on ridership, such as varying weather conditions or holidays, for the most part it is a symmetric distribution. 7
8
0.0.5 Question 3c The density plots do not show us how the counts for registered and casual riders vary together. Use sns.lmplot (documentation) to make a scatter plot to investigate the relationship between casual and registered counts. This time, let’s use the bike DataFrame to plot hourly counts instead of daily counts. The lmplot function will also try and draw a linear regression line (just as you saw in Data 8). Color the points in the scatterplot according to whether or not the day is a working day (your colors do not have to match ours exactly, but they should be different based on whether the day is a working day). Hints: * Check out this helpful tutorial on lmplot . * There are many points in the scatter plot, so make them small to help reduce overplotting. Check out the scatter_kws parameter of lmplot . * You can set the height parameter if you want to adjust the size of the lmplot . * Add a descriptive title and axis labels for your plot. * You should be using the bike DataFrame to create your plot. * It is okay if the scales of your x and y axis (i.e., the numbers labeled on the two axes) are different from those used in the provided example. In [35]: sns . set(font_scale =1 ) # This line automatically makes the font size a bit bigger on the plot. Y sns . lmplot(x = 'casual' , y = 'registered' , data = bike, hue = 'workingday' , scatter_kws = { 's' : 0.5 }) plt . title( 'Comparison of Casual vs Registered Riders on Working and Non-working Days' ) plt . xlabel( 'Casual Riders' ) plt . ylabel( 'Registered Riders' ) plt . show() 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
10
0.0.6 Question 3d What does this scatterplot seem to reveal about the relationship (if any) between casual and registered riders and whether or not the day is on the weekend? What effect does overplotting have on your ability to describe this relationship? The number of casual and registered riders appears to be positively correlated; however, this relationship is stronger on non-working days. The rise in registered ridership during working days appears to be independent of the number of casual riders, indicating that registered users may be using the service for their daily commute, irrespective of the casual ridership. The scatterplot shows overplotting, especially among the casual riders on non-working days, which may mask more subtle trends in the densest regions of the plot. It may be challenging to determine the precise number of points in the busiest regions or to spot possible outliers due to this overplotting. 11
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.0.7 Question 4a (Bivariate Kernel Density Plot) Generate a bivariate kernel density plot with workday and non-workday separated using the daily_counts DataFrame . Hints: You only need to call sns.kdeplot once. Take a look at the hue parameter and adjust other inputs as needed. After you get your plot working, experiment by setting fill=True in kdeplot to see the difference between the shaded and unshaded versions. Please submit your work with fill=False . In [31]: # Set the figure size for the plot plt . figure(figsize = ( 12 , 8 )) sns . kdeplot(data = daily_counts, x = 'casual' , y = 'registered' , hue = 'workingday' , fill = False ) plt . title( 'Bivariate KDE Plot Comparison of Registered vs Casual Riders' ) plt . show() 13
14
0.0.8 Question 4b With some modification to your 4a code (this modification is not in scope), we can generate the plot above. In your own words, describe what the lines and the color shades of the lines signify about the data. What does each line and color represent? Hint : You may find it helpful to compare it to a contour or topographical map as shown here . The plot displays the concentration of rides made by both casual and registered users, with clear differences between workdays and non-workdays. For example, workdays (red tones) may have a higher concentration of registered rides than non-workdays (blue tones), suggesting that workdays are when people use the service more frequently for routine or commute-related purposes. On the other hand, the distribution may be more dispersed or have a different center on non-workdays, indicating a more flexible or recreational use of the service. 15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
16
0.0.9 Question 4c What additional details about the riders can you identify from this contour plot that were diffcult to determine from the scatter plot? The contour plot can reveal information about the riders that a scatter plot might not be able to, such as the density and the relationship between the counts of casual and registered riders. The contour plot, in particular, aids in locating the densest areas by displaying the typical numbers of casual and registered riders. Additionally, it can display the distribution and trend of the data, indicating the type of relationship that exists between casual and registered riders. For instance, the contour plot may reveal that on certain days, there is a high density of registered riders with a low count of casual riders, or vice versa. It might also show if there is a consistent pattern where the increase in registered riders corresponds to an increase or decrease in casual riders. Such patterns might be less discernible in a scatter plot, especially if many points overlap, because the scatter plot does not directly show the density of points. 17
18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.1 5: Joint Plot As an alternative approach to visualizing the data, construct the following set of three plots where the main plot shows the contours of the kernel density estimate of daily counts for registered and casual riders plotted together, and the two “margin” plots (at the top and right of the figure) provide the univariate kernel density estimate of each of these variables. Note that this plot makes it harder to see the linear relationships between casual and registered for the two different conditions (weekday vs. weekend). You should be making use of daily_counts . Hints : * The seaborn plotting tutorial has examples that may be helpful. * Take a look at sns.jointplot and its kind parameter. * set_axis_labels can be used to rename axes on a seaborn plot. For example, if we wanted to plot a scatterplot with ‘Height’ on the x-axis and ‘Weight’ on the y-axis from some dataset stats_df , we could write the following: graph = sns.scatterplot(data=stats_df, x='Height', y='Weight') graph.set_axis_labels("Height (cm)", "Weight (kg)") Note : * At the end of the cell, we called plt.suptitle to set a custom location for the title. * We also called plt.subplots_adjust(top=0.9) in case your title overlaps with your plot. In [20]: # Set the figure size plt . figure(figsize = ( 12 , 8 )) sns . jointplot(data = daily_counts, x = "casual" , y = "registered" , kind = "kde" ) plt . suptitle( "KDE Contours of Casual vs Registered Rider Count" ) plt . subplots_adjust(top =0.9 ); <Figure size 1800x1200 with 0 Axes> 19
20
0.2 6: Understanding Daily Patterns 0.2.1 Question 6a Let’s examine the behavior of riders by plotting the average number of riders for each hour of the day over the entire dataset (that is, bike DataFrame ), stratified by rider type. Your plot should look like the plot below. While we don’t expect your plot’s colors to match ours exactly, your plot should have a legend in the plot and different colored lines for different kinds of riders, in addition to the title and axis labels. In [21]: average_counts = bike . groupby( 'hr' )[[ 'casual' , 'registered' ]] . mean() . reset_index() plt . figure(figsize = ( 12 , 6 )) sns . lineplot(x = 'hr' , y = 'casual' , data = average_counts, label = 'casual' ) sns . lineplot(x = 'hr' , y = 'registered' , data = average_counts, label = 'registered' ) plt . title( 'Average Count of Casual vs. Registered by Hour' ) plt . xlabel( 'Hour of the Day' ) plt . ylabel( 'Average Count' ) plt . legend() plt . show() 21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.2 Question 6b What can you observe from the plot? Discuss your observations for both types of riders, and hypothesize about the meaning of the peaks in the registered riders’ distribution. We can infer different use patterns for casual and registered riders during the day by looking at the plot for question 6a. It appears that casual ridership rises gradually in the morning, peaks in the early afternoon, and then declines as dusk draws near. This implies that non-commuters or tourists who use the service occasionally prefer to use it for errands or leisure during the day. In contrast, there are two clear peaks in the registered ridership, one in the early morning and another in the late afternoon, which correspond to regular work commute times. This suggests that commuters who use the service to get to work in the morning and go home in the evening are probably among the registered riders. The midday and late night troughs indicate that demand is lower during working hours and when most people are at home. The work commute could be the cause of the peaks in the registered riders’ distribution; the morning peak would be associated with the rush to get to work, and the evening peak would be associated with the mass return home. This pattern highlights potential opportunities for targeted promotions or increased service availability during peak hours, underscoring the importance of the service for regular commuters. 23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
24
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.3 Question 7b In our case, with the bike ridership data, we want 7 curves, one for each day of the week. The x-axis will be the temperature (as given in the 'temp' column), and the y-axis will be a smoothed version of the proportion of casual riders. You should use statsmodels.nonparametric.smoothers_lowess.lowess just like the example above. Un- like the example above, plot ONLY the lowess curve. Do not plot the actual data, which would result in overplotting. For this problem, the simplest way is to use a loop. You do not need to match the colors on our sample plot as long as the colors in your plot make it easy to distinguish which day they represent. Hints: * Start by plotting only one day of the week to make sure you can do that first. Then, consider using a for loop to repeat this plotting operation for all days of the week. • The lowess function expects the y coordinate first, then the x coordinate. You should also set the return_sorted field to False . You will need to rescale the normalized temperatures stored in this dataset to Fahrenheit values. Look at the section of this notebook titled ‘Loading Bike Sharing Data’ for a description of the (normalized) temperature field to know how to convert back to Celsius first. After doing so, convert it to Fahrenheit. By default, the temperature field ranges from 0.0 to 1.0. In case you need it, Fahrenheit = Celsius × 9 5 + 32 . Note: If you prefer plotting temperatures in Celsius, that’s fine as well! Just remember to convert accordingly so the graph is still interpretable. In [34]: from statsmodels.nonparametric.smoothers_lowess import lowess plt . figure(figsize = ( 10 , 8 )) bike[ 'temp_F' ] = bike[ 'temp' ] . apply( lambda x: x * 41 * ( 9 / 5 ) + 32 ) for day in bike[ 'weekday' ] . unique(): data = bike[bike[ 'weekday' ] == day] line = lowess(data[ 'prop_casual' ], data[ 'temp_F' ]) plt . plot(line[:, 0 ], line[:, 1 ], label = day) plt . xlabel( 'Temperature (Fahrenheit)' ) plt . ylabel( 'Casual Rider Proportion' ) plt . title( 'Temperature vs Casual Rider Proportion by Weekday' ) plt . legend() plt . show() 25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.4 Question 7c What do you observe in the above plot? How is prop_casual changing as a function of temperature? Do you notice anything else interesting? The percentage of casual riders (prop_casual) in the plot generally rises with temperature. The fact that this pattern holds true on all days suggests that the arrival of warmer weather will probably lead to an increase in the casual usage of ride-sharing services. In addition, the plot indicates that on weekends (Saturday and Sunday) as opposed to weekdays, the slope of the increase appears to be steeper. This may suggest that weekend temperatures have a greater impact on casual riders, who may have more time to engage in leisure activities such as using ride-sharing services during weekends. Furthermore, the beginning points of the curves for each day differ, with weekdays starting lower than weekend values, even though the increase in prop_casual is clearly visible as temperatures rise. This may suggest that there is a base level of casual ridership that is independent of temperature, perhaps for essential travel, but that leisure travel—which is more temperature-dependent, particularly on weekends—contributes to higher levels of casual ridership. 27
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.5 Question 8a Imagine you are working for a bike-sharing company that collaborates with city planners, transportation agencies, and policymakers in order to implement bike-sharing in a city. These stakeholders would like to reduce congestion and lower transportation costs. They also want to ensure the bike-sharing program is implemented equitably. In this sense, equity is a social value that informs the deployment and assessment of your bike-sharing technology. Equity in transportation includes: Improving the ability of people of different socio-economic classes, gen- ders, races, and neighborhoods to access and afford transportation services and assessing how inclusive transportation systems are over time. Do you think the bike data as it is can help you assess equity? If so, please explain. If not, how would you change the dataset? You may discuss how you would change the granularity, what other kinds of variables you’d introduce to it, or anything else that might help you answer this question. Note : There is no single “right” answer to this question – we are looking for thoughtful reflection and commentary on whether or not this dataset, in its current form, encodes information about equity. Even though the current bike dataset might shed light on usage trends, it might not be enough to conduct a thorough assessment of equity. The dataset would need to contain geographic data to determine where the service is being used and who has access to it, as well as demographic data about riders, such as age, gender, income level, and race, in order to address equity. Surveys of riders’ satisfaction, accommodations for people with disabilities, and multilingual accessibility are possible additional factors. Adjustments to collect longitudinal data may demonstrate shifts over time, illustrating the effects of policy adjustments or service expansions. The dataset would need to measure accessibility and usage across various demographics and reflect the diversity of the community in order to accurately assess equity. 29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
30
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
0.2.6 Question 8b Bike sharing is growing in popularity , and new cities and regions are making efforts to implement bike- sharing systems that complement their other transportation offerings. The goals of these efforts are to have bike sharing serve as an alternate form of transportation in order to alleviate congestion, provide geographic connectivity, reduce carbon emissions, and promote inclusion among communities. Bike-sharing systems have spread to many cities across the country. The company you work for asks you to determine the feasibility of expanding bike sharing to additional cities in the US. Based on your plots in this assignment, would you recommend expanding bike sharing to additional cities in the US? If so, what cities (or types of cities) would you suggest? Please list at least two reasons why, and mention which plot(s) you drew your analysis from. Note : There isn’t a set right or wrong answer for this question. Feel free to come up with your own conclusions based on evidence from your plots! The analysis suggests that expanding bike-sharing to more US cities is a promising idea. The most important graphs that help with this decision might be those that demonstrate heavy use during commute hours, which suggests that bike sharing can reduce peak traffc, and the positive relationship between pleasant weather and higher casual usage, which points to the possibility of bike sharing becoming popular in cities with temperate climates. Expansion, however, ought to take into account the particular requirements of every city, taking into account the local transit system, the density of the city, and the cultural perspectives on cycling. Bike-sharing schemes would probably be most beneficial to cities with strong public transportation networks and a dedication to environmentally friendly mobility. Additionally, bike-sharing might be used by smaller or mid-sized cities where traffc is starting to become an issue as a preventative measure. 31
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
32
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help