kavyaPratapSingh_Module2_Report
docx
keyboard_arrow_up
School
Northeastern University *
*We aren’t endorsed by this school
Course
6040
Subject
Mathematics
Date
Apr 3, 2024
Type
docx
Pages
12
Uploaded by BailiffComputerNewt27
Module 2 Assignment: EDA
College of Professional Studies, Northeastern University
Professor: Justin Grosz
Mar 1, 2024
Introduction
The dataset we're working with is like a treasure of information about how people watch shows on Netflix. It tells us things like who's watching, what they're watching, and when they're watching it. In this paper, we're going to clean up the data, pick out the most important bits, and try to figure out if we can predict whether someone will finish watching an episode or not. By doing this, we hope to show just how useful this
data can be for understanding viewer behavior and making decisions about what shows to produce or recommend. So, let's dive in and see what we can uncover!
Data Cleaning
In the initial stage of data cleaning, I first conducted an overview of the dataset to understand its dimensions and structure. This involved obtaining the total number of records and features, which revealed that there were 3004 records and 10 features in the dataset. Following this preliminary assessment, I focused on identifying and handling missing values within the dataset.
With careful consideration, it was observed that several columns, including 'Season', 'Episode', 'Time Watched', 'Gender', 'Completed', and 'Time of Day', contained null values. Evaluating the percentage of missing values for each column revealed that they ranged from approximately 0.5% to 3.3%. Figure 1: Percentage of Missing Data
For the 'Season' column, considering that each show may have a different number of seasons and that missing seasons could be due to various reasons such as incomplete data or new shows without specified seasons, imputing missing values with mean, median, or mode could lead to inaccurate representations. For example, assigning a mean season number to missing values might falsely imply a mean season for that episode when the season could be different. Similarly, in the 'Episode' column, attempting to fill missing values with mean, median, or mode may not be appropriate, as episodes within a season might not follow a sequential pattern or could have unique identifiers. Imputing missing episode
numbers without considering the context of the show could introduce inconsistencies and misrepresent viewing patterns.
In the 'Time Watched' column, missing values may occur due to various factors such as interruptions during viewing or technical issues. Imputing these missing values
with mean, median, or mode could distort the distribution of viewing times and misrepresent actual viewing habits. For instance, assigning the average viewing time to missing values may not accurately reflect the diverse viewing durations among users.
Regarding the 'Gender' column, attempting to fill missing values with mean, median, or mode would lack a meaningful basis, as viewing habits on Netflix are unlikely to be correlated with gender. Imputing missing gender data could introduce biases and inaccuracies in subsequent analyses and conclusions. Lastly, in the 'Time of Day' column, missing values could result from technical limitations or inconsistencies in data recording. Imputing missing times with mean, median, or mode could distort the distribution of viewing times and lead to inaccurate interpretations of viewing patterns.
In the assessment of outliers within the dataset, I examined key variables such as 'Episode', 'Season', and 'Time Watched' as they were the only numerical variables in our dataset. For the 'Episode' column, no outliers were detected, as all values fell within the expected range of 1 to 10 episodes. Figure 2: Outliers for Episode
Moving to the 'Season' column, an initial identification of 55 outliers was made, suggesting values beyond the interquartile range (IQR). However, further investigation revealed that the maximum value of 10 seasons was present in 32 instances of the show 'Friends'. Given this context, where the maximum value aligns with a valid data point, I concluded not to treat these instances as outliers. Figure 3: Outliers for Season
Similarly, in the 'Time Watched' column, 6 outliers were initially identified with a maximum value of 90 minutes. However, considering the variability in episode lengths and viewing habits, categorizing these values as outliers may not be
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
appropriate. Therefore, I decided against removing these values from the dataset. In both cases, the decision not to remove outliers was made to preserve the integrity and representativeness of the dataset, acknowledging the unique context of the variables and their relevance to the analysis of Netflix viewing behaviors. Figure 4: Outliers for Time Watched
I also performed one-hot encoding for the categorical variables in the dataset, including 'Date', 'Day', 'Show', and 'Gender'. Despite the potential lack of direct correlation between 'Date' and 'Day' with episode completion, I encoded all four columns to prepare the data for analysis. This decision allows for easy interpretation by machine learning algorithms and facilitates further exploration of viewer behaviors
and episode completion patterns.
Figure 5: Categorical Encoded Variables Variable Selection
In selecting predictor variables to effectively predict whether someone would complete an episode, I conducted exploratory data analysis to identify meaningful correlations. Initially, a heatmap was generated to visualize the relationships between the target variable ('Completed') and other variables in the dataset. It was observed that there was no significant correlation between episode completion and variables such as 'Day', 'Date', 'User ID', and 'Gender'. However, strong positive correlations were identified between 'Completed' and 'Time Watched', and negative correlations between 'Completed' and both 'Season' and 'Show'.
Figure 6: Heatmap Based on these findings, I selected the following predictor variables that I believe will
effectively predict episode completion. Before we talk about that, I would like to explain a new variable that I created to understand the trends a bit better. Completion rate refers to the proportion of viewers who watch an entire episode compared to those who do not. It is a metric commonly used to measure viewer engagement and satisfaction with content on streaming platforms. The completion rate
is calculated by dividing the number of viewers who completed watching an episode by the total number of viewers who started watching it.
1.
Show: A bar plot of completion rates by show revealed variations in completion rates across different shows, suggesting a correlation between the specific show being watched and the likelihood of episode completion. Completion rates represent the proportion of viewers who completed watching
each episode compared to the total number of viewers who started watching it.
For instance, completion rates were highest for 'Stranger Things' and lowest for 'Friends', indicating potential differences in viewer engagement levels between shows. This finding underscores the importance of content preferences in viewer behavior, highlighting how certain shows may inherently captivate audiences more than others. Understanding these
differences can aid streaming platforms in tailoring content recommendations and optimizing viewer satisfaction.
Figure 7: Show vs Completion Rate 2.
Season: Despite initially considering 'Season' as a potential predictor variable, further investigation revealed episodes from the show 'Friends' with 10 seasons, suggesting that season number alone may not accurately predict completion rates. Nevertheless, negative correlation observed between 'Season' and 'Completed' suggests that the season of a show could still influence completion behavior, albeit indirectly through show selection. This observation prompts the need for a nuanced understanding of how seasonality impacts viewer engagement, considering factors such as storyline progression,
character development, and viewer preferences over time.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Figure 8: Episode vs Completion Rate
3.
Time Watched: Exploratory analysis using boxplots and bar plots revealed a clear positive correlation between the duration of time watched and the likelihood of episode completion. Episodes with longer watch times tended to have higher completion rates, indicating that viewers who invest more time in watching episodes are more likely to complete them. This insight suggests that
'Time Watched' can be a valuable predictor variable for episode completion. Understanding the relationship between viewing duration and completion rates
offers valuable insights into viewer engagement patterns, highlighting the importance of captivating content and viewer retention strategies.
Figure 9: Time Watched vs Completion Rate
Key Insights
1.
Show Completion Rate: The dataset reveals significant variation in completion
rates among different shows, with 'Stranger Things' boasting the highest completion rate and 'Friends' the lowest. This discrepancy suggests that the likelihood of episode completion is heavily influenced by the specific show being watched. Various factors such as genre, storyline, episode length, and other show-specific attributes may contribute to these differences, underscoring the importance of understanding viewer preferences for tailored content recommendations.
2.
Episode Number Correlation: Analysis of completion rates across episode numbers unveils a distinct pattern. Episodes 3 and 5 through 9 consistently exhibit higher completion rates compared to the initial episodes and the finale.
This trend implies an evolving engagement among viewers as they progress through the series. Lower completion rates for the initial episodes might indicate a screening phase by viewers, while the upward trend from episode 3 suggests increased engagement as viewers become more invested in the content. The decline in completion rate at episode 10 could signify a peak in viewer interest or a natural drop-off point.
3.
Time Watched Correlation: The dataset demonstrates a clear correlation between the duration of time watched and completion rates. Both the boxplot and bar plot analyses confirm that episodes with longer watch times tend to have higher completion rates. The median time watched for completed episodes surpasses that of incomplete ones, indicating that viewers who invest more time in watching episodes are more likely to complete them. The consistency in watch times among completed episodes further reinforces this correlation, suggesting a strong relationship between viewer engagement and episode completion.
Stakeholders should be happy because the data can help them understand what viewers like and how they watch shows. For example, the data shows that some shows are finished by more people than others. This means that some shows are more popular. Also, the data shows that people tend to watch some episodes more than others. This tells us that viewers' interest changes as they watch more episodes. Another thing the data shows is that when people watch a show for a longer time, they
are more likely to finish it. This helps stakeholders predict what viewers will do and make better decisions about what shows to offer and how to keep viewers interested. Overall, this data is important because it helps stakeholders understand viewers better and make choices that keep them happy and engaged.
Appendix
1.
Figure 1
: Percentage of Missing Data
2.
Figure 2
: Outliers for Episode
3.
Figure 3
: Outliers for Season
4.
Figure 4
: Outliers for Time Watched
5.
Figure 5
: Categorical Encoding of Variables
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
6.
Figure 6
: Heat plot
7.
Figure 7
: Completion Rate vs Show
8.
Figure 8
: Completion Rate vs Episode
9.
Figure 9
: Completion Rate vs Time Watched
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help