Dataset Exploration Project Part 1

docx

School

Georgian College *

*We aren’t endorsed by this school

Course

1005

Subject

English

Date

Dec 6, 2023

Type

docx

Pages

18

Uploaded by ColonelMosquitoPerson992

Report
Dataset Exploration Project Part 1 BDAT1005-23F Mathematics for Data Analytics Submitted By: Darpan Aryal Student ID: 200569576 Submitted To: Prof. Eshan Pourjavad
About Dataset: The dataset used for this project was taken from Kaggle. The most well-known songs of 2023 are fully listed in this dataset, according to Spotify. The dataset provides a multitude of features that are not generally present in datasets of a similar nature. It offers information about the qualities, appeal, and visibility of each song across different music outlets. Track name, artist(s) name, release date, Spotify playlists and charts, streaming statistics, Apple Music presence, Deezer presence, Shazam charts, and numerous audio attributes are among the details included in the dataset. [ CITATION Kag23 \l 1033 ] According to the project requirements this includes 24 different variables and more than 1000 data. Assumptions of the Dataset: It's critical to be aware of any possible assumptions that must be made about a dataset when dealing with it. The following presumptions are applied to the provided music dataset. 1) Data collection: Data collection: The dataset makes the assumption that the data were correctly and fairly gathered. Understanding the data's origin and any possible mistakes in the data collection process is important. Data reliability and quality can be evaluated by knowing how and when the data were collected. 2) Sample vs. Population: It seems that the dataset only contains a sample of all the music tracks that are currently in existence. Since it's frequently impossible to collect data on every member of a population, this assumption tends to be made in data analysis. 3) Units of Measurement: It's critical to define and fully understand each variable's units of measurement in the dataset. For instance, "streams" can refer to the overall number of streams on a streaming music site, "bpm" stands for beats per minute, and percentages like "valence_%" and "danceability_%" should be measured on a scale of 0 to 100. The interpretation of variables is made sure to be accurate by understanding the units.
4) Data Integrity: The dataset makes the assumption that the data are accurate and full. However, problems with data quality, including missing numbers or exceptions, can occur. Data cleansing or imputation may be necessary to solve data integrity problems. 5) Categorical Variables: It is assumed that the categories for categorical variables, such as "key" and "mode," are clearly specified and follow accepted practices in music theory. It's critical to confirm that the categories are reliable representations of the musical recordings. 6) Time Assumptions: The dataset makes the assumption that the release dates accurately reflect the times at which the tracks became publicly accessible. However, there might be inconsistencies, like pre-release advertising or re- releases of earlier songs. 7) Accuracy of Popularity Analytics: Variables linked to popularity (such as "in_spotify_charts," "streams") rely on the idea that these metrics are accurate measures of how well-liked a music is. These measurements, nevertheless, might be influenced by things like marketing initiatives and outside circumstances, so they might not always accurately reflect the caliber of the music. 8) Creators: It is believed that the dataset adequately depicts the dynamics of the collaboration, including the roles played by each artist in the track, for songs with multiple artists ("artist_count" > 1). The type of collaboration may differ, which may have an impact on the analysis. 9) Genre and Listener Groups: It doesn't appear that the dataset contains explicit data on music genres or listener groups. Additional information might be needed for any analysis involving audience characteristics or genre preferences. 10) Sampling Errors: There may be sampling errors if the dataset wasn't created using random sampling. For instance, if it mostly consists of hit songs, it might not adequately reflect unknown or up-and-coming musicians.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Dataset Description: The music-related data set under review consists of a number of different attributes connected to specific music tracks. This information was probably put together from a variety of sources, including music streaming services like Shazam, Apple Music, Deezer, and Spotify. A list of the major characteristics present in the dataset is provided below: Track Name: The name or title of each music track. Artist(s)_name: The name(s) of the artist(s) who performed or contributed to each track. artist_count: The number of artists involved in each track. released_year: The year in which each track was released. released_month: The month of release. released_day: The day of the month of release. in_spotify_playlists : The number of Spotify playlists in which each track is featured. in_spotify_charts: The number of times each track appeared in Spotify charts. streams: The total number of streams for each track. in_apple_playlists: The number of Apple Music playlists in which each track is featured. in_apple_charts: The number of times each track appeared in Apple Music charts. in_deezer_playlists: The number of Deezer playlists in which each track is featured. in_deezer_charts: The number of times each track appeared in Deezer charts. in_shazam_charts: The number of times each track appeared in Shazam charts. bpm: The beats per minute (tempo) of each track.
key: The musical key of each track. mode: The musical mode (e.g., Major, Minor) of each track. danceability_%: A percentage score which describes how danceability each track is in terms of numbers. valence_%: A percentage score which describes how valence(positivity) each track is in terms of numbers. energy_%: A percentage score which describes how energy each track is in terms of numbers. acousticness_%: A percentage score which describes how acousticness each track is in terms of numbers. instrumentalness_%: A percentage score which describes how instrumentalness each track is in terms of numbers. liveness_%: A percentage score which describes how liveness each track is in terms of numbers. speechiness _%: A percentage score which describes how speechy each track is in terms of numbers. Source: The various music streaming services that gather and distribute data about music tracks, such as information on the artists, track specifics, and streaming statistics, are most likely the source of this dataset. It's possible that the data was filtered and aggregated for analytical purposes. The dataset was taken from Kaggle to go through it and for the data analysis. Data Dictionary Data elements utilized or recorded in a database, information system, or as a component of a research study are given names, definitions, and properties in a data dictionary. In addition to offering instructions on interpretation, acceptable interpretations, and representation, it specifies the meanings and goals of data pieces
within the framework of a project. Information about data elements is also provided by a data dictionary.[ CITATION UCM23 \l 1033 ] The dataset which I chose to work on has 24 different variables. The data dictionary including statistical variable types, descriptions, ranges and limitations as appropriate for every variable is shown below: Variable Name Variable Type Description Value Range/ Limitat ion Data Type Units of Measure ment Data Source track_name categorical The name of each music track . Text String N/A Spotify / Apple Music / Deezer / Shazam artist(s) name categorical The names of artists for the track Text String N/A Spotify / Apple Music / Deezer / Shazam artist_count Numeric (integer) The artist count for the track. Positiv e integer s Integer N/A Spotify / Apple Music / Deezer / Shazam released_year Numeric (integer) Released year of the track Four- digit years (e.g., 2000- 2023 Integer N/A Spotify / Apple Music / Deezer / Shazam released_month Numeric (integer) Released moth of the track. Jan to Dec Integer N/A Spotify / Apple Music / Deezer / Shazam released_day Numeric (integer) Released day of the track 1 to 31 Integer N/A Spotify / Apple Music / Deezer / Shazam in_spotify_playlists Numeric (integer) No. of Spotify playlist the track is in Non- negativ e integer s Intege r N/A Spotify in_spotify_charts Numeric (integer) No. of times the track appeared in Spotify Non- negativ Integer N/A Spotify
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
charts e integer s streams Numeric (integer) Total no. of streams of the track Non- negativ e integer s Integer N/A Spotify / Apple Music / Deezer / Shazam in_apple_playlists Numeric (integer) No. of apple playlist the track is in Non- negativ e integer s Integer N/A Apple in_apple_charts Numeric (integer) No. of times the track appeared in apple charts Non- negativ e integer s Integer N/A Apple in_deezer_playlists Numeric (integer) No. of Deezer playlist the track is in Non- negativ e integer s Integer N/A Deezer in_deezer_charts Numeric (integer) No. of times the track appeared in Deezer charts Non- negativ e integer s Integer N/A Deezer in_shazam_charts Numeric (integer) No. of times the track appeared in shazam charts Non- negativ e integer s Integer N/A Shazam bpm Numeric (integer) Beats per minute of the track Positiv e Integer s Integer Bpm Spotify / Apple Music / Deezer / Shazam key Categorical Musical key of the track Text (e.g., A, F#) String N/A Spotify / Apple Music / Deezer / Shazam
mode categorical Musical mode of the track Text (e.g., Major, Minor) String N/A Spotify / Apple Music / Deezer / Shazam danceability_% Numeric (integer) Danceability score of the track (%) Percen t (0- 100) Integer % Spotify / Apple Music / Deezer / Shazam valence_% Numeric (integer) Valence score of the track (%) Percen t (0- 100) Integer % Spotify / Apple Music / Deezer / Shazam energy_% Numeric (integer) Energy score of the track (%) Percen t (0- 100) Integer % Spotify / Apple Music / Deezer / Shazam Acousticness_% Numeric (integer) Acousticness score of the track (%) Percen t (0- 100) Integer % Spotify / Apple Music / Deezer / Shazam instrumentalness_ % Numeric (integer) Instrumentalness score of the track (%) Percen t (0- 100) Integer % Spotify / Apple Music / Deezer / Shazam liveness_% Numeric (integer) Liveness score of the track (%) Percen t (0- 100) Integer % Spotify / Apple Music / Deezer / Shazam speechiness_% Numeric (integer) Speechiness score of the track (%) Percen t (0- 100) Integer % Spotify / Apple Music / Deezer /Shazam Some Data Analysis excel charts:
The data are analyzed with different methods, with different variables shown in charts below: Analyzing data by the % of total bpm by ‘key’ C# (blank) G G# F D A B F# E A# D# 0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% 14.00% Percentage of 'bpm' by 'key' bpm key Stream Insights 1930 1950 1958 1968 1973 1982 1985 1991 1995 1998 2002 2005 2009 2012 2015 2018 2021 0 20000000000 40000000000 60000000000 80000000000 100000000000 120000000000 140000000000 'streams' has outliers at 'released_year': 2022 and 2023. released_year streams Streams by Key and Mode
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In apple playlist by released year and mode 1930 1950 1958 1968 1973 1982 1985 1991 1995 1998 2002 2005 2009 2012 2015 2018 2021 0 2000 4000 6000 8000 10000 12000 'in_apple_playlists' by 'released_year' and 'mode' Major Minor released_year in_apple_playlists Percentage of total streams of each mode
Major Minor 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% Percentage of 'streams' by 'mode' streams mode Frequency of apple charts In Spotify playlist and streams
0 10000 20000 30000 40000 50000 60000 0 500000000 1000000000 1500000000 2000000000 2500000000 3000000000 3500000000 4000000000 Field: in_spotify_playlists and Field: streams appear highly correlated. in_spotify_playlists streams Some examples of Manipulating Data: Sorting with year released from ascending to descending order. Sorting with artist_count
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Motivation for Studying the dataset: There are several reasons for choosing this dataset: 1) Analysis of the music industry: The data I chose might offer insightful information on trends and patterns in the music industry. It enables the analysis of elements including a song's qualities, the involvement of several performers, and the date of release that affect how popular a song is. 2) Listener Preferences: To understand listener preferences and how they relate to the musical qualities of tracks can be achieved by analyzing the dataset. To appeal to their fans, musicians, music companies, and streaming services can use this information as a reference. 3) Platform Comparison: By taking into account the data from various streaming services, it is able to analyze the effectiveness of tracks on various services and spot potential differences in user behavior and popularity trends. 4) Music Production: By understanding the elements that contribute to a track's success, musicians and producers can modify their creative processes as necessary.
In conclusion, this dataset presents potential for both business experts and academics to investigate the dynamics of the music industry, from artist partnerships and popularity patterns to the influence of musical qualities on listeners' preferences. Research Questions: Certainly, here are three research questions that I have analyzed using the provided Spotify dataset. I have designed these questions to explore relationships among the variables in the dataset and form a logical analysis: Research question 1: How do different musical qualities like danceability, valence, and energy affect the popularity of the song as measured by the number of streams? Research question 2: Does the popularity of music tracks change over time, and how does this tendency vary amongst different music streaming services (Spotify, Apple Music, Deezer, Shazam)? Research question 3: Do songs with more artists (a greater artist_count) typically have different musical qualities (like key, mode, or danceability) than songs with only one artist? Considering the data need It is crucial to consider if additional or alternative data may be required in order to successfully respond to the study questions. Here is my evaluation of each study question and whether further information could be needed: Research Question 1: 1) How do different musical qualities like danceability, valence, and energy affect the popularity of the song as measured by the number of streams?
Additional Data Needed: Statistics on the genre or information related to the genre would be helpful for each piece of music. As different musical genres may exhibit different patterns, this would allow for a more extensive analysis of the relationship between musical quality and popularity. User reviews or listener demographic data may help to explain why certain musical characteristics are more usually linked to greater popularity. Research Question 2: 2) Does the popularity of music tracks change over time, and how does this tendency vary amongst different music streaming services (Spotify, Apple Music, Deezer, Shazam)? Additional Data Needed: For capturing more thorough time-related trends, a longer historical dataset with more years of data would be advantageous. Different popularity trends among streaming platforms may be explained in part by information about how each platform's user base has changed over time. Research Question 3: 3) Do songs with more artists (a greater artist_count) typically have different musical qualities (like key, mode, or danceability) than songs with only one artist? Additional Data Needed: A greater understanding of how several artists affect musical characteristics might be possible with more specific information on the roles and contributions of each artist in collaborative tracks. Analysis of this relationship may also benefit from knowing the length of the cooperation (e.g., how frequently artists interact). In general, the addition of more data on genre, listener demographics, user evaluations, historical patterns, and collaborative dynamics could improve analysis and offer deeper understanding of the study issues. To carry out a more extensive study, it can be required to gather or access such data, depending on the analysis's unique aims.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Tracking the analysis and finding the study: Now let’s begin Starting the keeping track of the analysis for the study questions that I mentioned before. I am mentioning the thoughts about what to study, the data available and the question I made, and the assumptions made and their reasons. Research question 1: 1) How do different musical qualities like danceability, valence, and energy affect the popularity of the song as measured by the number of streams? What I Want to Study: I want to investigate the connection between the danceability, valence, and energy of a song and its popularity as assessed by streams. Results Found: For each musical track, I have information on its danceability, valence, energy, and number of streams. Questions made: I want to know if there's a link between musical qualities and popularity. Assumptions: I hypothesize that the popularity of a track is influenced by these particular musical characteristics. I also believe that the quantity of streams is a trustworthy sign of appeal. It's crucial to understand that other elements, such as marketing or outside circumstances, may also affect popularity. Research Question 2: 2) Does the popularity of music tracks change over time, and how does this tendency vary amongst different music streaming services (Spotify, Apple Music, Deezer, Shazam)? What I want to study: I wish to investigate the temporal trends in music track popularity over time and whether these trends change across various streaming services. Results Found: I have information on the tracks' year of release, quantity of streams, and position on various platforms' charts. Questions made: I want to see if popularity has tendencies over time and if platform differences affect these trends.
Assumptions: I assumption I made is that the dataset contains a representative sample of songs from different years and that popularity is accurately reflected by the number of streams and chart appearances. It's crucial to remember that outside variables, (such as marketing campaigns), can affect patterns. Research Question 3: 3) Do songs with more artists (a greater artist_count) typically have different musical qualities (like key, mode, or danceability) than songs with only one artist? What I want to study: I am interested in examining the relationship between a track's musical qualities, such as key, mode, and danceability, and the number of artists that contributed to it (artist_count). Research found: For each track, I have an information on the number of artists and numerous musical qualities. Questions made: If collaborative recordings differ musically from solo tracks, is that something I want to know? Assumptions: I assume that each artist's roles and contributions to collaborative tracks are appropriately reflected in the dataset. Additionally, I presume that the musical elements are typical of the song's overall style. These qualities might be influenced by collaborative dynamics and collaboration time. The foundation for subsequent analysis is comprised of these initial ideas, information sources, research questions, and presumptions. To get useful insights, it is crucial to keep an open mind about revising these presumptions and investigating the facts. Additionally, as the study develops, it might need more information, or the strategy needs to be modified.
Conclusion Hence, this is the first part of the project about analyzing the dataset that I chose. In this project I worked on collecting dataset meeting the minimum requirements and explaining the motivation for choosing this specific dataset. I also worked on developing data dictionary and describing each variable. I worked on developing research question and making the possible assumption and what additional data could be needed for further analysis. I also started to keep the track of the analysis and what I want to study, research found and making the question that I want to know and making assumptions to clear the doubt and make analysis effective. So, these are some important basic task which are done for the data analysis process. References Kaggle, 2023. Kaggle. [Online] Available at: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023 Library, U. M., 2023. UC Merced Library. [Online] Available at: https://library.ucmerced.edu/data-dictionaries [Accessed 28 09 2023].
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help