Dataset Exploration Project Part 1
docx
keyboard_arrow_up
School
Georgian College *
*We aren’t endorsed by this school
Course
1005
Subject
English
Date
Dec 6, 2023
Type
docx
Pages
18
Uploaded by ColonelMosquitoPerson992
Dataset Exploration Project Part 1
BDAT1005-23F
Mathematics for Data Analytics
Submitted By: Darpan Aryal
Student ID: 200569576
Submitted To: Prof. Eshan Pourjavad
About Dataset:
The dataset used for this project was taken from Kaggle. The most well-known songs of
2023 are fully listed in this dataset, according to Spotify. The dataset provides a
multitude of features that are not generally present in datasets of a similar nature. It
offers information about the qualities, appeal, and visibility of each song across different
music outlets. Track name, artist(s) name, release date, Spotify playlists and charts,
streaming statistics, Apple Music presence, Deezer presence, Shazam charts, and
numerous audio attributes are among the details included in the dataset.
[ CITATION
Kag23 \l 1033 ]
According to the project requirements this includes 24 different
variables and more than 1000 data.
Assumptions of the Dataset:
It's critical to be aware of any possible assumptions that must be made about a dataset
when dealing with it. The following presumptions are applied to the provided music
dataset.
1)
Data collection:
Data collection: The dataset makes the assumption that the data
were correctly and fairly gathered. Understanding the data's origin and any
possible mistakes in the data collection process is important. Data reliability and
quality can be evaluated by knowing how and when the data were collected.
2)
Sample vs. Population:
It seems that the dataset only contains a sample of all
the music tracks that are currently in existence. Since it's frequently impossible to
collect data on every member of a population, this assumption tends to be made
in data analysis.
3)
Units of Measurement:
It's critical to define and fully understand each variable's
units of measurement in the dataset. For instance, "streams" can refer to the
overall number of streams on a streaming music site, "bpm" stands for beats per
minute, and percentages like "valence_%" and "danceability_%" should be
measured on a scale of 0 to 100. The interpretation of variables is made sure to
be accurate by understanding the units.
4)
Data Integrity:
The dataset makes the assumption that the data are accurate and
full. However, problems with data quality, including missing numbers or
exceptions, can occur. Data cleansing or imputation may be necessary to solve
data integrity problems.
5)
Categorical Variables:
It is assumed that the categories for categorical variables,
such as "key" and "mode," are clearly specified and follow accepted practices in
music theory. It's critical to confirm that the categories are reliable representations
of the musical recordings.
6)
Time Assumptions:
The dataset makes the assumption that the release dates
accurately reflect the times at which the tracks became publicly accessible.
However, there might be inconsistencies, like pre-release advertising or re-
releases of earlier songs.
7)
Accuracy of Popularity Analytics:
Variables linked to popularity (such as
"in_spotify_charts," "streams") rely on the idea that these metrics are accurate
measures of how well-liked a music is. These measurements, nevertheless, might
be influenced by things like marketing initiatives and outside circumstances, so
they might not always accurately reflect the caliber of the music.
8)
Creators:
It is believed that the dataset adequately depicts the dynamics of the
collaboration, including the roles played by each artist in the track, for songs with
multiple artists ("artist_count" > 1). The type of collaboration may differ, which may
have an impact on the analysis.
9)
Genre and Listener Groups:
It doesn't appear that the dataset contains explicit
data on music genres or listener groups. Additional information might be needed
for any analysis involving audience characteristics or genre preferences.
10)
Sampling Errors:
There may be sampling errors if the dataset wasn't created
using random sampling. For instance, if it mostly consists of hit songs, it might not
adequately reflect unknown or up-and-coming musicians.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Dataset Description:
The music-related data set under review consists of a number of different attributes
connected to specific music tracks. This information was probably put together from a
variety of sources, including music streaming services like Shazam, Apple Music,
Deezer, and Spotify. A list of the major characteristics present in the dataset is provided
below:
Track Name:
The name or title of each music track.
Artist(s)_name:
The name(s) of the artist(s) who performed or contributed to each
track.
artist_count:
The number of artists involved in each track.
released_year:
The year in which each track was released.
released_month:
The month of release.
released_day:
The day of the month of release.
in_spotify_playlists
: The number of Spotify playlists in which each track is featured.
in_spotify_charts:
The number of times each track appeared in Spotify charts.
streams: The total number of streams for each track.
in_apple_playlists:
The number of Apple Music playlists in which each track is
featured.
in_apple_charts:
The number of times each track appeared in Apple Music charts.
in_deezer_playlists:
The number of Deezer playlists in which each track is featured.
in_deezer_charts:
The number of times each track appeared in Deezer charts.
in_shazam_charts:
The number of times each track appeared in Shazam charts.
bpm:
The beats per minute (tempo) of each track.
key:
The musical key of each track.
mode:
The musical mode (e.g., Major, Minor) of each track.
danceability_%:
A
percentage score which describes how danceability each track is in
terms of numbers.
valence_%:
A
percentage score which describes how valence(positivity) each track is
in terms of numbers.
energy_%:
A
percentage score which describes how energy each track is in terms of
numbers.
acousticness_%:
A
percentage score which describes how acousticness each track is
in terms of numbers.
instrumentalness_%:
A
percentage score which describes how instrumentalness each
track is in terms of numbers.
liveness_%:
A
percentage score which describes how liveness each track is in terms of
numbers.
speechiness _%:
A
percentage score which describes how speechy each track is in
terms of numbers.
Source:
The various music streaming services that gather and distribute data about
music tracks, such as information on the artists, track specifics, and streaming statistics,
are most likely the source of this dataset. It's possible that the data was filtered and
aggregated for analytical purposes. The dataset was taken from Kaggle to go through it
and for the data analysis.
Data Dictionary
Data elements utilized or recorded in a database, information system, or as a
component of a research study are given names, definitions, and properties in a data
dictionary. In addition to offering instructions on interpretation, acceptable
interpretations, and representation, it specifies the meanings and goals of data pieces
within the framework of a project. Information about data elements is also provided by a
data dictionary.[ CITATION UCM23 \l 1033 ]
The dataset which I chose to work on has 24 different variables. The data dictionary
including statistical variable types, descriptions, ranges and limitations as appropriate
for every variable is shown below:
Variable Name
Variable
Type
Description
Value
Range/
Limitat
ion
Data
Type
Units of
Measure
ment
Data Source
track_name
categorical
The name of each
music track
.
Text
String
N/A
Spotify / Apple
Music / Deezer /
Shazam
artist(s) name
categorical
The names of artists
for the track
Text
String
N/A
Spotify / Apple
Music / Deezer /
Shazam
artist_count
Numeric
(integer)
The artist count for
the track.
Positiv
e
integer
s
Integer
N/A
Spotify / Apple
Music / Deezer /
Shazam
released_year
Numeric
(integer)
Released year of the
track
Four-
digit
years
(e.g.,
2000-
2023
Integer
N/A
Spotify / Apple
Music / Deezer /
Shazam
released_month
Numeric
(integer)
Released moth of the
track.
Jan to
Dec
Integer
N/A
Spotify / Apple
Music / Deezer /
Shazam
released_day
Numeric
(integer)
Released day of the
track
1 to 31
Integer
N/A
Spotify / Apple
Music / Deezer /
Shazam
in_spotify_playlists
Numeric
(integer)
No. of Spotify playlist
the track is in
Non-
negativ
e
integer
s
Intege
r
N/A
Spotify
in_spotify_charts
Numeric
(integer)
No. of times the track
appeared in Spotify
Non-
negativ
Integer
N/A
Spotify
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
charts
e
integer
s
streams
Numeric
(integer)
Total no. of streams
of the track
Non-
negativ
e
integer
s
Integer
N/A
Spotify / Apple
Music / Deezer /
Shazam
in_apple_playlists
Numeric
(integer)
No. of apple playlist
the track is in
Non-
negativ
e
integer
s
Integer
N/A
Apple
in_apple_charts
Numeric
(integer)
No. of times the track
appeared in apple
charts
Non-
negativ
e
integer
s
Integer
N/A
Apple
in_deezer_playlists
Numeric
(integer)
No. of Deezer playlist
the track is in
Non-
negativ
e
integer
s
Integer
N/A
Deezer
in_deezer_charts
Numeric
(integer)
No. of times the track
appeared in Deezer
charts
Non-
negativ
e
integer
s
Integer
N/A
Deezer
in_shazam_charts
Numeric
(integer)
No. of times the track
appeared in shazam
charts
Non-
negativ
e
integer
s
Integer
N/A
Shazam
bpm
Numeric
(integer)
Beats per minute of
the track
Positiv
e
Integer
s
Integer
Bpm
Spotify / Apple
Music / Deezer /
Shazam
key
Categorical
Musical key of the
track
Text
(e.g.,
A, F#)
String
N/A
Spotify / Apple
Music / Deezer /
Shazam
mode
categorical
Musical mode of the
track
Text
(e.g.,
Major,
Minor)
String
N/A
Spotify / Apple
Music / Deezer /
Shazam
danceability_%
Numeric
(integer)
Danceability score of
the track (%)
Percen
t
(0-
100)
Integer
%
Spotify / Apple
Music / Deezer /
Shazam
valence_%
Numeric
(integer)
Valence score of the
track (%)
Percen
t
(0-
100)
Integer
%
Spotify / Apple
Music / Deezer /
Shazam
energy_%
Numeric
(integer)
Energy score of the
track (%)
Percen
t
(0-
100)
Integer
%
Spotify / Apple
Music / Deezer /
Shazam
Acousticness_%
Numeric
(integer)
Acousticness score of
the track (%)
Percen
t
(0-
100)
Integer
%
Spotify / Apple
Music / Deezer /
Shazam
instrumentalness_
%
Numeric
(integer)
Instrumentalness
score of the track (%)
Percen
t
(0-
100)
Integer
%
Spotify / Apple
Music / Deezer /
Shazam
liveness_%
Numeric
(integer)
Liveness score of the
track (%)
Percen
t
(0-
100)
Integer
%
Spotify / Apple
Music / Deezer /
Shazam
speechiness_%
Numeric
(integer)
Speechiness score of
the track (%)
Percen
t
(0-
100)
Integer
%
Spotify / Apple
Music / Deezer
/Shazam
Some Data Analysis excel charts:
The data are analyzed with different methods, with different variables shown in charts
below:
Analyzing data by the % of total bpm by ‘key’
C#
(blank)
G
G#
F
D
A
B
F#
E
A#
D#
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
Percentage of 'bpm' by 'key'
bpm
key
Stream Insights
1930
1950
1958
1968
1973
1982
1985
1991
1995
1998
2002
2005
2009
2012
2015
2018
2021
0
20000000000
40000000000
60000000000
80000000000
100000000000
120000000000
140000000000
'streams' has outliers at 'released_year': 2022 and 2023.
released_year
streams
Streams by Key and Mode
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
In apple playlist by released year and mode
1930
1950
1958
1968
1973
1982
1985
1991
1995
1998
2002
2005
2009
2012
2015
2018
2021
0
2000
4000
6000
8000
10000
12000
'in_apple_playlists' by 'released_year' and 'mode'
Major
Minor
released_year
in_apple_playlists
Percentage of total streams of each mode
Major
Minor
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
Percentage of 'streams' by 'mode'
streams
mode
Frequency of apple charts
In Spotify playlist and streams
0
10000
20000
30000
40000
50000
60000
0
500000000
1000000000
1500000000
2000000000
2500000000
3000000000
3500000000
4000000000
Field: in_spotify_playlists and Field: streams appear highly correlated.
in_spotify_playlists
streams
Some examples of Manipulating Data:
Sorting with year released from ascending to descending order.
Sorting with artist_count
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Motivation for Studying the dataset:
There are several reasons for choosing this dataset:
1)
Analysis of the music industry:
The data I chose might offer insightful
information on trends and patterns in the music industry. It enables the analysis
of elements including a song's qualities, the involvement of several performers,
and the date of release that affect how popular a song is.
2)
Listener Preferences:
To understand listener preferences and how they relate
to the musical qualities of tracks can be achieved by analyzing the dataset. To
appeal to their fans, musicians, music companies, and streaming services can
use this information as a reference.
3)
Platform Comparison:
By taking into account the data from various streaming
services, it is able to analyze the effectiveness of tracks on various services and
spot potential differences in user behavior and popularity trends.
4)
Music Production:
By understanding the elements that contribute to a track's
success, musicians and producers can modify their creative processes as
necessary.
In conclusion, this dataset presents potential for both business experts and
academics to investigate the dynamics of the music industry, from artist partnerships
and popularity patterns to the influence of musical qualities on listeners' preferences.
Research Questions:
Certainly, here are three research questions that I have analyzed using the provided
Spotify dataset. I have designed these questions to explore relationships among the
variables in the dataset and form a logical analysis:
Research question 1:
How do different musical qualities like danceability, valence, and energy affect
the popularity of the song as measured by the number of streams?
Research question 2:
Does the popularity of music tracks change over time, and how does this
tendency vary amongst different music streaming services (Spotify, Apple Music,
Deezer, Shazam)?
Research question 3:
Do songs with more artists (a greater artist_count) typically have different
musical qualities (like key, mode, or danceability) than songs with only one artist?
Considering the data need
It is crucial to consider if additional or alternative data may be required in order to
successfully respond to the study questions. Here is my evaluation of each study
question and whether further information could be needed:
Research Question 1:
1)
How do different musical qualities like danceability, valence, and energy affect
the popularity of the song as measured by the number of streams?
Additional Data Needed:
Statistics on the genre or information related to the genre would be helpful for each
piece of music. As different musical genres may exhibit different patterns, this would
allow for a more extensive analysis of the relationship between musical quality and
popularity. User reviews or listener demographic data may help to explain why certain
musical characteristics are more usually linked to greater popularity.
Research Question 2:
2)
Does the popularity of music tracks change over time, and how does this
tendency vary amongst different music streaming services (Spotify, Apple Music,
Deezer, Shazam)?
Additional Data Needed:
For capturing more thorough time-related trends, a longer historical dataset with more
years of data would be advantageous. Different popularity trends among streaming
platforms may be explained in part by information about how each platform's user base
has changed over time.
Research Question 3:
3)
Do songs with more artists (a greater artist_count) typically have different
musical qualities (like key, mode, or danceability) than songs with only one artist?
Additional Data Needed:
A greater understanding of how several artists affect musical characteristics might be
possible with more specific information on the roles and contributions of each artist in
collaborative tracks. Analysis of this relationship may also benefit from knowing the
length of the cooperation (e.g., how frequently artists interact).
In general, the addition of more data on genre, listener demographics, user evaluations,
historical patterns, and collaborative dynamics could improve analysis and offer deeper
understanding of the study issues. To carry out a more extensive study, it can be
required to gather or access such data, depending on the analysis's unique aims.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Tracking the analysis and finding the study:
Now let’s begin
Starting the keeping track of the analysis for the study questions that I
mentioned before. I am mentioning the thoughts about what to study, the data available
and the question I made, and the assumptions made and their reasons.
Research question 1:
1)
How do different musical qualities like danceability, valence, and energy affect
the popularity of the song as measured by the number of streams?
What I Want to Study:
I want to investigate the connection between the danceability,
valence, and energy of a song and its popularity as assessed by streams.
Results Found:
For each musical track, I have information on its danceability, valence,
energy, and number of streams.
Questions made:
I want to know if there's a link between musical qualities and
popularity.
Assumptions:
I hypothesize that the popularity of a track is influenced by these
particular musical characteristics. I also believe that the quantity of streams is a
trustworthy sign of appeal. It's crucial to understand that other elements, such as
marketing or outside circumstances, may also affect popularity.
Research Question 2:
2)
Does the popularity of music tracks change over time, and how does this
tendency vary amongst different music streaming services (Spotify, Apple Music,
Deezer, Shazam)?
What I want to study:
I wish to investigate the temporal trends in music track popularity
over time and whether these trends change across various streaming services.
Results Found:
I have information on the tracks' year of release, quantity of streams,
and position on various platforms' charts.
Questions made:
I want to see if popularity has tendencies over time and if platform
differences affect these trends.
Assumptions:
I assumption I made is that the dataset contains a representative
sample of songs from different years and that popularity is accurately reflected by the
number of streams and chart appearances. It's crucial to remember that outside
variables, (such as marketing campaigns), can affect patterns.
Research Question 3:
3)
Do songs with more artists (a greater artist_count) typically have different
musical qualities (like key, mode, or danceability) than songs with only one artist?
What I want to study:
I am interested in examining the relationship between a track's
musical qualities, such as key, mode, and danceability, and the number of artists that
contributed to it (artist_count).
Research found:
For each track, I have an information on the number of artists and
numerous musical qualities.
Questions made:
If collaborative recordings differ musically from solo tracks, is that
something I want to know?
Assumptions:
I assume that each artist's roles and contributions to collaborative tracks
are appropriately reflected in the dataset. Additionally, I presume that the musical
elements are typical of the song's overall style. These qualities might be influenced by
collaborative dynamics and collaboration time.
The foundation for subsequent analysis is comprised of these initial ideas, information
sources, research questions, and presumptions. To get useful insights, it is crucial to
keep an open mind about revising these presumptions and investigating the facts.
Additionally, as the study develops, it might need more information, or the strategy
needs to be modified.
Conclusion
Hence, this is the first part of the project about analyzing the dataset that I chose. In this
project I worked on collecting dataset meeting the minimum requirements and
explaining the motivation for choosing this specific dataset. I also worked on developing
data dictionary and describing each variable. I worked on developing research question
and making the possible assumption and what additional data could be needed for
further analysis. I also started to keep the track of the analysis and what I want to study,
research found and making the question that I want to know and making assumptions to
clear the doubt and make analysis effective. So, these are some important basic task
which are done for the data analysis process.
References
Kaggle,
2023.
Kaggle.
[Online]
Available at: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
Library,
U.
M.,
2023.
UC
Merced
Library.
[Online]
Available
at:
https://library.ucmerced.edu/data-dictionaries
[Accessed 28 09 2023].
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help