By Cheng-Chun Lee, Sanadhi Sutandi, Skander Hajri.
Background
Music has become one fundamental part in our daily activities. Unconsciously, we listen to music everytime and anywhere, e.g. while cooking, sitting in Rolex (either in silent area or cafeteria), coding your project, cycling to Vevey, and so on.
- According to Spin, in average American listens to 4 hours of music each day.
A famous violinist once said. Music transcends words. By exchanging notes, you get to know one another, to understand one another. As if your souls were connected and your hearts were overlapping. It's a conversation through instruments. A miracle that creates harmony. In that moment, music transcends words -K.
It is widely known that music with lyrics, well known as “Song”, is one favorite way to express human emotional feelings and expressions. Song is one of the greatest creations of human kind in the course of history and now it has already been transformed into music industry.
It is exicing for us to elaborate what factors influence songs popularity at most. Thus we present the analysis of songs’ popularity as our final project for ADA course!
Datasets
Throughout this project, we mainly use Million Song Dataset that has collection of audio features and metadata of popular and unpopular songs.
In addition we also utilize two additional datasets:
- The musiXmatch Dataset: Containing lyrics.
- The Echo Nest Taste Profile Subset : Containing profiles of real users with their play count.
Important fields of Million Song Dataset:
- track_id
- The primary identifier field for all songs in dataset.
- song_hotttnesss
- the popularity of a song measured with value of between 0 - 1.
Observing Songs' Popularity
Important Features of Popular Songs
Using correlation matrix, we can briefly observe which features influences songs’ popularity. Compared with other features, artist_familiarity, artist_hotttnesss, year have stronger correlation into song_hotttnesss.
In order to obtain more accurate results, we use random forest classifier to predict whether a song is popular/unpopular. We get a high accuracy of 97,52% by random forest, and then we observe the attribute feature_importances to see which feature matters the most. We figure out that two most important features are:
- artist_hotttnesss
- The popularity of an artist (usually last for short-term)
- artist_familiarity
- The indication of how well-known an artist is (usually last for longer-term)
More into Exploratory Data Analysis
Artist and Release Album
Let’s compare the distribution of occurence for popular and unpopular songs coming from same artist and release album in order to justify the previous results.
While we infer that, in average, at least 4 of popular songs are coming from the same artist, we see that there is a tiny clear distinction between popular songs and unpopular songs. In average, at least two popular songs are coming from the same release album in list.
This strengthen our previous analysis that an artist himself/herself (artist_hotttnesss and artist_familiriaty) gives significant correlation to the song_hotttnesss.
Location of Popular and Unpopular Songs Across the world
Popular = Blue, Unpopular = Green
We spot for both popular and unpopular songs, they are mostly coming either from United States (Eastern America) or European Union (England). In general, songs coming from non-english countries are tend to be unpopular. There is a high possibility that audiences around the world prefer to listen for songs in English.
Popular Genres over Time
Rock songs are favorite music for audiences from 2001-2009.
However, in 2010, pop becomes the top first popular genre. This indicate that music popularity is inconsistent and can change as time goes by.
Herding Bias in Songs
Have you taken a close look at your playlist? Do you notice that several songs from your playlist are actually from certain artists?
We define this phenomenon as herding bias, and we guess this phenomenon would exist because once the artist/artist gives a positive impression on users, they are more willing to listen to, or even more likely to love their songs. To measure the degree of herding bias, we use the following formula:
Consider there are M data of user, pm is play_count of mth user, and sm is the singer of mth data where Im is equal to 1 when sm exists more than once in M data, and is equal to 0 otherwise (exists only once).
We analyze playlists of 1022 users, and get the following distribution (To avoid misleading in histogram, we make bins = 50 to get a higher resolution.)
We find there are 160 people out of 1022 people (only 16%) don’t have herding bias, and the median value is of herding bias is 0.38. Which means, users commonly listen, at least 38%, songs from certain artists. Is it a good thing or a bad thing? This is a subjective question, if you love to try new stuff, then don’t let herding bias constrain yourself!
Tendency of Hearing Singers’ Voice, not the Songs
Are songs from popular artists usually popular? We collect 25 popular and 20 unpopular artists in 2010, and analyze the song hotness of their songs. Surprisingly, it differs a lot!
This kind of phenomenon is like “Rich gets richer”, once you gain more connections (popularity), the more possible that your songs will be popular. Now, let’s observe the clickthrough rate of 2 popular artists in 2017:
The Importance of First Performance
Do the songs in the first year matter a lot for artist? Are they key to success for artists? We observe nowadays people could get popular or famous because of single event (You always can find viral videos to watch when you are bored, right?), and hence we want to see whether this would also somehow lead to the career success of a singer. To do so, we choose several popular and unpopular artists during 1995-2000, 2000-2005 and 2005-2010, and observe the song hotness of their songs in their first year:
The scatter plot tells us artists may need to seize the opportunity in their first year because several recently popular singers make a success during their first year! Let us give 2 classical examples: Psy and Taylor Swift:
Lyrics of Songs
Do people tend to listen to songs that contains certain terms or themes?
Here we only display the figures for the popular songs.
Explanation:
- Top word count: we take the most recurrent word in every top/worst song and look at what are the most common top words.
- Top word weight: which is the same as the previous category but weighted using the duration of the songs.
- Full count: we consider the full lyrics dataset without taking care of hotttnesss, we’re counting all the words for every tracks and summing them.
- Top songs count: we repeat the previous operation, but this time on the top/worst songs.
The results being very similar for the popular and unpopular songs in which for the two first categories we have that the top word is by far ‘yeah’. Hence as a first conclusion we might say people do not really care about lyrics as ‘yeah’ isn’t related to any specific topic. Apart from ‘yeah’ we can see a lot of top words concerning themes such as youth, world, and verbs that refer to desire(wish, want).
For the two remaining categories we have a different result. Over all the songs we can see that the most recurrent word is ‘love’ and there are many other high-ranked words that recall feelings (feel, like, want, baby, heart, girl). So emotional, isn’t?
Sentiment Analysis of Songs
From a list of positively/negatively connoted words, lets determine whether a popular song is usually positive(happy) or negative(sad).
We have about 43.6% positive songs and 56.4% negative songs for the tracks with high hotttnesss and about 40% positive songs and 60% negative songs for the tracks with low hotttnesss. Here, we have no significant difference between popular and unpopular songs.
Presence of “Slang Words” in Songs
“Slang words”, such as insults or controversed subjects, are gathered in frequencies within popular/unpopular songs which will give an estimation of the lyrics quality.
We have a ratio of 30.6% top songs contain bad words. For the unpopular songs we get lower ratio of 22.1%. So people might be more interested in borderline songs ?
Users Behaviours Analysis
Now, we give an example of analyzing user behavior in listening songs according to playcount distribution, favorite singer, herding bias, and genres.
User analysis:
Based on playlist record, 32.1 % of user playcounts contribute to at least 2 songs from the following singers:
Benabar:
Les Mots D'Amour,
L'Itinéraire,
Y'a Une Fille Qu'Habite Chez Moi,
The genre this user love the most:
1. French
Conclusions
Using million song dataset, we elaborate more about how people react to songs, especially for popular and unpopular songs. We come to conclusion of:
- People listen to his/her favorite artists. This is proven by important features of random forest classifier and analysis of herding bias phenomeon. This is widely known as “Rich become Richer”, meaning popular artist tend to become more popular.
- People do care more about his/her favorite artist rather than the songs itself. We can see for both popular and unpopular songs; both have “yeah” as most commonly word in lyrics, both contains more negative words rather than positive ones, and both composed with slang words.