Skip to content

This project was a joint effort by Lucas De Oliveira, Chandrish Ambati, and Anish Mukherjee to create a song and playlist embeddings for recommendations in a distributed fashion using a 1M playlist dataset by Spotify.

Notifications You must be signed in to change notification settings

lbdeoliveira/song-playlist-recommendation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

Recommending Songs and Playlists

Lucas De Oliveira, Chandrish Ambati, Anish Mukherjee

Introduction and motivation

It all started with a dataset. In 2018, Spotify organized an Association for Computing Machinery (ACM) RecSys Challenge where they posted a dataset of one million playlists, challenging participants to recommend a list of 500 songs given a user-created playlist.

As both music lovers and data scientists, we were naturally drawn to this challenge. Right away, we agreed that combining song embeddings with some nearest-neighbors method for recommendation would likely produce very good results with not much effort. Importantly, we were curious about how a company like Spotify might do this recommendation task at scale – not with 1 million playlists but with the over 4 billion user-curated playlists on their platform. This realization raised serious questions about how to train a decent model since all that data would likely not fit in memory.

In this article we will discuss how we built a scalable ETL pipeline using Spark, MongoDB, Amazon S3, and Databricks to train a deep learning Word2Vec model to build song and playlist embeddings for recommendation. We’ll also see some visualizations we created on Tensorflow’s Embedding Projector.

Workflow

Collecting lyrics

The most tedious task of this project was collecting as many lyrics for the songs in the playlists as possible. We began by isolating the unique songs in the playlist files by their track URI; in total we had over 2 million unique songs. Then, we used the track name and artist name to look up the lyrics on the web. Initially, we used simple Python requests to pull in the lyrical information but this proved too slow for our purposes. We then used asyncio, which allowed us to make requests concurrently. This sped up the process significantly, reducing the downloading time of lyrics for 10k songs from 15 mins to under a minute. Ultimately, we were only able to collect lyrics for 138,000 songs.

Preprocessing

The original dataset contains 1 million playlists spread across 1 thousand JSON files totaling about 33 GB of data. We used PySpark in Databricks to preprocess these separate JSON files into a single SparkSQL DataFrame and then joined this DataFrame with the lyrics we saved. From there, it was easy to read the files back from MongoDB into DataBricks to conduct our future analyses.

Check out the Preprocessing.ipynb notebook to see how we preprocessed the data.

Training song embeddings

For our analyses, we read our preprocessed SparkSQL DataFrame from MongoDB and grouped the records by playlist id, aggregating all of the songs in a playlist into a list under the column song_list. Below is a snapshot of the first five rows:

Screen Shot 2022-04-21 at 3 15 42 PM

Using the Word2Vec model in Spark MLlib we trained song embeddings by feeding lists of track IDs from a playlist into to the model much like you would send a list of words from a sentence to train word embeddings. As shown below, we trained song embeddings in only 3 lines of PySpark code:

Screen Shot 2022-04-21 at 3 16 03 PM

We then saved the song embeddings down to MongoDB for later use. Below is a snapshot of the song embeddings DataFrame that we saved:

Screen Shot 2022-04-21 at 3 17 48 PM

Check out the Song_Embeddings.ipynb notebook to see how we train song embeddings.

Training playlist embeddings

Finally, we extended our recommendation task beyond simple song recommendation to recommending entire playlists. Given an input playlist, we would return the k closest or most similar playlists. We took a “continuous bag of songs” approach to this problem by calculating playlist embeddings as the average of all song embeddings in that playlist.

This workflow started by reading back the song embeddings from MongoDB into a SparkSQL DataFrame. Then, we calculated a playlist embedding by taking the average of all song embeddings in that playlist and saved a playlist_id --> vector DataFrame in MongoDB.

Check out the Playlist_Embeddings.ipynb notebook to see how we did this.

Training lyrics embeddings

We trained lyrics embeddings by loading in a song's lyrics, separating the words into lists, and feeding those words to a Word2Vec model to produce 32-dimensional vectors for each word. We then took the average embedding across all words as that song's lyrical embedding. Ultimately, our analytical goal here was to determine whether users create playlists based on common lyrical themes by seeing if the pairwise song embedding distance and the pairwise lyrical embedding distance between two songs were correlated. Unsurprisingly, it appears they are not.

Check out the Lyrical_Embeddings.ipynb notebook to see our analysis.

Notes on embedding training approach

You may be wondering why we used a language model (Word2Vec) to train these embeddings. Why not use a Pin2Vec or custom neural network model to predict implicit ratings? For practical reasons, we wanted to work exclusively in the Spark ecosystem and deal with the data in a distributed fashion. This was a constraint set on the project ahead of time and challenged us to think creatively.

However, we found Word2Vec an attractive candidate model for theoretical reasons as well. The Word2Vec model uses a word’s context to train static embeddings by training the input word’s embeddings to predict its surrounding words. In essence, the embedding of any word is determined by how it co-occurs with other words. This had a clear mapping to our own problem: by using a Word2Vec model the distance between song embeddings would reflect the songs’ co-occurrence throughout 1M playlists, making it a useful measure for a distance-based recommendation (nearest neighbors). It would effectively model how people grouped songs together, using user behavior as the determinant factor in similarity.

Additionally, the Word2Vec model accepts input in the form of a list of words. For each playlist we had a list of track IDs, which made working with the Word2Vec model not only conceptually but also practically appealing.

Visualization and recommendation

After all of that, we were finally ready to visualize our results and make some interactive recommendations. We decided to represent our embedding results visually using Tensorflow’s Embedding Projector which maps the 32-dimensional song and playlist embeddings into an interactive visualization of a 3D embedding space. You have the choice of using PCA or tSNE for dimensionality reduction and cosine similarity or Euclidean distance for measuring distances between vectors.

Click here for the song embeddings projector for the full 2 million songs, or here for a less crowded version with a random sample of 100k songs (shown below):

songs_reduced

Click here for the playlist embeddings projector (shown below):

playlists_reduced

The neat thing about using Tensorflow’s projector is that it gives us a beautiful visualization tool and distance calculator all in one. Try searching on the right panel for a song and if the song is part of the original dataset, you will see the “most similar” songs appear under it.

Conclusions

We were shocked by how this method of training embeddings actually worked. While the 2 million song embedding projector is crowded visually, we see that the recommendations it produces are actually quite good at grouping songs together.

Consider the embedding recommendation for The Beatles’ “A Day In The Life”:

Drawing Drawing

Or the recommendation for Jay Z’s “Heart of the City (Ain’t No Love)”:

Fan of Taylor Swift? Here are the recommendations for “New Romantics”:

Secondly, we were delighted to find naturally occurring clusters in the playlist embeddings. Most notably, we see a cluster containing mostly Christian rock, one with Christmas music, one for reggaeton, and one large cluster where genres span its length rather continuously and intuitively.

Note also that when we select a playlist, we have many recommended playlists with the same names. This in essence validates our song embeddings. Recall that playlist embeddings were created by the taking the average embedding of all its songs; the name of the playlists did not factor in at all. That is, similar playlists are similar because they have similar, if not the same songs. The similar names only conceptually reinforce this fact.

Further Scope:

  1. We could use these trained song embeddings in other downstream tasks and see how effective these are. Also, you could download the song embeddings we here: Embeddings | Meta Info

  2. We could look at other methods of training these embeddings using some recurrent neural networks and enhanced implementation of this Word2Vec model.

About

This project was a joint effort by Lucas De Oliveira, Chandrish Ambati, and Anish Mukherjee to create a song and playlist embeddings for recommendations in a distributed fashion using a 1M playlist dataset by Spotify.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published