Algo-Rhythms: DL-Playlist-Recommendation System

Background:

Creating cohesive music playlists is a core part of the modern listening experience. Thus, providing automated recommendations to users to extend their playlists is a key feature for music-listening platforms like Spotify and Apple Music. To make this feature effective, the system must understand the ”vibe” of a playlist. This task is challenging as the theme of some playlists are quite clear (e.g. ”rap music playlist”) whereas others have more complex themes that cannot be fully captured by features of the playlist such as genre or time period (e.g. ”beach trip playlist”). As music platform users who have felt dissatisfied with the playlist recommendation algorithms, we wanted to dive more deeply into how we could create a model that learns the ”vibe” of a playlist to make our own recommendation system.

Abstract:

In this project, we attempt using deep learning to address the problem of playlist recommendation. Specifically, we build a system that produces five recommended songs given a playlist of songs with a title. We use a title-conditioned set transformer architecture with pooling by multi-head attention trained on user Spotify playlists with titles and tabular audio features about the songs in the playlists. Furthermore, we create a custom weighted approximate-rank pairwise (WARP) loss function that is in part a reconstruction loss and in part a more traditional WARP loss. While our model is able to train and improve on the metrics we use such as R-precision, our qualitative results are underwhelming and we see indications of mode collapse in our system’s recommendations. We hypothesize that the title conditioning in our model is too weak and that the training data we use may be too noisy or even might not represent a function that is learnable.

Codebase/Github

You can access our codebase using this link here:

From there, follow the README instructions to train our model, access our current weights, and understand the code structure a little bit more!

Architecture

Our playlist recommendation system is built around a title-conditioned Set Transformer architecture. The main components are:

Song Encoder: A feed-forward neural network that embeds tabular audio features for each song.
Title Encoder: Uses the Universal Sentence Encoder (USE) to embed playlist titles, which are projected to match the song embedding dimension.
Set Transformer: Stacks multiple Set Attention Blocks (SAB) to model relationships between songs, followed by Pooling by Multihead Attention (PMA) to produce a fixed-size playlist representation.
Output Layer: A dense layer predicts logits over the entire song vocabulary for playlist completion.

Custom Loss:
We use a hybrid Weighted Approximate-Rank Pairwise (WARP) loss. This loss combines two objectives:

Reconstruction loss: Encourages the model to reconstruct songs that were visible in the input (with a lower weight).
Recommendation loss: Focuses on predicting songs that were masked/hidden from the input (with a higher weight). The loss is computed using sampled negatives and a margin ranking approach, with different weights for seen and hidden items to balance reconstruction and recommendation.

The model is trained as a denoising autoencoder: random songs are masked from the input, and the model is tasked with reconstructing the full playlist.

Hyperparameters

Embedding dimension (song & title): 64
Playlist representation size: 64
Set Transformer layers: 3 SAB layers
Attention heads: 4 per SAB/PMA
Dropout rate: 0.1
Optimizer: Adam, learning rate 3e-4
WARP loss: 50 negative samples per positive, margin = 1.0
WARP loss weights:
- Seen (reconstruction) weight: 0.5
- Hidden (prediction) weight: 1.0
Batch size: Variable, bucketed by playlist length (up to 32)
Denoising mask: Randomly hides 10–50% of songs per playlist during training
Vocab Size: 30,000 most frequent tracks seen in full playlist dataset