Algo-Rhythms: DL-Playlist-Recommendation System
Background:
Creating cohesive music playlists is a core part of the modern listening experience. Thus, providing automated recommendations to users to extend their playlists is a key feature for music-listening platforms like Spotify and Apple Music. To make this feature effective, the system must understand the ”vibe” of a playlist. This task is challenging as the theme of some playlists are quite clear (e.g. ”rap music playlist”) whereas others have more complex themes that cannot be fully captured by features of the playlist such as genre or time period (e.g. ”beach trip playlist”). As music platform users who have felt dissatisfied with the playlist recommendation algorithms, we wanted to dive more deeply into how we could create a model that learns the ”vibe” of a playlist to make our own recommendation system.
Abstract:
In this project, we attempt using deep learning to address the problem of playlist recommendation. Specifically, we build a system that produces five recommended songs given a playlist of songs with a title. We use a title-conditioned set transformer architecture with pooling by multi-head attention trained on user Spotify playlists with titles and tabular audio features about the songs in the playlists. Furthermore, we create a custom weighted approximate-rank pairwise (WARP) loss function that is in part a reconstruction loss and in part a more traditional WARP loss. While our model is able to train and improve on the metrics we use such as R-precision, our qualitative results are underwhelming and we see indications of mode collapse in our system’s recommendations. We hypothesize that the title conditioning in our model is too weak and that the training data we use may be too noisy or even might not represent a function that is learnable.
Codebase/Github
You can access our codebase using this link here:
- From there, follow the README instructions to train our model, access our current weights, and understand the code structure a little bit more!
Architecture
Our playlist recommendation system is built around a title-conditioned Set Transformer architecture. The main components are:
- Song Encoder: A feed-forward neural network that embeds tabular audio features for each song.
- Title Encoder: Uses the Universal Sentence Encoder (USE) to embed playlist titles, which are projected to match the song embedding dimension.
- Set Transformer: Stacks multiple Set Attention Blocks (SAB) to model relationships between songs, followed by Pooling by Multihead Attention (PMA) to produce a fixed-size playlist representation.
- Output Layer: A dense layer predicts logits over the entire song vocabulary for playlist completion.
Custom Loss:
We use a hybrid Weighted Approximate-Rank Pairwise (WARP) loss. This loss combines two objectives:
- Reconstruction loss: Encourages the model to reconstruct songs that were visible in the input (with a lower weight).
- Recommendation loss: Focuses on predicting songs that were masked/hidden from the input (with a higher weight). The loss is computed using sampled negatives and a margin ranking approach, with different weights for seen and hidden items to balance reconstruction and recommendation.
The model is trained as a denoising autoencoder: random songs are masked from the input, and the model is tasked with reconstructing the full playlist.
Hyperparameters
- Embedding dimension (song & title): 64
- Playlist representation size: 64
- Set Transformer layers: 3 SAB layers
- Attention heads: 4 per SAB/PMA
- Dropout rate: 0.1
- Optimizer: Adam, learning rate 3e-4
- WARP loss: 50 negative samples per positive, margin = 1.0
- WARP loss weights:
- Seen (reconstruction) weight: 0.5
- Hidden (prediction) weight: 1.0
- Batch size: Variable, bucketed by playlist length (up to 32)
- Denoising mask: Randomly hides 10–50% of songs per playlist during training
- Vocab Size: 30,000 most frequent tracks seen in full playlist dataset
Authors
Authors listed in alphabetical order:
- Brendan Rathier - Brown University - brendan_rathier@brown.edu
- Camilo Tamayo-Rousseau - Brown University - camilo_tamayo-rousseau@brown.edu
- Daniel Schiffman - Brown University - daniel_schiffman@brown.edu
- Matias Bronner - Brown University - matias_bronner@brown.edu
Acknowledgments
We thank the CSCI 1470 teaching staff at Brown University for their guidance and support throughout this project!
- Cover image generated by ChatGPT-5.2
Log in or sign up for Devpost to join the conversation.