Vibe Engine 🎵

Your vibe, your soundtrack.


Inspiration

Music recommendation has always felt impersonal. Spotify knows what you've listened to. It doesn't know how you feel. You can't type "I just closed a massive deal and I want to feel like a god" into a search bar and get something back that actually fits.

We wanted to build something that understands the vibe — not the genre, not the artist, not the playlist tag, but the raw emotional state behind the request. The insight was that lyrics are the most human part of a song, and if you could match the semantic meaning of a mood description to the semantic meaning of lyrics, you'd have a recommendation engine that actually gets it.


What It Does

Vibe Engine takes a freeform description of a mood, feeling, or moment — anything from "grindset mindset" to "late night drive, windows down, thinking about everything" — and returns 10 songs whose lyrics semantically match that vibe.

Users can also upload an image — album art, a photo, a mood board — and get recommendations based on the emotion the image conveys. Every result comes with:

  • Album art and a direct Spotify link
  • A one-sentence AI-generated explanation of why the song was recommended
  • Mood tags, themes, and energy level extracted from the query

How We Built It

The Data

We worked with a dataset of 500,000 songs ranked by view count, each with cleaned lyrics, stored in a single consolidated parquet file. The lyrics were embedded offline using sentence-transformers/all-MiniLM-L6-v2, a BERT-based model that maps text into a 384-dimensional semantic space:

$$\vec{e}_i = \text{Encoder}(\text{lyrics}_i), \quad \vec{e}_i \in \mathbb{R}^{384}$$

The full embedding matrix was precomputed and stored as a NumPy binary file:

$$E \in \mathbb{R}^{500000 \times 384}$$

Query Enrichment

Raw queries like "grindset mindset" are too sparse for reliable embedding. We used Gemini 2.5 Flash to expand them into structured retrieval signals:

{
  "rewritten_prompt": "relentless ambition, hustle, self-discipline, winning mindset",
  "moods": ["motivated", "focused", "intense"],
  "keywords": ["grind", "ambition", "success"],
  "energy": "high"
}

Multi-Angle Query Encoding

Rather than encoding just the rewritten prompt, we encode multiple semantic angles — the prompt, keywords, and moods — and average them into a single richer query vector:

$$\vec{q} = \frac{1}{n} \sum_{i=1}^{n} \text{Encoder}(p_i)$$

Blended Ranking

Pure semantic similarity isn't enough — an obscure song with a perfect lyric match isn't useful if nobody knows it. We blend three signals into a final score:

$$\text{score}_i = \alpha \cdot \text{sem}_i + \beta \cdot \text{pop}_i + \gamma \cdot \text{rec}_i$$

where sem is the cosine similarity between the query and lyric embedding, pop is the log-normalised view count, rec is the normalised release year, and weights α = 0.7, β = 0.2, γ = 0.1 by default. Songs below a semantic similarity floor are zeroed out regardless of popularity.

Diversity Filtering

To avoid returning multiple songs from the same artist, we enforce a max per artist cap and deduplicate titles across results. Candidates are drawn from a larger pool before filtering, ensuring we always return diverse results.

Image-Based Recommendations

Rather than re-embedding the full corpus with a multimodal model like CLIP — which would have required days of compute — we used an LLM-as-bridge approach: image → Gemini Vision → text description → existing text embedding pipeline. A pragmatic tradeoff that gave us multimodal recommendations without touching the precomputed index.

Explainability

All explanations are generated in a single Gemini API call, returning a structured JSON array of one-sentence explanations for all 10 results simultaneously. This keeps latency low while making every recommendation interpretable.

The Stack

Layer Technology
Embeddings sentence-transformers/all-MiniLM-L6-v2
Similarity search scikit-learn cosine similarity
Query enrichment Gemini 2.5 Flash
Image understanding Gemini Vision
Song metadata Spotify Web API
Backend FastAPI + Uvicorn
Frontend React + Vite

Challenges We Ran Into

Spotify mismatch. The Spotify search API returns something even when the exact song isn't on the platform — leading to confidently wrong links. We implemented a fuzzy match validator that compares the returned title and artist against the query before accepting the result, returning null rather than a wrong link.

Compute constraints. Embedding 500,000 lyric documents is expensive. We precomputed the entire matrix offline, reducing inference-time compute to a single cosine similarity operation against the query vector — fast enough to run on CPU without a GPU.

Time constraints. With 36 hours on the clock, every architectural decision was a tradeoff between correctness and speed. We made deliberate choices to keep the stack simple — a single Python module for the pipeline, FastAPI for the backend, precomputed embeddings on disk — rather than reaching for more sophisticated infrastructure that would have consumed the clock.

Multimodal tradeoff. Rather than re-embedding the full corpus with CLIP, we used an LLM-as-bridge (image → text → embedding), which was a pragmatic hackathon tradeoff that let us ship multimodal support without rebuilding the index.


Accomplishments That We're Proud Of

  • Building a semantic search pipeline over half a million songs that runs on CPU in real time
  • Shipping a fully working image-to-music recommendation feature in under 36 hours
  • Designing an explainability layer that gives every recommendation a human-readable reason without adding significant latency
  • Keeping the entire pipeline clean enough that the same recommend() function powers both text and image queries

What We Learned

  • Semantic search over lyrics is surprisingly effective at capturing feeling, not just topic
  • LLM enrichment as a preprocessing step dramatically improves retrieval quality for short, abstract queries
  • Batching LLM calls and deferring external API lookups to the end of a pipeline makes a measurable difference in perceived responsiveness
  • Shipping something real in 36 hours requires ruthless prioritisation — good enough architecture that works beats perfect architecture that doesn't ship

What's Next for Vibe Engine

  • Faster search — swap cosine similarity over the full matrix for FAISS approximate nearest neighbour search to cut query latency on larger corpora
  • True multimodal embeddings — replace the LLM bridge with a CLIP-based index so image queries match directly in embedding space without the text intermediate
  • Personalisation — let users thumbs-up or thumbs-down results to shift the query vector in real time
  • Playlist generation — extend from 10 songs to a full coherent playlist with arc and energy curve
  • Expanded corpus — move beyond lyrics to incorporate audio features, tempo, and key for a richer match signal

Built at Hackalytics 2026.

Built With

Share this project:

Updates