Inspiration
Apps like Shazam are great at one thing: identifying a song you're currently hearing. But that's not how most "what's that song" moments actually happen. More often, you're left with a fragment — a half-remembered lyric, a mood, a genre, the way a melody felt — and no audio to feed into a fingerprinting algorithm. There was no good tool for that gap: searching for a song by description rather than by recording. SongSense was built to fill it.
What it does
SongSense lets users search for a song using any combination of four loose, fragmentary inputs — a hummed/recorded melody, a genre, a lyric snippet, or free-text mood/description — and returns the top 5 closest matches. None of the inputs need to be exact or complete; the whole point is supporting the "I don't really remember, but it felt like..." search.
How we built it
- Frontend: React Native (Expo SDK 54), with three tabs (Search, History, Saved) built using React Navigation, running cross-platform on iOS, Android, and web.
- Backend: FastAPI serving a local REST API on port 8000, backed by SQLite (a songs catalog, search_history, and saved_songs tables).
- AI matching: every song is flattened into one descriptive string (title, artist, genre, mood keywords, lyric snippet, description) and embedded with all-MiniLM-L6-v2 (sentence-transformers) at startup. A user's query is embedded the same way, and ranked against all song vectors by cosine similarity:
$$\text{sim}(q, s) = \frac{q \cdot s}{|q| |s|}$$ Since every embedding is pre-normalized to unit length, this reduces to a single dot product, so the whole catalog can be ranked with one matrix-vector multiply (song_matrix @ q) instead of a per-song loop.
Audio: 30-second previews are pulled from Wikimedia Commons (public domain, for classical pieces) and the iTunes Search API (for modern songs), played inline via expo-av.
Challenges we ran into
The hardest part wasn't the plumbing — it was getting the matching itself to feel right. A few specific issues:
- Field weighting. Concatenating title, artist, genre, mood, lyric, and description into one string means a strong match on, say, mood keywords can get diluted by irrelevant noise elsewhere in the string. Tuning what goes into the descriptive text (and how it's phrased) mattered more than expected.
- Score interpretability. Cosine similarity scores don't have an obvious "good match" threshold — a 0.45 might be a strong match for one query and a weak one for another, so deciding how to rank/display confidence took iteration.
- Designing for a second signal in advance. Knowing that hum-based melody matching was coming in Phase 2, the matching engine had to be architected to blend two independent similarity scores (text + audio) via a weighted sum, without knowing yet what scale or distribution the audio scores would have. That meant building the normalization and blending logic defensively, ahead of having real data to test it against.
What we learned
- *How to turn a fuzzy, human "vibe" query into something a vector space can actually rank.
- That embedding-based search needs content design, not just model selection — what you embed matters as much as which model embeds it.
- How to architect a multi-signal ranking system (text now, audio later) so a new scoring signal can be dropped in without reworking the pipeline.
What's next
Implementing real melody/hum matching to replace the score_hum() stub, tuning the text/hum blend weights once both signals are live, migrating the databases to more formal database instead of sqlite, and expanding the song database.
Built With
- expo-av
- expo.io
- fastapi
- itunes-api
- numpy
- python
- react-native
- react-navigation
- sentence-transformers
- sqlite
- uvicorn
Log in or sign up for Devpost to join the conversation.