Vibely

Inspiration

We've all been there, sitting with friends, trying to describe a video we saw: "you know that one video of IShowSpeed jumping over a car" or "the one audio that goes like when I'm with youuu, I don't want to be with youuu" or "the video of the cat doing the Scuba dance?" You know exactly what you're looking for, but you can't find it. TikTok and Instagram search only work with exact keywords, and Google doesn't index short-form video content well. We wanted to build something that lets you search the way you actually remember, by vibe, by sound, by what you saw, or by what someone said.

What it does

Vibely is a multimodal search engine for short-form content (Instagram Reels, TikToks, YouTube Shorts). Instead of relying on keywords, users can search by:

Text — describe what you saw in your own words ("that video where the chicken is in a flood")
Audio — sing the audio, say a quote, or play a clip and we'll match it
Image — upload a screenshot or photo and find visually similar posts
Video — upload a clip to find the same moment elsewhere

An AI intent layer automatically detects whether you're describing, reenacting, quoting, or vibing, and adjusts the search weights across modalities accordingly. Users can also pause on a frame to identify products and items on screen.

How we built it

Backend (Python/FastAPI):

Apify scrapes content metadata (captions, likes, views, thumbnails, video URLs) from Instagram
Each piece of content gets 4 separate embeddings via Google Gemini Embeddings 2: text (caption + hashtags + transcript), visual (video frames), audio (extracted audio track), and description (an LLM-generated summary of the action in the video)
All embeddings are stored in MongoDB Atlas Vector Search with a consolidated index across all 4 vector fields
Search runs 4 parallel vectorSearch queries (one per modality) and fuses results using weighted score fusion, where weights are controlled by an intent classifier

Frontend (React + Swift):

Web app with a warm, editorial design built in React
Floating search pill with mode chips (Text, Image, Video, Audio)
Real microphone recording via the MediaRecorder API for audio search
Video autoplay on hover with masonry result grid
Mobile app built in parallel for on-the-go search using Swift

Infrastructure:

FastAPI serves both the API and frontend from a single server
ngrok provides public URL tunneling for the mobile app to reach the backend
MongoDB Atlas for vector storage

Challenges we ran into

Audio in video is invisible to embeddings. We discovered that Gemini's video embedding model completely ignores the audio track in video files. This meant a video's sound (music, dialogue, effects) was lost during embedding. We had to build a pipeline that extracts audio from video using ffmpeg and embeds it separately, effectively splitting each video into three modalities.
- Visual embeddings are heavily influenced by appearance. Two videos of the same action in different lighting produce distant embeddings because color, brightness, and framing dominate over motion. We solved this by adding an LLM-generated action description as a 4th embedding, creating a layer that captures what happens regardless of how it looks.
- Difficulty getting a large content pool. It takes a lot of compute to download and embed multimodal embeddings for hundreds of videos. This is a future step we are looking into to make this multimodal search as useful as possible.

Accomplishments that we're proud of

Built a complete end-to-end multimodal embedding pipeline from scraping to embedding to search in under 24 hours
Successfully implemented 4-modality search (text, visual, audio, description) with weighted fusion that adapts to query intent
Got a web app, mobile app, and backend all functioning together seamlessly through a single API
Solved the audio-in-video embedding gap that would have made audio search impossible

What we learned

Modality separation is critical for multimodal search, embedding a video as a single blob loses too much signal. Splitting into text, visual, audio, and description gives the search engine the ability to match on the right dimension for each query type.
Visual embeddings are not as robust as we expected, they're heavily skewed by color, lighting, angles, and framing. A semantic text description of the action ("cat jumps into pool") is far more reliable for matching across different recordings of similar events.
The intent behind a query matters as much as the query itself — "that video where the cat jumps" and "the one that goes meow splash" describe the same content but need completely different search strategies.

What's next for Vibely

Scale the content pool to expand from ~20 pieces of content to thousands, pulling from TikTok, Instagram, YouTube Shorts, Reddit, and Twitter/X
- Real-time ingestion, automatically scrape and embed trending content as it's posted, keeping the search pool fresh
- User accounts and history — save searches, bookmark posts, and build personalized recommendations based on your vibe
- Taking this product to market for user testing and validation.