Inspiration

We were inspired by Shazam’s ability to recognize a song from just a fragment — a lyric, a melody, or even a vague memory. That sparked a question:

Why doesn’t something like this exist for movies and TV shows?

People often remember scenes, emotions, or partial plots but struggle to recall the title. CineRecallAI was created to recreate that same “aha!” moment for visual media discovery.


What it does

CineRecallAI helps users rediscover movies and TV shows from imperfect memories.

Users can:

  • Describe a vague plot, scene, quote, or “vibe”
  • Submit a YouTube clip link
  • Receive the most semantically relevant matches

Each result includes:

  • Title
  • Confidence score
  • Plot/overview snippet
  • AI-generated explanation of why it matched

We integrated the Gemini API, which:

  • Expands vague queries into richer descriptions
  • Generates natural-language explanations
  • Summarizes top matches
  • Analyzes YouTube clips by describing scenes before search

The app also logs searches and provides an analytics dashboard to visualize usage patterns.


How we built it

CineRecallAI uses a scalable semantic retrieval pipeline:

  • Cleaned and processed a large-scale dataset (~1M movies & TV shows)
  • Created a unified search_text representation
  • Generated dense embeddings using SentenceTransformers
  • Stored embeddings in Actian VectorAI DB
  • Performed high-performance K‑NN vector similarity search
  • Integrated the Gemini API for:
    • Query expansion
    • Result explanations
    • Clip-to-scene understanding
  • Built an interactive Streamlit UI
  • Implemented logging & analytics with Altair visualizations

Actian VectorAI DB enabled fast similarity search across a million-scale dataset — something infeasible with in-memory cosine similarity alone.


Challenges we ran into

  • Cleaning and standardizing noisy dataset fields
  • Scaling from thousands → ~1M records
  • Managing embedding generation time & memory
  • Integrating Actian VectorAI DB
  • Designing meaningful confidence scoring
  • Implementing clip-to-search via Gemini
  • Debugging Streamlit state/rerun behavior
  • Resolving Git merge conflicts

Accomplishments that we're proud of

  • Building a semantic retrieval engine from scratch
  • Scaling search to ~1M movies & TV shows
  • Integrating Actian VectorAI DB for vector search
  • Adding Gemini-powered query understanding
  • Implementing YouTube clip-based discovery
  • Designing a full-stack prototype (ML + UI + analytics)
  • Creating explainable AI-driven search results

What we learned

  • Differences between TF‑IDF and embedding-based retrieval
  • How cosine similarity operates in vector search
  • Challenges of large-scale semantic indexing
  • How vector databases enable scalable AI systems
  • How LLMs enhance retrieval via reasoning & expansion
  • Tradeoffs between accuracy, latency, and complexity
  • Importance of modular design & version control

What's next for CineRecallAI

  • Multimodal search (image/audio embeddings)
  • Quote & dialogue-based retrieval
  • Hybrid retrieval (keywords + vectors + LLM)
  • Improved ranking & confidence calibration
  • Personalization & recommendations
  • Production deployment & scaling

Built With

Share this project:

Updates