Video-MCP-RAG

Ask anything. Find any moment. Understand every video.

Inspiration

Video is one of the richest formats for storing knowledge, yet it remains almost entirely unsearchable. Lectures, product demos, meeting recordings, and training videos represent enormous institutional value — but the only way to find anything inside them is to watch them in full. We wanted to change that. The question that drove us was simple: what if you could talk to a video the same way you talk to a colleague who watched it for you?

What it does

Video-MCP-RAG lets users upload any video and interact with it through natural language. Once a video is processed, users can:

Ask factual questions — "What did the speaker say about the budget?" and receive a grounded answer drawn directly from the transcript and scene captions
Find specific moments — "Show me when the product is demonstrated" and receive a trimmed clip of exactly that scene
Search by image — upload a reference image and find the most visually similar moment in the video
Extract by time — "Give me the first 30 seconds" returns that clip immediately

How we built it

The system runs as two services connected via the Model Context Protocol (MCP):

MCP Tool Server (FastMCP) owns all video intelligence — frame extraction, audio transcription via AWS Transcribe, batched scene captioning with Nova Pro, and similarity search powered by Amazon Titan Multimodal Embeddings. It exposes four typed tools: get_video_clip_from_query, get_video_clip_from_image, get_video_clip_by_time, and ask_question_about_video.

FastAPI Server hosts the NovaAgent, which runs a four-stage pipeline on every user message: routing (Nova Lite classifies intent), tool selection (Nova Pro picks the right tool via structured JSON output), tool execution (MCP call), and response generation (Nova Pro wraps the result in natural language).

The key architectural decision was using Amazon Titan Multimodal Embeddings V1 for every modality — text, images, and video frames all live in a single unified vector space. This eliminated the need for separate CLIP and text-embedding models and made cross-modal search work naturally.

All structured LLM outputs are Pydantic-validated using Bedrock's forced tool-use mode, so there is no string parsing anywhere in the agent pipeline. Pixeltable serves as both the vector store and the conversation memory backend.

Challenges we ran into

Unified embedding space — early versions used separate models for image and text embeddings, which made cross-modal retrieval unreliable. Switching everything to Titan Multimodal Embeddings solved this cleanly.

MCP health checking — the FastMCP server returns HTTP 406 to plain GET requests (it expects SSE headers), which broke our startup health check. We had to detect 406 as a valid "server is up" signal rather than treating it as a failure.

Gunicorn worker count — the default formula (2 × CPU + 1) spawned too many workers on constrained cloud environments, exhausting available memory. We added an explicit worker cap for deployment.

Video processing latency — per-frame Nova Pro captioning was too slow and expensive. Batching consecutive frames (5 per API call) reduced both latency and cost by roughly 3x while producing better captions because the model has temporal context.

Accomplishments that we're proud of

A fully working multimodal RAG pipeline that searches across transcript chunks and scene captions in a single unified vector space
Structured JSON output at every LLM step with zero string parsing — every routing, tool-selection, and generation call is Pydantic-validated
A clean two-service architecture using MCP that keeps video intelligence fully decoupled from the agent layer
Full observability with Opik tracing every LLM call end-to-end

What we learned

Amazon Titan Multimodal Embeddings is a genuinely powerful simplification for multimodal systems — one model, one vector space, no glue code
MCP is an elegant pattern for separating tool logic from agent logic, and FastMCP makes it very fast to build production-ready tool servers
Nova Lite is surprisingly capable for binary routing decisions and adds almost no latency — the two-model routing pattern (Lite for routing, Pro for reasoning) is a pattern we will reuse
Batching multimodal inputs to Nova Pro significantly reduces both cost and latency without sacrificing quality

What's next for Video-MCP-RAG

Multi-video search — query across an entire library of videos at once
Automatic chapter generation — use Nova Pro to segment and label videos into named chapters automatically
Speaker diarization — identify and label individual speakers in the transcript so users can search by speaker name
S3 integration — connect directly to existing video libraries in S3 without manual upload
Nova Act integration — use Nova Act to automate video upload and processing workflows across web-based video platforms

Built With

amazon-bedrock
amazon-nova-lite
amazon-nova-pro
amazon-titan-multimodal-embeddings
aws-transcribe
fastapi
fastmcp
ffmpeg
gunicorn
moviepy
pixeltable
pyav
pydantic
python
react
typescript
uvicorn

Updates

Aman Singh started this project — Mar 16, 2026 06:07 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.