Inspiration

Everyone has used LLMs these days. If you want web data, just turn on web search. If you want documents, just upload them. But what about video data? The most information-dense source that we're unable to access, or it's too expensive to access. We had a key insight: we can represent all the modalities of a video as time-indexed textual information, be it audio, visual frames, or OCR! Hence, we decided to create FrameChat so non-technical folks can talk to videos as just another RAG source and multinational companies can unlock more workflows for video data.

What it does

FrameChat makes video data as accessible as text documents. It transforms videos into time-indexed, structured data that can be queried through natural language. Upload any video and FrameChat:

  1. Extracts visual information from intelligently sampled frames with precise timestamps
  2. Transcribes audio with speaker labels and timing
  3. Creates a unified timeline merging all modalities into searchable text
  4. Enables conversational queries like asking a document chatbot, but for video. The result? Video becomes just another data source in your RAG pipeline. Ask "What happened at 12:45?" or "Summarize the second half" and get instant, contextualized answers. All data is exportable for analytics, search indexing, or custom integrations.

How we built it

We architected FrameChat as a parallel processing pipeline that treats video as multi-modal, time-indexed data: Tech Stack: AWS Bedrock (Nova models) for visual frame analysis AWS Transcribe for audio-to-text with timestamps OpenCV & FFmpeg for intelligent frame extraction S3 for scalable frame storage Python ThreadPoolExecutor for parallel processing A clean, minimalistic user interface

Challenges we ran into

  1. The Version Mismatch Mystery Spent hours debugging why S3 URIs weren't working with Bedrock. The culprit? Mismatched boto3 (1.34.113) and botocore (1.34.162) versions are causing silent API validation failures. Always check your dependencies!
  2. Quality vs Cost Tradeoff Processing every frame would be prohibitively expensive and slow. We developed an intelligent sampling algorithm that analyzes visual difference between frames, only processing ones with meaningful changes. This reduced costs by 85% while capturing all key moments.
  3. Multi-Modal Time Synchronization Visual events and audio transcripts operate at different temporal resolutions. Creating a unified timeline required careful timestamp alignment and a second AI pass to cross-reference modalities (e.g., "goal scored" in visuals matches "crowd erupts" in audio).
  4. Chat State Management Initial implementation created infinite loops where the AI repeated question-answer pairs. The issue was managing when messages get added to chat history vs when they're sent to the model. Required careful state management to fix.
  5. Making Video "RAG-Ready" Traditional RAG works with chunks of text. Video data is inherently temporal and multi-modal. We had to design a representation that preserves time relationships while being semantically searchable, essentially creating "video embeddings" that maintain temporal context.

Accomplishments that we're proud of

  1. Made video queryable like documents, treating video as just another RAG source
  2. 60% faster processing through parallel analysis
  3. Time-indexed everything, every insight has precise timestamps
  4. Reasonably fast (Can process 7-minute videos in less than two minutes)
  5. Cost-optimized: Intelligent sampling reduced API costs by more than 50% as compared to processing the whole video
  6. Multi-modal synthesis: Successfully merged visual + audio into coherent timelines
  7. Accessible to non-technical users while powerful enough for enterprise workflows
  8. Most importantly, we proved that video can be treated as structured, queryable data, not just binary blobs to be stored and played back.

What we learned

Market:

Video data is the last major unstructured data frontier Current solutions either don't exist or cost $10k+/month for enterprises Non-technical users are desperate for this—they have video data but can't access it The RAG framing makes the video feel familiar instead of intimidating

Technical:

Video is just time-indexed multi-modal data—treat it that way Intelligent sampling > brute force processing (faster, cheaper, nearly as accurate) AWS Bedrock's Nova models are powerful but require careful version management Parallel processing complexity pays off massively at scale Time synchronization between modalities is harder than it looks

Product:

Timestamps are non-negotiable—every insight needs temporal context The synthesis step (merging modalities) creates 10x more value than separate analyses "Chat with video" sounds cool, but "video RAG" sells better to enterprises

What's next for FrameChat

Short-term (Making Video RAG Production-Ready):

Batch processing: Analyze entire video libraries, create searchable indices Vector embeddings: Store video segments in vector DBs for semantic search Custom schemas: Let users define what data to extract (sports stats, meeting action items, security incidents) API endpoints: RESTful API for programmatic access Export formats: JSON, CSV, Parquet for data pipelines

Long-term (Video Data Infrastructure):

Video Database Layer: SQL-like queries for video ("SELECT moments WHERE action='goal' AND time > 45:00") Embedding marketplace: Pre-trained models for specific industries Integration ecosystem: Plug into Slack, Notion, Salesforce, and analytics tools Video data warehouse: Centralized storage + intelligence layer for all organizational video Developer platform: Let others build on our video intelligence API

Built With

Share this project:

Updates