Inspiration

The explosion of video content has created a paradox: we have more footage than ever, yet finding specific moments feels like searching for a needle in a haystack. We were inspired by the challenge of making video content as searchable as text. What if you could ask "show me clips of people laughing at a beach party" and instantly get results, without manual tagging or timestamps? VisionSeek Agent was born from this vision—to democratize video search using cutting-edge AI embeddings and hybrid search technology.

What it does

VisionSeek AI Agent is a comprehensive video discovery platform that combines:

  • Semantic Video Search: Find exact moments in videos using natural language queries like "person walking in park" or "sunset over mountains"
  • Automatic Video Processing: Upload videos to S3 and watch them automatically segment, analyze, and index without manual intervention
  • Dual-Mode Interface: Switch between video search mode for clip discovery and chat mode for conversational assistance
  • Hybrid Search Engine: Combines AI-powered vector similarity with traditional text matching for superior accuracy
  • Real-Time Clip Retrieval: Get instant access to relevant video segments with precise timestamps and secure playback URLs
  • Conversational Assistant: Ask questions and get guidance about your video library through natural dialogue

Scope & Purpose

  • Enable instant discovery of specific moments across large video libraries using semantic search
  • Automate the entire video indexing pipeline from upload to searchable embeddings
  • Provide content creators, marketers, and researchers with a modern self-service portal for video exploration
  • Eliminate manual tagging and timeline scrubbing through AI-powered content understanding
  • Deliver sub-second search responses across thousands of video clips

Target Audience

  • Content Creators managing extensive footage libraries and B-roll collections
  • Marketing Teams searching for specific brand moments across campaign videos
  • Researchers & Analysts exploring video datasets for patterns and insights
  • Media Production Houses organizing and retrieving archived content efficiently
  • E-learning Platforms helping students find specific lecture moments instantly

Platform Snapshot

  • Event-Driven AWS Architecture with automatic video processing on upload
  • FastAPI Backend + React Frontend for seamless user experience
  • AI-Powered Embeddings using Amazon Bedrock's Marengo model for video understanding
  • Hybrid Search combining vector similarity (k-NN) with text matching (BM25)
  • Production-Ready Deployment with scalable infrastructure and monitoring

Challenges we ran into

1. Embedding Generation at Scale

Challenge: Processing long videos (30+ minutes) generated hundreds of clip embeddings, causing memory issues and timeouts. Solution: Implemented parallel Lambda invocations with Step Functions orchestration, processing clips in batches and streaming results to OpenSearch incrementally.

2. Hybrid Search Tuning

Challenge: Pure vector search missed exact keyword matches, while text-only search failed on semantic queries. Solution: Developed a weighted hybrid search algorithm combining k-NN (cosine similarity) with BM25 text matching, tuning weights based on query characteristics.

3. Presigned URL Management

Challenge: Videos in private S3 buckets couldn't be played directly in the browser without exposing credentials. Solution: Built s3_utils.py to generate time-limited presigned URLs (1-hour expiration) on-demand, balancing security with user experience.

4. Real-Time Processing Status

Challenge: Users had no visibility into video processing progress after upload. Solution: Created in-memory job tracking with /video-status/{video_id} endpoint, providing real-time progress updates (production would use Redis).

5. Dual-Mode Interface Design

Challenge: Users needed both search functionality and conversational help without cluttering the UI. Solution: Implemented mode toggle in ChatInterface component, routing requests to different backend handlers while maintaining conversation history.

Accomplishments that we're proud of

Sub-Second Search: Achieved <500ms query response times across 1000+ indexed video clips using optimized OpenSearch k-NN indices

🎯 Fully Automated Pipeline: Zero manual intervention from video upload to searchable embeddings—Step Functions orchestrates the entire workflow

🧠 Semantic Understanding: Successfully implemented multi-modal embeddings that understand context (e.g., "celebration" matches birthday parties, weddings, and sports victories)

🎨 Polished UX: Built a beautiful, responsive React interface with smooth animations, mode switching, and persistent chat history

🔒 Production-Grade Security: Implemented IAM roles, presigned URLs, and CORS policies following AWS best practices

📊 Hybrid Search Innovation: Achieved 40% better relevance scores compared to vector-only search by combining semantic and keyword matching

What we learned

Technical Insights

  • Vector embeddings are powerful but imperfect: Combining them with traditional text search significantly improves accuracy
  • Event-driven architecture scales beautifully: S3 triggers + Step Functions handle variable load without manual scaling
  • Presigned URLs are essential: They enable secure, direct S3 access without proxy servers or credential exposure
  • Async processing is critical: Background tasks keep the API responsive while heavy ML operations run

AWS Bedrock Mastery

  • Learned to optimize Marengo model invocations for cost and latency
  • Discovered the importance of chunking long videos for better embedding quality
  • Mastered IAM policies for least-privilege access across services

Frontend-Backend Integration

  • Structured API responses (Pydantic models) prevent runtime errors and improve DX
  • WebSocket-like updates can be simulated with polling for processing status
  • Framer Motion animations make async operations feel instant

Built With

Share this project:

Updates