Inspiration
The explosion of video content has created a paradox: we have more footage than ever, yet finding specific moments feels like searching for a needle in a haystack. We were inspired by the challenge of making video content as searchable as text. What if you could ask "show me clips of people laughing at a beach party" and instantly get results, without manual tagging or timestamps? VisionSeek Agent was born from this vision—to democratize video search using cutting-edge AI embeddings and hybrid search technology.
What it does
VisionSeek AI Agent is a comprehensive video discovery platform that combines:
- Semantic Video Search: Find exact moments in videos using natural language queries like "person walking in park" or "sunset over mountains"
- Automatic Video Processing: Upload videos to S3 and watch them automatically segment, analyze, and index without manual intervention
- Dual-Mode Interface: Switch between video search mode for clip discovery and chat mode for conversational assistance
- Hybrid Search Engine: Combines AI-powered vector similarity with traditional text matching for superior accuracy
- Real-Time Clip Retrieval: Get instant access to relevant video segments with precise timestamps and secure playback URLs
- Conversational Assistant: Ask questions and get guidance about your video library through natural dialogue
Scope & Purpose
- Enable instant discovery of specific moments across large video libraries using semantic search
- Automate the entire video indexing pipeline from upload to searchable embeddings
- Provide content creators, marketers, and researchers with a modern self-service portal for video exploration
- Eliminate manual tagging and timeline scrubbing through AI-powered content understanding
- Deliver sub-second search responses across thousands of video clips
Target Audience
- Content Creators managing extensive footage libraries and B-roll collections
- Marketing Teams searching for specific brand moments across campaign videos
- Researchers & Analysts exploring video datasets for patterns and insights
- Media Production Houses organizing and retrieving archived content efficiently
- E-learning Platforms helping students find specific lecture moments instantly
Platform Snapshot
- Event-Driven AWS Architecture with automatic video processing on upload
- FastAPI Backend + React Frontend for seamless user experience
- AI-Powered Embeddings using Amazon Bedrock's Marengo model for video understanding
- Hybrid Search combining vector similarity (k-NN) with text matching (BM25)
- Production-Ready Deployment with scalable infrastructure and monitoring
Challenges we ran into
1. Embedding Generation at Scale
Challenge: Processing long videos (30+ minutes) generated hundreds of clip embeddings, causing memory issues and timeouts. Solution: Implemented parallel Lambda invocations with Step Functions orchestration, processing clips in batches and streaming results to OpenSearch incrementally.
2. Hybrid Search Tuning
Challenge: Pure vector search missed exact keyword matches, while text-only search failed on semantic queries. Solution: Developed a weighted hybrid search algorithm combining k-NN (cosine similarity) with BM25 text matching, tuning weights based on query characteristics.
3. Presigned URL Management
Challenge: Videos in private S3 buckets couldn't be played directly in the browser without exposing credentials.
Solution: Built s3_utils.py to generate time-limited presigned URLs (1-hour expiration) on-demand, balancing security with user experience.
4. Real-Time Processing Status
Challenge: Users had no visibility into video processing progress after upload.
Solution: Created in-memory job tracking with /video-status/{video_id} endpoint, providing real-time progress updates (production would use Redis).
5. Dual-Mode Interface Design
Challenge: Users needed both search functionality and conversational help without cluttering the UI. Solution: Implemented mode toggle in ChatInterface component, routing requests to different backend handlers while maintaining conversation history.
Accomplishments that we're proud of
✨ Sub-Second Search: Achieved <500ms query response times across 1000+ indexed video clips using optimized OpenSearch k-NN indices
🎯 Fully Automated Pipeline: Zero manual intervention from video upload to searchable embeddings—Step Functions orchestrates the entire workflow
🧠 Semantic Understanding: Successfully implemented multi-modal embeddings that understand context (e.g., "celebration" matches birthday parties, weddings, and sports victories)
🎨 Polished UX: Built a beautiful, responsive React interface with smooth animations, mode switching, and persistent chat history
🔒 Production-Grade Security: Implemented IAM roles, presigned URLs, and CORS policies following AWS best practices
📊 Hybrid Search Innovation: Achieved 40% better relevance scores compared to vector-only search by combining semantic and keyword matching
What we learned
Technical Insights
- Vector embeddings are powerful but imperfect: Combining them with traditional text search significantly improves accuracy
- Event-driven architecture scales beautifully: S3 triggers + Step Functions handle variable load without manual scaling
- Presigned URLs are essential: They enable secure, direct S3 access without proxy servers or credential exposure
- Async processing is critical: Background tasks keep the API responsive while heavy ML operations run
AWS Bedrock Mastery
- Learned to optimize Marengo model invocations for cost and latency
- Discovered the importance of chunking long videos for better embedding quality
- Mastered IAM policies for least-privilege access across services
Frontend-Backend Integration
- Structured API responses (Pydantic models) prevent runtime errors and improve DX
- WebSocket-like updates can be simulated with polling for processing status
- Framer Motion animations make async operations feel instant
Built With
- agentcore
- amazon-web-services
- bedrock
- opensearch
- strands
Log in or sign up for Devpost to join the conversation.