Memora Clip - Project Story
Inspiration
Have you ever tried to find a specific moment in a video you watched weeks ago? Maybe it was a cooking technique, a tutorial step, or a beautiful scene you wanted to revisit. You remember seeing it, but finding it means scrubbing through hours of footage—or worse, giving up entirely.
This frustration inspired Memora Clip. We live in a world where text is instantly searchable thanks to Google, but video—the fastest-growing content format—remains a black box. You can't search for "sunset on the beach" and find that exact moment in your 50-video library. You can't type "how to make carbonara" and jump to the timestamp where someone explains it.
We were particularly inspired by the explosive growth of short-form video content on platforms like Instagram. People bookmark hundreds of videos with the intent to revisit them, but they become lost in an ever-growing collection. The content is there, the value is there, but it's functionally inaccessible.
We asked ourselves: What if videos were as searchable as text? What if AI could understand not just what's said, but what's shown? That's how Memora Clip was born.
What it does
Memora Clip is an AI-powered video search engine that makes your entire video library searchable like text.
Core Capabilities:
1. AI-Powered Search
- Visual Search - Search for visual content using natural language. Type "person wearing red shirt" or "mountain sunset" and find every matching moment across all your videos using CLIP vision AI.
- Transcript Search - Search through everything said in your videos. Find exact quotes, topics, or discussions instantly.
- Hybrid Search - Comprehensive search across visual content, transcripts, AI-generated summaries, keywords, and categories for the most relevant results.
2. Instagram Bookmark Integration
- Chrome extension adds a bookmark button to Instagram posts and reels
- One click automatically downloads, processes, and indexes the video
- Turn your saved Instagram content into a searchable personal knowledge base
3. Automated AI Processing When you upload or bookmark a video, Memora Clip automatically:
- Extracts frames at configurable intervals (default: every 2 seconds)
- Generates CLIP embeddings (768-dimensional vectors) for visual similarity search
- Transcribes audio to text using Whisper AI
- Creates AI-generated summaries, keywords, and categories
- Generates OpenAI embeddings (1536 dimensions) for semantic text search
- Compresses videos for faster preview and playback
4. Smart Video Library
- Hover-to-play video previews
- Metadata tracking (file size, format, upload date, Instagram source)
- Download original videos
- Real-time processing status updates
The Result:
Instead of watching hours of footage to find one moment, you type a query and get instant, precise results—complete with clickable timestamps and preview clips.
How we built it
Architecture
We built Memora Clip as a full-stack TypeScript monorepo using modern AI and web technologies:
Frontend:
- Next.js 16 with React 19 for the web application
- Tailwind CSS for responsive, modern UI
- Real-time updates using Convex React hooks
- Plasmo framework for the Chrome extension
Backend:
- Convex for real-time database, vector search, and workflow orchestration
- Hono API server for heavy video processing tasks
- FFmpeg for video manipulation and frame extraction
- Workflow-based processing for parallel task execution
AI/ML Pipeline:
- CLIP (OpenAI) - Vision embeddings for visual similarity search (768 dimensions)
- OpenAI Embeddings - Text embeddings for semantic search (1536 dimensions)
- Whisper (via Groq) - Audio transcription with 95%+ accuracy
- Vector Search - Multiple vector indexes for different search modalities
Key Technical Decisions:
1. Hybrid Processing Architecture We initially tried running FFmpeg processing in Convex actions, but hit timeout limits with large videos. We migrated to a hybrid approach where Convex orchestrates workflows and handles database operations, while a separate Hono API server handles heavy FFmpeg processing. This gave us the best of both worlds—Convex's real-time capabilities with the power to process videos of any size.
2. Multi-Modal Vector Search We implemented separate vector indexes for different search modes:
- Image embeddings for visual search (CLIP)
- Transcript embeddings for semantic text search (OpenAI)
- Summary embeddings for topic-level search
- Keywords/categories embeddings for categorical search
This allows users to search the same way they think—sometimes visually, sometimes by content, sometimes by topic.
3. Embedding Cache
To avoid redundant API calls and reduce costs, we implemented caching for both CLIP and OpenAI embeddings. The cache tracks usage with accessCount and lastAccessedAt for LRU-style cleanup.
4. Browser Extension Integration The Instagram integration required deep understanding of Instagram's DOM structure and handling for both feed posts and individual post pages. We built a robust content script that works across Instagram's SPA navigation and dynamically injected content.
Challenges we ran into
1. Video Processing at Scale Our biggest challenge was processing large videos efficiently. Initially, running FFmpeg in Convex actions caused timeouts with videos longer than a few minutes. We solved this by moving heavy processing to a dedicated Hono API server while keeping Convex for orchestration and data management. This architectural decision was critical to making the system work.
2. Instagram's Dynamic DOM Instagram's heavily obfuscated class names and SPA architecture made building a reliable browser extension extremely challenging. The DOM structure changes frequently, and finding reliable selectors for injecting bookmark buttons required multiple fallback strategies. We implemented 4 different detection strategies to ensure the bookmark button appears consistently.
3. Vector Search Performance With thousands of frames across multiple videos, vector search performance became critical. We optimized by:
- Implementing multiple specialized vector indexes instead of one generic index
- Adding filter fields to narrow search scope
- Caching embeddings to avoid redundant API calls
- Clustering frames into clips for more relevant results
4. Balancing Accuracy vs. Cost AI embeddings aren't free. We had to balance search accuracy with API costs:
- CLIP embeddings: ~768 dimensions per frame (potentially thousands per video)
- OpenAI embeddings: ~1536 dimensions for transcripts, summaries, and keywords
- Transcription costs via Groq
We implemented intelligent caching, batch processing, and configurable frame extraction rates to keep costs manageable while maintaining search quality.
5. Real-Time Processing Feedback Users needed to see what was happening during processing. We implemented a workflow-based system with granular status updates:
uploading→processing→ready/failed- Individual step tracking:
framesExtracted,transcribed,embedded - Real-time progress updates in the UI
Accomplishments that we're proud of
1. It Actually Works The most satisfying accomplishment is that Memora Clip delivers on its promise. You can genuinely search for "person cooking pasta" or "sunset on beach" and find matching moments across your entire video library. The AI isn't perfect, but it's remarkably good—good enough to be genuinely useful.
2. The Instagram Integration is Magical The one-click Instagram bookmark flow feels like magic. See a video, click the star, and within minutes it's searchable in your library. No copying URLs, no manual downloads, no tagging. It just works.
3. End-to-End Type Safety Building in TypeScript across the entire stack—Next.js frontend, Convex backend, Hono API, browser extension—gave us confidence in our code. Convex's automatic type generation meant changes to the database schema immediately surfaced in the frontend. We caught so many bugs at compile time instead of runtime.
4. The Hybrid Search Implementing hybrid search that combines visual similarity, transcript matching, and semantic understanding was technically complex but incredibly powerful. Users can search however they think—by what they saw, what they heard, or what the video was about.
5. Processing Efficiency Getting from "upload video" to "fully searchable" in under 2 minutes for a typical video required significant optimization. Parallel workflows for frame extraction, transcription, and compression—plus smart caching—made this possible.
6. Beautiful, Responsive UI Despite the complex AI backend, we built a clean, intuitive interface. Hover-to-play previews, smooth search results, real-time status updates—it feels polished and professional.
What we learned
1. AI is Ready for Production CLIP, Whisper, and OpenAI embeddings are genuinely production-ready. The accuracy is impressive, the APIs are reliable, and the costs are manageable. We're at an inflection point where AI-powered features that seemed impossible a few years ago are now table stakes.
2. Architecture Matters More Than Code Our biggest breakthroughs came from architectural decisions, not clever code:
- Moving FFmpeg to a separate service
- Using workflows for parallel processing
- Implementing multiple vector indexes
- Caching embeddings aggressively
3. Vector Search is a Game Changer Vector embeddings + similarity search unlocks entirely new UX patterns. Searching by "vibe" or visual similarity instead of exact keywords is powerful and intuitive. This technology will reshape how we interact with content.
4. Real-Time is Expected Users expect immediate feedback. Real-time processing status, instant search results, live updates—anything that feels slow or stale breaks the experience. Convex's real-time capabilities were essential for this.
5. Instagram's Walled Garden Building on top of Instagram is fragile. Their DOM structure can change at any moment, breaking our extension. We learned to build defensively with multiple fallback strategies and graceful degradation.
6. The Demo is Everything Technical excellence means nothing if users don't understand the value. We spent significant time on the demo flow and elevator pitch because showing the value is harder than building the value.
What's next for Memora Clip
Short-term (Next 3 Months)
1. YouTube Integration Instagram is just the beginning. YouTube is the obvious next platform—playlists, subscriptions, and watch history represent thousands of hours of content that users can't effectively search through.
2. Mobile App Video consumption happens on mobile. We need iOS and Android apps with the same search capabilities, plus mobile-native features like sharing clips directly to social media.
3. Collaborative Libraries Enable teams to build shared video libraries. Content creators, researchers, and educators all need to collaborate around video content.
4. Advanced Clip Editing Once users find the perfect moment, they want to export it. Add trimming, stitching, and export features to create shareable clips from search results.
Medium-term (6-12 Months)
5. Multi-Language Support Expand beyond English. Whisper supports 50+ languages—we should too.
6. Custom AI Models Allow power users to fine-tune search for their specific use cases—medical videos, sports footage, security cameras, etc.
7. API Access Enable developers to build on top of Memora Clip. Video search as a service.
8. Smart Collections Auto-generate collections based on AI understanding: "All my pasta recipes," "Morning routine videos," "Travel content from 2024."
Long-term Vision
9. The Google for Video Our ultimate goal is to be the definitive way people search and organize video content. Whether it's personal videos, social media saves, or professional footage—Memora Clip should be the answer.
10. Enterprise Solutions News organizations, production companies, and enterprises with massive video archives need this technology. Build enterprise-grade features: SSO, permissions, team management, advanced analytics.
11. Real-Time Video Understanding Process video streams in real-time. Search through Zoom recordings, live streams, or security footage as they happen.
12. AI Video Assistant Move beyond search to answers. "Show me how to make carbonara" should return step-by-step instructions compiled from your video library, not just search results.
The future is visual, and visual content should be as searchable as text. That's what we're building.
Log in or sign up for Devpost to join the conversation.