🚀 Inspiration Over the past year, Mani and Bhanu have been diving deep into one of AI's toughest challenges: video understanding. Despite progress in foundation models, reliably interpreting videos especially in real-world, dynamic settings remains unsolved. Our direction was shaped by:

  • The Twelve Labs blog on Context Engineering, which argued that the next leap won’t come from bigger models, but richer, adaptive context and self-healing memory.
  • A discussion at the All-In Summit 2025 where Mark Cuban and Tucker Carlson debated the future of video AI reinforcing the need for systems that are both context-aware and user-aware. These ideas led us to build mem[v] -> The context and memory layer for multimodal Agents.

🧠 What It Does mem[v] creates a persistent memory graph from video content, extracting:

  • Episodic context (what happened)
  • Temporal context (when and in what order)
  • Semantic context (relationships and meaning) Instead of re-processing videos repeatedly, AI agents query this memory layer instantly-enabling real-time insights at 40x speed and 1% the cost. Process once. Remember everything. Query instantly. It also integrates external business documents - like brand guidelines, product specs, and campaign briefs—into a unified graph, turning raw video data into actionable business intelligence.

🛠️ How We Built It Tech Stack:

  • Video Understanding: Twelve Labs (Pegasus + Marengo)
  • Reasoning: OpenAI GPT-4
  • Context Graph: Neon Postgres (Graph schema)
  • Query Layer: Redis cache + GPT-powered logic
  • Frontend: Next.js
  • Auth: Clerk We built intelligent chunking, stateful context tracking, and custom prompt pipelines to overcome limitations in API context length and lack of multi-turn capabilities.

⚔️ Challenges We Faced

  • No multi-turn chat support in Twelve Labs → Built our own context manager
  • Rate limiting & unclear errors → Upgraded mid-hackathon to pay-as-you-go
  • Limited video context length → Engineered smart chunking strategies
  • No fine-tuning options → Relied on prompt engineering for domain-specific graphs

✅ Accomplishments 🔧 Technical Wins

  1. Built a memory layer on top of Twelve Labs
    • From one-time API calls → persistent, queryable memory
  2. Integrated external business context
    • PDFs, decks, catalogs, and performance data into a multimodal graph
  3. 40x speed improvement
    • From 30s+ video queries → <100ms with Redis + Graph
  4. Graph-based video reasoning

"Find moments where Product X appears after a competitor mention and aligns with brand guidelines (section 3.2)"

  5. First working prototype in 24 hours
    • Processed 20+ ad videos
    • Ingested 5+ docs
    • Created 500+ graph nodes and 2K+ relationships
  6. Tackled $80B ad waste problem
    • Reuses video memory across campaigns, teams, and platforms
  7. Built a “single source of truth” for video intelligence
    • Unifies video content with business knowledge
  8. Context as infrastructure
    • Democratizing memory + context for all video AI applications

🔍 Why It Matters

  1. We amplify, not compete with, Twelve Labs
Like Pinecone powers OpenAI - we power Twelve Labs outputs
  2. Closed the context gap
Bridge between raw video understanding and institutional knowledge
  3. Unlocked real-world scalability
40x faster and 100x cheaper = deployable at scale
  4. Built what the industry theorized
First working prototype of context-engineered video memory
  5. Immediate revenue path
Ad industry needs this now: massive ROI, immediate need
  6. Multimodal data lake
Videos, documents, structured data—> all queryable via natural language

📚 What We Learned

  • Context beats model size
  • Memory compounds
  • Graphs + vectors = 🔥
  • Most AI failures = context failures, not model limitations
  • Foundation models need infrastructure to become usable

🚧 What’s Next

  • Launching SDKs: memvai on pip + npm (already registered)
  • Collaborate with Twelve Labs and become a selected customer for fine-tuning.
  • Onboarding 5–10 design partners in advertising
  • Proving 40%+ CPM improvements in real-world campaigns
  • Building privacy-preserving federated memory for cross-customer learning
  • Expanding into fashion, e-learning, and media AI Long-term: mem[v] becomes the universal memory layer for multimodal AI.

🌐 The Bigger Picture Twelve Labs democratized video understanding models.
We’re making video memory + business context usable. Together, we're building the infrastructure for next-gen AI agents where video understanding meets institutional memory, and insights become truly actionable.

Built With

Share this project:

Updates