Inspiration

The world is drowning in video, images, audio, docs, spreadsheets and unstructured data, yet finding a specific moment in a 2-hour recording or bridging knowledge across a PDF and a presentation remains a manual chore. We were inspired by Twelve Labs' vision of "Video Understanding" and Dynbox.app 's agentic File Explorer and wanted to bring that level of professional visual file search and multi-modal reasoning to the Gemini ecosystem. We wanted to build a "Marathon Agent" that doesn't just answer questions, but autonomously researches technical documentation while analyzing your project files.

What it does

Gemini Files is a unified workspace for media and knowledge:

  • Temporal Video Search: Find exact moments using natural language queries like "Find where the red-nosed reindeer appears."
  • Interactive Timeline: A color-coded visualization of scenes (Action, Dialogue, Transitions) with clickable timestamps.
  • Unified Chat with Thought Signatures: A conversational interface that maintains deep reasoning continuity across multiple turns, accessing ALL your uploaded files simultaneously.
  • Automatic MCP Agent: Example: Autonomously connects to the DeepWiki MCP Server to study GitHub repositories when mentioned in chat, combining live SDK documentation with your local files.
  • Real-Time AI Teacher: A voice-enabled interactive tutor powered by the Gemini Live API that can discuss and explain your files in real-time.

How we built it

The project is built on a modern stack optimized for speed and AI interaction:

  • Framework: Next.js 16 with App Router and Edge Runtime for low-latency streaming.
  • Design: Vanilla CSS with Tailwind CSS and Framer Motion for a "Twelve Labs" and Dynbox premium aesthetic.
  • AI Engine: Gemini 3 Flash Preview for core reasoning and temporal analysis.
  • File Management: Gemini File API for multi-modal processing (Video, Audio, PDF, Images, Spreadsheets).
  • Communication: WebSockets for Gemini Live API (Voice) and Server-Sent Events (SSE) for streaming analysis.
  • Protocols: Model Context Protocol (MCP) for autonomous tool calling and external knowledge retrieval. ## Challenges we ran into
  • Streaming Parity: Implementing real-time streaming analysis with SSE while maintaining state across the Edge Runtime was a complex juggle of React transitions and buffer management.
  • MCP Orchestration: Building a resilient "Marathon Agent" that can autonomously decide when to call the DeepWiki MCP server, extract repository names from natural language, and merge that external data with internal PDF context without overwhelming the model's context window.
  • Thought Signature Persistence: Ensuring that Gemini 3's Thought Signatures were correctly passed and stored across multi-turn conversations to maintain "thinking levels" and logical continuity.
  • Rate Limit Resilience: Managing the high-frequency demands of the Gemini File API and implementing robust batching strategies to handle 429 errors gracefully during heavy analysis.

Accomplishments that we're proud of

  • UI Excellence: Transforming the project from a basic uploader into a professional-grade portal that rivals industry-leading visual search tools (like twelvelabs.io) and Agentic File Explorers (like dynbox.app) .
  • Autonomous Research: Successfully implementing the MCP flow where the AI can "go out and read the docs" to better answer questions about your code.
  • Real-Time Voice: Creating a seamless, low-latency voice experience with the Real-Time AI Teacher.
  • Temporal Reasoning: Achieving highly accurate scene detection and search results that feel like magic.

What we learned

  • Temporal >> Spatial: We learned that for video & audio, understanding when something happens is often more valuable than just what is in a single frame.
  • The Power of MCP: Standardizing how AI agents talk to documentation MCP servers (like DeepWiki) is a game-changer for technical workflows.
  • Continuity Matters: Using Gemini 3's Thought Signatures proved that maintaining the "internal state" of the AI lead to significantly higher quality responses in complex, multi-modal tasks.

What's next for Gemini Files

  • Production Persistence: Moving beyond localStorage to a fully integrated Vercel KV/Blob storage for collaborative workspaces.
  • Advanced Video Timestamping: AI-generated Timestamps of moments in the video, in the AI's Analysis or in the AI's response to the User.
  • Voice-to-Insight: Allowing users to gain deep insights into their files via voice chats through the Gemini Live API.

Built With

  • gemini
  • gemini-3-api
  • gemini-3-flash-preview
  • gemini-api
  • gemini-file-api
  • gemini-live-api
  • mcp
  • model-context-protocol
  • nextjs
  • react
  • tailwind
  • typescript
Share this project:

Updates