Inspiration

We've all been there. You open a 47-page document, scroll through it, and think "I really don't want to read this." What if instead of reading your documents, you could just... talk to them? And what if, to make things more interesting, two AI personalities debated the content while you listened? We were inspired by ChatGPT's voice mode and thought: what if we could bring that experience to any Google Doc, but make it a conversation between multiple AI agents? Because reading is so 2023.

What it does

DocTalk is a Chrome extension that turns any Google Doc into a voice conversation. Open a document, click the extension, and two AI agents (Critic and Creative) are ready to discuss it with you.

Ask a question about your document and watch as two glowing orbs come alive. The Critic analyzes with precision. The Creative finds the interesting angles. They respond with synthesized voices while their orbs pulse and rise, ChatGPT-style. You can interrupt them anytime by speaking. You can even drag the orbs around because why not.

It's like having two very opinionated colleagues who actually read the document so you don't have to.

How we built it

We went a little overboard (hence our category).

Frontend:

  • Chrome Extension with OAuth2 for Google Docs API access
  • Custom animated orbs with CSS animations (breathing, pulsing, sound wave ripples)
  • Draggable elements with touch support
  • Real-time transcript display

Backend:

  • Node.js/Express server with WebSocket support
  • Supabase for storing documents, sessions, and conversation history
  • Speech-to-text pipeline for processing user audio
  • GPT-4o-audio-preview for direct voice synthesis (no separate TTS needed)

AI Architecture:

  • LangGraph for multi-agent orchestration
  • Two distinct agent personalities with different system prompts
  • Streaming responses with interrupt handling
  • Audio queuing system so agents don't talk over each other

The flow:

  • User opens Google Doc → Extension auto-extracts content
  • User speaks → Audio chunks stream via WebSocket → Speech-to-text
  • Transcript + document context → LangGraph agents
  • Agents respond with audio → Streamed back to extension → Orbs animate

Challenges we ran into

Audio synchronization was brutal. When the user interrupts mid-response, you need to immediately stop playback, clear the audio queue, signal the server to halt generation, and resume listening. Getting this to feel snappy took many iterations.

Chrome extension permissions for microphone access are tricky. Extensions run in a weird sandboxed context, and getting real-time audio capture working reliably required diving deep into the Web Audio API.

Making two agents actually feel different instead of just "the same AI with different adjectives" required careful prompt engineering. They needed distinct personalities without becoming caricatures. The orb animations needed to feel organic. Early versions looked robotic. We spent way too long tweaking cubic-bezier curves and shadow spreads to get that "breathing" feeling right.

Accomplishments that we're proud of

The UI genuinely looks and feels polished. The orbs are satisfying to watch and interact with. Interrupt handling works smoothly. Speak and the agents stop immediately. No awkward overlap. The whole flow is seamless. Open a doc, click the extension, start talking. No manual extraction buttons, no setup.

We built a full real-time voice pipeline from scratch: capture → stream → transcribe → orchestrate → synthesize → play.

It actually works. Like, you can have a real conversation about your document.

What we learned

  • WebSocket state management is harder than it looks, especially with audio streaming
  • Chrome extension development has a lot of gotchas around permissions and context isolation
  • LangGraph is powerful for multi-agent systems but has a learning curve
  • GPT-4o-audio-preview is genuinely impressive for direct audio generation
  • CSS animations can achieve a lot without JavaScript if you're patient
  • "Just add another AI agent" sounds simple until you're debugging race conditions at 3am

What's next for DocTalk

  • More agent personalities: Add a Skeptic, an Optimist, a Devil's Advocate. Let users choose their debate panel. Maybe even allow the users to describe the agent they want.
  • Document highlighting: When agents reference specific parts, highlight them in the actual doc.
  • Memory across sessions: Remember previous conversations about the same document.
  • Support for other formats: PDFs, Notion pages, Confluence docs, Slides, etc.
  • Mobile app: Because sometimes you want AI to read your documents while you're on the bus.

Built With

Share this project:

Updates