DocTalk

Inspiration

We've all been there. You open a 47-page document, scroll through it, and think "I really don't want to read this." What if instead of reading your documents, you could just... talk to them? And what if, to make things more interesting, two AI personalities debated the content while you listened? We were inspired by ChatGPT's voice mode and thought: what if we could bring that experience to any Google Doc, but make it a conversation between multiple AI agents? Because reading is so 2023.

What it does

DocTalk is a Chrome extension that turns any Google Doc into a voice conversation. Open a document, click the extension, and two AI agents (Critic and Creative) are ready to discuss it with you.

Ask a question about your document and watch as two glowing orbs come alive. The Critic analyzes with precision. The Creative finds the interesting angles. They respond with synthesized voices while their orbs pulse and rise, ChatGPT-style. You can interrupt them anytime by speaking. You can even drag the orbs around because why not.

It's like having two very opinionated colleagues who actually read the document so you don't have to.

How we built it

We went a little overboard (hence our category).

Frontend:

Chrome Extension with OAuth2 for Google Docs API access
Custom animated orbs with CSS animations (breathing, pulsing, sound wave ripples)
Draggable elements with touch support
Real-time transcript display

Backend:

Node.js/Express server with WebSocket support
Supabase for storing documents, sessions, and conversation history
Speech-to-text pipeline for processing user audio
GPT-4o-audio-preview for direct voice synthesis (no separate TTS needed)

AI Architecture:

LangGraph for multi-agent orchestration
Two distinct agent personalities with different system prompts
Streaming responses with interrupt handling
Audio queuing system so agents don't talk over each other

The flow:

User opens Google Doc → Extension auto-extracts content
User speaks → Audio chunks stream via WebSocket → Speech-to-text
Transcript + document context → LangGraph agents
Agents respond with audio → Streamed back to extension → Orbs animate

Challenges we ran into

Audio synchronization was brutal. When the user interrupts mid-response, you need to immediately stop playback, clear the audio queue, signal the server to halt generation, and resume listening. Getting this to feel snappy took many iterations.

Chrome extension permissions for microphone access are tricky. Extensions run in a weird sandboxed context, and getting real-time audio capture working reliably required diving deep into the Web Audio API.

Making two agents actually feel different instead of just "the same AI with different adjectives" required careful prompt engineering. They needed distinct personalities without becoming caricatures. The orb animations needed to feel organic. Early versions looked robotic. We spent way too long tweaking cubic-bezier curves and shadow spreads to get that "breathing" feeling right.

Accomplishments that we're proud of

The UI genuinely looks and feels polished. The orbs are satisfying to watch and interact with. Interrupt handling works smoothly. Speak and the agents stop immediately. No awkward overlap. The whole flow is seamless. Open a doc, click the extension, start talking. No manual extraction buttons, no setup.

We built a full real-time voice pipeline from scratch: capture → stream → transcribe → orchestrate → synthesize → play.

It actually works. Like, you can have a real conversation about your document.

What we learned

WebSocket state management is harder than it looks, especially with audio streaming
Chrome extension development has a lot of gotchas around permissions and context isolation
LangGraph is powerful for multi-agent systems but has a learning curve
GPT-4o-audio-preview is genuinely impressive for direct audio generation
CSS animations can achieve a lot without JavaScript if you're patient
"Just add another AI agent" sounds simple until you're debugging race conditions at 3am

What's next for DocTalk

More agent personalities: Add a Skeptic, an Optimist, a Devil's Advocate. Let users choose their debate panel. Maybe even allow the users to describe the agent they want.
Document highlighting: When agents reference specific parts, highlight them in the actual doc.
Memory across sessions: Remember previous conversations about the same document.
Support for other formats: PDFs, Notion pages, Confluence docs, Slides, etc.
Mobile app: Because sometimes you want AI to read your documents while you're on the bus.

Built With

agentic
langchain
langgraph
typescript

Updates

Karthik Sankar started this project — Jan 17, 2026 08:18 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.