Inspiration

The inspiration for Say Edit came from a simple frustration: editing tools are invisible walls between intent and result.

Whether you're a researcher buried in a 200-page legal brief trying to find one clause, or a designer who wants to change the color of a jacket in a photo — you already know exactly what you want. You just have to fight the UI to get it.

We were guided by a single question: what if the interface was just your voice?

Not push-to-talk. Not a chatbot you type at. A persistent, low-latency voice session that listens continuously, understands context, and acts the moment it understands your intent — just like a skilled human assistant would.

Say Edit is built on the belief that the best interface is no interface at all. Click where you want the change. Say what you want. Watch it happen.


What it does

Say Edit is a voice-first AI workspace that routes to one of two purpose-built environments:

📄 Document Navigator: Document Navigator Video

Upload any PDF — a legal contract, research paper, textbook. Say Edit indexes it at sentence level with vector embeddings. Then you just ask questions:

"What are the termination clauses?" "Where does it mention liability?" "Summarize section 4.2."

The AI doesn't just answer — it navigates to the page and applies yellow highlights over the exact sentences it's referencing, in real time. You hear the answer and see it on the page simultaneously.

🖼️ Image Editor: Image Editor Video

Upload any photo. Click anywhere on it to drop a crosshair hotspot. Then describe your change:

"Make the jacket leather." "Blur the background." "Add sunglasses."

The edit is applied, localized to the region you pointed at, and pushed onto a non-destructive history stack with full undo/redo and before/after compare.

⬡ Compose Studio

Load a second image and merge two photos by voice:

"Dress the person from A in the outfit from B." "Put the product on the background from B."

The composite is generated and added to your edit history, ready for further voice refinement.


How we built it

The Voice Loop (Gemini Live API)

The core of Say Edit is a persistent bi-directional WebSocket with the Gemini Live API. This is not push-to-talk — the model listens continuously and calls tools the moment it understands your intent. Audio is streamed at 16kHz PCM from the microphone in real time. If you speak while the AI is responding, it interrupts instantly and listens again.

Microphone → 16kHz PCM → Gemini Live WebSocket
       ↓
Tool call (search_document / edit_image_region / compose_images)
       ↓
Result returned → AI responds by voice + workspace updates

Document Intelligence

PDFs are uploaded to Supabase Storage and processed by a NestJS backend. The ingestion pipeline uses pdfjs-dist to extract text at word level, groups words into lines by Y-coordinate proximity, and merges lines into sentence-level chunks with tight bounding boxes [x, y, width, height]. Each chunk is embedded with gemini-embedding-001 and stored in a pgvector index.

When the Live session calls search_document, the query is embedded and a cosine similarity search (match_chunks_by_document RPC) returns the top-N sentence chunks with their page numbers and bounding boxes. The frontend then calls focus_document_section using those exact bboxes to render highlights on the PDF canvas.

Image Pipeline

Image editing happens entirely client-side. The current image is encoded as base64 and sent alongside the edit prompt and hotspot coordinates to gemini-2.0-flash-exp-image-generation. The model returns a localized edit. The new image is added to a stale-closure-safe history stack (using refs to prevent stale state in async tool callbacks).

Tech Stack

  • Frontend: React + Vite + TypeScript, @react-pdf-viewer with custom highlight plugin
  • Backend: NestJS, pdfjs-dist (Node, with canvas mock), Supabase pgvector
  • AI: Gemini Live API (voice), Gemini image generation (edits/composites), gemini-embedding-001 (semantic search)
  • Infra: Google Cloud Run (frontend + backend), Supabase (storage + vector DB)

Gemini Live API Usage Map

Feature Live API Tool Tool Name What It Does Code Reference
Document Search Function Calling search_document AI calls this when user asks a question. Hits the pgvector backend and returns top-N sentence chunks with bboxes GlyphWorkspace.tsxonmessage tool handler
Spatial Navigation Function Calling focus_document_section Receives page + rects[][] from search results. Frontend jumps the PDF viewer to the page and renders yellow highlights at exact coordinates GlyphWorkspace.tsxfocus_document_section handler
Image Region Edit Function Calling edit_image_region AI calls this after user clicks a hotspot and speaks an edit. Receives edit_prompt, hotspot_x, hotspot_y. Frontend sends image + coords to Gemini image model Glyphimageworkspace.tsxhandleEditImageRegion
Hotspot Query Function Calling get_current_hotspot AI can query the currently selected pixel coordinates before triggering an edit, enabling natural speech like "edit this area" without re-stating coordinates Glyphimageworkspace.tsxget_current_hotspot handler
Image Compositing Function Calling compose_images AI calls this when user requests a merge of two photos by voice. Passes composition prompt; frontend combines Image A + B via Gemini image generation Glyphimageworkspace.tsxhandleComposeVoice
Interruption serverContent.interrupted When user speaks mid-response, the Live API sends an interrupt signal. All queued audio is cleared instantly and the model re-listens Both workspaces — clearAudioQueue
Live Transcription outputAudioTranscription Real-time transcript of everything the AI says streams into the Neural Transcript panel with a blinking cursor Both workspaces — outputTranscription handler
Audio Streaming sendRealtimeInput 16kHz PCM audio is captured via ScriptProcessorNode, converted Float32 → Int16 → base64, and streamed continuously to the Live session Both workspaces — proc.onaudioprocess

Model used: gemini-2.5-flash-native-audio-preview-12-2025 — chosen for native audio modality (no separate STT pipeline) and sub-300ms response latency.


Challenges we ran into

1. Stale Closures in Async Tool Callbacks

The Live API's onmessage callback is set up once at session start. React state inside that closure becomes stale immediately — meaning history[historyIndex] would always point to the original image, regardless of how many edits had been applied.

The fix: We maintain historyRef and historyIndexRef that are updated synchronously alongside state. All async tool handlers read from refs, not state, guaranteeing they always operate on the current image.

// Stale-closure-safe: always reads the live value
const imageFile = historyRef.current[historyIndexRef.current];

2. PDF Bounding Box Coordinate Mismatch

PDF coordinate systems use bottom-left origin. The @react-pdf-viewer canvas uses top-left origin. Our initial highlights were appearing at mirrored positions — a sentence on line 3 would highlight at the bottom of the page.

The fix: During ingestion, we convert every word's Y coordinate: y = pageHeight - transform[5] - height. The stored bboxes are in top-left origin from day one, so the frontend highlight math is clean.

3. The Canvas Module Crash in Node

pdfjs-dist optionally requires the canvas npm package for rendering. In our NestJS Cloud Run environment, canvas isn't installed and the require throws, crashing the server before it even starts.

The fix: We intercept require('canvas') at module load time using Node's Module._load hook and return an empty stub. The hook is applied before pdfjs loads and immediately removed after, leaving the rest of the module system untouched.

4. Live API Connection Timing

Sending tool messages or audio immediately on onopen causes a race condition — the WebSocket is open but the model isn't ready to receive. This resulted in dropped greetings and occasional 400 errors.

The fix: A 600ms stabilization delay after onopen before enabling audio streaming and sending the initial greeting. Crude but reliable across network conditions.


Accomplishments that we're proud of

Spatial grounding at sentence level. Most RAG systems return text. Ours returns text and tells you exactly where on the page it lives, with pixel-accurate highlights. The combination of hearing the answer and seeing it highlighted in real time creates a genuinely new kind of document interaction.

Zero-latency interruption. When you speak while the AI is answering, it stops instantly. No wait, no overlap. This was only possible because of the Gemini Live API's native interruption signal — we'd never be able to replicate it with a request/response architecture.

Non-destructive image editing by voice. Every edit is pushed onto a history stack. Undo, redo, and before/after compare all work seamlessly. The session always operates on whatever the current history frame is, even after many sequential edits.

A single voice session that spans two modes. The image workspace handles both localized edits and full compositing through the same Live session, switching modes fluidly as the user changes intent.


What we learned

Tool design is UX design. The get_current_hotspot tool exists entirely to let users say "edit here" without re-stating coordinates. Getting the tool schema right — what the model calls, when, and with what arguments — is as important as the UI itself.

Refs are the solution to stale closures in Live API callbacks. Any async callback registered at session start will capture stale React state. The pattern of maintaining a ref that mirrors every piece of state you need in a callback is now a fixture of our architecture.

PDF text extraction is harder than it looks. pdfjs-dist gives you individual text items with transform matrices, not sentences or lines. Grouping words into lines by Y tolerance, then lines into sentences by punctuation, then computing tight bounding boxes from word-level coordinates — each step has edge cases (rotated text, multi-column layouts, headers).

The Live API's interruption model changes everything. Building on a WebSocket that sends interrupted signals rather than a request/response model shifts how you think about the entire interaction design. The user is always in control. That's the right default.


What's next for Say Edit

Collaborative sessions — Share a document workspace with a second user; both hear the same AI and see the same highlights in real time.


Built With

  • geminiliveapi
  • nest
  • vite
Share this project:

Updates