Inspiration
I am a firm believer that recent advancements in AI will lead to voice being a more dominant interface for users to interact with browsers, applications, and the computers in general.
I am also a bit lazy when it comes to typing. It’s common for me to write a rough draft, ask ChatGPT or Gemini to refine it, and then paste it back into Gmail. But it feels very inefficient so I wanted an in-browser solution that would fix that for me.
What it does
I built two different ways to write with voice for my extension and both options are available via side panel or as a floating widget.
Transcribe mode uses the browser’s Web Speech API to capture audio, show a live transcript, and let you edit it directly. Clicking Refine sends the text to the Prompt API with our refinement prompt; choosing a preset and pressing Polish calls the Rewriter APIs.
Compose mode streams your audio straight to the Prompt API to generate the draft. The live transcript is for your reference, but if you edit it and press Compose again we rerun the Prompt API in text mode with your changes.
In both modes you can apply the style presets or extra context to fine-tune the AI response.
How we built it
- Structured Prompt API responses. Every compose/refine call goes through runComposePrompt with a JSON response schema. The API returns subject, paragraphs, and content, and we normalize it with coerceComposeDraft. That’s why we can insert the subject line straight into Gmail’s subject box and the body into the editor instead of dumping raw text.
- Direct‑insert DOM helper. We ship a content script (src/content/directInsert.ts) that watches focus/selection inside Gmail. It remembers the caret, figures out whether the user is in the subject or body, and exposes handlers (applyDraft, applyTranscript) so the background worker can say “focus, paste, set caret.” Caret restoration is an explicit message (ekko/direct-insert/restore) so even the side panel—which lives outside the page—can hand text back to the exact spot the user clicked.
- Side panel / page messaging. The side panel talks to the background script via chrome.runtime.sendMessage, and the background fans that out to the right frame (tracked in directInsertFrameMap). That’s how we can stream transcripts, insert drafts, and keep the DOM helper in sync whether the UI comes from the widget or the panel.
- Audio vs. text compose. Compose mode streams audio directly to the Prompt API (composeFromAudio), but we also run text-only compose (composeFromText) when the user edits the live transcript or types instructions without recording. The code treats the same set of style presets for both audio and text paths.
- Web Speech + Prompt API layering. Transcribe mode uses the browser’s Web Speech API for live text, and only calls the Prompt API when you click Refine/Polish. Compose mode flips that: audio goes straight to the Prompt API, but the live transcript is editable and can drive another compose pass if you hit Compose again.
- Used Codex for majority of implementation
Challenges we ran into
Keeping the field in sync across two surfaces: the side panel and the floating widget both stream transcripts, so we built a shared “direct insert” bridge that toggles Chrome’s direct-insert mode, mirrors text via runtime messaging, and knows when to back off to clipboard copy. Getting that handshake right (and avoiding loops when direct insert auto-prefills) took a few iterations.
Accomplishments that we're proud of
I asked my friend to test out the Chrome extension and he said he’d use it. He said it’s useful and the design is solid. One feature I made which I think takes my extension to the next level is for gmail, it will auto-fill not only the body field but also the subject field. This saves a lot of time because you don’t need to think of what the subject of your email has to be!
What we learned
From a UI/UX perspective, I learned that displaying transcript makes it much better because it prevents the user from waiting too long while they talk and provides the feeling that they are speaking into something. From an implementation perspective, prompting is also important. For example, for compose mode, I make sure that AI responses are generated in the user’s voice, not in an assistant’s perspective, so that users can insert into their text field.
What's next for Echo: Write with Voice
I hope to eventually publish onto the Chrome Extension store so users can download it and I can get real user feedback, that would be great!
Built With
- codex
- javascript
- react

Log in or sign up for Devpost to join the conversation.