MouthPiece

Inspiration

What it does

How we built ## Where the idea came from

We started thinking about a screen-aware audio engine that could narrate anything on your screen with cinematic flair. After about ten minutes of arguing about Electron versus Chrome extensions, we cut it down to something we could actually ship in three hours: a browser extension that doesn't just read text aloud, but casts the voice based on what the text is.

Every TTS extension out there reads a news article and a horror story in the exact same flat voice. That bothered us. A news article should sound like a broadcast. A Poe story should sound like something is whispering behind you. Why doesn't anything do that?

So we built Mouthpiece.

How it works

You hit Cmd+Shift+P on any webpage and click on the first word you want read. From there:

We pull the article text out of the page using Mozilla's Readability library, the same one Firefox Reader Mode uses.
We send the first chunk to Claude Sonnet along with a curated catalog of ElevenLabs voices, and Sonnet casts it. Picks a narrator. Picks distinct voices for each quoted speaker in the article. Like a director assembling a cast.
As each chunk of text plays, Claude Haiku tags which segments are narration and which are quoted dialogue from named characters. The player swaps voices on the fly.
If you click a language flag, Sonnet translates the page into Hindi or Spanish, swaps the on-page text live, and the audio continues in the new language using ElevenLabs' multilingual model.

The whole thing streams. You click a word and you hear voice within about a second, while the rest of the document is still being processed in the background.

What we learned

The biggest lesson was about pipelining versus throttling. Our first working version fired all the TTS requests as fast as it could, which started racing past ElevenLabs' concurrent request limit, throwing 429 errors, and causing partial audio to play as garbage. It sounded like screaming. We had to add a semaphore that caps parallel TTS calls at five, plus an AbortController so that pausing actually cancels in-flight requests instead of just silencing the speakers while credits keep burning.

We also learned how to design a clean handoff between two people working in parallel. We wrote a "contract" file at minute five with five function signatures each direction, agreed never to change them, then built mocks for both sides so we could work without blocking each other. We didn't integrate until we each had something working against the mock. That single decision saved the build.

What we built it with

Everything is plain JavaScript. No build step, no bundler, no framework. The Chrome extension is Manifest V3, content scripts plus a background service worker. All the API calls relay through the worker because content scripts hit CORS walls when calling Anthropic or ElevenLabs directly. We used Sonnet for the higher-judgment work (casting, translation) and Haiku for the volume work (per-chunk dialogue parsing), which kept the cost profile reasonable. Audio playback is straight Web Audio API with an AnalyserNode feeding the live waveform meter in the floating widget.

What was hard

Three things really fought us:

Translation sync. The translation feature looks easy on paper but the on-page text swap has to preserve sentence indices across translation, otherwise the karaoke highlighting goes out of sync with the audio. We solved it by sending Sonnet numbered sentences and asking for numbered translations back.

Pause and resume. The audio queue and pause/resume took two passes to get right. Our first version paused the audio context but didn't stop the upstream pipeline, so paused playback was still burning credits in the background. The fix was a playing boolean plus pipelineActive plus the AbortController, all coordinated.

The widget styling. We wanted something that looked like a real product, not a hackathon prototype. The opal-chrome border is a CSS conic gradient that slowly rotates through purple, blue, green, gold, and pink, masked into a one-pixel ring. No JavaScript animation, just one keyframe. It makes the whole thing feel alive.

What's next

The cast is fixed once at the start of a document, so a character introduced halfway through gets read in the narrator's voice. We'd want to do dynamic mid-document recasting. We'd also want more languages, voice cloning so you can hear articles in your own voice, and an ambient soundscape layer that adds context-appropriate background audio under the narration.it