Tamewave

Inspiration

I wanted to watch a movie with my mom last month. She'd enjoy it, but it had a lot of profanity and I'm not comfortable sitting through that next to her, so we picked something else. This kept happening. I started wishing Netflix had a small button that just cleaned this up — not a bleep, not a mute, not a stranger's voice doing the line. Just the same actor saying a softer word.

The existing options all leak. A bleep tells you something was there — kids notice that, and curious kids look it up. A mute leaves a hole that pulls you out of the scene. A generic TTS replacement breaks character entirely.

Tamewave is that button. Drop a video in, get the same video back with every swear replaced in the original speaker's own voice and emotional register. Same intensity, same delivery, different word.

What it does

You hand it a video. It separates speech from background, transcribes with per-speaker diarization, picks a natural replacement for every profanity in context, clones each speaker from the clip itself, resynthesizes only the replaced word with the surrounding sentence as prosody context, and splices it back at the quietest point of the inter-word gap. Output is cleaned_<original name> next to the input. The video itself never leaves your machine.

This is an early version with rough edges — most cuts are clean, but speaker cloning is hit-or-miss when the clip only has 20–30 s of a given voice, and tight rapid-fire deliveries don't leave enough silence for the cut to reach past the consonant burst safely but it's solvable by phrase level regeneration by locating the most silent nearby areas. Both are tractable; see "What's next."

How we built it

The interesting work isn't the bag of cloud services you wire together — that's commodity. The interesting work is what sits between them so the output doesn't sound spliced.

The seam. Word-level ASR boundaries are loose by tens of milliseconds, so cutting on raw timestamps leaves a fricative onset surviving into the replacement. The cut is a chain: search the inter-word gap for the quietest RMS point, snap to a zero crossing, pad margins with extracted room tone, and splice with a smoothstep crossfade (3t² − 2t³) whose zero-derivative endpoints attenuate residual energy ~5× more than equal-power at the same fade length.

Detection chain. A wordlist pre-filter bounds cost (it scales with swear count, not clip length). Candidates go to an LLM with the full containing sentence. A validator enforces severity / inflection / syllable constraints. A retry pass re-asks for a tighter pick if the first one was longer than the original. A curated lexicon stands by as a same-length safety fallback.

Voice-aware loudness. The LUFS reference for matching the synthesized word to its surroundings is built from the two same-speaker word spans nearest the edit — not the whole clip — so loud/quiet contrasts or other speakers can't drag the target wrong.

Length fit, never crop. The first time-stretcher I tried silently under-ran length and zero-padded. Replaced it with a librosa phase-vocoder under a policy that never crops the synthesized audio: an over-length word gets sped up slightly instead, with the clean-cut margins reserved to keep the speed-up small.

Clone lifecycle. Voice clones cost an account slot. The pipeline reuses an existing clone if a sparse speaker can't be cloned (rather than aborting the run), and every clone is deleted in a finally block — an end-user clip never leaks a permanent voice anywhere.

Offline-capable install. PyInstaller bundles ffmpeg, ffprobe, and the 80 MB Demucs weights. A runtime hook drops the model into ~/.cache/torch on first launch so the app works from the very first run, not after a download.

Challenges we ran into

Most of the work was at the seam, not the model.

ASR boundaries are looser than the timestamps suggest. A naïve cut leaves a fricative onset surviving into the replacement. The cleanup chain landed only after multiple iterations — RMS search alone wasn't enough on fricatives that have no internal silence; the zero-crossing snap was needed so the splice didn't click; the smoothstep curve specifically (not equal-power) was what killed residual energy at the seam.

The detection LLM kept satisfying length constraints with nonsense — "let's fucking go" became "let's going go". Rewriting the prompt around grammaticality, adding a same-syllable retry pass, and dropping back to a curated lexicon as the final fallback fixed most of it.

Voice character drift was the hardest one. The synthesized word came out flat conversational on screaming source. The fix wasn't tuning a voice setting — it was conditioning the synthesis on the FULL containing sentence as previous/next text. That single change matched the source intensity in a way no per-call setting could.

Accomplishments that we're proud of

The seam is genuinely inaudible on most cuts.
One button, one drop, one output file. No keys, no setup.

What we learned

Replacing a word in real audio is mostly a seam problem, not a TTS problem. Where you cut and how you fade matter more than which TTS model you reach for.

The conditioning context you pass to a TTS matters more than the model's own knobs. Sending the surrounding sentence got the prosody right; everything else was small adjustments.

ASR word timestamps are suggestions. Treat them as anchors and search around them with RMS and zero-crossing analysis.

What's next for Tamewave

The end goal is to have one button to clean up entire movies with 100% accuracy and completely natural feel and hopefully get this integrated into major OTT platforms. This is an early version. The few current rough edges I will be working on in the near future are :

Better edit/replace. Most cuts are clean, but tight rapid-fire deliveries don't leave enough silence for the wider cut to reach safely past the consonant burst. Plan: forced alignment for sub-phoneme boundaries on those specific gaps, and a phrase-level regeneration fallback when a single-word swap can't land. One other thing to work on here, is to fine-tune and figure out the best settings for the highest quality replacement generations every time without any residual artifacts in the end.

Flexible timeline. Eased, formant-preserving stretch for small length deltas; for genuinely longer replacements, grow the video slightly via frame interpolation.

Closed-loop QA. Re-transcribe the edited region after splicing and assert the profanity is actually gone — catches residual cases automatically.

Windows + Linux. Mac build is what shipped today. Windows and Linux builds will be out soon.