AutoRhythm

Inspiration

It's 2035. The last software engineer was laid off on a Tuesday. The severance package included a complimentary subscription to MasterClass. So a coder, grieving, starts thinking about a rap career. He buys a microphone. He writes sixteen bars about his Phantom Pain 😔 in his favorite game, Roblox. The problem reveals itself immediately: coders have no sense of rhythm. Every take lands a little ahead or behind the beat, never quite locked in.

This is, of course, also a real problem for real producers. In a professional studio, an engineer fixes timing by hand, slicing and nudging syllables in a DAW for hours per song. We wanted to automate that work without touching the voice itself. No autotune, no AI vocal cloning, no black box that hands back a stranger's performance. Just precise, syllable-level rhythm correction you can open in Audacity and inspect line by line.

What it does

AutoRapper takes three inputs (a backing track, a dry rap vocal, and lyrics) and produces a rhythmically corrected version of the vocal where every syllable lands exactly on beat. It works in two modes:

Guide mode: an AI-generated rap vocal serves as a timing reference. The human vocal's syllable anchors are mapped to the AI guide's timing with zero-sample error: not "close enough," but mathematically exact.
Beat-only mode: detects BPM from the backing track and snaps syllables directly to the beat grid. No guide vocal needed.

The output isn't just a corrected WAV file. It's a full Audacity session with visible clips, label tracks, and waveforms that you can inspect, tweak, and override. Every edit is transparent and reversible.

How we built it

The pipeline has 9 phases, each with a clear boundary between "AI allowed" and "deterministic only":

Phases 0–3 (AI allowed): We use ACE-Step 1.5 to generate an AI guide vocal from lyrics, Demucs for vocal isolation, CMUdict + G2P for automated syllable detection, and Montreal Forced Aligner for phone-level forced alignment on both the guide and human vocals.
Phases 4–8 (fully deterministic): Syllable anchor mapping, safe-boundary clip grouping, piecewise time-warp planning, Rubber Band pitch-preserving time-stretch rendering, and Audacity session assembly via mod-script-pipe. No neural model touches the human voice past Phase 3.

We built it in Python with uv for package management, a Flask + pywebview interactive syllable editor with wavesurfer.js waveforms, and a studio launcher that opens Audacity and the editor side-by-side. The codebase is around 4,600 lines of source with 168 passing tests.

Challenges we ran into

MFA on rap vocals: Montreal Forced Aligner is designed for read speech, not rapid-fire rap delivery. We added multi-pronunciation support, energy-based fallback for missed vowels, and phoneme smoothing to get reliable syllable timestamps from stylized performances.
AI guide lyrics fidelity: ACE-Step doesn't always sing every word. It sometimes skips or merges lyrics. We iterated on BPM settings, duration, and captions to improve coverage, and built a manual guide fallback for when generation misses words.
Extreme stretch ratios: when human and guide timing diverge significantly, some syllables need 15x stretch or 0.1x compression. We added configurable ratio bounds and safe-boundary grouping to minimize audible artifacts at these extremes.
8 GB VRAM constraint: running ACE-Step + Demucs on an RTX 3070 required CPU offloading for model components and tiled VAE decoding to stay within memory limits.

Accomplishments that we're proud of

Zero-sample anchor error guarantee: every rendered syllable anchor lands exactly on the guide anchor, not within a tolerance, but at the exact integer sample index. The pipeline fails rather than silently producing wrong output.
Full transparency: the output is an editable Audacity session, not a black box. You can see every cut, stretch, and crossfade, and override any decision the pipeline made.
Two rendering modes: warp mode produces a single contiguous time-stretched vocal; clip mode produces individual clips for Audacity visual editing. Both are always computed from the same anchor map.
168 tests passing: covering syllabification, anchor mapping, edit plan exactness, warp maps, safe-boundary grouping, and render clip lengths.

What we learned

Forced alignment on rap vocals is surprisingly viable once you handle pronunciation variants and low-energy regions. The key insight is that rap is more rhythmically consistent than conversational speech, which actually helps the aligner.
Deterministic audio editing with integer sample arithmetic eliminates an entire class of floating-point drift bugs that plague audio pipelines.
The model-adapter pattern for guide generation paid off immediately. We started with manual guides, added ACE-Step, and could swap backends without touching the rest of the pipeline.
Safe-boundary clip grouping (splitting at silence, breaths, and zero crossings rather than arbitrary syllable edges) makes a dramatic difference in perceived audio quality versus naive syllable-level cutting.

What's next for AutoRapper

Setting aside the career-pivot use case, there's a serious production workflow here: a tool that collapses hours of manual timing edits into a single deterministic pass while leaving the producer in full control of the final result. From there:

Vowel-nucleus anchoring: map vowel centers instead of syllable onsets for smoother, more natural-sounding corrections on melodic delivery styles.
Confidence heatmap labels: visual overlay in Audacity showing which syllables had low alignment confidence so producers know where to double-check.
Multi-rapper support: separate and independently align multiple voices on the same track.
Native DAW plugins: move beyond mod-script-pipe to native Audacity, Reaper, and Ableton plugins with real-time preview.
Formant-aware stretching: adaptive Rubber Band settings that prevent vocal artifacts at extreme stretch ratios by shifting formants intelligently.

Built With

audacity
click
cmudict-+-g2p-en-(pronunciation)-audio-processing:-rubber-band-(time-stretch)
demucs-(source-separation)
flask
librosa-(beat-detection)
montreal-forced-aligner-(forced-alignment)
nltk
numpy
python
pytorch
pywebview
scipy
soundfile
torchaudio-ai/ml-models:-ace-step-1.5-(music-generation)
uv
wavesurfer.js-(waveform-display)-tools-&-platforms:-audacity-(mod-script-pipe)

Updates

Matthew Deng started this project — Apr 26, 2026 09:54 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.