Inspiration Music is universal but language isn't. We wanted to hear songs in different languages without losing the original singer's voice or feel — and automate the hardest part: making translated lyrics actually fit the music.
What it does Takes any song by URL, separates vocals from instrumental, transcribes and translates the lyrics to match the original syllable count and stress patterns, clones the singer's voice, and outputs a finished track in the target language.
How we built it Three stages: (1) yt-dlp + demucs + Whisper for audio separation and transcription. (2) A multi-agent loop using uAgents where Claude Haiku translates, and syllable/stress critic agents score each line — rejecting and revising until it meets a quality threshold. (3) ElevenLabs voice cloning + time-stretching via pyrubberband to fit translated audio back to the original timing.
Challenges we ran into Claude refusing to work with lyrics it flagged as copyrighted — fixed with prompt reframing and assistant prefill Syllable matching across languages with different natural densities Time-stretching artifacts — solved with pause-aware stretching that only warps speech, not silence Whisper cache reuse across songs — fixed by hashing audio file contents as the cache key Accomplishments we're proud of Fully automated pipeline: URL in, translated MP3 out Multi-agent critique loop with quantitative metrics producing genuinely singable translations The output actually sounds like the original singer performing in another language What we learned Prompt framing matters enormously. Multi-agent critique loops with measurable constraints outperform single-model self-evaluation. Audio time-stretching requires preserving silence structure, not just phoneme quality.
What's next Better translation search (beam/sampling over candidates), prosody-aware TTS, broader language support, and a web UI.
Built With
- elevenlabs
- fetch.ai
- python
- whisper
Log in or sign up for Devpost to join the conversation.