Inspiration
What if language was no longer a barrier?
Basketball is a global game. Hundreds of millions (if not billions) of kids, teens, and adults shoot hoops + watch games / interviews / commentary from every corner of the globe.
Yet, the full experience of the game is locked down to english speakers only – even tho games / interviews are translated + broadcasted through local networks around the world, the emotion of the game is lost (anyone whose watched a game outside of the US can attest – there's something magical about the voices / words our favorite legendary anchors have / use)
What if we could give them the ability to speak to their fans in any language of their choice regardless of where they live / who they are?
What it does
We've built a universal translator to let you convert a video from one language to another while keeping the original speakers voices + syncing their lips to the translated audio (making it feel like they're actually speaking like a native)
How we built it
we built a suite of tools to: 1) download the videos from youtube + break into scence w/ whisper x speaker diarization 2) run whisper x on each scene to transcribe into english text 3) translate each scene using gpt-4 + prompt engineering 4) clone the voices of the speakers in the video using 11labs 5) feed the translated text to 11labs for each identified speaker + generate the audio 6) run our state of the art lipsyncing model as a post-processing step to sync the lips of the speaker to the new audio 7) stitch all the clips back together
Challenges we ran into
the process can be difficult to automate fully e2e (for example, when you translate from one language to another the timing of the audio can be less / more than the original video).
this is why we took the approach of building tools to help video editors / translators speed up the process drastically + giving them the ability to use their judgement for the best outcome.
its an approach that combines the best of both worlds: speed w/ AI + quality w/ human intuition + judgement
Accomplishments that we're proud of
honestly, we're proud of the outputs of our lipsyncing model.
translation w/ voice preservation is fairly easy, but our brain expect lips to align w/ words we hear. this synchronization fosters trust + connection.
w/o seamless lip-sync translated content feels jarring + uncanny and fails to translate the emotion behind the game
What we learned
this feels inevitable – we can't imagine a world where this is not the future.
What's next for universal sportscast
Getting this to work in real-time.
Imagine a world where we could broadcast is every language everwhere all at once – w/o losing the personality behind the voices + people who make the game as special as it is.
Built With
- elevenlabs
- openai
- python
- synclabs
- whisper
Log in or sign up for Devpost to join the conversation.