Inspiration
Just like Prince Zuko, I had lost my honor. Over the summer, I was very excited to share “Avatar the Last Airbender” with my grandma. I had found a Hindi dubbing and we were enjoying ourselves. When we got to season 3, we couldn’t find the Hindi dubbing! My grandma really enjoyed the show (even though she never admitted it), and I'm sure she felt similarly about not getting to know what would happen next. I had lost my honor, and the only way to regain it would be to create an AI-based dubbing software. I am glad to say that my honor has been restored!
What it does
We built a website that can upload a video file with captions. The website sends this file to the server, where the server communicates with the Google Cloud API to translate the audio and convert it to a usable form. The captions are parsed using a conversion algorithm which creates text chunks to be translated. The caption parser also uses timing calculations to add pauses in the speech using SSML. SSML is sent from the Google Translate API to the Google text-to-speech API, which converts the text into an audio file. The audio file's sampling rate is then adjusted to match the length of the original video. The adjusted audio file is sent back to the browser, where it can be watched.
How we built it
Our program works in three phases: First we translate the caption file into the desired language using Google Cloud’s Translation API. Then we add pauses in the translated speech text SSML notation. The conversion algorithm that accomplishes this was made from scratch. This SSML text is sent to Google Cloud’s text to speech API to create the dubbing. The audio file's sampling rate is then adjusted to match the length of the original video. The adjusted audio file is sent back to the browser, where it can be watched.
Challenges we ran into
During the process of creating an automatic dub, one problem that immediately presented itself was the general issue of timing. Specifically, how we take statements of varying lengths due to language, and have them match? Additionally, how do we utilize the Google Platform and languages such as SSML to accomplish these goals? By utilizing the power of SSML and various algorithms, we were able to speed and slow the dubbing without major changes to pitch and control. This combined with well timed breaks and pauses create a high quality dub of any subtitled video presented.
Accomplishments that we're proud of
Tackling Automatic speech dubbing was a daunting task. This was the most exciting idea we had, but we had a steep learning curve to over come; learning google cloud API calls, make a flask backend interact with a react frontend, and especially how to make the audio sync with the video. We are all very proud that we not only got to learn about the before mentioned topics, but also get a working product.
What we learned
While creating our automatic dubbing software, we learned how to use Google Cloud's API to translate and convert text to speech. We also learned about SSML and how to use it to create timed text to speech. We learned how to integrate all of these technologies using a Flask and React application.
What's next for EasyDubz
We want to find ways to quantify the emotional qualities of the original voice acting. By finding and tuning different variables, we can then apply the same characteristics to the translated speech waveforms. Furthermore, we want to be able to distinguish between different voices. Once we have the basics completed, we want to turn our sights towards creating and training Machine Learning models. We would train our models with native-language movies to find out how certain words are generally spoken, and how tone and emphasis is affected based on context. We can also use computer vision technologies and background score analysis to further understand the atmosphere of the scene.
Built With
- bootstrap
- flask
- gcp
- google-translate
- python
- react
- text-to-speech
- wave

Log in or sign up for Devpost to join the conversation.