We both are interested in linguistics and learning languages, but one of the biggest roadblocks to regular practice was that we couldn't find enough entertaining media to watch in a given language. We wanted to build software that could automatically translate the audio of a given video into another language so that we could watch whatever content we wanted in whatever language we wanted to learn. Throughout the process, we realized it could make a huge difference to vision-impaired people who need to get information from a video but are unable to use subtitles. It can also be valuable to people who do not understand the language a given video that does not offer subtitles.
What it does
Given the URL of any YouTube video, a destination language the video is to be translated into, and an output filename, our software writes a translated video to the local disk of whatever machine this code is running on. The proof-of-concept backend could easily be served through a website or mobile app for distribution in the future.
How we built it
Our software takes in the URL of any YouTube video, a destination language the video is to be translated into, and an output filename for the final translated video. It then downloads the video, along with any subtitles that are available from YouTube. If subtitles are not available, we use Google's Web Speech API (via Autosub) to auto-generate subtitles. We then programmatically build a new audio file, parsing the subtitle files for timing information to ensure the audio stays synced with the video. Finally, we strip the audio from the video, use sklearn and scipy to tease apart the voice from the rest of the audio, combine the non-voice audio with the translated speech, and combine with the video component.
Challenges we ran into
One of the biggest challenges was keeping the audio in-sync with the video. We wrote our own parser for the .vtt filetype (a subtitle filetype) and then wrote an algorithm that detects pauses in speech and segments the subtitle text into chunks accordingly. It then uses these chunks to add silence to the audio file if the pause is longer than a certain threshold.
Another challenge was separating the voice from the rest of the audio in the original video. We wanted all other sounds, i.e. theme songs, sound effects, and general noise to be included in the final video to maximize similarity to dubbing by a human being. We used Independent Component Analysis and then adjusted the volume level of the noise track to accomplish this.
Accomplishments that we're proud of
We're very proud of the overall impact this project could have for disabled people and foreign language learners. We're also happy with the quality of the final translated video we were able to achieve, and have many plans for future optimizations and services.
What we learned
We learned a lot about best software engineering practices and are proud of the overall organization we were able to achieve in the repo. We also learned a lot about audio/video formats, as well as various machine learning topics.
What's next for a.tv
We plan to host the service on a website and/or mobile app in the coming weeks, as well as further improve synchronization, automatically generate male/female voice based on original video, and possibly even use sentiment analysis to create even more realistic dubs.