Primarily YouTube tutorials and Coursera videos. Some of my relatives wanted to learn latest tools and technologies like programming, machine learning, and psychology. However, the language barrier always stood high and prevented them from accessing the tons of freely available video lectures on the internet. We were surprised to learn that video translation is not supported by even the major learning platforms and decided to explore this area.
What it does
tongueSpeak essentially translates any given video into a video in another language in a highly scalable manner. It uses machine learning, speech recognition, speech generation, text translation, signal processing (eg: chromagram and FFT algorithms) and audio normalization to stitch together a video translation service.
How we built it
We used numpy and pandas for all mathematical calculations. Scikit-learn for machine learning algorithms. Librosa for signal processing. pydub for audio stitching and splitting, and gtts for speech recognition and generation.
Challenges we ran into
One of the biggest challenges was identifying the gender of speakers as it was necessary to identify the tone of voice of the speakers to preserve the charisma of the original video. Since there is no definite mechanism to do this, we used RandomForest ensemble machine learning classification algorithm trained on 5000 input audio files. This gave us an appreciable 75% accuracy in identifying the gender of the person, through which we adjusted the pitch of the output audio to mimic the input audio.
Apart from gender recognition, we also faced challenges in noise filtering, background music detection and pitch resolution. Altogether, these cutting-edge challenges gave us an opportunity to explore the latest frontiers of machine learning and use sophisticated algorithms to solve challenging problems.
Accomplishments that we're proud of
We understood the overall mechanism of the algorithms, got together a working web service, and processed extensive signal inputs, all in less than 36 hours. Since none of us had prior experience in these domains, this opportunity was a fantastic learning experience.
What we learned
Apart from the obvious gain in technical prowess, especially related to signal processing and machine learning, we also learnt essential interpersonal skills task distribution, project planning, collaboration, and effective time management.
What's next for tongueSpeak
- Improved background noise filtering
- Wider range of languages
- Handle multiple overlapping speakers
- Deploy as a chrome extension for real-time translation