tongueSpeak

Inspiration

Primarily YouTube tutorials and Coursera videos. Some of my relatives wanted to learn latest tools and technologies like programming, machine learning, and psychology. However, the language barrier always stood high and prevented them from accessing the tons of freely available video lectures on the internet. We were surprised to learn that video translation is not supported by even the major learning platforms and decided to explore this area.

What it does

tongueSpeak essentially translates any given video into a video in another language in a highly scalable manner. It uses machine learning, speech recognition, speech generation, text translation, signal processing (eg: chromagram and FFT algorithms) and audio normalization to stitch together a video translation service.

How we built it

We used numpy and pandas for all mathematical calculations. Scikit-learn for machine learning algorithms. Librosa for signal processing. pydub for audio stitching and splitting, and gtts for speech recognition and generation.

Challenges we ran into

One of the biggest challenges was identifying the gender of speakers as it was necessary to identify the tone of voice of the speakers to preserve the charisma of the original video. Since there is no definite mechanism to do this, we used RandomForest ensemble machine learning classification algorithm trained on 5000 input audio files. This gave us an appreciable 75% accuracy in identifying the gender of the person, through which we adjusted the pitch of the output audio to mimic the input audio.

Apart from gender recognition, we also faced challenges in noise filtering, background music detection and pitch resolution. Altogether, these cutting-edge challenges gave us an opportunity to explore the latest frontiers of machine learning and use sophisticated algorithms to solve challenging problems.

Accomplishments that we're proud of

We understood the overall mechanism of the algorithms, got together a working web service, and processed extensive signal inputs, all in less than 36 hours. Since none of us had prior experience in these domains, this opportunity was a fantastic learning experience.

What we learned

Apart from the obvious gain in technical prowess, especially related to signal processing and machine learning, we also learnt essential interpersonal skills task distribution, project planning, collaboration, and effective time management.

What's next for tongueSpeak

Improved background noise filtering
Wider range of languages
Handle multiple overlapping speakers
Deploy as a chrome extension for real-time translation

Built With

Submitted to

PennApps XIV

Created by

I worked on audio separation and re-stitching, timestamp generation and hosting the full site on AWS

Madhur Singal
I worked on text to speech translation with gtts and pitch and amplitude modulation of translated audio using RandomForest machine learning algorithm.

Kashish Garg
I worked on text to text translation with Yandex, and also the front end with Polymer.

Michael Ryan
Sampath Chanda