Inspiration

Many language learning tools exist. Most focus on gamifying language skills through an app environment. Other tools can translate text-to-text using your mobile camera. However, there are no tools that are able to provide realtime language training using your actual environment. Additionally, audio language samples are often in generic voices that sound nothing like your own.

What it does

Babelfish enables language learning using the environment surrounding you with the sound of your own voice. Using your mobile device, whatever your camera is seeing gets classified into a text summary. This text summary is then passed to a language translation service based on the language of your choice. The translated text is then passed to ElevenLabs where it calls an instant clone of your own voice. This instant clone can generate any of the languages in the multilingual_v2 language set. This approach allows you to hear your own voice speak the language of your choice as if you were a native speaker of the language. Hearing your own voice in a different language dramatically improves your ability to hear accent and pronunciation differences to accelerate your spoken language proficiency.

How we built it

The architecture consists of three services:

  • Environment summarization using LLaVA that converts an image to text
  • Text-to-text language translation using ChatGPT
  • Text-to-speech using ElevenLabs instant clones

Challenges we ran into

Realtime video classification is still inconsistent. Video classifiers struggle to understand the context of time based events. To compensate, you have to define a window of time in which you want the environment summarized.

ElevenLabs instant clone feature is both incredibly powerful and very sensitive. The audio quality of your voice sample is incredibly important. For instance, since the hackathon space is quite noisy, we went to the bathroom showers to record an audio sample. While this sample had no background noise, there was a tremendous amount of echo. This echo was then omnipresent in the clone generated audio. To fix this problem, a second set of audio samples was recorded last night at home. The first sample was reading from a book for ~2 minutes. While the audio quality was much better, this clone sounded like - well - someone reading from a book. We then added another audio sample that was a minute of just random talking with intentionally using different speaking tones, inflections, and pacing. This second sample radically transformed the clone responses. However, it's difficult to say which is the "best". The clarity parameter in ElevenLabs seems to be the most influential in the pacing and overall tone of the generated audio. After adding the second sample, turning the clarity up much higher seemed to yield more natural results.

Accomplishments that we're proud of

The quality of the instant clone was surprisingly good. Being able to hear your own voice describe your environment in a different language with a perfect accent is a bit surreal. This approach creates an instant feedback loop where you can correct your pronunciation to greatly accelerate language proficiency.

What we learned

Hearing your own voice speak a different language with a perfect accent is a bit surreal. It tremendously accentuates differences in pronunciations much more than hearing the language spoken in someone else's voice. Interestingly, opinions vary wildly on how "good" the clones are at replicating a voice. The individual having their voice cloned often heard the quality of the clone differently than others. This difference highlights how differently we hear own our voices compared to others.

What's next for Babelfish

The biggest barrier to Babelfish being really powerful is latency. Currently, there is too much latency to use in a real world situation where you're trying to communicate with someone in a foreign language. The current architecture has to capture an image and then convert that image to text. Then, that text has to be converted to the target language text which then hits ElevenLabs to generate the audio file. An image-to-text model where you could specify the text language would dramatically improve performance.

Built With

Share this project:

Updates