TalkBox

Inspiration

While brainstorming, one of our group members, Alex complained about how difficult it is to communicate with others with a different language in online voice call environments. He especially shared his experience of how he and his Japanese friend (totally real) had difficulties understanding each other due to the language barrier. His statement sparked our curiosity and made us ask what would happen if someone didn't understand the common language in an online meeting or call. Since we didn't have any other interesting/fun ideas to implement, we began digging into our live translation application, TextBox.

What it does

TalkBox is a PC application that acts as a virtual man in the middle to allow smooth and real-time voice-to-text translation in calls and meetings. (Similar to Zoom call live captions I've been told a lot.)

How we built it

TalkBox system leverages OpenAI's Whisper to transcribe audio into text, utilizes DeepL to translate the text from any given language to a desired and displays these translations as captions in real time. This process is divided into three main components: Backend, User Experience (UX), and User Interface (UI), each responsible for a distinct part of the captioning pipeline.

Backend

The Backend is responsible for live captioning and translation for the User Experience (UX) to utilize.

User Experience (UX)

The UX is responsible for the managing of the translation strings from the backend. Displaying the captions, TTS for the captions, input and output language selection.

User Interface (UI)

Serves as the interactive layer through which users control the captioning system.

Challenges we ran into

The hardest wall our group ran into was the implementation of live captioning and translation. At the time none of us knew

Accomplishments that we're proud of

Everyone seemed to agree it to be the live captioning. In the beginning, none of us knew how captioning even worked or even how to implement it, but after several iterations

What we learned

We spent the most time on how to use public APIs, searching for the correct API for a specific usage and even how to scroll through documentation to debug an issue.

What's next for TalkBox

Although we put a ton of elbow grease into this project, possible improvements and optimizations can be found all over the place. The processing speed for captioning and translating, the design of the UI, the logic behind the translations, the logic for live captioning and many more. The possible improvements do not stop there. As we were implementing TalkBox, we came up with several different ideas to make the application further user-friendly. Instead of listening to the user's audio input and translating, why not listen to the user's outputting and the same? I TalkBox was originally meant to listen to the audio output of the client's device to listen to the calling opponent's voice. As we began digging further we noticed that it may take too much time to implement such, so we had o

Built With

deepl
docker
git
github
python
whisper

Submitted to

DeerHacks

Created by

I worked on the "middleman" connection between the frontend and backend. In a model-view-controller perspective, I would have created the controller. Events from language selection to settings customization would have been handled by my work and sent to the backend for processing.

AlexanderTheMango ⠀
cyclist adrenaline junkie driver typist and musician coder also in suits
I worked on the frontend, the UI specifically. I built a GUI with tkinter (learned how to use it and then made it). Tragically couldn't hook it up with the UX and backend due to API conflict.

Kevin J
Liam Lee
Mike Peng