We've all been in situations where we can't understand what someone is saying. What if you want to talk to someone but your ears are full or you're hard of hearing? We set out to make an easier way to hear someone in person: real-time subtitles using augmented reality.

What it does

Now with this app, you can see what people are saying as face-tracking text. This makes communication easier for everyone! Additionally, this tool can do a lot of social good by breaking down boundaries for the hard of hearing by making it easier for them to make conversions.

How we built it

It was built in Python combining OpenCV code for face tracking, pyaudio voice analysis, and the speech to text API.

Challenges we ran into

Parsing the webcam data and mic data simultaneously and communicating the data between them was difficult. We solved this with multiple threads and global variables. Additionally, detecting multiple speakers was challenging. The more speakers in the frame, the more mics we need to determine which text belongs to which speaker. We used 2 mics and compared their volumes to determine which part of the frame the sound came from.

Accomplishments that we're proud of

Combining latest visual and audio machine learning technology.

What's next for Subtitles IRL

We'd like to implement translation, both on the display and input side. Further, voice diarization is a huge area of research, with lots of potential improvements.

Built With

Share this project: