We've all been in situations where we can't understand what someone is saying. What if you want to talk to someone but your ears are full or you're hard of hearing? We set out to make an easier way to hear someone in person: real-time subtitles using augmented reality.
What it does
Now with this app, you can see what people are saying as face-tracking text. This makes communication easier for everyone! Additionally, this tool can do a lot of social good by breaking down boundaries for the hard of hearing by making it easier for them to make conversions.
How we built it
It was built in Python combining OpenCV code for face tracking, pyaudio voice analysis, and the Rev.ai speech to text API.
Challenges we ran into
Parsing the webcam data and mic data simultaneously and communicating the data between them was difficult. We solved this with multiple threads and global variables. Additionally, detecting multiple speakers was challenging. The more speakers in the frame, the more mics we need to determine which text belongs to which speaker. We used 2 mics and compared their volumes to determine which part of the frame the sound came from.
Accomplishments that we're proud of
Combining latest visual and audio machine learning technology.
What's next for Subtitles IRL
We'd like to implement translation, both on the display and input side. Further, voice diarization is a huge area of research, with lots of potential improvements.