We knew we wanted to do something with speech to text for accessibility purposes. As soon as we saw that Rev had a live speech-to-text API, we saw the opportunity to do something new and fun! This soon manifested into live speech bubbles as a more immersive alternative to traditional captions.
What it does
Real-time, sticky captions in augmented reality that recognizes faces and uses Rev's real-time speech-to-text API to add text captions next to an orator in 3D space.
How we built it
We used Rev.ai speech-to-text API to stream and transcribe audio from a mobile device in real-time and then used Apple's Vision framework to detect faces attach captions to the orator. The face detection only works in two dimensions, so we extrapolated the z dimension based on the size of the face.
Challenges we ran into
- Getting an audio stream from the microphone is a lot less trivial than it initially sounds. iOS makes it easy to record audio and then process it, but live processing of audio takes a lot more setup.
- Facial recognition returns results in the context of pixel locations on a two-dimensional image, but we had to extrapolate from this information to place objects in a three-dimensional world.
Accomplishments that we're proud of
- Getting facial recognition and tracking to work with AR kit in a very untraditional way.
- Connecting two unrelated APIs in a way that hasn't been done before.
What we learned
A whole lot about AR and Rev's APIs
What's next for bubl.
Right now, speech bubbles only attach to a person if there is one person in view because it is obvious who is speaking. The next step for bubl is to detect when there are multiple people in view and attach a speech bubble to the person who is currently talking. Also, we'd like to add the ability to view the full history of a session and inspect speach bubbles that are far in the distance.