We drew inspiration from the theme of the weekend and decided to target out project about two things we are very passionate about. Improving accessibility on the web and helping people explore.
What it does
YeetView is a fully accessible interface for Google StreetView.
Users load up the website on their phone then they have the ability to "look around". You can imagine this being like standard google StreetView VR but instead of seeing the visuals we describe what part of the scene they are looking at and read out the description to them. This enables them to explore without the need for vision. Furthermore YeetView has voice commands built in for navigation such that the user can navigate and experience the world with only voice.
How we built it
The tech stack is quite large for the project but it can be broken down into the few components described below:
Getting a text description based on a photo of what the user is looking at: A custom CNN with a attention based LSTM trained on coco2014 (And on some GPUs the guys over at Spell were kind enough to lend us)
Getting the voice input: The Speech Recognition API was used to obtain voice input commands and extract keywords pertaining to navigational control.
Reading out the description: The Speech Synthesis API was used to take a description string and read it back to the user.
Interface: The Google Street View API was used to provide motion tracking and imagery information for our project. Through the API we are able to navigate city streets through the aforementioned voice commands and read back to the user what they are currently looking at.
Challenges we ran into
Since the machine learning model takes a long time to perform inference we had to come up with a solution using a tree search to select which images were most likely to appear next (branching out into neighbouring street views) and pre-process them ahead of time, which we further cached for fast access. This enabled real-time exploration and feedback from the environment.
We also used Spell to train our machine learning models. Since we hadn't used this platform before, and since it was significantly different to standard training pipelines, it took us considerable time to debug and construct a working pipeline. It's worth noting that by the end of the weekend we had gained back significant time by using Spell over standard methods.
Accomplishments that we're proud of
We were able to successfully incorporate speech recognition, voice navigation, visceral scene description and speech synthesis in the space of 24 hours. We are particularly proud of our progress since we hadn't previously touched many of these areas.
What we learned
- Spell API
- Web Speech API
- Street View API
What's next for YeetView
YeetMaps, bringing you visceral, high-level map descriptions in real-time! That and a lot of product polish :)