On the bus ride to another hackathon, one of our teammates was trying to get some sleep, but was having trouble because of how complex and loud the sound of people in the bus was. This led to the idea that in sufficiently noisy environment, hearing could be just as descriptive and rich as seeing. Therefore to better enable people with visual impairments to be able to navigate and understand their environment, we created a piece of software that is able to describe and create an auditory map of ones environment.
What it does
In a sentence, it uses machine vision to give individuals a kind of echo location. More specifically, one simply needs to hold their cell phone up, and the software will work to guide them using a 3D auditory map. The video feed is streamed over to a server where our modified version of the yolo9000 classification convolutional neural network identifies and localizes the objects of interest within the image. It will then return the position and name of each object back to ones phone. It also uses the Watson IBM api to further augment its readings by validating what objects are actually in the scene, and whether or not they have been misclassified.
From here, we make it seem as though each object essentially says its own name, so that individual can essentially create a spacial map of their environment just through audio cues. The sounds get quieter the further away the objects are, and the ratio of sounds between the left and right are also varied as the object moves around the use. The phone are records its orientation, and remembers where past objects were for a few seconds afterwards, even if it is no longer seeing them.
However, we also thought about where in everyday life you would want extra detail, and one aspect that stood out to us was faces. Generally, people use specific details on and individual's face to recognize them, so using microsoft's face recognition api, we added a feature that will allow our system to identify and follow friend and family by name. All one has to do is set up their face as a recognizable face, and they are now their own identifiable feature in one's personal system.
What's next for SoundSight
This system could easily be further augmented with voice recognition and processing software that would allow for feedback that would allow for a much more natural experience. It could also be paired with a simple infrared imaging camera to be used to navigate during the night time, making it universally usable. A final idea for future improvement could be to further enhance the machine vision of the system, thereby maximizing its overall effectiveness