After 8 hours of trying to figure out how to implement hardware and software with 5 different APIs, our good friend and teammate mentioned how that we should do something to help people out blind people. Right at that moment, my other teammate and I just stared at each other in awe and the lightbulbs above our heads glowed brighter than Rudolf's nose. There was like a special mental connection or something going on. We both had the same idea and knew exactly how we wanted to implement it.
What it does
Overview Project InSight is a mobile application that audibly maps the world around when you are taking a video or photo. Our target audience is visually impaired individuals. A surprisingly large number of (legally) blind people have phones and we decided that we could build a constantly running application that continuously describes the world around them.
Ideal Implementation Project InSight's app would open up and speak instructions for using the program. The program itself has two major settings: photo and video. To activate photo, you would double tap the screen and to activate video, you put three fingers on the screen. If you took a photo, the image would be sent to Clarifai's image recognition API and then return JSON (text) tags on what objects were found in the image. The JSON tags would then be interpreted through a text-to-speech API and then your phone would speak out the objects in the image to you. For example, if you were walking down a side walk and wanted to know what was in front of you, you would just double tap on your screen (when the app is running) and then you would hear: "trees, people, restaurants, dogs, bright, sunny, etc...," giving you a mental image of what the world in front of you looks like. If you enabled video mode, by default, every 5 seconds the app would use the frame captured and send it to the Clarifai API and the same process would follow as described earlier. However, if you were in a more congested area (NYC), you could slide your finger on the screen down and the app would count downward from 5 (until 1), and if you listened to instructions earlier when you opened the app, you would know that each number (5,4,3,2,) would represent how many seconds would pass before each time a frame/photo was sent to be processed. So if you were in NYC and you wanted constant of your environment, you could slide all the way down and every second, a frame would be captured and the process described earlier would occur. So for every second that passes, you get audible feedback on what's happening around you, allowing you to mentally map out your surroundings. Conversely, if you slid your finger all the way up the screen (max = 10), every 10 seconds, you'd get feedback on your surrounding. Interestingly, the program could tell when your surroundings haven't changed. For example, let's say you were traveling in a pastoral area and had video mode enabled, if you were standing still and nothing was changing, the app would know this by comparing a previous frame with a current one and not to annoy you by repeating the same tags in your stationary environment, it instead would stay silent or give you a notification sound that nothing has changed. And let's say start moving forward and a you pass the tree you apparently standing by, the app would would compare previous and current frames, and based on a percentage weight of how significantly the environment has changed would either respond new details to you or nothing at all.
Actual Implementation Due to the tight time constraints of a 24 hour hackathon, we were only able to get the photo feature of the app functioning and allowed the image tags to be present on the screen instead of a fullscreen camera view. However, the photo feature of the app was buggy, so a feature to just process stored photos was implemented.
How we built it
We used Clarifai's incredible image recognition and machine learning API to identify images and constantly learn smaller details on how the environment changed. We also used Android's built-in text-to-speech API for interpreting the JSON strings of object data. We also used Android Studio for development of the interface and server side implementations.
Challenges we ran into
We had trouble sending images directly to the Clarifai API without locally storing them and the text-to-speech implementation was extremely buggy and at one point not even functional. We did fix both of these issues.
Accomplishments that we're proud of
Since using Java and developing with Android presented a problem to 3/4 of the team, there was a lot of on-the-spot learning and persistence. We are proud that this program even got off the ground in less than 16 hours with the amount of sleep we didn't get.
What we learned
We learned how to effectively use Clariai's API and uniformly have it interact with a non-native text-to-speech converter. We also learned (as a group) a ton of Android development and how to effectively screen for bugs.
What's next for Project InSight
Implementing the other features mentioned above.