What it does
PointVision uses a camera image that's captured from a phone, along with a voice command to narrate a description of the scene and visible objects. Additionally, it can provide distance data to each of the objects.
How we built it
The main processing pipeline runs on a local server, due to the heavy processing required by computer vision/neural networks. The user interface is an Android app that communicates with the local server via a custom REST API. To start and stop recording the voice command, a large button the size of the entire screen is pressed. We made this UI decision since the user will be visually impaired, and touching the entire screen would not be too difficult. At the same time the audio records, we also capture an image. Both the image and audio file are sent to the server for processing.
Given the image, we needed to figure out what exactly the user wants to do with it. The audio clip of the user's voice is sent to the Azure Speech API for speech-to-text conversion. Given the transcribed text, we send it to the Azure Language Understanding API to figure out intent. Currently we support 4 separate intentions: describe the scene qualitatively, what objects are in view, tell what the distances to the objects are, and "what's right in front of me?"
Now that we know what the user intends for us to to, we then send the image to two separate pipelines: the Azure Computer Vision API, and a modified version of the Niantic monodepth2 API. The former detects objects in the image, as well as comes up with a general one-line description of the scene. The monodepth2 API uses a neural network model trained on the KITTI dataset to estimate depth from a single image, as opposed to a stereo/depth camera. Modifications were necessary since the API only returned relative distances within each image, but we used some geometry to estimate absolute distance to within 7% accuracy. With this generated depth map, and the bounding box data from the Azure CV API, we mapped the detected objects into 3D space. From there, we could narrate the bounding box labels, fused with custom code to make the responses more natural.
User Interface Part 2
The narrations from the image pipeline was converted into a speech audio file with the Azure Text-To-Speech API, and the file was sent back over the REST API to the mobile device for playback to the user.
Challenges we ran into
The biggest problem we had was getting the Android app to communicate with our REST API. While the API works with direct command line commands, the in-app implementation seems to be having some trouble.
Accomplishments that we're proud of
Figuring out the equations and algorithm to convert the depth map from relative coordinates to absolute coordinates, and being able to fuse that information together with bounding box labels for a semantically-labeled 3D space.
What's next for Point Vision
There are a few improvements that could be made. Firstly, the system could be either deployed onto an online server for scale, or even run locally on the phone. The latter would require an entire rework of the neural networks for it to run efficiently on a mobile processor. More functionality could be added, such as a hands-free mode where the phone is on a lanyard around the user's neck, and their hand is detected with machine vision for a point-and-identify scheme. Another possible improvement would be real-time obstacle notification (avoidance?) where the 3D depth map is used with visual odometry data and phone IMU data to compute the trajectory of the user and give an audible warning if they are about to collide with an object.