Inspiration
Globally, vision impairment affects 40 million people, yet most of the current solutions only offer basic obstacle detection. We built our project as an attempt to evolve assistive tech from "notifying" to "navigating." By combining AI with real-time spatial reasoning, we aim to provide the visually impaired with the ability and the confidence to walk anywhere!
What it does
ReSee is a real-time navigation assistive device for the visually impaired people, using stereoscopic depth maps and Gemini to analyse the 3D depth maps, detect stationary and moving obstacles and provide audio navigation. The system runs three local detection modules continuously: - Depth mapping and YOLO + SLAM to understand the distances to objects and what objects they are - Detecting objects that are close to the wearer and whether it is in the wearer's path When obstacles are detected or the wearer requests the device to get to a location, Gemini will map out a route and provide the wearer with directions via voice commands. An example of these instructions include "Person approaching on your left. Veer slightly right and slow down."
How we built it
We built a multi-tier architecture combining efficient local processing with on-demand AI:
- Depth mapping using the stereoscopic depth maps, calibrated and filtered using StereoSGBM for disparity matching
- Object detection using YOLO8n (CoreML) with IoU tracking and ReiD embeddings for persistent identification
- Object-based SLAM using stationary objects as anchors to estimate the camera motion (wearer on the move), with RANSAC refinement from tracked anchored objects
- Outputs a real-time bird's eye view of the objects positions in circular coordinate space as the camera moves.
- Now Spatial reasoning by Gemini finds the shortest and safest path to the desired goal avoiding the obstacles detected and returns to the wearer as audio instructions to move.
Challenges we ran into
- Dealing with noise when generating the depth maps. A lot of the depth map configurations we ran had a lot of noise and distortion in the quality which made it hard for Gemini to process. To solve this we trial and errored through various configurations and settled on using a StereoSGBM filtering remove small regions of speckle noises, a WLS filter for edge-preserving smoothing and a Median filter, a 5x5 kernel to remove the remaining speckle noise after the WLS filter. - We struggled with latency issues in preprocessing and api calls. Because StereoSGBM and Object Detection in high resolution are computationally expensive, we had to resort to running it concurrently and using a pre-built binary for yolo. Further because of the slow upload rates to Gemini, we couldn't just upload the whole image since it lead to inacceptable latencies above 1.5 seconds natively with the prompts and special information. Thus we had to resort to using the labelled and distance-annotated center-points of the detected objects and determined the best action with latencies of around 300ms this way.
Accomplishments that we're proud of
We have achieved a working model of a device that can reason and navigate a user from A to Z, wherever the wearer wishes to go. It comes complete with map routing and voice communication with the device as well. This technology can be integrated into many form factors and can be implemented as a solution to drones, robots and more excitingly, mobility assistive devices.
What we learned
- We learnt a lot of hard truths about why we haven't seen commercial products utilising stereoscopic maps and depth mapping simply because of how demanding it's power usage is, the complexity of the code, and the reliability of the depth maps. We also learnt how to adapt to these challenges and come up with creative solutions to tackle these problems, ranging from power draw, to the depth map noise issues and latency issues. - We also learnt about some of the limitations of the stereoscopic depth mapping: it fails with transparent or reflective objects, light sensitivity (darker scenes worsened the quality of the depth maps) and objects that are moving too fast for the cameras to properly pick up.
- Further we learned how to keep track of the own orientation and movement using only a stereocamera and without additional sensors by cleverly re-identifying objects that were lost from the field of vision based on cosine similarity of embeddings created by DinoV2 and re-calibrating our self orientation based on these when they are re-discovered.
What's next for ReSe
Next up for ReSee is to go bigger, and that means improving our navigation systems to incorporate more details, include a wider vision and perceive distances at greater lengths and detect quickly moving objects more reliably. Once achieved, with agentic triggers, this technology can be applied in drones, robots and mobility assistive devices such as smart glasses. The applications are broad and the potential is vast!
[FINAL REMARKS] Speaking from personal experience, as someone who has been night blind all my life, it brought me hope when we were able to bring our idea to life as it meant that AI was finally good enough to bring a qualitative improvement to the types of assistive device solutions that already exist and improve the lives of those who struggle traveling abroad or even to their nearby convenience stores at night time. The merit of completing this project in the last 24hrs can only suggest at how much more can be done if we had more time! -Nids C.
Log in or sign up for Devpost to join the conversation.