Inspiration

According to IAPB, International Agency for Preventing Blindness, 1.1 billion people globally were living with vision loss in 2020. Among those 1.1 billion, 350 million people were at least moderate-to-severely blind. That is more than 12% of the world! In addition, according to this research study (https://pmc.ncbi.nlm.nih.gov/articles/PMC7721280/), blindness is one of the most feared health problems, with a higher proportion fearing it than cancer and paralysis. Although two of our teammates only have minor vision issues like astigmatism, it made us recognize how life-altering a full visual impairment can be. Inspired by this realization, we created Luna—a device that scans and narrates the world in real time for people who are blind.

What it does

By combining advanced multimodal Optical Character Recognition (OCR) machine-learning models, transformer-based text-to-speech technology, and minimal, intuitive hardware, LUNA provides an enriched sensory context. Ultimately, the instant source of information empowers users to confidently understand their surroundings and bring it to life. LUNA focuses on providing an accurate and rich description of the environment, rather than trying to provide navigation or safety advice. While we want to fully capitalize on the immense potential of machine learning models, we also acknowledge its limitations. By relying less on its spatial reasoning and critical navigation, we reduce any potential harm that may be caused from any inevitable errors. With that being said, the model performs excellently at scene descriptions and object recognition, which we take advantage of to the fullest extent.

How we built it

We used the Xiao ESP32S3 Sense Microcontroller development board that is put into a casing that can clip onto eyewear. This is the main LUNA device. The Microcontroller is connected to a camera and a speaker to be able to take images and provide auditory narration. We used Platform IO with C++ to code the Microcontroller and send HTTP POST and GET requests between the microcontroller and our backend Express JS server. These requests are responsible for transferring images from the microcontroller to the backend and transcribed audio to the LUNA device. We use Anthropic’s Claude 3.5 Sonnet, given its superior OCR performance, to generate the text descriptions of our images. The text response is then piped to Google Cloud Text to Speech to get the audio transcriptions to be played back locally on LUNA.

Challenges we ran into

Our microcontroller uses Arduino libraries with Arduino C++. A big problem that we had throughout the whole weekend was finding libraries that would fit the functions we needed. The libraries that were available to us were often out of date, or poorly documented. This made debugging difficult and development slow. We swapped over to Platform IO from the Arduino IDE because it allowed us to save time during code compilation and upload to the microcontroller. The libraries however were still the same and we had to consistently deal with issues, like finding a good service for HTTPS instead of HTTP. At some point, we had our backend hosted on a dedicated hosted server but due to slow demos as a result of long API call times, we decided to scrap the server.

Accomplishments that we're proud of

The process of integrating the ESP32S3 with complex peripherals such as a high-definition camera, I2S DAC, and networking proved to be much more challenging than expected. All the challenges we outlined, mixed with the seemingly random error messages that we would constantly receive made the process grueling. Thus, the final result felt a lot more satisfying in the end. We are also really proud of the actual hardware of the LUNA. A highly iterative prototyping process involving CAD and a 3D printer resulted in a compact, polished package that can be comfortably worn by the user. Finally, we are also proud of the seamless integration of the two separate LM models that we include in our final project that work so well together.

What we learned

Overall, we were impressed at the current state of OCR models and their capabilities. It was excellent at object recognition and showed promise in spatial reasoning under limited information.

What's next for LUNA

Although the current LUNA is very sleek and streamlined, its footprint could be further optimized by replacing the loudspeaker with a smaller model. It would also be a very nice feature to allow connection to the device via bluetooth so that it may be connected to earbuds, so that it does not exclusively rely on the loudspeaker. Finally, we would like to allow voice activation for certain commands to make the user experience more fluent.

Share this project:

Updates