bino.dev

OBS Enviornment Passthrough
Web app interface, YOLO Inferencing
Backend Processing
Our Beautiful Model

Inspiration

A study from UC Santa Cruz found that 88% of visually impaired people experience accidents in their lifetime, and 13% run into head-level obstacles at least once a month. It’s clear that navigating the world without sight proves challenging and dangerous. By utilizing computer vision and emerging augmented reality technologies, bino.dev aims to lend eyes to the visually impaired, effectively serving as a pair of digital binoculars.

What it does

Bino.dev combines Virtual Reality and Machine Learning to help visually impaired individuals navigate their surroundings more easily. Using the Meta Quest 3S and a YOLOv8N inference model, our system detects surrounding objects in realtime and provides audio feedback through headphones. When the user says the keyword "detect," Bino relays information about the objects around them through their headphones, offering a simple and hands-free way to understand their environment.

How we built it

Our system uses real-time object detection to enhance awareness and independence. We developed a Flask backend to handle AI detections, while the frontend—built with HTML, Javascript, and CSS—displays the camera feed. We also integrated Open Broadcasting Software (OBS) to seamlessly stream the processed information.

To enhance interaction, we incorporated the Google Web Speech API, which inferences a transformer model for speech recognition, enabling users to ask for details about their surroundings. When prompted, the VR system relays this information directly to the user’s headphones, creating a hands-free and intuitive experience.

Challenges we ran into

We faced a major roadblock with accessing the live camera feed of the Oculus. Meta only allows camera feed to be directly cast to the Meta Casting site, so we were unable to extract the video data for processing. There is the option to utilize the Passthrough API, however, the API has very limited access to the video feed, including the prevention of image analysis or any type of inference. The API lag also proved considerable, as it was approximately 30 frames behind real-time. We eventually found a workaround which involved using Online Broadcasting Software (OBS) to record the Casting environmental feed, turning it into a real-time live audio stream that we could interact with and process.

Another problem was interacting with Meta Quest 3S sensors through Unity. We initially tried to use an open-source tool that allowed the host device to access data when the Quest user pressed a button on the controller. However, while this open-source tool worked well with Meta Quest 2 & 3, it had compatibility issues with the sensors on our Meta Quest 3S. After countless hours of troubleshooting, we made the hard decision to pivot to an external headphone/earbud device. Instead of using user feedback on controllers to determine when the user wanted a request, we used the Google Web Speech API to detect an audio request from the user.

Accomplishments that we're proud of

We successfully projected the Quest Environment Passthrough to OBS, enabling video processing through our web app, a crucial step in integrating real-world visuals with our system. Utilizing YOLOv8 for computer vision object detection, we developed Python scripts to capture screen grabs of the video feed and generate structured lists of detected objects. Additionally, we implemented a microphone-based input system to capture user speech, which was then processed using a LLM to analyze and dynamically generate scenes based on the user's input. These accomplishments showcase our ability to merge cutting-edge technologies in computer vision, real-time processing, and AI-driven interaction.

What we learned

Throughout our time working on Bino.dev, we faced numerous challenges that pushed us to think critically and collaborate effectively. Along the way, we gained valuable insights into both the engineering process and teamwork. One of the biggest takeaways was learning how to leverage our tools for debugging—whether it was Stack Overflow, generative AI, software documentation, or other resources, we became adept at finding solutions to niche problems. Since this was our first time using the Meta Developer Platform, the learning curve proved steep as the documentation was very closed-sourced. We also recognized the importance of clear communication within our team, ensuring tasks were well-delegated and came together seamlessly to form a cohesive product. Lastly, we learned to think on our feet, constantly adapting and troubleshooting when faced with seemingly unsolvable issues. Through perseverance and teamwork, we built a product we’re truly proud of.

What's next for bino.dev

Looking ahead, we’re committed to expanding Bino.dev to enhance both its capabilities and user experience. One of our key priorities is better integrating an LLM agent that can improve the illustrative audio ques. We also aim to improve detection accuracy, enabling Bino to recognize multiple objects simultaneously while increasing processing speed. Additionally, we plan to explore deeper integration with the Meta Quest, with the goal of consolidating computations and audio processing directly within the headset for a more seamless experience. By enabling haptic feedback on the controllers, it will provide a better user experience that truly replaces the walking stick. Through these advancements, we hope to make Bino even more accessible and reliable for visually impaired users, empowering them to navigate the world with greater confidence.

Built With

api
ar
cnn
css
flask
html
javascript
metaquest
python
transformer
vr
yolo

Submitted to

HackIllinois 2025

Created by

I worked on fine tuning the YOLOv8 reference model and creating the scripts needed to fetch JSON data regarding screengrabs of the environment and object information. I also worked on the audio output and microphone input framework.

Ishaan Desai
I explored various methods for porting video from the Meta Quest into our CV host, thoroughly investigating multiple open-source packages. Additionally, I set up the Unity, OBS, and Meta Quest environments, resolving technical issues and debugging along the way to ensure smooth integration.

Aman Ravishankar
I created a script to convert the YOLOv8 text output into audio output for the user, using the Google Web Speech API. I also designed a Flask backend to manage requests for audio and video data.

vdar99 Dar
I worked on developing the front end UI/UX that relays the video passthrough, command prompts, and object logs. I also worked on porting the video from Quest to the CV host, both through Unity and through the OBS enviorment passthrough. I also assisted in audio interaction scripting.

Michael Yang