Inspiration

According to the World Health Organization, at least 2.2 billion people globally have a vision impairment, with approximately 43 million being completely blind and nearly 300 million living with moderate-to-severe visual impairment. A significant amount of this population lives in developing countries with limited access to expensive accessibility tools, often leaving them confined to their homes. For decades, the standard for navigation has been the white cane. However, the cane has severe limitations: it fails to detect obstacles above waist height, occupies a user's hands, relies heavily on avoiding human error, and can frequently become a tripping hazard for surrounding pedestrians. Therefore, we aimed to build a hands-free, affordable, and intelligent alternative that not only keeps the user safe, but also helps them fully understand their surroundings.

What it does

HawkEye is an AI navigation wearable that acts as a real-time, interactive guide for the visually impaired. In its default mode, it continuously scans the user's environment and provides concise, low-latency audio instructions, alerting them regarding obstacles and guiding them to move left, right, or proceed forward safely. Beyond navigation, HawkEye features an interactive mode that can be activated at the press of a button. In this mode, the user can speak directly to the system to ask specific questions about their environment, specifically regarding to read any sort of text (ex. signs) that appear in the environment, as well as to find an empty place to sit. HawkEye analyzes the visual data and provides a spoken, context-aware answer, directing the user accordingly.

Prior solutions, including those with ultrasonic sensors or CNNs rely on only one measurement (distance) or ultra-specific models that are trained to recognize certain objects only. Using a multimodal LLM combines broad functionality with strong image processing, as well as ability to gather and reason information from its surroundings.

How we built it

Hardware: We designed a custom, 3D-printed PLA headset with a low infill of 15% for flexibility. It houses an ESP32-CAM, battery pack, and a tactile push-button, secured with an adjustable velcro strap to conform around the user’s specific head size.

Firmware: To minimize latency, the ESP32 streams binary JPEG frames via WebSockets. We implemented dynamic resolution, with low-bandwidth for constant obstacle scanning, jumping to high-resolution only when the user presses the button to read text. To prevent heap exhaustion, we deleted the image immediately after it was processed, before processing the next one. We also snapped images at 1 second intervals, which is more than enough time for the processors to handle the requests.

Backend: Our Python backend utilizes asynchronous task handling to ensure the server never blocks during AI processing, maintaining a stable connection with the ESP32.

AI & Audio Pipeline: We feed frames to Gemma 4, using prompt engineering to force navigation commands. The output is then translated to audio using the ElevenLabs API (TTS). For interactive mode, we capture the user's voice using the ElevenLabs API (STT), send the context to Gemma, and convert the output into instant audio answers. To enable communication between the iPhone client and the locally hosted FastAPI server, we use ngrok to expose the backend via a secure public URL. This allows real-time WebSocket streaming, voice queries, and audio playback to function seamlessly across devices.

Challenges we ran into

Headset Conformity: Designing a rigid 3D-printed structure that could comfortably and securely conform to different-sized heads using only a basic velcro strap was difficult, especially with the number of components of various sizes and weights we had to integrate onto it.

Prompt Engineering: Iterating the system prompt countless times for Gemma 4 to enable the model to be concise, directional, and focused purely on immediate safety hazards rather than its natural conversational tendencies took a great deal of optimization.

Software Pipeline Decisions: We faced a major architectural issue in deciding the optimal software pipeline, specifically, how to efficiently route images from the constrained ESP32 microcontroller to the Gemma 4 API and ElevenLabs Text-to-Speech to minimize critical network latency.

Power Management: Optimizing the power consumption of the wearable device, particularly with the ESP32-CAM's high current draw during Wi-Fi transmission and Edge AI functionality, to ensure a long enough battery life.

Packaging: Integrating the camera, large battery pack, button, and microcontroller into a sleek yet durable package that was not overly heavy for the user.

Accomplishments that we're proud of

We are incredibly proud of successfully bridging the gap between affordable embedded hardware (the ESP32) and state-of-the-art vision-language models. Building a pipeline that quickly routes images from a microcontroller to Gemma 4, and translates that into real-time audio with the ElevenLabs API, was a massive achievement. We're especially proud of the user interaction feature, which allows the users to actively explore their environment.

What we learned

Our main takeaways include: - How to effectively interface microcontrollers and core vision-language models with voice AI. - Best prompt engineering utilization techniques for Vision-Language Models, specifically focusing on spatial awareness and constraint-setting for safety-critical applications. - Rapid CAD design and 3D printing to create ergonomic, well-toleranced wearable hardware.

What's next for HawkEye

Form Factor: Miniaturizing the hardware to fit onto a standard pair of glasses for a more discreet and comfortable user experience, while reducing overall bulk and improving wearability for daily use.

Haptic Integration: Adding vibration motors to the left and right sides of the wearable to provide intuitive spatial cues. This would allow users to perceive direction (e.g., turn left/right) without relying solely on audio, reducing cognitive load and improving safety in noisy environments.

GPS Integration: By combining global positioning data with Hawkeye’s real-time vision system, the device could provide layered guidance using GPS for long-range navigation and computer vision for precise, near-field obstacle avoidance. This would enable turn-by-turn outdoor guidance, destination-based routing, and contextual awareness (ex. intersections, sidewalks, entrances). This approach would allow users to safely navigate complex environments from point A to point B.

Built With

Share this project:

Updates