Inspiration
Coming from large cities where walking is a common, we noticed how much subconscious effort goes into navigating a busy sidewalk—stepping around a sign, noting the curb, watching for cyclists. We realized that for our friends, family, and the millions of individuals who are visually impaired, this "subconscious" effort becomes a demanding, high-stakes task.
In an age where our phones can identify songs, translate languages, and guide our cars with pinpoint accuracy, we asked ourselves: "Why can't we leverage this powerful technology to make something as fundamental as walking down the street safer and more accessible for everyone?" We were inspired to build a tool that could act as a second pair of eyes, using the device most people already carry in their pocket to foster a greater sense of freedom and independence.
What it does
The Guiding Eye is a mobile application that transforms a smartphone into a real-time, AI-powered navigation assistant for the visually impaired. The user simply holds their phone up, and the app uses the camera to analyze the path ahead.
Through clear, verbal commands delivered via the phone's speaker or an earpiece, the app provides critical, real-time alerts. It specifically warns the user when they are: Approaching an obstacle like a pole, a person, or street furniture. Veering off the sidewalk and into a potentially unsafe area like a lawn or the street. Walking on a road instead of a designated walkway. Nearing a crosswalk, preparing them for an intersection. Our goal is to provide an extra layer of awareness that complements traditional tools like a white cane, giving users the confidence to navigate their world more freely.
How we built it
We built The Guiding Eye in a fast-paced, integrated sprint over the course of the hackathon. Our tech stack was chosen for speed, performance, and its ability to run on mobile devices.
Architecture: iPhone app written in SwiftUI captures camera frames, streams them over a WebSocket, and speaks any text it receives. A lightweight Python proxy listens on port 8000, receives each JPEG frame, forwards it to Gemini 2-Flash on Vertex AI, and sends the model’s one-sentence sidewalk guidance back to the phone. Gemini 2-Flash is prompted to focus on sidewalk adherence, obstacles, intersections, and crosswalks, and to answer in one calm sentence.
Backend work: We used the google-genai Python SDK (version 1.21) together with the websockets library. The server accepts Json messages that contain a base-64 JPEG, converts that into a Gemini Part, calls generate_content, and relays the returned text. Ping timeouts were disabled so a long camera pause would not close the connection.
iOS work: We created a fresh iOS-only Xcode project named SidewalkGuide, added Starscream with Swift Package Manager for WebSockets, and inserted the required camera and microphone privacy descriptions in the Info.plist. A single service class starts an audio session, opens the WebSocket, starts an AVCaptureSession, and for every frame encodes a JPEG at moderate quality, wraps it in Json, and writes it to the socket. When a text message arrives the class creates an AVSpeechUtterance and speaks it at a comfortable rate.
Build hurdles: Xcode at first gave us a Mac-Catalyst target which broke UIKit imports; we deleted it and kept only the iOS target. Upgrading to google-genai 1.x required switching to keyword arguments such as Part.from_bytes(data=jpeg_bytes, mime_type="image/jpeg") and removing the old role-parts wrapper. We hit “keep-alive ping timeout” disconnects, fixed by setting ping_interval=None when starting the websocket server. Finally we verified audio by ensuring the iPhone was not in silent mode and enabling the spokenAudio category in AVAudioSession.
What works now: Phone and laptop on the same Wi-Fi connect instantly. Gemini 2-Flash replies in under 600 ms, so the user hears guidance almost live. AVSpeechSynthesizer queues sentences so they never overlap. With the backend moved to Cloud Run the only change needed is to point the phone to a wss URL. Outcome: When the user raises the phone the app says things like: “Sidewalk is clear, continue straight.” “Curb ahead in three metres, slow down.” “Crosswalk detected, wait for the walk signal.”
Challenges we ran into
Environmental Variability: Our initial model worked perfectly in well-lit, clear conditions. However, testing outside revealed challenges with strong shadows, which the model sometimes confused for curbs or obstacles. We had to adjust our data augmentation and model thresholds to become more resilient to varied lighting. Creating Useful, Not Annoying, Alerts: Early on, our system was too "chatty." It would announce every single thing it saw, overwhelming the user. The biggest challenge was designing a logic system that only provides necessary and timely warnings. We implemented a "cooldown" period for alerts and prioritized warnings based on immediate danger (e.g., "Road ahead!" is more important than "Bench on your right"). Performance vs. Accuracy: Running a sophisticated segmentation model in real-time on a phone was a major hurdle. We spent a significant amount of time optimizing the model—quantizing its weights and resizing the input stream—to strike the right balance between processing speed and detection accuracy, ensuring the alerts were timely enough to be useful. Accomplishments that we're proud of We are incredibly proud of creating a functional, end-to-end prototype within the tight timeframe of this hackathon. Seeing the entire pipeline—from the live camera feed to a correctly spoken audio alert—work for the first time was a huge moment for us.
Specifically, we are proud of the accuracy of our sidewalk vs. road detection. This is a non-trivial computer vision task and is the most critical safety feature of our app. We are also proud of the intuitive feel of the final demo; the alerts are timely and clear, which validates our focus on user experience. Most of all, we're proud to have built a project with a clear social purpose that could one day make a real difference in people's lives.
What we learned
This hackathon was a tremendous learning experience. First and foremost, we gained a much deeper empathy for the daily navigational challenges faced by the visually impaired community. From a technical standpoint, we learned how to optimize and deploy complex AI models on resource-constrained mobile devices, a skill set that is invaluable. We also learned the importance of rapid iteration; our initial idea for the alert system was completely overhauled after just a few minutes of self-testing, proving that building, testing, and refining in quick cycles is key to a successful project.
What's next for The Guiding Eye
This hackathon is just the beginning. We have a clear and ambitious roadmap for The Guiding Eye: Depth Perception: Our next major step is to integrate a depth-sensing model like MiDaS v2.1. This will allow us to not only detect obstacles but to tell the user how far away they are ("Obstacle 10 feet ahead"). GPS & Contextual Navigation: We plan to fuse our visual detection system with GPS data to provide a richer, more descriptive navigation experience, guiding users turn-by-turn to their destination. Indoor Navigation: We will expand our model's capabilities to recognize indoor environments, helping users navigate complex spaces like shopping malls, airports, and university campuses. Community-Driven Development: Our most important next step is to connect with organizations for the visually impaired to get our prototype into the hands of users for real-world feedback. Their insights will be crucial as we refine the app and work towards a public release on the App Store.
Built With
- claude
- fastapi
- gemini
- google-cloud
- google-genai
- java
- javascript
- python
- starscreamer
- swift
Log in or sign up for Devpost to join the conversation.