πΆββοΈ SafeSight: My Project Story
What It Is
SafeSight is a super-fast, accessible visual alert system designed to be a digital second set of eyes for the visually impaired. It gives immediate safety warnings and translates essential text into audio. I built it as a quick, functional prototype, and the big win is its speed: it delivers life-saving audio feedback in under two seconds, which is the critical window where a warning can actually prevent an accident.
The Moment That Inspired Me
The idea for SafeSight hit me on the NJ Transit train ride to the hackathon. I watched a visually impaired person at the train door. They were pausing, trying to sense the perfect moment to step onto the train. They wanted to make sure no one was rushing out first. Just as they decided to move, someone bolted past them, creating a very close, uncomfortable moment.
That's when I realized the gap: existing assistive tech is great at describing a scene, but it often fails on immediate, time-critical warnings of the unexpected hazard that pops up out of nowhere. My mission shifted right then: I had to close that latency gap and build something that genuinely contributes to safety and independent mobility.
The "How": Building for Speed
My non-negotiable goal was sub-two-second latency. To hit that, I had to be clever with the architecture and basically sacrifice a high-quality video feed for guaranteed speed:
- The Pipeline Trick: I built a very disciplined, synchronous Python loop using
opencv. It doesn't try to capture a smooth video; instead, it focuses on grabbing a single, clean webcam frame, compressing it into an optimized JPEG, and sending it out. Stability and control were more important than a high frame rate. - The "Ah-Ha" Moment (Taming the AI): This was the game-changer. I realized the AI was wasting time thinking. So, I used an aggressive system prompt to force the Multimodal LLM to return its analysis only in a strict JSON format. This format only contained two things: a prioritized
hazard_warningandextracted_text. I cut out all the verbose reasoning. - The Instant Audio: The moment the JSON came back, my script immediately checked for the safety warning, parsed it, and created the audio. Crucially, I skipped the slow step of writing the audio to a hard drive (disk I/O). Instead, I used
gTTSto pump the audio directly into apygamememory buffer (io.BytesIO) for instant playback. This guaranteeed the life-saving warning was spoken first, before the general scene description or text translation even began.
The Real-World Struggles
Building this rapid prototype taught me some hard lessons:
- The Speed Limit: Waiting for the massive cloud AI model to respond was the biggest bottleneck. It means my system can only process about 1β2 frames per second (FPS). It works great for static hazards, but it's not fast enough for tracking a quickly moving person or car yet.
- The Broken Record: Because every frame is analyzed as a brand-new situation, the system lacks memory. It repeatedly announces the same stationary object ("There is a wall. There is a wall. There is a wall."). The immediate fix is adding a simple memory cache to stop the annoying, redundant warnings.
- Trusting the AI: In a safety application, hallucinations (false warnings or missed warnings) are a serious problem. I put in maximum effort with strict prompts and output validation, but the risk of an unreliable safety alert remains with large generative models.
My Biggest Takeaway
I learned that when you're designing something for real-time safety, the intelligence of your architecture is more important than the brute force of your computer. By designing the data exchange to prioritize time-critical information, I essentially told the massive AI what to look for first, which is how I got a functional, low-latency system working without any specialized, expensive hardware.
My next step is a clear one: I'm transitioning to a Hybrid Edge Architecture. Iβll put a small, fast local model (like YOLO) on the device to get near-instant, sub-500ms detection of simple things (stairs, doors). I'll only use the expensive cloud AI for the complex tasks, like a detailed scene description or high-fidelity text reading. That's the path that will turn SafeSight from a successful prototype into a reliable, high-speed product.


Log in or sign up for Devpost to join the conversation.