Smart Agentic assistive device

Hardware
Software

Inspiration

My grandmother once spent 45 minutes searching for her glasses. They were on her head the whole time. She laughed, but I saw the frustration.

Then a visually impaired friend told me he avoids new restaurants. He can't read menus. He can't identify food on his plate. He can't eat alone.

285 million people are visually impaired. Most assistive tech costs between ₹25,000 and ₹3,00,000 — unaffordable for most Indian families.

I built VisionAssist AI to change that. One voice. Zero dependence. Complete independence.

What it does

Say "My Eye" followed by any command. The AI responds instantly.

AI Agent with Smart Multi-Step Reasoning — The core innovation. Say "help me find my medicine." The agent thinks before acting. It checks saved memory first — if the object is already saved, it speaks the location instantly and stops there. No camera opened, no time wasted. If not in memory, it opens the camera and scans the current scene. Still not found — it triggers a full 360° room scan, capturing frames as you slowly turn. Each step result is stored in session memory so the agent never repeats completed work. This layered reasoning means a 3-second answer when data exists, and a complete room search only when truly needed.

Traffic Light Detection — Dual verification runs simultaneously — AI vision model identifies the light color while HSV color analysis confirms it independently. Both must agree before speaking. Red light means cars are stopped — announces "Safe to cross now." Green light means cars are moving — triggers an immediate "Stop, do not cross." Response under 2 seconds. Designed for Indian roads where pedestrian signals are absent. Currently in optimized beta phase, functioning as a secondary awareness tool while real-world edge accuracy is being refined.

Food Identifier — Names every item on the plate, estimates quantity, counts plates and glasses, and states meal type. Live Mode runs continuously and only speaks when something on the plate changes.

Page and Medicine Reader — Auto-detects when a document appears in frame. No button press needed. Reads books word for word. For medicine labels, extracts name, usage, and dosage clearly — critical for blind users managing medications alone.

Object Finder 360° — Captures multiple frames during a slow rotation. Sends all frames together for full-room spatial analysis. Returns exact direction and distance of the object. Checks memory before scanning — skips camera entirely if location is already saved.

Dual Mode across all features — Every detection feature has Normal Mode for single accurate analysis and Live Mode for continuous awareness. Live Mode only speaks when something changes — prevents voice fatigue and keeps the user informed without overwhelming them.

These are the core features we have focused on deeply. The app also includes face recognition, currency detection, scene description, stair safety, navigation, and emergency SOS — added specifically around the needs of blind users. Some of these are in early stages and we are actively improving them. Our priority has always been depth over breadth — we would rather one feature work perfectly in the real world than ten features work halfway.. All hands-free. Works in Hindi, English, and Hinglish.

How we built it

Built entirely from scratch using Python and Flask for the backend and HTML5, CSS3, and Vanilla JavaScript for the frontend. No frameworks, no paid platforms, no shortcuts.

AI Vision — Groq API (Llama 4 Scout 17B) as primary model, Google Gemini 1.5 Flash as automatic fallback for object finding and complex scenes.

Computer Vision — OpenCV for frame processing and HSV color analysis, YOLOv8 for real-time object detection, DeepFace with Facenet for face recognition, Tesseract OCR for text extraction backup.

Agent Brain — Custom multi-step reasoning loop built in Python. Each session holds an AgentSessionMemory object that stores results from every completed step. Agent reads this before each decision so it never re-runs completed work. Sessions are fully thread-safe with lock-protected reads and writes.

Speech — Web Speech API handles wake word detection, continuous voice recognition, and text-to-speech simultaneously in the browser with zero latency.

API Reliability — 3-key rotation system with exponential backoff. If one Groq key hits rate limits, the next key takes over instantly. No request fails silently.

Maps — Leaflet.js with OpenStreetMap for outdoor routing. Indoor map generated dynamically from a home walkthrough video using AI room detection.

Alerts — Twilio for SMS, SMTP for email, both triggered simultaneously on SOS with live GPS coordinates.

Hardware — Arduino clone with 2 ultrasonic sensors and a buzzer for real-time obstacle distance detection. Total prototype cost ₹540 / $6.50.

Challenges we ran into

Latency for traffic lights — AI takes 1.5–2.5 seconds. For crossing a road that is too long. Solved with Live Mode — runs continuously in background, only speaks when light color changes.

Traffic light accuracy in low light — Pure AI vision failed in shadows and glare. Added HSV color analysis as parallel verification. If either method detects red, it announces red. Safety always overrides confidence.

Agent state across HTTP requests — Flask is stateless by default. Built a full session store with thread-safe locks so the agent remembers what it already checked across multiple API calls without ever repeating a step.

Indoor map broke on unusual home layouts — Added structured fallback extraction. If AI returns unstructured text, a secondary parser extracts room names and builds a valid navigation graph from keywords.

API rate limits under heavy use — Built exponential backoff with 3-key rotation. Each key gets cooldown time proportional to failure count and resets automatically after cooldown expires.

Voice collision — TTS speaking while mic is listening caused feedback loops. Solved by pausing the microphone exactly when the agent speaks and restarting it automatically after speech ends.

Accomplishments we're proud of

Smart agent reasoning that stops early. If the answer is in memory, no camera is ever opened. Saves time, saves API calls, feels genuinely intelligent.

Dual Mode across every feature. Normal for accuracy, Live for continuous awareness. No other accessibility app has this combination.

Dual verification for traffic lights. AI vision plus color analysis must agree. Safety-first architecture.

Eight plus features in one voice app, built solo. Most accessibility apps have 3–4 features and a full team.

Zero software cost. Free tiers only. No credit card. Works globally on any smartphone browser.

Real user validation. A tester said: "I don't need someone with me anymore. I can walk alone, read my prescription, know what I'm eating. That's freedom."

What we learned

Safety must override speed — always. Smart reasoning that stops early is better than always doing more. Voice-first design is deeply underrated. Free tiers are enough to build production-grade tools. Never assume AI is correct — always verify with a second method.

What's next

Medication expiry alerts. Offline mode for rural areas. More Indian languages. Crowdsourced hazard alerts from the blind community.

The goal: VisionAssist AI pre-installed on every budget smartphone in India. Zero cost. Blind people lead independent lives.

Production Economics and Global Scalability

1. Low-Cost Hardware Layer Prototype cost ₹540 / $6.50 — Arduino clone (₹250), 2 ultrasonic sensors (₹140), buzzer and wires (₹150). Plug-and-play circuit, simple to assemble. At mass production scale using custom PCBs and 3D-printed enclosures, hardware cost drops to ₹250–₹300 ($3.00–$3.60) per unit.

2. Zero-Cost Cloud Infrastructure Each user gets their own Groq API key — the free tier resets instantly and covers standard daily usage indefinitely. Even heavy users doing 500 live vision scans daily cost less than ₹10 ($0.13) per day at Groq's $0.18 per million token rate.

3. The Market Disruption Existing alternatives like OrCam cost ₹3,00,000+ ($3,600+) and require dedicated hardware. VisionAssist AI runs entirely on the user's existing smartphone — zero new hardware for software features. Combined with the ₹540 Arduino obstacle sensor, the complete system retails at ₹1,300 ($15.50). Accessible to all 285 million visually impaired people globally.

Built With

Python · Flask · Groq API (Llama 4 Scout 17B) · Google Gemini 1.5 Flash · OpenCV · YOLOv8 · DeepFace · Tesseract OCR · Web Speech API · Leaflet.js · OpenStreetMap · Twilio · SMTP · Arduino · HTML5 · CSS3 · Vanilla JavaScript