insAIght

Inspiration

In Mexico, more than 2.69 million people live with visual disabilities (INEGI, 2020). Many assistive devices that help visually impaired individuals understand their surroundings are expensive or hard to obtain. We wanted to build something affordable, portable, and powered by modern AI, so anyone could point a camera, take a picture, and instantly hear what is around them. That idea became insAIght — an intelligent, low-cost assistant designed to make visual information accessible through audio.

What it does

insAIght captures an image using an ESP32-CAM or a smartphone and sends it to our cloud server. The server processes the photo with Gemini Vision, generates a detailed description, and then converts that text into natural speech using ElevenLabs. The user receives an audio message that explains what the camera sees — objects, people, signs, obstacles, text, and general context of the scene. The system works both with the ESP32-CAM device and a simple mobile interface, making it accessible to everyone.

How we built it

We developed a three-part system:

ESP32-CAM device Captures photos on command Sends compressed JPEG images over WiFi Communicates with our backend through HTTP
Cloud backend Receives images from ESP32-CAM or phone Sends the image to Gemini Vision API Processes and formats the model’s response Forwards the description text to ElevenLabs
Audio delivery ElevenLabs generates a clear, natural voice The audio is returned to the user’s device Designed for quick response and low friction.

Challenges we ran into

ESP32-CAM instability (WiFi drops, memory limits) Ensuring images uploaded correctly without corruption Reducing latency between picture → analysis → audio Handling low-light or noisy images Balancing clarity and length in AI descriptions.

Accomplishments that we're proud of

Building a complete end-to-end assistive tool in limited hackathon time Successfully integrating Gemini Vision + ElevenLabs + IoT Making the system work both on hardware and mobile Creating a prototype that is actually useful for visually impaired users

What we learned

How to optimize microcontrollers for image capture How to design AI pipelines combining vision + audio Practical IoT communication and error handling

What's next for insAIght

Add real-time object tracking with continuous audio feedback Integrate OCR + text reading for signs, menus, labels Improve night-mode performance with better image preprocessing Build a wearable version (e.g., a small clip-on camera

Built With

arduino
c++
elevenlabs
esp32cam
gemini
godaddy
python
vultr

Submitted to

PoliHacks By CIS
- Winner Tercer Lugar

Created by

I led the integration and orchestration of the AI layer, using Gemini 2.5 Flash for image understanding and ElevenLabs for text-to-speech, designing the logic that transforms each captured image into an accessible spoken description and ensuring a reliable end-to-end flow between both APIs.

Gámez Gress Isaac
Electronics and esp32

Garcia Gomez Uriel
Jose Alberto Barrios Mendez

Updates

Gámez Gress Isaac started this project — Nov 23, 2025 04:38 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.