AEyes

user interface
architecture diagram

Inspiration

Blind and visually impaired users often need immediate spoken guidance about their surroundings, not just static image captions. We wanted to build an assistant that could act more like a calm pair of guiding eyes: warning about hazards, helping locate objects, and understanding a room in real time through voice and camera input.

What it does

AEyes is a live multimodal accessibility agent for blind users. It uses Gemini Live to listen through the microphone, see through the camera, and respond with spoken guidance.

AEyes can:

guide the user through a startup room scan
describe surroundings in real time
prioritize hazards such as obstacles on the floor, stairs, low-hanging objects, and collision risks
help the user find objects with directional spoken guidance
maintain short-term scene context during a session

Unlike a generic multimodal assistant, AEyes is tuned specifically for accessibility-first, safety-first behavior.

How we built it

The mobile client is built in Flutter for iPhone, using live camera frames, microphone input, and streaming audio playback for spoken responses.

On the backend, we built a Google Cloud-hosted agent architecture:

Cloud Run hosts the relay/backend service
Firestore stores session state and scene context
Cloud Logging provides backend visibility and debugging
Gemini Live powers the real-time multimodal reasoning and voice responses

We also added the foundation for ARKit-based spatial enhancement on iOS as a future path toward better world understanding.

Tech Stack:

Flutter
Dart
Python
Gemini Live API
Google Cloud Run
Google Cloud Firestore
Google Cloud Logging
iOS camera/audio stack
ARKit groundwork on iOS

Challenges we ran into

One of the biggest challenges was making the agent feel truly live instead of like a sequence of disconnected camera queries. We had to reduce client-side latency, improve audio streaming, and tune interruption behavior so the assistant would not cut itself off unexpectedly.

Another challenge was turning a frame-based multimodal model into something more useful for navigation. Gemini is powerful, but it is not automatically a persistent 3D world model. To address that, we added a startup scan flow and short-term scene memory, and we began laying the foundation for future ARKit-backed spatial reasoning.

We also moved the live session architecture onto Google Cloud so the project would be more than just a phone talking directly to Gemini.

Accomplishments that we're proud of

Built a real-time accessibility-focused live agent
Created a safety-first interaction style for blind users
Added startup scan behavior and session memory
Moved live orchestration behind Google Cloud Run
Integrated Firestore-backed session context
Demonstrated a clear real-world use case for Gemini Live beyond generic multimodal chat

What we learned

We learned that building a useful live agent is not just about model capability. It also depends on latency, interruption behavior, prompt design, UX framing, and system architecture. We also learned that a strong hackathon project needs a real product story and a cloud-hosted architecture, not just an impressive model demo.

What's next for AEyes

ARKit-based spatial grounding and better room mapping
stronger scene memory across longer interactions
more reliable navigation guidance
richer object-finding workflows
broader accessibility testing and UX refinement

Built With

dart
firestore
flutter
geminiliveapi
googlecloudlogging
googlecloudrun
python

Updates

xiangkml Wong started this project — Mar 16, 2026 04:08 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.