Category
Live Agents
Inspiration
Blind and visually impaired users often need immediate spoken guidance about their surroundings, not just static image captions. We wanted to build an assistant that could act more like a calm pair of guiding eyes: warning about hazards, helping locate objects, and understanding a room in real time through voice and camera input.
What it does
AEyes is a live multimodal accessibility agent for blind users. It uses Gemini Live to listen through the microphone, see through the camera, and respond with spoken guidance.
AEyes can:
- guide the user through a startup room scan
- describe surroundings in real time
- prioritize hazards such as obstacles on the floor, stairs, low-hanging objects, and collision risks
- help the user find objects with directional spoken guidance
- maintain short-term scene context during a session
Unlike a generic multimodal assistant, AEyes is tuned specifically for accessibility-first, safety-first behavior.
How we built it
The mobile client is built in Flutter for iPhone, using live camera frames, microphone input, and streaming audio playback for spoken responses.
On the backend, we built a Google Cloud-hosted agent architecture:
- Cloud Run hosts the relay/backend service
- Firestore stores session state and scene context
- Cloud Logging provides backend visibility and debugging
- Gemini Live powers the real-time multimodal reasoning and voice responses
We also added the foundation for ARKit-based spatial enhancement on iOS as a future path toward better world understanding.
Tech Stack:
- Flutter
- Dart
- Python
- Gemini Live API
- Google Cloud Run
- Google Cloud Firestore
- Google Cloud Logging
- iOS camera/audio stack
- ARKit groundwork on iOS
Challenges we ran into
One of the biggest challenges was making the agent feel truly live instead of like a sequence of disconnected camera queries. We had to reduce client-side latency, improve audio streaming, and tune interruption behavior so the assistant would not cut itself off unexpectedly.
Another challenge was turning a frame-based multimodal model into something more useful for navigation. Gemini is powerful, but it is not automatically a persistent 3D world model. To address that, we added a startup scan flow and short-term scene memory, and we began laying the foundation for future ARKit-backed spatial reasoning.
We also moved the live session architecture onto Google Cloud so the project would be more than just a phone talking directly to Gemini.
Accomplishments that we're proud of
- Built a real-time accessibility-focused live agent
- Created a safety-first interaction style for blind users
- Added startup scan behavior and session memory
- Moved live orchestration behind Google Cloud Run
- Integrated Firestore-backed session context
- Demonstrated a clear real-world use case for Gemini Live beyond generic multimodal chat
What we learned
We learned that building a useful live agent is not just about model capability. It also depends on latency, interruption behavior, prompt design, UX framing, and system architecture. We also learned that a strong hackathon project needs a real product story and a cloud-hosted architecture, not just an impressive model demo.
What's next for AEyes
- ARKit-based spatial grounding and better room mapping
- stronger scene memory across longer interactions
- more reliable navigation guidance
- richer object-finding workflows
- broader accessibility testing and UX refinement
Log in or sign up for Devpost to join the conversation.