Inspiration (The problem)
- Screens and software are built for keyboard + mouse users ONLY
- Existing screen readers are slow and inefficient for real tasks that visually impaired people need
- Most systems force users to learn complex shortcuts instead of using natural interaction
- Current voice assistants are fragmented and can’t complete full workflows
- Goal: reduce the gap between human intent → computer action
What it does
Multi-Agent Architecture (What we built)
- Built a system of 5 working AI agents that handle different tasks:
- Shopping Agent - searches and compares products
- Research Agent - pulls web info and summarizes it
- Calendar Agent - reads and manages Google Calendar events
- General Agent - handles normal conversation
- Router Agent - decides which agent should respond
- Implemented a routing system (the brain)
- Takes user input → classifies intent → sends task to respective agent
- Built system so tasks are not handled by one model
- Designed so specialized components are working together which is where existing solutions fail
Seamless Design (What users actually see)
- Users interact entirely through natural speech → no UI learning curve
- Built a live visual feedback system:
- Shows what the agent is doing in real time
- Displays navigation, cursor movement, and actions taken
- Shows system reasoning/decision flow
- Tested with real accessibility context:
- Worked with TLOS (Technology-Enhanced Learning and Online Strategies)
- Connected with Disability Alliance and Caucus
- Tentatively working with DisCoTec, the Disability Community Technology Center
- Tentatively working with Andrew Begel's lab VariAbility at Carnegie Mellon
- Tentatively working with disability studies professors Ashley Shew
Designed for Scale (Decisions made for maximum growth)
- Added context compression to handle long conversations efficiently
- Designed a modular architecture so new agents can be added easily
- Packaged as a desktop application for easy distribution
- Designed to be able to integrate external tools and APIs in the future
How we built it
- FastAPI backend for agent orchestration
- WebSocket system for real-time updates
- Deepgram for speech-to-text
- Gemini for routing + decision making
- ElevenLabs for voice output
- Desktop client for live interaction UI
Challenges we ran into
- Keeping multiple agents coordinated without conflicts
- Ensuring tasks were delegated to their respective agent
- Maintaining context across long conversations
- Designing a system where each agent held their own context and stored relevant information
- Making routing decisions fast enough to be usable
Accomplishments we’re proud of
- Built a fully working multi-agent voice system → which is where many existing solutions struggle to accomplish
- Achieved real-time action visualization (not just chat output)
- Created a system that has advanced technical capabilities with decisions & responses
- Designing the system for maximum growth and scalability
What we learned
- Multi-agent systems are powerful but require strong orchestration and edge case testing
- Routing is just as important as model capability
- Real-time feedback dramatically shifts the development direction
- Accessibility-first design changes how you think about UX
What’s next for OpenSight
- Add more specialized agents (email, travel, coding, etc.)
- Expand app into cross-platform deployment
- Allow for plug-ins for third-party tools
- Move towards a fully autonomous task execution flow
Built With
- deepgram
- elevenlabs
- fastapi
- google-ai-studio
- google-calendar-api
- google-cloud
- google-gemini-api
- google-workspace
- python
- serpapi
- websockets
Log in or sign up for Devpost to join the conversation.