Inspiration (The problem)

  • Screens and software are built for keyboard + mouse users ONLY
  • Existing screen readers are slow and inefficient for real tasks that visually impaired people need
  • Most systems force users to learn complex shortcuts instead of using natural interaction
  • Current voice assistants are fragmented and can’t complete full workflows
  • Goal: reduce the gap between human intent → computer action

What it does

Multi-Agent Architecture (What we built)

  • Built a system of 5 working AI agents that handle different tasks:
    • Shopping Agent - searches and compares products
    • Research Agent - pulls web info and summarizes it
    • Calendar Agent - reads and manages Google Calendar events
    • General Agent - handles normal conversation
    • Router Agent - decides which agent should respond
  • Implemented a routing system (the brain)
    • Takes user input → classifies intent → sends task to respective agent
  • Built system so tasks are not handled by one model
    • Designed so specialized components are working together which is where existing solutions fail

Seamless Design (What users actually see)

  • Users interact entirely through natural speech → no UI learning curve
  • Built a live visual feedback system:
    • Shows what the agent is doing in real time
    • Displays navigation, cursor movement, and actions taken
    • Shows system reasoning/decision flow
  • Tested with real accessibility context:
    • Worked with TLOS (Technology-Enhanced Learning and Online Strategies)
    • Connected with Disability Alliance and Caucus
      • Tentatively working with DisCoTec, the Disability Community Technology Center
      • Tentatively working with Andrew Begel's lab VariAbility at Carnegie Mellon
      • Tentatively working with disability studies professors Ashley Shew

Designed for Scale (Decisions made for maximum growth)

  • Added context compression to handle long conversations efficiently
  • Designed a modular architecture so new agents can be added easily
  • Packaged as a desktop application for easy distribution
  • Designed to be able to integrate external tools and APIs in the future

How we built it

  • FastAPI backend for agent orchestration
  • WebSocket system for real-time updates
  • Deepgram for speech-to-text
  • Gemini for routing + decision making
  • ElevenLabs for voice output
  • Desktop client for live interaction UI

Challenges we ran into

  • Keeping multiple agents coordinated without conflicts
    • Ensuring tasks were delegated to their respective agent
  • Maintaining context across long conversations
    • Designing a system where each agent held their own context and stored relevant information
  • Making routing decisions fast enough to be usable

Accomplishments we’re proud of

  • Built a fully working multi-agent voice system → which is where many existing solutions struggle to accomplish
  • Achieved real-time action visualization (not just chat output)
  • Created a system that has advanced technical capabilities with decisions & responses
  • Designing the system for maximum growth and scalability

What we learned

  • Multi-agent systems are powerful but require strong orchestration and edge case testing
  • Routing is just as important as model capability
  • Real-time feedback dramatically shifts the development direction
  • Accessibility-first design changes how you think about UX

What’s next for OpenSight

  • Add more specialized agents (email, travel, coding, etc.)
  • Expand app into cross-platform deployment
  • Allow for plug-ins for third-party tools
  • Move towards a fully autonomous task execution flow

Built With

Share this project:

Updates