Inspiration
The internet was built for the sighted and the able-bodied. For the 2.2 billion people globally with vision impairments or motor disabilities, the web remains a place full of friction. Traditional screen readers are passive—they read content at you, but they don't help you act on it.
We wanted to shift the paradigm from "Screen Reading" to "Agentic Browsing." We asked ourselves: What if the browser could see what you see and hear what you say, then do the clicking and scrolling for you? This inspired Echo Nav—a system designed to give true digital agency back to users, allowing them to navigate the web using only their voice or hand gestures.
What it does
ECHO-NAV is a smart agentic system with two accessible interfaces:
- Gestura (Hand Signal Agent): Uses a camera to track hand gestures. A pretrained ML model predicts alphabets from these gestures, which then trigger specific automated browsing tasks (e.g., 'S' gesture -> Search).
- Vox (Voice Agent): Takes natural language voice commands, performs the web action autonomously, and speaks back a summary.
How we built it
- Frontend: HTML, CSS, JavaScript for the user interface.
- Backend: Python and FastAPI for high-performance handling.
- AI & ML: MediaPipe for hand tracking, fed into a pretrained ML model to predict alphabets/commands.
- Agentic Core: browser_use library to control the headless browser (clicking, typing, scrolling).
- Dev Tools: Jupyter for prototyping and model training.
Challenges we ran into
- Model Latency: Reducing the lag between gesture prediction and browser action.
- Gesture Confusion: Fine-tuning the ML model to distinguish between similar hand signs (like 'A' vs 'E').
- Web Clutter: Preventing the browsing agent from getting stuck on pop-ups and cookie banners.
Accomplishments that we're proud of
- Real-Time Translation: Successfully mapping hand sign alphabets to complex web actions instantly.
- True Agency: Giving users the ability to complete tasks (not just consume content) without a keyboard or mouse.
- Multimodal Design: Seamlessly integrating voice and vision inputs into one cohesive system.
What we learned
- Agents > Chatbots: LLMs combined with tools like
browser_usecan actively navigate the world, not just talk about it. - ML Integration: Managing the pipeline from MediaPipe landmarks to ML inference to backend execution requires rigorous optimization.
What's next for ECHO-NAV
- Custom Gestures: Allowing users to train the system on their own unique hand signs.
- Mobile App: Porting the technology to smartphones for on-the-go accessibility.
- Complex Workflows: Expanding Vox to handle multi-step tasks (e.g., "Find a recipe and order the ingredients").
Built With
- browser-use
- css3
- fastapi
- html5
- javascript
- jupyter
- mediapipe
- python

Log in or sign up for Devpost to join the conversation.