mb guys, I forgot to include the word test or said the wrong prompt sometimes. I tried to remove the places where I messed up prompting, but the video is still kinda bad
Inspiration
While building a workflow automation tool for repetitive browser tasks, I had a realization: if this system were connected to voice, it could fundamentally change how people with visual impairments or motor disabilities interact with the web.
Most current accessibility tools are rigid, requiring precise commands or complex setups. I wanted to build something that understood intent, not just keywords. Voxium was born from the idea that accessibility shouldn't be a niche feature—it should make technology more fluid for everyone.
What it does
Voxium is an AI-powered browser control system that translates natural speech into intelligent web actions. Instead of memorizing commands, users interact with their browser naturally.
Key capabilities include:
- Natural Language Navigation: "Open YouTube" or "Scroll to comments."
- Content Manipulation: "Replace 'disabled' with 'differently abled'."
- AI Insights: "Summarize this page" to get instant TL;DRs.
- Safety First: Confirmation prompts for potentially destructive actions.
How we built it
Voxium is a Chrome Extension built with a focus on real-time interaction and robust architecture:
- Logic & UI: JavaScript, HTML, and CSS.
- Voice Engine: Web Speech API coupled with an Offscreen Document for continuous, background recognition.
- Brain: AI APIs for intent parsing, summarization, and cleaning misinterpreted speech.
- Automation: Custom Content Scripts for direct DOM interaction and element targeting.
- Used Claude and Copilot to vibe code it
Challenges we ran into
- Speech Misinterpretation: Recognition engines often trip over accents or background noise. I implemented a preprocessing layer to "sanitize" text before it hits the AI.
- Background Execution: Keeping the "ears" open while the popup was closed required navigating the complexities of Chrome’s extension lifecycle and offscreen documents.
- Dynamic DOM Targeting: Converting an abstract thought like "click the first result" into a reliable CSS selector across millions of different site structures required building an adaptive querying logic.
Accomplishments that we're proud of
- Built a fully functional AI automation engine in under 12 hours.
- Successfully moved beyond "keyword matching" to true intent parsing.
- Implemented persistent listening, allowing for a hands-free experience.
- Created a real, demo-ready system—not just a mockup.
What we learned
- AI Prompt Engineering: How to extract structured JSON intent from messy human speech.
- Extension Architecture: Deep dived into background scripts, permissions, and secure API management.
- UX for Accessibility: The critical importance of input preprocessing; even a 1% error rate in speech can break the user's trust in automation.
What's next for Voxium
- Precision Tuning: Improving intent accuracy through advanced prompt engineering.
- Custom Training: Allowing users to "teach" Voxium specific routines or nicknames for sites.
- Latency Reduction: Optimizing the speech-to-action pipeline for near-instant response times.
- Accessibility Presets: Creating profiles tailored to specific disability needs.
Built With
- cometapi
- gemini
- html
- javascript
- json
- minimax-m2.5
Log in or sign up for Devpost to join the conversation.