Sensus
Inspiration
Sensus was inspired by a simple question: what would a truly voice-first computer experience look like for someone who cannot rely on a screen? Most assistants can answer questions, but they struggle with real desktop tasks. We wanted to build something that could actually operate a Linux machine, navigate the web, and keep context across sessions in a way that feels practical for visually impaired users.
What it does
Sensus is a voice-first Ubuntu assistant that can:
- Understand spoken commands and route them to the right capability (browser, desktop actions, shell, shortcuts, or coding).
- Control Firefox through automation for real web tasks (navigation, clicking, downloads, and accessibility checks).
- Show a lightweight top-right overlay UI for live interaction, status, and session history.
- Speak responses naturally with tuned TTS buffering for smoother playback.
- Optionally persist sessions/messages in IBM Db2 and generate session summaries for quick recall.
- Use multimodal vision models for screenshot-based understanding when needed.
How we built it
We built Sensus as a modular Python system:
- Orchestrator: OpenAI-compatible model routing and tool-calling logic for deciding what action to take.
- Voice stack: STT + TTS pipeline with real-time streaming and VAD-style handling.
- Agents: Separate modules for browser automation, coding tasks, desktop actions, and shortcuts.
- Overlay: GTK/WebKit overlay window pinned in the corner for a persistent, non-intrusive UI.
- Storage: Optional IBM Db2-backed session/message persistence.
- Infra: Environment-driven config (
.env) for model selection, timeouts, browser behavior, and audio tuning.
Challenges we ran into
- Making voice interactions reliable under real-world latency and varied model response times.
- Handling Linux display/server differences (X11 vs Wayland), especially for pinned always-on-top overlays.
- Keeping browser automation robust when websites throw modal overlays, dynamic DOM changes, and download edge cases.
- Avoiding audio glitches (underruns/static) in TTS streaming.
- Balancing a powerful tool-calling assistant with safe, deterministic behavior.
Accomplishments that we're proud of
- A working end-to-end voice-first assistant that can execute meaningful computer tasks.
- A clean tool-routing architecture that makes the assistant extensible.
- A practical accessibility-first overlay experience with session history and chat continuity.
- Integration of multimodal + browser + system actions in one cohesive UX.
- Real persistence support (Db2) for session memory beyond a single process.
What we learned
- Accessibility is not a "feature"; it has to shape every architecture decision.
- Reliability beats novelty in voice UX: buffering, retries, and fallbacks matter more than flashy demos.
- Tool-using agents need strong prompting constraints and clear execution boundaries.
- Cross-environment Linux behavior can be the hardest engineering problem in UI/system automation projects.
- Iterating with real usage scenarios exposes edge cases faster than synthetic tests.
What's next for Sensus
- Improve conversational memory and personalization across longer time horizons.
- Expand desktop integrations (more apps, richer system controls, and tighter shortcut workflows).
- Add stronger safety/confirmation layers for high-impact actions.
- Improve onboarding and deployment so non-technical users can install and run Sensus quickly.
- Continue hardening browser + vision reliability for real-world websites.
- Run user testing with visually impaired users and prioritize roadmap items from direct feedback.
Built With
- featherlessai
- html
- ibm
- python
- react
- typescript


Log in or sign up for Devpost to join the conversation.