Sensus

Inspiration

Sensus was inspired by a simple question: what would a truly voice-first computer experience look like for someone who cannot rely on a screen? Most assistants can answer questions, but they struggle with real desktop tasks. We wanted to build something that could actually operate a Linux machine, navigate the web, and keep context across sessions in a way that feels practical for visually impaired users.

What it does

Sensus is a voice-first Ubuntu assistant that can:

  • Understand spoken commands and route them to the right capability (browser, desktop actions, shell, shortcuts, or coding).
  • Control Firefox through automation for real web tasks (navigation, clicking, downloads, and accessibility checks).
  • Show a lightweight top-right overlay UI for live interaction, status, and session history.
  • Speak responses naturally with tuned TTS buffering for smoother playback.
  • Optionally persist sessions/messages in IBM Db2 and generate session summaries for quick recall.
  • Use multimodal vision models for screenshot-based understanding when needed.

How we built it

We built Sensus as a modular Python system:

  • Orchestrator: OpenAI-compatible model routing and tool-calling logic for deciding what action to take.
  • Voice stack: STT + TTS pipeline with real-time streaming and VAD-style handling.
  • Agents: Separate modules for browser automation, coding tasks, desktop actions, and shortcuts.
  • Overlay: GTK/WebKit overlay window pinned in the corner for a persistent, non-intrusive UI.
  • Storage: Optional IBM Db2-backed session/message persistence.
  • Infra: Environment-driven config (.env) for model selection, timeouts, browser behavior, and audio tuning.

Challenges we ran into

  • Making voice interactions reliable under real-world latency and varied model response times.
  • Handling Linux display/server differences (X11 vs Wayland), especially for pinned always-on-top overlays.
  • Keeping browser automation robust when websites throw modal overlays, dynamic DOM changes, and download edge cases.
  • Avoiding audio glitches (underruns/static) in TTS streaming.
  • Balancing a powerful tool-calling assistant with safe, deterministic behavior.

Accomplishments that we're proud of

  • A working end-to-end voice-first assistant that can execute meaningful computer tasks.
  • A clean tool-routing architecture that makes the assistant extensible.
  • A practical accessibility-first overlay experience with session history and chat continuity.
  • Integration of multimodal + browser + system actions in one cohesive UX.
  • Real persistence support (Db2) for session memory beyond a single process.

What we learned

  • Accessibility is not a "feature"; it has to shape every architecture decision.
  • Reliability beats novelty in voice UX: buffering, retries, and fallbacks matter more than flashy demos.
  • Tool-using agents need strong prompting constraints and clear execution boundaries.
  • Cross-environment Linux behavior can be the hardest engineering problem in UI/system automation projects.
  • Iterating with real usage scenarios exposes edge cases faster than synthetic tests.

What's next for Sensus

  • Improve conversational memory and personalization across longer time horizons.
  • Expand desktop integrations (more apps, richer system controls, and tighter shortcut workflows).
  • Add stronger safety/confirmation layers for high-impact actions.
  • Improve onboarding and deployment so non-technical users can install and run Sensus quickly.
  • Continue hardening browser + vision reliability for real-world websites.
  • Run user testing with visually impaired users and prioritize roadmap items from direct feedback.

Built With

Share this project:

Updates