Nova | Devpost

Nova is a modern, accessibility-friendly desktop overlay application to help people that are differently abled to use their device using AI.
A flow diagram showing how our AI pipeline enables smart and efficient tool calling.

Video

https://drive.google.com/drive/folders/1AfV3oLZJrQtkFLiQNcOpjn8-mKV0oWNd?usp=drive_link

Inspiration

In an increasingly digital world, access to a computer is access to opportunity. Yet, for millions of people, this access is severely limited. An estimated 25 million adults in the U.S. alone have a mobility disability that makes using a traditional keyboard and mouse difficult or impossible. This "digital divide" creates significant barriers to employment, education, and social connection, a problem highlighted by the fact that the unemployment rate for persons with a disability is consistently double that of those without.

Existing assistive technologies, while well-intentioned, are often clunky, unreliable, and lack the deep integration needed for modern, complex workflows. We envisioned a better way: a tool that doesn't just offer basic commands, but provides a fluid, secure, and intelligent way to control an entire digital environment with the one tool many can still use—their voice. We built Nova to be that solution.

What it does

Nova is a voice-native desktop agent that acts as a seamless, hands-free bridge between a user and their computer. It lives in an unobtrusive floating bar, always ready to translate natural language commands into complex actions.

Complete Hands-Free Control: Nova can control the entire operating system. Users can open applications, manage files, control the mouse, and type text, all with voice commands. It also features deep browser automation for navigating the web, clicking elements, and filling out forms.
Secure Voice Authentication: Nova isn't just a voice-controlled tool; it's a secure one. Using an innovative "VoiceGate" system, it biometrically verifies the user's voice before executing sensitive commands. This ensures only the enrolled user can access personal files or control the OS, preventing unauthorized use.
LLM-Powered Planning: At its core, Nova is powered by a large language model that acts as a reasoning engine. It understands complex, multi-step natural language commands (e.g., "Open the browser, search for today's weather, and then type the result into a new text file") and generates an executable plan.
Dynamic Tool Creation: When faced with a novel task it doesn't have a tool for, Nova's AI planner can write, register, and execute new Python tools on the fly, allowing it to constantly adapt and learn new skills.

How we built it

Nova is built on a robust, real-time architecture that integrates multiple AI systems into a cohesive agent.

Core Backend: The backend is built in Python, using WebSockets to maintain a persistent, low-latency connection with the frontend for audio streaming and event messaging.
Agentic Loop: The core logic resides in a SessionWorker that orchestrates the entire process:
1. Voice Activity Detection (VAD): Detects when the user is speaking.
2. Wakeword Detection: Listens for the "Hey Nova" wakeword.
3. ASR: Transcribes the user's speech to text using Whisper.
4. Planning: Sends the transcribed command to Gemini, which returns a structured plan of tool calls.
5. Execution: Executes the plan, passing each step through the VoiceGate for speaker verification before running the corresponding tool.
Tooling: The agent's capabilities are provided by a modular toolset, including Playwright for browser automation and Pyautogui for native OS control.
Frontend: The floating UI is a translucent Electron application built with React and TypeScript, providing a modern, cross-platform user experience.

Challenges we ran into

End-to-End Latency: Creating a responsive feel required minimizing latency across the entire pipeline: audio capture -> WebSocket transit -> VAD -> ASR -> LLM inference -> execution. Every millisecond counted, and we spent significant time optimizing the audio processing and communication loop.
Reliable Voice Authentication: Building the VoiceGate was a delicate balance. It needed to be secure enough to prevent false acceptances but not so strict that it would frequently reject the legitimate user (false rejections). This required careful tuning of the speaker verification model's sensitivity thresholds.
Agent Reliability and Safety: Giving an AI agent the ability to write and execute its own code is incredibly powerful but introduces significant safety challenges. We had to design a sandboxed environment for dynamic tool execution and craft precise system prompts for the LLM to ensure it generated safe, functional, and correct code.

Accomplishments that we're proud of

The VoiceGate Security Model: We successfully implemented a system that gates sensitive actions behind biometric voice authentication. This is a critical innovation that builds the trust necessary for an agent with this level of system access.
A Truly Agentic, Learning System: Nova is more than a command-and-control program. By enabling the LLM to dynamically create its own tools, we've built a system that can learn and adapt to user needs, solving problems it wasn't explicitly programmed to handle.
A Fluid, Hands-Free Workflow: We are proud to have integrated multiple complex AI systems (ASR, LLM, Speaker ID) into a single, cohesive experience that allows users to perform complex digital tasks with nothing but their voice.

What we learned

Agentic Architecture is the Future: A simple intent-to-action model is insufficient for a true assistant. The "Plan-then-Execute" loop, combined with a flexible toolset, is a far more powerful and scalable paradigm.
Security Must Be Foundational: For any agent that interacts with the real world or personal data, security cannot be an afterthought. The VoiceGate concept shaped our entire execution architecture from the ground up.
The Frontend is the Agent: A user's experience is defined by the interface. Making the floating bar intuitive, responsive, and informative was just as crucial as perfecting the backend AI.

What's next for Nova

Multimodal Understanding: The next frontier is giving Nova eyes. By integrating a vision model (like Gemini Pro Vision), we will enable commands like "Click the blue button on the right" or "Read the error message in that pop-up." This will make the agent exponentially more intuitive and capable.
Expanded Tool Library: We plan to pre-build a rich library of integrations for popular applications like Slack, VS Code, Figma, and more, making Nova an indispensable productivity tool out of the box.
On-Device Processing: We are exploring the use of smaller, optimized on-device models for tasks like VAD, ASR, and even simple intent recognition. This would further reduce latency, improve privacy, and allow for functionality even when offline.

Built With

ai
electron
gemini
python
pytorch
react
tailwindcss
typescript
whisper

Updates

Aman Meherally started this project — Oct 04, 2025 11:38 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.