NAVI | Devpost

Inspiration

We wanted to rethink how people interact with their computers. Between endless tabs, files, and tasks, the way we use technology hasn’t really caught up with how we think. NAVI came from the idea of building something that feels more like a digital partner — one that listens, sees, and acts on intent. We also wanted to make computing more accessible for users with limited mobility.

What it does

NAVI (Neural Agent for Visual Interaction) is a hands-free AI assistant that lets you control your desktop with voice and gestures. You can say things like “Hey NAVI, summarize this PDF and email it to Dr. Smith,” and it will handle every step — reading the file, writing the summary, and sending the message. It combines computer vision, speech recognition, and reasoning to plan and complete multi-step workflows naturally.

How we built it

We built NAVI with Electron and React for the desktop interface, integrating NVIDIA’s Nemotron for reasoning, Riva for low-latency speech recognition, and TensorRT for gesture detection and GPU-accelerated pose estimation. The system connects to APIs like Gmail, Google Drive, and Notion, allowing NAVI to automate real desktop workflows. We used ElevenLabs for the voice system to make responses sound natural and immediate.

Challenges we ran into

One of the biggest challenges was synchronizing multiple AI pipelines — making sure voice, vision, and reasoning worked together in real time without lag. We also had to manage OS-level permissions for automation while keeping it secure. Getting gesture tracking stable across different lighting conditions was another unexpected hurdle.

Accomplishments that we're proud of

We’re proud that NAVI doesn’t just execute commands — it actually understands intent and chains actions together. Seeing it complete an entire workflow hands-free was a huge moment. We also successfully got real-time gesture control running at smooth FPS with TensorRT acceleration.

What we learned

We learned a lot about multimodal systems — how complex it is to merge language, vision, and reasoning in one pipeline. We also gained a deeper appreciation for efficient GPU inference and how critical latency management is when building real-time AI experiences.

What's next for NAVI

We want to integrate deeper OS automation, expand to mobile platforms, and refine gesture recognition for more precise control. Long term, we see NAVI evolving into a full personal computing agent — one that not only executes tasks but understands context, goals, and routines.