Inspiration
We were originally inspired by Tony Stark's AI, J.A.R.V.I.S as it seemed like a huge productivity booster that would help everyone out in their daily lives. However, we realized that claude chrome could already essentially behave like JARVIS, so we decided to focus our efforts on maximizing focus instead of task completion. Even though everyone has access to state of the art AI models nowadays, the distractions of the internet still lead many astray, which was an issue affecting both of us so we decided to solve this.
What it does
The Agent runs in the background on a user's computer and can be activated with the wakeword of "Soren". Upon activation, the user can tell the agent to either open a tab, navigate a website, or activate focus mode. Upon focus mode activation, the agent will automatically block apps that are not in line with the task the user is working on and will continue to do so until the user exits focus mode. Upon exiting, the agent can also open previously closed tabs.
How we built it
This project, JarvisAi, is a voice-controlled productivity assistant built in Python that combines local system automation with cloud-based AI intelligence. The architecture is designed as an always-on background agent that listens for a wake word ("Jarvis") using the Porcupine engine (pvporcupine), which is highly efficient for continuous listening. Upon activation, it switches to Elevenlabs to capture a natural language command. This command is sent to Google's Gemini 2.0 Flash model, which acts as the system's "brain." Instead of simple text generation, the model is prompted to parse the user's intent into structured JSON data (e.g., [{"action": "focus", "target": "coding"}]), enabling the application to programmatically understand complex requests like "Close YouTube and open GitHub." The application's "hands" are implemented through a custom AppleScript bridge (AppleScriptBrowserControl), which allows Python to directly manipulate Chrome or Safari tabs (opening, closing, switching) via the subprocess module, bypassing the need for heavy browser drivers like Selenium for these tasks. A standout feature is the Focus Manager, which uses the LLM to semantically evaluate browser tabs. When the user sets a goal (e.g., "Learning React"), the agent sends the titles of all open tabs to Gemini, which scores them based on relevance to that goal. Tabs identified as distractions (like social media) are automatically closed or hidden, demonstrating a practical application of AI reasoning for productivity.
Challenges we ran into
One big issue we had was the speed of the agent; we wanted the agent to be able to respond to user commands quickly, but we also needed to make sure the model being called would know exactly what its task was, so we designed an efficient prompt structure for the api call and leveraged Google's Gemini 2.0 Flash model to reduce inference latency. However, simply calling the model faster wasn't enough; we initially faced severe UI freezing when the agent was "thinking." To fix this, we had to completely refactor our codebase to use qasync, allowing the PyQt6 graphical interface and the asyncio backend (handling the API calls and audio processing) to share the same event loop without blocking each other.
Another major hurdle was the "AppleScript Bridge." While it's easy for an LLM to look at a list of tabs and say "Close YouTube," translating that intent into a precise AppleScript command that identifies and closes the correct tab index in a live Chrome window was incredibly finicky. We had to write robust error handling to manage race conditions where tabs might be moved or closed by the user manually while the AI was still processing.
Accomplishments that we're proud of
We are extremely proud of the fact that the model takes in voice commands and also actively dynamicaly monitors all tabs the user opens at all times. The core functionality of our agent is the part that we think will have the most impact on peoples' work, as it can concretely help users stay focused according to what their task is.
What we learned
We learned to architect a "Computer Use" agent that functions as an OS-level assistant by integrating Google's Gemini 2.0 Flash model to reason about visual inputs (screenshots via PIL) and natural language commands. This involves creating a feedback loop where the AI analyzes the screen, decides on an action (clicking coordinates via pyautogui, typing text, or managing browser tabs via playwright and CDP), and receives updated visual context, requiring robust prompt engineering and state management. Additionally, the project demonstrates how to build a responsive, Voice-Activated Asynchronous GUI by bridging modern async patterns with traditional desktop frameworks. We combined PyQt6 for custom, transparent overlays (drawing math-based animations) with asyncio and qasync to handle non-blocking I/O operations. This includes integrating pvporcupine for wake-word detection and faster_whisper for speech-to-text within a single event loop, ensuring the application can listen, process complex AI logic, and update the UI simultaneously without freezing. By implementing "Focus Sentinel" logic that monitors scrolling and tab-switching patterns to detect distractions (doomscrolling) and proactively offer assistance, we learned how to actively suit models towards users.
What's next for JarvisAi
We want to be able to store and train on the data that Soren has acquired in order to tune the agent to be more adaptive to different types of work. We also want to improve speed of decisions and be able to allow users to select the voice soren talks to them in.
Log in or sign up for Devpost to join the conversation.