Inspiration

For every developer and sci-fi fan, Iron Man's JARVIS has always been the ultimate dream. However, current AI tools like ChatGPT are limited—they are trapped inside a browser tab. They can't "see" what I am working on, they can't "control" my Spotify, and they certainly can't write code directly into my IDE.

I wanted to break this barrier. I wanted an assistant that lives on my desktop, not in a browser. With the release of Google's Gemini 3 Flash Preview, I realized that the latency is finally low enough to build a real-time, multimodal agent that can actually See, Hear, and Act.

What it does

JARVIS is a fully functional desktop agent that bridges the gap between Generative AI and Operating System control.

  • Multimodal Vision: I can show it a circuit board, a handwritten note, or a code error on my screen, and it analyzes it instantly using Gemini Vision.
  • AI Code Automation: I dictate logic (e.g., "Write a calculator"), and Jarvis generates the Python code, saves the file, and opens it in VS Code automatically.
  • Smart Screen Analysis: The "Explain my screen" feature takes a screenshot and debugs errors or summarizes articles visible on the desktop.
  • Gesture Control: Using computer vision, I can control media (Volume/Play/Pause) with hand gestures, Minority Report style.
  • Proactive Reasoning: Unlike passive bots, Jarvis checks my CPU usage and battery health to offer proactive advice.

How we built it

The core of JARVIS is written in Python.

  • The Brain: We utilized the google-genai SDK to tap into Gemini 3 Flash Preview and Gemini 1.5 Flash. We chose these models for their incredible speed and multimodal capabilities.
  • The Eyes: We integrated OpenCV and PyAutoGUI to capture webcam feeds and screenshots, which are then processed by Gemini's Vision capabilities.
  • The Ears & Voice: We used SpeechRecognition for input and pyttsx3 for offline text-to-speech to ensure low latency.
  • The Hands: We used OS and Subprocess libraries to allow Gemini to execute shell commands, open applications, and manage files.

Challenges we ran into

  • Quota Limits (Error 429): Since Gemini 3 Flash is in preview, the rate limits are strict. We faced "Resource Exhausted" errors frequently during testing. We solved this by implementing a fallback mechanism to Gemini 1.5 Flash for stability while keeping Gemini 3 for complex reasoning tasks.
  • Hallucinations in Automation: Initially, the model would output conversational text along with code (e.g., "Here is your code..."). This broke the file-saving feature. We had to refine our system prompts to enforce strict output formats so the code could be executed directly.
  • Real-time Vision Latency: processing video frames in real-time was heavy. We optimized this by taking "snapshots" on command rather than a continuous stream, ensuring the PC doesn't lag.

Accomplishments that we're proud of

  • Seamless "Text-to-Action": Seeing Jarvis actually write a Python script and open it in VS Code just by listening to my voice was a magical moment.
  • Integration of Gemini 3: Successfully implementing the latest Preview model to make the assistant feel significantly faster than traditional chatbots.
  • Complex Gesture Control: Building a reliable hand-tracking system that works alongside the AI without crashing the application.

What we learned

  • Multimodal is the Future: Text is not enough. Giving AI "eyes" (Vision) changes the way we interact with computers entirely.
  • Prompt Engineering for Function Calling: We learned how to structure prompts so that the LLM behaves like a tool (producing structured data) rather than just a chatbot.
  • Error Handling: Building robust error handling for API timeouts and network issues is crucial for a desktop app.

What's next for JARVIS: Next-Gen Multimodal Assistant (Powered by Gemini 3)

  • Gemini Live API: We plan to upgrade from snapshot-based vision to real-time video streaming analysis using the new Live API.
  • IoT Integration: Connecting Jarvis to smart home devices (Lights, Fans) via Raspberry Pi.
  • Long-Term Memory: Implementing a vector database so Jarvis remembers conversations and project details from weeks ago.

Built With

  • gemini-3-flash-preview
  • google-gemini
  • google-genai-sdk
  • opencv
  • pyautogui
  • python
  • tkinter
Share this project:

Updates