JARVIS: Next-Gen Multimodal Assistant (Powered by Gemini 3)

GIF
Graphical User Interface gif
Vision Command
AI_writen_code Command

Inspiration

For every developer and sci-fi fan, Iron Man's JARVIS has always been the ultimate dream. However, current AI tools like ChatGPT are limited—they are trapped inside a browser tab. They can't "see" what I am working on, they can't "control" my Spotify, and they certainly can't write code directly into my IDE.

I wanted to break this barrier. I wanted an assistant that lives on my desktop, not in a browser. With the release of Google's Gemini 3 Flash Preview, I realized that the latency is finally low enough to build a real-time, multimodal agent that can actually See, Hear, and Act.

What it does

JARVIS is a fully functional desktop agent that bridges the gap between Generative AI and Operating System control.

Multimodal Vision: I can show it a circuit board, a handwritten note, or a code error on my screen, and it analyzes it instantly using Gemini Vision.
AI Code Automation: I dictate logic (e.g., "Write a calculator"), and Jarvis generates the Python code, saves the file, and opens it in VS Code automatically.
Smart Screen Analysis: The "Explain my screen" feature takes a screenshot and debugs errors or summarizes articles visible on the desktop.
Gesture Control: Using computer vision, I can control media (Volume/Play/Pause) with hand gestures, Minority Report style.
Proactive Reasoning: Unlike passive bots, Jarvis checks my CPU usage and battery health to offer proactive advice.

How we built it

The core of JARVIS is written in Python.

The Brain: We utilized the google-genai SDK to tap into Gemini 3 Flash Preview and Gemini 1.5 Flash. We chose these models for their incredible speed and multimodal capabilities.
The Eyes: We integrated OpenCV and PyAutoGUI to capture webcam feeds and screenshots, which are then processed by Gemini's Vision capabilities.
The Ears & Voice: We used SpeechRecognition for input and pyttsx3 for offline text-to-speech to ensure low latency.
The Hands: We used OS and Subprocess libraries to allow Gemini to execute shell commands, open applications, and manage files.

Challenges we ran into

Quota Limits (Error 429): Since Gemini 3 Flash is in preview, the rate limits are strict. We faced "Resource Exhausted" errors frequently during testing. We solved this by implementing a fallback mechanism to Gemini 1.5 Flash for stability while keeping Gemini 3 for complex reasoning tasks.
Hallucinations in Automation: Initially, the model would output conversational text along with code (e.g., "Here is your code..."). This broke the file-saving feature. We had to refine our system prompts to enforce strict output formats so the code could be executed directly.
Real-time Vision Latency: processing video frames in real-time was heavy. We optimized this by taking "snapshots" on command rather than a continuous stream, ensuring the PC doesn't lag.

Accomplishments that we're proud of

Seamless "Text-to-Action": Seeing Jarvis actually write a Python script and open it in VS Code just by listening to my voice was a magical moment.
Integration of Gemini 3: Successfully implementing the latest Preview model to make the assistant feel significantly faster than traditional chatbots.
Complex Gesture Control: Building a reliable hand-tracking system that works alongside the AI without crashing the application.

What we learned

Multimodal is the Future: Text is not enough. Giving AI "eyes" (Vision) changes the way we interact with computers entirely.
Prompt Engineering for Function Calling: We learned how to structure prompts so that the LLM behaves like a tool (producing structured data) rather than just a chatbot.
Error Handling: Building robust error handling for API timeouts and network issues is crucial for a desktop app.

What's next for JARVIS: Next-Gen Multimodal Assistant (Powered by Gemini 3)

Gemini Live API: We plan to upgrade from snapshot-based vision to real-time video streaming analysis using the new Live API.
IoT Integration: Connecting Jarvis to smart home devices (Lights, Fans) via Raspberry Pi.
Long-Term Memory: Implementing a vector database so Jarvis remembers conversations and project details from weeks ago.

Built With

gemini-3-flash-preview
google-gemini
google-genai-sdk
opencv
pyautogui
python
tkinter

Updates

Muhammad Bin Nadeem Muhammad Bin Nadeem started this project — Feb 05, 2026 02:33 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.