Inspiration
I wanted to build an AI assistant that breaks out of the browser window. While chatting with an AI is great, a true "agent" should be able to see what I see and take physical action on my behalf. Operating under my startup branch, Deep Intel, the goal was to create "Gemini Omni" (or Deep Intel Omni)—a fully multimodal desktop assistant that streams your screen, listens to your voice, and actually clicks on your screen to get work done, leveraging the incredible speed of Google's Gemini models.
What it does
Gemini Omni is a real-time, voice-controlled desktop agent. It sits on your Windows desktop as a sleek, glassmorphic UI overlay. When you start a live session, it streams your microphone audio and desktop screen frames directly to a custom cloud backend. The Gemini AI processes your screen context and voice commands, replies with synthesized speech in real-time, and sends down specific JSON execution commands to physically move your mouse and perform clicks on your local machine.
How I built it
The architecture is split between a local desktop client and a high-performance cloud brain:
Frontend (The Body): Built with Flutter for Windows desktop. It manages the UI, records PCM 16-bit audio, captures screen frames every 5 seconds, and handles the WebSocket connection.
Backend (The Brain): A Python FastAPI server containerized with Docker. It uses the bleeding-edge google-genai SDK to process the multimodal streams.
Infrastructure: The backend is deployed serverless on Google Cloud Run, with continuous deployment set up via Google Cloud Build straight from GitHub.
Hardware Control: To execute the AI's physical commands, I used Dart FFI and the win32 package to interface directly with the Windows OS C-level memory structures (specifically the modern SendInput API) to drive the mouse.
Challenges I ran into
Building a bridge between a serverless cloud environment and local OS hardware was incredibly difficult:
Bleeding-Edge Dependency Conflicts: The new google-genai library updated its architecture (specifically how HttpOptions is passed), which caused my Docker builds to fail upon deployment. I had to carefully map out pip dependency conflicts between FastAPI, Uvicorn, and GenAI to get a stable, green CI/CD pipeline in Cloud Build.
Cloud vs. Local Execution: Code that worked perfectly on my local development machine crashed in the Linux cloud environment. Diagnosing Traceback logs in Google Cloud Logs Explorer was crucial to tracking down missing environment variables and pathing issues.
Low-Level Windows APIs: Attempting to click the mouse using deprecated Dart methods failed silently. I had to learn how to allocate C-style memory structures using calloc() to pass modern SendInput commands to the Windows kernel.
Accomplishments that I'm proud of
Accomplishments that I'm proud of I am incredibly proud of successfully completing the entire loop: from a voice command hitting a physical microphone, traveling via WebSocket to a Google Cloud Run container, being processed by Gemini, and traveling back down to execute a physical mouse click on a desktop monitor. Seeing all the Cloud Build steps (Build, Push, Deploy) turn green after hours of debugging was an unforgettable "It's Alive!" moment.
What I learned
This project fundamentally leveled up my engineering skills. I learned how to build and debug containerized applications in Google Cloud Run, manage WebSocket state in a reactive Flutter frontend, and execute low-level operating system tasks using Dart FFI. I also learned the harsh realities of keeping dependencies strictly versioned in a requirements.txt file!
What's next for Gemini Omni
Right now, the agent has mastered the mouse. The immediate next step is mapping the Windows Keyboard APIs so the agent can type out code, draft emails, and use keyboard shortcuts based on its visual understanding of the screen. I also plan to package the client for macOS and Linux to make Deep Intel Omni truly platform-agnostic.
Log in or sign up for Devpost to join the conversation.