Inspiration
The inspiration for G-NOV SENTINEL came from the "Visual Blindness" of modern DevOps. Most AI agents are limited to text, but developers live in a visual world of UIs, dashboards, and code. We wanted to build a Sovereign Agent that doesn't just talk, but actually "sees" the workspace and acts as the user's autonomous hands.
What it does
G-NOV SENTINEL is a next-generation Multimodal AI-Ops Engine. It performs real-time Visual Telemetry by capturing screen data and audio. It identifies UI bugs visually, maps Ghost-Fix coordinates, and handles natural language commands to orchestrate the browser. It bridges the gap between identifying a problem and executing a solution without the user ever leaving their 3D tactical dashboard.
How we built it
We engineered G-NOV using a high-concurrency event-driven architecture:
Brain: Google Gemini 3 Flash for ultra-fast multimodal intent parsing and visual analysis.
Backend: A Python-based engine hosted on Google Cloud Platform (GCP) using Cloud Run and Vertex AI endpoints.
Frontend: A futuristic 3D Kinetic interface built with HTML5/CSS3 and Socket.io for low-latency duplex communication.
Communication: Real-time streaming of video and audio packets for near-instant reaction times.
Challenges we ran into
The biggest challenge was managing the latency of multimodal data sync. Sending video frames and audio simultaneously while expecting a real-time response required deep optimization of our WebSocket pipeline. Additionally, training the agent to interpret UI elements without DOM access—relying purely on Gemini’s visual intelligence—required precision prompting and coordinate mapping logic.
Accomplishments that we're proud of
We are incredibly proud of achieving a flawless Multimodal Handshake. Seeing the agent successfully interpret a verbal command to "Fix the UI" by visually scanning the screen and identifying the exact coordinates was a massive win. We also successfully deployed a stable, scalable backend on Google Cloud within a high-pressure timeframe.
What we learned
We learned the immense power of Gemini 3 Flash’s speed. In a UI Navigator context, speed is everything. We also discovered that multimodal context (Voice + Vision) drastically reduces the ambiguity of human commands, making AI agents much more reliable for complex system tasks.
What's next for G-NOV SENTINEL: A Multimodal AI-Ops Engine
The next phase for Sentinel is Deep OS-level Integration. We aim to move beyond the browser and allow the Sentinel to orchestrate entire operating systems and cloud infrastructures. We are also exploring "Collaborative Swarms," where multiple Sentinels work together to monitor global cloud health in real-time.
Built With
- archives
- css3
- flask
- gemini-3-flash
- google-cloud
- html5
- iam
- javascript
- json
- multimodal-ai
- neural
- python
- ui-navigator
- websockets


Log in or sign up for Devpost to join the conversation.