The Problem
Robotic Process Automation (RPA) is a $15 billion industry. Yet, modern RPA relies on fragile API integrations or heavy desktop software that breaks every time a website updates its CSS. Furthermore, 80% of legacy B2B software (ERPs, logistics portals) don't even have open APIs. We realized that to truly automate the web, an AI doesn't need an API. It just needs eyes and hands.
What it does
Stratos Ghost shatters the traditional "text-box" chatbot paradigm. It is a lightweight Chrome Extension and a serverless Google Cloud backend that acts as a "Human API." A user clicks the microphone, speaks a natural language B2B command (e.g., "Create a new bug report and extract the ID"), and Stratos Ghost takes over. It visually analyzes the DOM, reasons its next step, speaks its intent out loud, and physically drives the user's browser to click and type across complex Single Page Applications (SPAs), looping autonomously until the objective is complete. Finally, it extracts the data and beams it to a corporate webhook (Make.com/Telegram).
How we built it
To make this enterprise-grade, we separated the sensory layer from the decision engine:
The Brain (Backend): A Node.js server using the official @google/genai SDK, deployed serverlessly to Google Cloud Run using Cloud Native Buildpacks. It leverages Gemini 2.5 Flash for high-speed, multimodal reasoning, strictly enforcing a JSON responseSchema to dictate the agent's next action (CLICK, TYPE, WAIT, or COMPLETE).
The Eyes & Hands (Frontend):
A Manifest V3 Chrome Extension. We engineered a dynamic "Set-of-Mark" (SoM) visual layer. Before sending a compressed screenshot to Gemini, the extension injects numbered bounding boxes over all interactive DOM elements. This mathematically eliminates AI spatial hallucinations.
Challenges we ran into
React 19 & Modern DOM Security:
Modern frameworks block synthetic JavaScript clicks and value injections. We engineered a bypass by hooking directly into React Fiber nodes and building a brute-force keystroke emulator that dispatches native Pointer and Keyboard events, allowing the AI to type into hostile rich-text markdown editors.
API Rate Limiting:
High-speed ReAct loops trigger 429 RESOURCE_EXHAUSTED limits. We built an "Enterprise Shock Absorber" that catches API choking, pauses the state machine, and gracefully recalibrates without crashing the extension.
SPA Navigation Wipes:
When the AI clicks a link, the browser deletes the content script to load the new page. We implemented an async background bridge that patiently knocks on the new tab until the DOM is ready, ensuring the HUD and Voice persist across page loads.
Accomplishments that we're proud of
We successfully built a distributed, self-healing multimodal agent that doesn't just guess pixels, but deterministically interacts with the DOM. Seeing Stratos Ghost autonomously navigate the GitHub Issues SPA, type a dynamic bug report, and fire a payload to Telegram was a massive breakthrough.
What's next for Stratos Ghost
We plan to scale the agent's memory (using Vertex AI Vector Search to remember past workflow paths) and implement cross-tab navigation to allow the Ghost to operate across multiple SaaS platforms simultaneously.
Built With
- chrome
- css
- gemini-2.5-flash
- google-cloud-run
- google-genai-sdk
- html
- javascript
- make
- node.js
- prompt-engineering
Log in or sign up for Devpost to join the conversation.