🛡️ The Mission: Auditable UI Navigation
ComplyAct is a category-defining Agentic Process Automation (APA) platform built for the Gemini UI Navigator Track. Our mission is to solve the "Safety Gap" in enterprise AI: the moment an autonomous agent encounters ambiguity in a high-stakes workflow.
By leveraging Gemini 1.5 Pro's long-context multimodal capabilities, we’ve built a secure translation layer that connects messy, real-world unstructured data (handwritten scans) to rigid legacy ERP systems—governed by a deterministic Graceful Halt engine.
⛰️ The Problem: The Hallucination Abyss
In industries like Finance and Healthcare, "hallucinations" in automation aren't just bugs—they are regulatory disasters. Standard agents often operate as "black boxes," guessing when they should be asking. This leads to a total lack of trust, causing 80% of enterprise AI pilots to stall before reaching production.
⚙️ The Solution: The Gemini-Native Workflow
ComplyAct turns Gemini from a reasoning engine into a Governed Digital Worker.
▸ Gemini Multimodal Ingest: Using Gemini 1.5 Pro, the system interprets complex, scanned, and handwritten documents, extracting structured data with precise confidence scores. ▸ The Graceful Halt: When Gemini detects ambiguity (score < 80%), the system intentionally halts. It doesn't guess. It triggers a real-time Human-in-the-Loop (HITL) escalation via a Slack-integrated modal. ▸ Gemini UI Navigator (Ghost Cursor): A coordinate-based agent that "sees" the legacy ERP interface through visual reasoning, executing actions with deterministic precision. ▸ Hyper-Sync Engine: A zero-latency layer that synchronizes Gemini’s reasoning logs with UI movements at 2x speed for a transparent audit experience. ▸ Cloud Ledger Ledgering: Every action is cryptographically hashed and persisted to Google Cloud Firestore, creating a permanent, tamper-proof record of the human-AI collaboration.
🏗️ Technical Architecture (Google Cloud Stack)
We built ComplyAct to be a showcase of the Google AI Ecosystem:
- Brain: Gemini 1.5 Pro for high-reasoning multimodal document extraction.
- Navigator: Gemini-powered UI reasoning implemented via a deterministic Playwright executor for robust interface control.
- Database: Google Cloud Firestore to store the immutable audit trail and session states.
- Backend: High-performance FastAPI (Python) orchestration.
- Frontend: Next.js 14 HUD with a CRT-inspired "Transparency Terminal."
🧠 Technical Hurdles: "Visual Grounding"
Our greatest challenge was mapping Gemini's visual intent to the high-precision requirements of legacy web forms. Unlike modern APIs, legacy ERPs require exact coordinate clicks.
Our Solution: We developed a Spatial Reasoning Wrapper that passes DOM snapshots to Gemini. Gemini identifies the "Target Intent," and our backend translates that into specific Playwright interactions. We then implemented a WebSocket Logic Buffer to ensure the reasoning logs and the "Ghost Cursor" stayed frame-locked in real-time.
🏆 Prize Category Alignment
UI Navigator Track ComplyAct is built specifically to solve the "Interface Gap." It uses Gemini 1.5 Pro to interpret visual input and navigate complex, multi-step web workflows autonomously.
Multimodal Excellence The system processes a "Triple-Threat" of data: Scanned PDF visuals, handwritten text reasoning, and real-time UI frame analysis.
Accountability & Governance By integrating Google Cloud Firestore and our Graceful Halt logic, we solve the primary security and compliance concerns of Fortune 500 organizations.
🔭 Future Vision: Gemini Live Audio Integration
The next evolution of ComplyAct is the integration of the Gemini Multimodal Live API. This will allow auditors to verbally "talk" the agent through a halt—asking, "Why did you stop here?"—and receiving a real-time voice explanation from the agent’s reasoning chain.
Built With
- cloud
- fastapi
- firestore
- gemini
- next.js
- playwright
- pro
- python
- tailwindcss
- websockets
Log in or sign up for Devpost to join the conversation.