Inspiration

AML compliance is a $60 billion/year industry where analysts spend 90–95% of their time clearing false positives. Each alert takes ~3 hours of manual evidence assembly, pulling KYC records, tracing entity ownership graphs, checking sanctions lists, scanning adverse media, before an analyst can even begin writing the Suspicious Activity Report. The reasoning is straightforward; the retrieval is the bottleneck.

We asked: what if an AI agent could do the evidence assembly autonomously, but with a structural guarantee against hallucination? Not a prompt-based "please don't make things up", an architectural constraint where fabricated claims physically cannot reach the final report.

The Investigator–Skeptic dual-agent pattern was inspired by adversarial review processes in academia (peer review) and law (cross-examination). If one agent gathers evidence and a separate agent, with restricted tools and no shared context, must independently verify every citation, hallucinations become structurally impossible rather than merely discouraged.

What it does

Argus takes a suspicious transaction alert and autonomously produces a citation-backed SAR (Suspicious Activity Report) draft in under 60 seconds. Specifically:

  1. Investigator agent explores a synthetic AML dataset (entities, transactions, adverse media, sanctions, prior alerts) using 8 MCP-compatible tools
  2. Skeptic agent re-fetches every cited document in a separate context and verifies claims match source data
  3. Orchestrator runs a deterministic loop (max 4 iterations), on rejection, the Investigator must revise or retract
  4. Output: structured findings with severity levels, typology tags, cited evidence, and a full SAR narrative

The system correctly escalates suspicious shell-company round-tripping (GC-001) AND correctly clears legitimate cross-border trade finance (GC-004), proving it doesn't have an escalation bias.

How we built it

Backend: Python 3.12 + FastAPI, deployed on Google Cloud Run. The orchestrator coordinates two Gemini 2.5 Flash-Lite calls (via Vertex AI), one for the Investigator, one for the Skeptic, with a local Elastic-compatible simulator serving synthetic AML data across 6 indices (~3,900 documents).

Frontend: Next.js 15 with a single-screen, three-pane layout, alert queue (left), investigation workspace (center), SAR draft + reasoning trace (right). Deployed on Cloud Run.

Architecture enforcement:

  • Read-only data access (no write methods exist in the tool surface)
  • Asymmetric tools: Investigator gets 8 tools, Skeptic gets only 3 (get_document, search with size≤5, count)
  • Separate Gemini sessions with independent system prompts, no shared conversation history
  • Pydantic schema requires evidence[] with min_length=1 on every Finding
  • Hard iteration cap (MAX_ITERATIONS=4) prevents infinite revision loops

Deployment: gcloud run deploy --source with Cloud Build → Artifact Registry → Cloud Run. Vertex AI ADC for authentication (no API keys in code).

Challenges we ran into

  • Gemini rate limits: The free-tier quota for gemini-2.5-flash was 5 RPM, too low for a dual-agent system making 4+ calls per investigation. We solved this by switching to gemini-2.5-flash-lite which had higher throughput, and adding exponential backoff retry logic.

  • Model availability on Vertex AI: Several models listed in the SDK (gemini-3.5-flash, gemini-2.0-flash) returned 404 on Vertex AI. Required systematic testing to find working models.

  • Reasoning trace visibility: Traces stored on Cloud Run's ephemeral filesystem were lost between requests. Fixed by returning traces inline with the investigation response rather than relying on separate fetch calls.

  • Cold-start latency: Cloud Run's scale-to-zero meant the first investigation took 15+ seconds. Mitigated with warm-up calls before demos and 300s timeouts.

  • Skeptic calibration: Getting the Skeptic to reject hallucinated findings without being so strict it rejects everything required careful prompt engineering of what constitutes a valid citation match.

Accomplishments that we're proud of

  • Zero prompt-based safety: Every guardrail is architectural. The system cannot hallucinate findings into the final report not because we asked it nicely, but because the code physically prevents it.

  • Correct negative control: The agent clears false positives with cited rationale, not just an escalation machine.

  • Full reasoning trace: Every Investigator draft, Skeptic verdict, and revision is logged and visible in the UI. A compliance auditor can inspect exactly how the agent reached its conclusion.

  • Domain-portable architecture: The Investigator–Skeptic pattern works for any analyst-triage workflow. We documented SOC (security) and SRE (incident) ports, same loop, same schemas, different indices.

  • Sub-60-second investigations: From alert to full SAR draft with 5 cited findings in under a minute, versus 3 hours of manual analyst work.

What we learned

  • Architectural constraints > prompt constraints for production AI safety. Prompts can be jailbroken; code that doesn't expose a write method cannot be talked into writing.

  • Asymmetric tool surfaces are a powerful design pattern. Giving the verification agent fewer capabilities than the investigation agent forces the system toward ground truth.

  • Vertex AI + Cloud Run is a clean deployment story, ADC handles auth, Cloud Build handles containers, and the whole stack stays in one project/region with no cross-service complexity.

  • Synthetic data is sufficient for demonstrating architectural patterns. Real bank data would add compliance risk without improving the demo's core story.

  • Rate limits on new models are the #1 practical blocker for agentic systems. Design for retry from day one.

What's next for Argus

  • Elastic Cloud integration: Replace the local simulator with a real Elastic Cloud Serverless deployment using the official elastic/mcp-server-elasticsearch MCP server
  • Gemini 3 Pro upgrade: Use the larger model for deeper reasoning once quota increases are approved
  • SOC alert triage port: Same architecture, different indices, SIEM events, threat intel, MITRE ATT&CK playbooks
  • Streaming traces: WebSocket-based real-time trace streaming instead of post-hoc rendering
  • PDF SAR export: Generate filing-ready PDF documents from the Markdown SAR drafts
  • Multi-case memory: Let the agent reference patterns from prior investigations when analyzing new alerts

Built With

  • artifact-registry
  • cloud-build
  • docker
  • elastic-mcp
  • fastapi
  • gemini-2.5-flash-lite
  • google-cloud-run
  • google-cloud-vertex-ai
  • next.js
  • node.js
  • pydantic
  • python
  • tailwind-css
  • typescript
  • uvicorn
Share this project:

Updates