
RegWatch: When AI Agents Argue, Compliance Gets Better
The "Oh Crap" Moment That Started It All
Picture this: A fintech company gets slapped with a $2M fine because someone missed a single paragraph in a 47-page Basel III update. Their compliance officer spent 4 hours every week manually checking regulatory sites, cross-referencing internal systems, and still... they missed it.
That's when we thought: What if we had AI agents that literally argue with each other about compliance?
The Problem Nobody Talks About
Compliance monitoring is boring, manual, and terrifying. Every week, companies need to:
- Check 50+ regulatory sources (GDPR, PCI-DSS, Basel III, SOX, you name it)
- Figure out which internal systems are affected
- Assign work to engineering teams
- Document everything for auditors
Time spent: 4 hours per framework. Margin for error: Zero. Consequences of missing something: Career-ending.
And here's the kicker—traditional automation doesn't work because regulations are written in legalese, and mapping them to specific infrastructure components requires human judgment. Or does it?
Our Idea: Make Disagreement a Feature
Most AI systems give you one answer. We built a system where two AI agents deliberately challenge each other.
Here's how it works:
Detection Agent scans new regulations and says: "Hey, this Basel III update about capital requirements affects 7 of our components."
Reviewer Agent goes full skeptic mode: "Hold up. Let me re-evaluate each one. You said Trade Surveillance is affected with 68% confidence, but that system monitors market abuse, not capital ratios. That's a false positive. I'm only 25% confident. REJECTED."
The disagreement gets logged. When agents differ by more than 15%, humans step in. Otherwise, approved findings go straight to Jira.
The magic? That 43% disagreement on Trade Surveillance prevented an engineering team from wasting 16 hours on unnecessary work.
What We Built
We used the Elastic Agent Builder Hackathon (Jan 22 - Feb 27, 2026) as our deadline and built:
The Data Foundation
- 17 real regulations indexed in Elasticsearch (GDPR articles, PCI-DSS requirements, Basel III circulars)
- 32 product components with metadata (owners, tech stack, compliance tags)
- Semantic embeddings (384 dimensions) for smart matching
- All running on Elastic Cloud Serverless (zero cost!)
The Agents (Built in Agent Builder)
Detection Agent:
- Queries
regulatory_circularsindex using ES|QL - Finds regulations published in the last 7 days
- Searches
product_configsfor semantic matches - Calculates confidence scores: ES relevance (40%) + framework tags (35%) + category overlap (25%)
- Returns findings with confidence ≥ 0.50
Reviewer Agent:
- Takes Detection Agent's findings
- Re-evaluates INDEPENDENTLY (doesn't trust the first answer)
- Uses stricter scoring rules:
- Component only logs but doesn't process? -0.20
- Framework tag present but doesn't handle specific requirement? -0.30
- Regulation mentions action component doesn't perform? -0.40
- Calculates delta:
|Reviewer Score - Detection Score| - Decision logic:
- Delta < 15% AND reviewer ≥ 0.70 → APPROVED
- Delta ≥ 15% → ESCALATED (human review)
- Reviewer < 0.50 → REJECTED (false positive)
The Workflow (Elastic Workflows - Tech Preview)
We orchestrated both agents using the brand-new Elastic Workflows feature (literally released during the hackathon!):
steps:
- name: run_detection_agent
type: ai.agent
with:
agent_id: detection-agent
message: "Find regulations from last 7 days..."
- name: run_reviewer_agent
type: ai.agent
with:
agent_id: reviewer-agent
message: "Review these findings: {{ steps.run_detection_agent.output }}"
Agents automatically pass data to each other. No manual intervention.
The Automation Scripts
- Data ingestion: Generates realistic regulations every 4 hours
- Notifications: Sends email/Slack alerts to compliance officers every 5 minutes
- Webhook server: FastAPI endpoint receives compliance alerts
Total workflow time: Detection → Review → Notification = 8 minutes
The "Holy Sh*t" Moment
We were testing the system with a Basel III regulation about minimum capital requirements. Detection Agent flagged "Trade Surveillance and Market Abuse Monitoring" with 68% confidence.
Reviewer Agent came back with: 0.25 confidence
Delta: 43% (way above our 15% threshold)
Reasoning: "Basel III PILLAR1-2025-01 is about minimum capital requirements. It mandates banks maintain specific capital ratios. Trade Surveillance monitors trading activity for market abuse (insider dealing, manipulation). The disconnect: Capital requirements ≠ Market abuse monitoring."
We just prevented a false positive. No manual review. No wasted engineering effort. Just two AI agents doing what they're built to do—argue until they find the truth.
What We Learned (The Hard Way)
1. Elastic Workflows Literally Dropped Mid-Hackathon
Workflows went GA on January 22, 2026—day 1 of the hackathon. Documentation was sparse. We had to reverse-engineer YAML syntax from example workflows on GitHub. The step type agent-builder.chat didn't work. We tried ai-assistant, agent.chat, then finally found ai.agent in a blog post. Worth it.
2. Agents Hallucinate When They're Too Confident
Our first Detection Agent had 0.30 as the minimum confidence threshold. It found "matches" everywhere. We raised it to 0.50, and false positives dropped 60%. Lesson: Make your agents earn their confidence.
3. Disagreement Resolution Is the Killer Feature
Initially, we thought the innovation was "automated compliance monitoring." Nope. Every demo we showed, people lit up when they saw the 43% disagreement. That's when we realized: autonomous validation is more valuable than autonomous detection.
4. ES|QL Is Ridiculously Powerful
Being able to query Elasticsearch with natural SQL-like syntax inside agents? Game-changer. Detection Agent uses this to find recent regulations:
FROM regulatory_circulars
| WHERE published_date > NOW() - 7 days
| KEEP regulation_id, title, framework, severity
No complex query DSL. Just clean, readable queries.
Challenges We Faced
Challenge 1: Agent Builder in Serverless vs. Hosted We initially used a hosted deployment. Agent Builder wasn't visible. Spent 2 hours debugging. Turns out, Agent Builder is only in Serverless projects during tech preview. Switched deployment. Problem solved.
Challenge 2: Passing Data Between Agents in Workflows
Agents return complex JSON objects. Getting Reviewer Agent to read Detection Agent's output was tricky. The template syntax {{ steps.detection.output }} returned [object Object]. We tried {{ steps.detection.output.message | json }} with mixed results. Final solution: simplified the Detection Agent's output format.
Challenge 3: Realistic Test Data We couldn't use real company data (privacy issues). So we built a synthetic data generator that creates:
- Realistic regulation text with frameworks, severity, effective dates
- Product components with tech stacks, owners, compliance tags
- Semantic embeddings that actually make sense
Took a full day. Worth it for the demo.
Challenge 4: Making Disagreement Visible Users need to SEE the disagreement. We built:
- A visual diff showing both confidence scores side-by-side
- Delta percentage calculation
- Color coding (green = approved, yellow = escalated, red = rejected)
- Reasoning explanations in plain English
This turned abstract AI behavior into something compliance officers could trust.
The Tech Stack
- Elasticsearch 9.3.0 (Serverless) - Data storage, semantic search
- Elastic Agent Builder (GA) - Agent creation and management
- Elastic Workflows (Tech Preview) - Agent orchestration
- ES|QL - Query language for agents
- Python - Automation scripts (data ingestion, notifications)
- FastAPI - Webhook server
- SentenceTransformers - Generating 384-dim embeddings
- Cost: $0 (Elastic Cloud trial + serverless)
The Numbers That Matter
| Metric | Before | After | Improvement |
|---|---|---|---|
| Time per framework | 4 hours | 8 minutes | 97% faster |
| False positives caught | Manual QA | Automatic | 43% delta example |
| Engineering hours saved | N/A | 16 hrs/week | Per rejected finding |
| Regulatory sources checked | 3-5 | 50+ | 10x coverage |
| Human involvement | 100% | ~15% (escalations only) | 85% reduction |
What's Next
If we had more time (or funding), here's what we'd build:
AWS Resource Mapping: Turn "update authentication component" into "enable MFA on arn:aws:iam::123456789:role/auth-service"
Jira Integration: Approved findings automatically create tickets with:
- Component owner assigned
- Severity-based priority
- Link to regulation source
- Estimated effort based on historical data
Feedback Loop: When humans override agent decisions, feed that back as training data to improve confidence scoring
Multi-Framework Correlation: "This GDPR update + that PCI-DSS change = you need to update TWO systems, not one"
Regulatory Change Prediction: "Basel IV is in draft. Here's what's likely to be affected when it passes."
Why This Matters
Compliance isn't sexy. But it's the difference between a thriving fintech and a bankrupt one. Every week, companies pay millions in fines for missing regulatory updates that were publicly available.
RegWatch doesn't just automate compliance monitoring—it makes disagreement systematic, trackable, and valuable.
When two AI agents argue about whether a component is affected, and one catches what the other missed, that's not a bug. That's the whole point.
Try It Yourself
- GitHub:
- Demo Video:
- Elastic Cloud: You can replicate this with a free trial
Built with coffee, determination, and the Elastic Agent Builder Hackathon deadline looming. 🚀
Team: Xyphor
Hackathon: Elastic Agent Builder (Jan 22 - Feb 27, 2026)
Built with: Agent Builder (GA), Elastic Workflows (Tech Preview), ES|QL, Python
Log in or sign up for Devpost to join the conversation.