UI Fraud Investigative Agent

Inspiration

The DOL OIG found roughly $28.9 billion in fraudulent unemployment payments during pandemic-era programs, with 991,793 SSNs filing for benefits across multiple states simultaneously. The pattern was visible. The technical capability to detect it existed. States couldn't share enough data to detect the patterns without sharing too much data to be compliant with 20 CFR 603 and modern privacy expectations. I wanted to see whether a governed agent system could thread that needle: detect cross-state fraud patterns through a transformed-identifier protocol that exposes matches without exposing claimants. And whether it could do so against the operational reality of state system i.e. heterogeneous, legacy mainframes alongside modern cloud stores, with strict due-process and audit requirements. This project is the prototype that answer.

What it does

The agent investigates a specific unemployment insurance claim and detects whether it shares identifiers with claims in another state, without either state ever exposing raw personally identifiable information.

When an investigator submits a claim ID, the agent reads the claim from a simulated legacy mainframe source (EBCDIC-encoded, COBOL copybook layout) and a modern cloud data store (MongoDB Atlas) through a unified adapter pattern. -It computes federation-safe HMAC-SHA256 hashes for the claim's sticky identifiers (SSN, device fingerprint, bank routing, bank account), queries the federation for matches on those hashes, and classifies the outcome as cross-state, within-state-only, or no-match. -When a cross-state match is found, the agent invokes an audited per-claim lookup tool to retrieve quasi-identifying context (claimant name, DOB, address) for the matched claims. Every release gets logged with timestamp, justification, and content hash.
The agent then produces both a machine-readable structured findings object and a human-readable due-process explanation suitable for an investigator's review.
The investigator surface is a Streamlit web app deployed on Google Cloud Run, with the agent's reasoning fully traced in Arize AX and its distributed execution traced in Dynatrace.

How I built it

The system is a single-agent orchestrator pattern: one Gemini 2.5 Flash agent on Google Cloud Agent Platform coordinates five tools that abstract over the data heterogeneity and enforce privacy at the tool boundary. 1- Data layer. State AA is a simulated legacy mainframe: synthetic claim records written as EBCDIC-encoded fixed-width binary following a COBOL copybook layout, parsed by a custom reader. State BB is MongoDB Atlas, accessed as a JSON document store. The legacy adapter normalizes both into a uniform internal schema. 2- Privacy primitives. Three primitives, mechanically enforced: HMAC-SHA256 transformation of sticky identifiers under a shared federation salt; structural exclusion of quasi-identifiers from any federation-exchange record; audited request-and-release for quasi-identifiers via a separate tool that requires justification and logs every event. 3- Agent layer. Gemini 2.5 Flash with structured tool calling. The orchestrator emits both a StructuredFindings JSON object (machine-readable contract) and a markdown narrative (human-readable explanation). 4- Observability. Arize AX captures every Gemini call via OpenInference auto-instrumentation; 5- Dynatrace captures distributed execution traces (MongoDB queries, custom tool spans) via OpenTelemetry. Complementary backends, complementary views. 6- Evaluation. A 19-case labeled benchmark spanning five fraud typology categories, plus an LLM-as-judge narrative quality scoring pass. 7-Deployment. Single container on Cloud Run, secrets in Google Cloud Secret Manager, image in Artifact Registry.

Challenges I ran into

Production reliability surfaced under evaluation load. Vertex AI rate limits and MongoDB Atlas connection churn never appeared during interactive testing. Running 19 evaluation cases back-to-back broke things: transient 429 quota errors mid-loop, SSL handshake failures from rapid connection creation. Fixing them required a retry helper with exponential backoff for transient cloud errors, plus connection pooling across all four MongoDB-using tool files. Both patterns now sit at every external dependency boundary.
Heuristic parsing of agent narratives was a dead end. My first eval pass tried to extract structured findings from the agent's markdown narrative via regex. It failed on the first edge case, the fix required a refactor: have the agent emit structured findings as JSON alongside the narrative, constrained by a schema in the system prompt. The model handles structured output far better than I expected.
Apple Silicon to Cloud Run. Cloud Run runs amd64. My Mac is arm64. Standard docker build produced arm64 images Cloud Run silently rejected. The fix is one flag — docker buildx build --platform linux/amd64 — but the symptom was a deployment that "succeeded" then refused to serve traffic.
LLM-as-judge surfaced bugs in the eval harness, not just the agent. When I sent the judge an incomplete view of the agent's structured findings, it correctly flagged "narrative references claim IDs not in structured findings." The bug was in my code, not the agent's output.

Accomplishments that I am proud of

Live URL on Google Cloud Run, two complementary observability backends capturing every reasoning step and every database query, secrets properly managed, deployment reproducible. -The agent structurally cannot return raw SSNs, device fingerprints, or bank info because no tool returns them. Three property-based tests prove this. -A 19-case labeled benchmark across five fraud typology categories, scored both for detection accuracy (100% primary outcome correct, 100% exact match-set agreement) and narrative quality (mean 5.0 / 5 across faithfulness, coverage, clarity via LLM-as-judge). This comes with limitations such as same-family judging, small sample, controlled benchmark .

What we learned

Tool design is the agent's real interface. Gemini reads tool descriptions, not source code. Most "model failures" turned out to be missing tool capabilities. -Evaluation infrastructure exposes system bugs, not just model bugs. Building the eval suite surfaced rate limits, connection churn, and harness-side scoring bugs that interactive testing never showed. -Privacy enforcement belongs at the tool layer, not the storage layer. Anchoring it in tools is robust, raw SSNs are never returned because no tool returns them, regardless of where data lives. -Every external dependency, Model APIs, databases, trace exporters, needs its own resilience pattern.

What's next for UI Fraud Investigative Agent

Broader fraud pattern catalog. Expand beyond the four seeded patterns to coordinated rings, identity laddering, and employer-side fraud drawn from DOL OIG and GAO reports. -Production-grade federation. Move from single-process simulation to per-state federation nodes with authenticated exchange and rotating salts. -Hardened deployment. Migrate agent runtime to Google Cloud Agent Engine. Replace 0.0.0.0/0 Atlas access with VPC peering. Add prompt-injection defenses. -Tamper-evident audit. Chain audit log events with cryptographic hashes; build a reviewer dashboard with structured drill-down.

Built With

arize
artifact-registry
atlas
cloud-run
cobol
docker
dynatrace
ebcdic
faker
gemini
mongodb
openinference
opentelemetry
pytest
python
secret-manager
streamlit
uv
vertex-ai

Updates

TaheraAhmed Ahmed started this project — Jun 11, 2026 04:49 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.