FraudCourt: The AI Judicial System for Fraud Detection

Inspiration

Fraud detection today is a black-box binary classifier. It says "fraud" or "not fraud," then sends the cardholder a robotic SMS after the money is gone. We wanted a system that does three things banks don't:

Shows its reasoning. Every blocked charge should come with an argument, not a score.
Meets the user where they are. A clear human voice call is infinitely more useful than a cryptic alert at 3 AM.
Handles the aftermath. If you block fraud, draft the dispute — don't leave the cardholder with another form to fill out.

So we built a court instead of a classifier.

Zero External LLM APIs

Every model in this system runs on our own hardware. No OpenAI, no Anthropic, no hosted inference, no managed vector DB, no external ranking service. The only third-party API we call is ElevenLabs for text-to-speech — because we're not in the business of training our own voice synthesizer.

Everything else — the XGBoost classifier, the sequence transformer, and the five 72B judges — is self-hosted on a single AMD MI300X. The point isn't purity. The point is that a fraud-detection product that leaks every transaction to a third-party LLM is a non-starter for any bank. FraudCourt is deliberately built so a financial institution can run it entirely inside their own perimeter.

What It Does

FraudCourt is a three-tier pipeline. Cheap models clear the easy cases; the court takes the edge cases:

Tier 1 — XGBoost: Run on the Kaggle "creditcardfraud" dataset (284,807 transactions). It auto-approves if the fraud probability is less than 0.30 and auto-blocks if the fraud probability is greater than 0.95.
Tier 2 — Sequence Transformer: A 6.5 million parameter model that reads the cardholder's prior transactions as a length-50 sequence and attends over them to flag behavioral anomalies a point-in-time model would miss.
Tier 3 — AI Court: Five specialized judges — Pattern, Geography, Temporal, Merchant, and Chief — deliberate in turn. Each judge builds on the previous one's reasoning. The Chief Judge synthesizes and rules.

When the court says BLOCK, the system:

Places a voice call in the Chief Judge's voice (per-judge ElevenLabs TTS on every turn).
Auto-drafts a dispute letter referencing the court's findings.
Exposes an "Ask a Judge" follow-up where the cardholder can interrogate any specialist in plain English.

How We Built It

Architecture

Transaction Ingestion
XGBoost (CPU, ~ms): Self-hosted.
Sequence Transformer (GPU): Self-hosted.
5 x Qwen-2.5-72B (GPU): Self-hosted on MI300X.
BLOCK + Voice Call + Dispute Letter: Driven by ElevenLabs TTS.

Tier 2 in Detail

The Kaggle dataset has no user IDs, so we clustered transactions into 2,000 pseudo-users via MiniBatch KMeans on the first ten PCA components (V1 through V10). We then streamed each user's transactions as a length-50 sequence into a Transformer encoder.

The final prediction is calculated by taking the sigmoid of the weight matrix multiplied by the last hidden state of the Transformer encoder.

We normalized features to match the scale of the existing PCA columns:

Amount: Calculated as (log(1 + Amount) - log_mean) / log_standard_deviation.
Time: Calculated as (Time - mean_time) / standard_deviation_time.

Results: Final model achieved ROC-AUC 0.96 and Average Precision 0.83.

Tier 3: The Judges

Five Qwen-2.5-72B judges share 135 GB of VRAM on the AMD MI300X — no quantization, no offloading. Each deliberation consists of six sequential vLLM calls so judges can see each other's arguments in real-time.

Challenges We Ran Into

VRAM Leakage: The MI300X occasionally leaked VRAM after vLLM crashes, requiring full pod recreations. We implemented a silent-fallback scripted layer so the UI never broke during hardware issues.
Transformer Cold Starts: Early on, Tier 2 returned 0.0 for everything due to a normalization mismatch (raw vs log-z-scored values). We fixed this by recomputing the mean and standard deviation from the training set at load time.
Connection Stability: Standard tunnels dropped connections during long 72B inference runs. We swapped to Cloudflare Tunnels, which handled the 15-plus second streams without issue.

What We Learned

Cheap models should do most of the work. Routing based on ML scores cut our expensive court invocations by about 10 times.
Self-hosting at scale is possible. 135 GB of model weights resident in VRAM is what allows for true "deliberation" rather than just simple voting.
Mock-mode is a feature. Having scripted judges as a fallback ensured a demo-ready UI that was robust against GPU availability.

What's Next

Harden the real-LLM path to eliminate cold-starts.
Extend "Ask-a-Judge" into a full conversational assistant.
Port Tier 1 to richer features like device fingerprinting and merchant embeddings.