Inspiration

Most fraud detection hackathon projects follow the same pattern: download a Kaggle dataset, train a classifier, submit. I wanted to ask a harder question — how do you test a fraud detector without real customer data, and how do you prepare it for attacks that haven't been invented yet? The answer hit me during a conversation about flight simulators. A pilot doesn't learn emergency landings by waiting for a real emergency. They practice in a safe, controlled simulation that throws every possible failure at them before they ever touch a real plane. Banks have no equivalent. They test fraud systems on static spreadsheets of historical data — the equivalent of training a pilot using old newspaper accounts of crashes. I built Ghost Protocol to be that flight simulator. Not another fraud detector. The infrastructure that makes fraud detectors better.

What it does

Ghost Protocol is a three-body adversarial simulation platform: The Criminal Agent (Red Team) — A large language model prompt-engineered to think like a sophisticated fraudster. It generates realistic attack sequences tailored to a target user's spending profile and adapts its strategy in real time when it gets caught. If Round 1 large transfers get blocked, Round 2 switches to hundreds of micro-transactions under $15. The attacker doesn't follow a fixed script — it reasons. The Defender (Blue Team) — Either the user's own fraud detection API submitted as a webhook endpoint, or Ghost Protocol's built-in Police AI. The Defender receives transactions with ground truth hidden — it only sees what a real fraud system would see. The Referee (Ghost Protocol itself) — The platform spins up the synthetic Ghost World, feeds transactions to both agents, scores every decision against ground truth, streams results live to the War Room dashboard via WebSocket, and generates a Post-Game Security Report identifying exactly which attack patterns the Defender missed. The War Room dashboard shows a live transaction feed, a real-time Risk Meter (0–100 threat level), and a scoreboard tracking True Positives, False Negatives, money lost, and F1 score — all updating live. When the attacker adapts between rounds, a full-width banner fires with the attacker's reasoning. The Post-Game Report includes an executive summary, a Security Gap analysis listing every blind spot the Defender consistently missed, and five prioritized recommendations with code-level implementation hints.

How we built it

Backend: Python 3.13 + FastAPI with async WebSocket support. Match state persists to Redis with a local JSON fallback. The simulation loop runs as a FastAPI BackgroundTask so the HTTP response returns instantly and the match streams live. AI Layer: Groq API with llama-3.1-8b-instant for real-time transaction evaluation (sub-200ms per batch) and llama-3.3-70b-versatile for the Criminal Agent adaptation logic and Post-Game Report generation. I batch all transactions in a round into a single API call, reducing LLM calls from 20+ per match down to 3–4 total. Frontend: Next.js 14 with Tailwind CSS, Recharts for the scoreboard metrics, React Simple Maps + D3.js for the live transaction world map. WebSocket client with exponential backoff reconnection so the dashboard stays live even if the connection drops mid-match. The Adaptation Loop: The core mechanic is not a trained model — it is a growing conversation. After each round, the Criminal Agent receives a summary of which attacks were caught and which slipped through, then generates a new wave specifically designed to exploit the patterns the Defender missed. No fine-tuning. No LangChain. Just a prompt that gets smarter because it has more context. Adaptation=f(previous_attacks,caught_ids,inferred_defender_rules)\text{Adaptation} = f(\text{previous_attacks}, \text{caught_ids}, \text{inferred_defender_rules})Adaptation=f(previous_attacks,caught_ids,inferred_defender_rules) The Scoring: F1=2⋅Precision⋅RecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}F1​=2⋅Precision+RecallPrecision⋅Recall​ Threat Level=False NegativesTotal Fraud Transactions×100\text{Threat Level} = \frac{\text{False Negatives}}{\text{Total Fraud Transactions}} \times 100Threat Level=Total Fraud TransactionsFalse Negatives​×100

Challenges we ran into

The orchestration gap. I built all the individual pieces — Criminal Agent, Defender interface, Referee engine, WebSocket layer — but wiring them into a single running simulation loop was harder than expected. The match orchestrator had to handle round transitions, state persistence, concurrent WebSocket emission, and graceful error recovery all at once. Rate limits. Using one LLM call per transaction evaluation hit free tier limits almost immediately during testing. I solved this by migrating to Groq and redesigning the Police AI to evaluate an entire round's transactions in a single batched API call. Ground truth isolation. The Defender must never see is_fraud or fraud_type — those fields are the ground truth that Ghost Protocol holds as referee. I built an explicit allowlist payload builder in the dispatcher rather than using an exclude filter, which is safer and more auditable. Getting this right was critical because a leak would make the entire simulation meaningless. Making the adaptation visually obvious. The mock adaptation needed to produce noticeably different transactions between rounds so the demo story was clear even without a live LLM. I specifically engineered the mock botnet persona to switch from large transfers to micro-purchases when adaptation triggers, creating a dramatic visible shift in the transaction feed. Building the entire stack solo. Designing the architecture, writing the backend, building the frontend, wiring up the AI agents, and preparing the demo — alone, under a 36-hour deadline — meant every technical decision had to be deliberate. There was no time to go back.

Accomplishments that we're proud of

The adaptation moment works. Watching the attacker switch from large wire transfers to floods of $9 Steam Wallet purchases in real time — and seeing the Defender's recall drop in the same session — is exactly the demo I set out to build. It tells the story without any explanation needed. Zero ground truth leakage. The Defender never sees is_fraud. The Referee is the only component that knows which transactions are fraudulent. This is a real security property, not just good practice. Full offline operation. The entire platform — simulation, adaptation, War Room dashboard, Post-Game Report — runs completely without any API key. The mock mode is realistic enough that the experience is identical to the live LLM version from a user's perspective. 50 passing tests. Every core component has unit test coverage. I didn't skip tests under hackathon pressure. Production-grade architecture built solo in 36 hours. Redis primary with JSON fallback. Thread-safe WebSocket connection manager. Orchestrator that pauses gracefully on errors rather than corrupting state. A frontend that reconnects automatically if the WebSocket drops.

What we learned

The hardest part of building an AI product is not the AI — it is the infrastructure around it. The LLM prompt for the Criminal Agent took about 20 minutes to write. The orchestrator that reliably runs it, persists state, handles errors, streams events to a frontend, and generates a post-match report took significantly longer. I also learned that "adversarial" is a better frame than "generative" for this problem. The interesting question is not "can an AI generate fake transactions" — it obviously can. The interesting question is "can an AI find the blind spots in a system designed to stop it." That framing changed how I designed every component. Batching LLM calls is not optional at hackathon scale. One call per transaction is fine for a demo with 10 transactions. It breaks immediately when you run 3 rounds of 10 transactions back to back. Designing for batch evaluation from the start would have saved hours.

What's next for Ghost Protocol

Multi-round persistent learning. Right now the Criminal Agent resets between matches. The next version maintains a cross-match memory — attacks that worked in previous sessions are tried first in new ones, creating a continuously evolving adversary. Bring your own defender SDK. A Python package that lets developers point Ghost Protocol at their fraud detection function with a single decorator and run a full red-team simulation in CI. Scenario marketplace. Named, reusable attack scenarios — "The Midnight Sweep," "The Ghost Army," "The Long Con" — that teams can share, version, and run against their systems on a schedule. Every completed match becomes a regression test. Enterprise integration. Banks running real fraud detection systems need a sandboxed environment with production-parity data volumes. Ghost Protocol's architecture maps directly onto what compliance teams need for pre-deployment red-teaming. Real bank API integration. Partner with Plaid or a similar provider to generate synthetic users with realistic transaction histories, making the Ghost World indistinguishable from real data for testing purposes.

Built With

Share this project:

Updates