RingFence — the AML triage screen that knows when not to guess
Track 02 · Fraud Watch — for "the analyst with three minutes per case." Four agents. One shared Cognee memory. A real PyMC Bayesian brain. Every decision shows its math; the one genuinely ambiguous case is refused, not guessed.
⚡ TL;DR for judges
RingFence ingests a bank's raw transaction file, runs a four-agent Find → Rank → Act → Explain pipeline over Cognee, and in under a second turns ~300 accounts into a triage queue of 14 — 9 escalate, 1 to human review, 4 cleared — with ring exposure reconciled to the cent at $161,750.90 and a one-click, regulator-ready SAR memo.
What makes it different from every "fraud score" tool:
- 🧠 A genuine probabilistic brain (PyMC). Agent 2 is an unsupervised two-component Bayesian mixture sampled with NUTS — not a logistic regression with magic weights. It returns a full posterior with credible intervals, which is what lets the system express honest uncertainty.
- 🤝 Provable agent collaboration (Cognee). Each agent accretes fields onto one shared
Casenode; downstream agents raise an error if the upstream fields are missing. The handoff isn't a vibe — it's a hard dependency. - 🙅 Calibrated abstention. When the posterior credible interval straddles the decision threshold, RingFence routes the case to a human because the expected value of information says it pays for itself — the one move a bare score cannot make.
- 🪤 Decoy resistance by construction. The planted "shared-device" trap that fools threshold tools is mathematically prevented from flagging, via a deliberately skeptical prior.
- 🔁 A self-learning loop. Every run compiles a "closing rule" from the signals that actually drove escalations — a deployable query that catches the next ring automatically.
🎯 The Product Brief we built against (Step 0)
User: an AML analyst at a community bank with ~3 minutes per case. Job: find the coordinated ring hiding below every alert threshold, without drowning in false positives. Success condition (our own bar):
- Recover all 9 ring accounts (escalate).
- Clear the planted decoy (don't fall for shared-device co-occurrence).
- Abstain on the one genuinely ambiguous account instead of guessing.
- Reconcile the dollars to the cent.
- Every decision carries a human-readable reason — no bare scores.
We hit all five, and we wrote a test suite that enforces them on every commit (more below). That's "matches the brief," verifiable.
🕵️ What it does (the story)
Crestline Community Bank hands you a 90-day file: ~5,000 transactions across ~300 accounts. Somewhere inside is a layering ring built specifically to never cross a monitoring threshold — your rules caught none of it.
You drop the CSV into RingFence. Four agents run over one shared Cognee memory and collapse the noise to a focused queue. Then the three product screens tell the whole story:
Screen 1 — The Queue
The ~14 surfaced accounts, each with its mule probability and uncertainty band, color-coded ESCALATE / REVIEW / CLEAR. Everything else was auto-cleared. Headline: 9 escalate · 1 review · 4 cleared · $161,750.90 ring exposure, reconciled.
Screen 2 — Case detail
Open AC-0009 (a relay): a posterior plot at p = 0.999 with a tight credible interval far above the threshold τ = 0.05; the signals that fired; the expected-loss arithmetic behind the ESCALATE; the typology; and the source→relay→sink chains it sits in. No bare score anywhere.
- The decoy beat: AC-0045 shares a device with three accounts — the obvious flag. RingFence holds it at p = 0.005 and CLEARS it. It didn't fall for the trap.
- The abstention beat: AC-0012 is fresh-cohort like the ring but has no transfers — genuinely ambiguous. p = 0.057, CI [0.004, 0.458] straddles τ. RingFence computes that a review pays for itself and routes to REVIEW instead of guessing.
Screen 3 — Pipeline view (the magic)
The same Case object shown after each agent touches it — you literally watch the fields accrete: the Detector wrote signals; the Estimator added p_mule + credible interval; the Adjudicator added the action + the expected-loss math; the Reporter added the memo + the learned rule. Then click Build Cognee Graph and ask it in plain English — "which accounts are relays in the ring?" — and it answers from the knowledge graph.
One more click downloads the SAR memo: FinCEN who/what/when/where/why/how, reconciled edge-by-edge to $161,750.90, with the closing rule appended.
🧠 PyMC: a real probabilistic brain, not a scorecard
This is the part we're proudest of, and it's why the system can do things a classifier can't.
Agent 2 (the Estimator) is an unsupervised two-component Bayesian mixture. There are no labels in the data, so we don't "train on known mules." Instead we posit two latent classes — legit and mule — each with its own vector of signal fire-rates φ, and we let PyMC's NUTS sampler (via nutpie, 4 chains) infer everything: the rates, the mixing weight, and every account's posterior probability of membership.
The generative model, in plain terms:
pi ~ Beta(1, 9)— mules are rare (a strong, honest prior).phi_mule ~ Beta(4, 1),phi_legit ~ Beta(1, 4)— starting beliefs only; the real fire-rates are learned from the data.- Each fired signal is a Bernoulli draw at its class's rate (product-of-Bernoullis likelihood).
Three pieces of real Bayesian engineering make it faithful — and they are exactly what produce the demo's best moments:
1. logsumexp marginalizes the hidden class analytically
NUTS is gradient-based and cannot sample discrete latent variables. So instead of sampling each account's class, we integrate it out in log-space:
log_mix = [ log(pi) + ll_mule , log(1-pi) + ll_legit ]
Potential = sum( logsumexp(log_mix) ) # exact marginal log-likelihood
Fed to pm.Potential, this gives the sampler a smooth, fully-differentiable objective. This is the textbook-correct way to fit a mixture under HMC — and we verify it converged by reporting R-hat and divergence counts on every run.
2. A masked likelihood: missing ≠ innocent
A pure sink account physically cannot fire signals like automation or fresh_cohort (it never originates, it may have no open date). A naive model would read those absent signals as evidence of innocence. We don't:
ll_k = sum( M * [ X*log(phi_k) + (1-X)*log(1-phi_k) ] )
The applicability mask M zeroes out structurally-inapplicable signals per account — they're treated as missing data, not negative evidence. This single element-wise multiply is why sinks stay confident and why AC-0012 (only one applicable mule-signal) comes back honestly uncertain with a wide, τ-straddling interval.
3. A skeptical decoy prior
device_shared is given the same weak prior in both classes, making identity co-occurrence non-discriminating by construction. The planted decoy cannot be driven to a flag by the math — decoy resistance is a property of the model, not a special-case if statement.
What PyMC hands downstream
For each account: posterior-mean p_mule, a 94% highest-density credible interval (_hdi94), and a set of learned log-Bayes-factors (log φ_mule − log φ_legit) per fired signal — the per-decision "why" the Adjudicator and Reporter consume. Uncertainty is a first-class output, not an afterthought. (This is the work we'd put forward for the PyMC Special Prize.)
🕸️ Cognee: the memory that makes the collaboration real
Cognee isn't a logging sink we bolted on — it's the substrate the entire pipeline runs over, in two complementary layers.
Layer 1 — the operational handoff store
Every account is one Case::AC-#### node. As each agent runs, it accretes new fields onto that same node, and every read/write is logged with its entity id. The handoff is a provable data dependency: the Estimator raises ValueError if the Detector's signals/dist_stats are absent; the Adjudicator raises if the Estimator's posterior is absent. Agent N+1 demonstrably uses Agent N's output — or it refuses to run. (This is the rubric's "Real collaboration" criterion, made literal.)
Layer 2 — the semantic knowledge graph
At the end of a run, each of the four agents contributes its own tagged layer to Cognee via add(node_set=["quorum", "agent:detector" …]), and a single cognify() builds a Gemini-backed knowledge graph carrying full multi-agent provenance — not just the final report, but who-found-what at every stage. The graph is then queryable in natural language: "Which accounts are relays?" "Which were cleared as decoys and why?" "What typology and total exposure did the agents find?" — answered from the graph, persisted to cognee_graph.json so the UI renders it instantly.
Why the knowledge graph matters:
- Cross-agent provenance & auditability — a regulator can trace any conclusion back through the exact agent layers that produced it. AML is a compliance domain; a queryable, provenance-carrying memory is a genuine product advantage, not a demo trick.
- Natural-language investigation — the analyst interrogates the whole investigation in English instead of writing SQL against four disjoint outputs.
- Relationship surfacing — source→relay→sink ring structure lives in the graph, so the topology is discoverable, not buried in tables.
- Graceful degradation — the SDK layer activates with a BYO key and degrades gracefully to the fast local store if absent, so the pipeline is never blocked. (Keys stay out of the repo.)
🏗️ Agent architecture & the exact handoffs
A real Find → Rank → Act → Explain pipeline — four specialists, not one LLM in a loop. Data flows only through Cognee; nothing is passed agent-to-agent in memory.
Crestline CSV → DuckDB (in-process) → COGNEE (shared Case nodes)
│
▼
1. DETECTOR ──signals, dist_stats──▶ 2. ESTIMATOR ──p_mule, credible_interval──▶
3. ADJUDICATOR ──action, EVPI, decisive_signals──▶ 4. REPORTER ──memo, closing_rule
| Agent | Reads from Cognee | What it computes | Writes to Cognee (the handoff) |
|---|---|---|---|
| 1 · Detector | raw txns (DuckDB) | Isolates the AC→AC transfer graph structurally; derives source/relay/sink roles; fires 7 behavioral signals; learns the "fresh-cohort" cutoff from the data (no magic 30 days); tags shared-device decoys | signals, empirical dist_stats |
| 2 · Estimator | signals, dist_stats |
PyMC mixture, NUTS (above) | p_mule, credible_interval, signal_contributions |
| 3 · Adjudicator | p_mule, credible_interval, signal_contributions |
Bayesian decision theory: τ = C_FP/(C_FP+C_FN); expected-loss argmin; abstains when the interval straddles τ and EVPI > review cost | action, E_loss_escalate, E_loss_clear, EVPI, decisive_signals |
| 4 · Reporter | the fully enriched Case set + transfer graph | edge-by-edge dollar reconciliation; FinCEN SAR memo; compiles the closing rule | typology, dollar_contribution, closing_rule, memo_ref |
Decisions are deterministic (a fixed seed + argmin), so the system is reproducible and never says "the model said so."
🔁 The self-learning loop
RingFence doesn't just decide — it gets smarter about this institution every run:
- The Detector learns its thresholds from the data distribution (e.g., the fresh-cohort cutoff is the largest age-gap in the youngest decile — not a hard-coded constant).
- The Estimator learns the class-conditional signal fire-rates from the population via the posterior — the "what does a mule look like here" model updates with the data.
- The Adjudicator's behavior is governed by an explicit, institution-tunable cost matrix (C_FN ≫ C_FP, the AML reality) — change the costs, the threshold τ and every action update automatically.
- The Reporter compiles a "closing rule" from the
decisive_signalsthat actually drove the escalations — a deployableFLAG AC→AC transfers WHERE …query, learned from this ring to catch the next one. Every run hardens the institution's detection net.
Roadmap: close the loop fully — feed resolved analyst dispositions back as labels to update the priors and the cost matrix online, turning each reviewed case into training signal.
✅ How we map to the 25-point rubric
| Criterion (5 pts each) | How RingFence nails it |
|---|---|
| Agents that work — "ran on real data, outputs not hardcoded" | Runs on the real Crestline CSV via DuckDB; a ground-truth oracle is kept strictly separate from the detection logic and tests prove the pipeline rediscovers the answers from data, not from the key. |
| Real collaboration — "Agent N+1 used Agent N's findings via Cognee" | Field-accretion on shared Case nodes; downstream agents raise if upstream fields are missing. Provable, not narrated. |
| Matches brief — "judged against your own Step 0" | All five success conditions met and enforced by a green test suite on every commit. |
| End-user usability — "a judge operates it cold" | Three-screen product: Queue → Case detail → Pipeline view, plus a one-click downloadable SAR memo and English-language graph search. Drop a CSV, get a queue. |
| Explainability — "every decision has a visible reason" | Posterior + credible interval + expected-loss arithmetic + decisive signals on every case; the SAR memo phrases (never invents) the computed facts. |
🛠️ How we built it
- Python 3.14,
uv-managed. Run the pipeline:uv run main.py data/track02_fraud_watch.csv. - DuckDB — in-process analytical SQL over the transactions; zero server.
- PyMC + nutpie + ArviZ — the Bayesian mixture, NUTS sampling, and convergence diagnostics.
- Cognee (Gemini-backed) — shared memory + the cognified knowledge graph and NL search.
- Streamlit — the three-screen analyst product.
- Geodo — Domain-Expert web research grounding the SAR memo's real-world precedents (Liberty Reserve layering, FinCEN structuring advisory FIN-2014-A005, FATF account-cluster typologies) and the 31 CFR § 1020.320 filing thresholds.
- Trupeer — the demo video.
- BYO API key — keys never touch the repo; every external layer degrades gracefully without one.
🧪 Proof it works
uv run pytest -q runs the full pipeline against the data and goes green:
- all 9 ring accounts escalated,
- the 4 decoys cleared,
- AC-0012 routed to review,
- dollars reconciled to the cent,
- and
dist_stats/posteriors validated against the separate ground-truth oracle the agents are forbidden to read.
🚧 Challenges we solved
- Fitting a mixture under NUTS — discrete latent classes break HMC; the
logsumexpmarginalization +pm.Potentialformulation made it sample cleanly (R-hat checked). - Structurally-missing signals — the masked likelihood was the unlock for both confident sinks and honest abstention.
- Keeping agents honest — separating the ground-truth oracle from the detection code so the pipeline earns its answers, and the tests can prove it.
- One async event loop for Cognee — buffering each agent's layer and flushing in a single loop with one
cognify(), so async DB connections never bind to dead loops.
🔮 What's next
Streaming ingestion for live monitoring · the closed analyst-feedback learning loop · graph-native typology detection that surfaces emerging rings before they complete · multi-institution federated typology sharing via Cognee.
🏁 Closing line
Every other tool ranks risk — and most just flag the decoy. RingFence surfaces the nine, refuses to guess on the tenth, ignores the trap, and shows you the math behind every call.
Built With
- cognee
- geodo
- pymc
- python
Log in or sign up for Devpost to join the conversation.