graph
graph 2

RingFence — the AML triage screen that knows when not to guess

Track 02 · Fraud Watch — for "the analyst with three minutes per case." Four agents. One shared Cognee memory. A real PyMC Bayesian brain. Every decision shows its math; the one genuinely ambiguous case is refused, not guessed.

⚡ TL;DR for judges

RingFence ingests a bank's raw transaction file, runs a four-agent Find → Rank → Act → Explain pipeline over Cognee, and in under a second turns ~300 accounts into a triage queue of 14 — 9 escalate, 1 to human review, 4 cleared — with ring exposure reconciled to the cent at $161,750.90 and a one-click, regulator-ready SAR memo.

What makes it different from every "fraud score" tool:

🧠 A genuine probabilistic brain (PyMC). Agent 2 is an unsupervised two-component Bayesian mixture sampled with NUTS — not a logistic regression with magic weights. It returns a full posterior with credible intervals, which is what lets the system express honest uncertainty.
🤝 Provable agent collaboration (Cognee). Each agent accretes fields onto one shared Case node; downstream agents raise an error if the upstream fields are missing. The handoff isn't a vibe — it's a hard dependency.
🙅 Calibrated abstention. When the posterior credible interval straddles the decision threshold, RingFence routes the case to a human because the expected value of information says it pays for itself — the one move a bare score cannot make.
🪤 Decoy resistance by construction. The planted "shared-device" trap that fools threshold tools is mathematically prevented from flagging, via a deliberately skeptical prior.
🔁 A self-learning loop. Every run compiles a "closing rule" from the signals that actually drove escalations — a deployable query that catches the next ring automatically.

🎯 The Product Brief we built against (Step 0)

User: an AML analyst at a community bank with ~3 minutes per case. Job: find the coordinated ring hiding below every alert threshold, without drowning in false positives. Success condition (our own bar):

Recover all 9 ring accounts (escalate).
Clear the planted decoy (don't fall for shared-device co-occurrence).
Abstain on the one genuinely ambiguous account instead of guessing.
Reconcile the dollars to the cent.
Every decision carries a human-readable reason — no bare scores.

We hit all five, and we wrote a test suite that enforces them on every commit (more below). That's "matches the brief," verifiable.

🕵️ What it does (the story)

Crestline Community Bank hands you a 90-day file: ~5,000 transactions across ~300 accounts. Somewhere inside is a layering ring built specifically to never cross a monitoring threshold — your rules caught none of it.

You drop the CSV into RingFence. Four agents run over one shared Cognee memory and collapse the noise to a focused queue. Then the three product screens tell the whole story:

Screen 1 — The Queue

The ~14 surfaced accounts, each with its mule probability and uncertainty band, color-coded ESCALATE / REVIEW / CLEAR. Everything else was auto-cleared. Headline: 9 escalate · 1 review · 4 cleared · $161,750.90 ring exposure, reconciled.

Screen 2 — Case detail

Open AC-0009 (a relay): a posterior plot at p = 0.999 with a tight credible interval far above the threshold τ = 0.05; the signals that fired; the expected-loss arithmetic behind the ESCALATE; the typology; and the source→relay→sink chains it sits in. No bare score anywhere.

The decoy beat: AC-0045 shares a device with three accounts — the obvious flag. RingFence holds it at p = 0.005 and CLEARS it. It didn't fall for the trap.
The abstention beat: AC-0012 is fresh-cohort like the ring but has no transfers — genuinely ambiguous. p = 0.057, CI [0.004, 0.458] straddles τ. RingFence computes that a review pays for itself and routes to REVIEW instead of guessing.

Screen 3 — Pipeline view (the magic)

The same Case object shown after each agent touches it — you literally watch the fields accrete: the Detector wrote signals; the Estimator added p_mule + credible interval; the Adjudicator added the action + the expected-loss math; the Reporter added the memo + the learned rule. Then click Build Cognee Graph and ask it in plain English — "which accounts are relays in the ring?" — and it answers from the knowledge graph.

One more click downloads the SAR memo: FinCEN who/what/when/where/why/how, reconciled edge-by-edge to $161,750.90, with the closing rule appended.

🧠 PyMC: a real probabilistic brain, not a scorecard

This is the part we're proudest of, and it's why the system can do things a classifier can't.

Agent 2 (the Estimator) is an unsupervised two-component Bayesian mixture. There are no labels in the data, so we don't "train on known mules." Instead we posit two latent classes — legit and mule — each with its own vector of signal fire-rates φ, and we let PyMC's NUTS sampler (via nutpie, 4 chains) infer everything: the rates, the mixing weight, and every account's posterior probability of membership.

The generative model, in plain terms:

pi ~ Beta(1, 9) — mules are rare (a strong, honest prior).
phi_mule ~ Beta(4, 1), phi_legit ~ Beta(1, 4) — starting beliefs only; the real fire-rates are learned from the data.
Each fired signal is a Bernoulli draw at its class's rate (product-of-Bernoullis likelihood).

Three pieces of real Bayesian engineering make it faithful — and they are exactly what produce the demo's best moments:

1. `logsumexp` marginalizes the hidden class analytically

NUTS is gradient-based and cannot sample discrete latent variables. So instead of sampling each account's class, we integrate it out in log-space:

log_mix    = [ log(pi) + ll_mule ,  log(1-pi) + ll_legit ]
Potential  = sum( logsumexp(log_mix) )      # exact marginal log-likelihood

Fed to pm.Potential, this gives the sampler a smooth, fully-differentiable objective. This is the textbook-correct way to fit a mixture under HMC — and we verify it converged by reporting R-hat and divergence counts on every run.

2. A masked likelihood: missing ≠ innocent

A pure sink account physically cannot fire signals like automation or fresh_cohort (it never originates, it may have no open date). A naive model would read those absent signals as evidence of innocence. We don't:

ll_k = sum( M * [ X*log(phi_k) + (1-X)*log(1-phi_k) ] )

The applicability mask M zeroes out structurally-inapplicable signals per account — they're treated as missing data, not negative evidence. This single element-wise multiply is why sinks stay confident and why AC-0012 (only one applicable mule-signal) comes back honestly uncertain with a wide, τ-straddling interval.

3. A skeptical decoy prior

device_shared is given the same weak prior in both classes, making identity co-occurrence non-discriminating by construction. The planted decoy cannot be driven to a flag by the math — decoy resistance is a property of the model, not a special-case if statement.

What PyMC hands downstream

For each account: posterior-mean p_mule, a 94% highest-density credible interval (_hdi94), and a set of learned log-Bayes-factors (log φ_mule − log φ_legit) per fired signal — the per-decision "why" the Adjudicator and Reporter consume. Uncertainty is a first-class output, not an afterthought. (This is the work we'd put forward for the PyMC Special Prize.)

🕸️ Cognee: the memory that makes the collaboration real

Cognee isn't a logging sink we bolted on — it's the substrate the entire pipeline runs over, in two complementary layers.

Layer 1 — the operational handoff store

Every account is one Case::AC-#### node. As each agent runs, it accretes new fields onto that same node, and every read/write is logged with its entity id. The handoff is a provable data dependency: the Estimator raises ValueError if the Detector's signals/dist_stats are absent; the Adjudicator raises if the Estimator's posterior is absent. Agent N+1 demonstrably uses Agent N's output — or it refuses to run. (This is the rubric's "Real collaboration" criterion, made literal.)

Layer 2 — the semantic knowledge graph

At the end of a run, each of the four agents contributes its own tagged layer to Cognee via add(node_set=["quorum", "agent:detector" …]), and a single cognify() builds a Gemini-backed knowledge graph carrying full multi-agent provenance — not just the final report, but who-found-what at every stage. The graph is then queryable in natural language: "Which accounts are relays?" "Which were cleared as decoys and why?" "What typology and total exposure did the agents find?" — answered from the graph, persisted to cognee_graph.json so the UI renders it instantly.

Why the knowledge graph matters:

Cross-agent provenance & auditability — a regulator can trace any conclusion back through the exact agent layers that produced it. AML is a compliance domain; a queryable, provenance-carrying memory is a genuine product advantage, not a demo trick.
Natural-language investigation — the analyst interrogates the whole investigation in English instead of writing SQL against four disjoint outputs.
Relationship surfacing — source→relay→sink ring structure lives in the graph, so the topology is discoverable, not buried in tables.
Graceful degradation — the SDK layer activates with a BYO key and degrades gracefully to the fast local store if absent, so the pipeline is never blocked. (Keys stay out of the repo.)

🏗️ Agent architecture & the exact handoffs

A real Find → Rank → Act → Explain pipeline — four specialists, not one LLM in a loop. Data flows only through Cognee; nothing is passed agent-to-agent in memory.

Crestline CSV → DuckDB (in-process) → COGNEE (shared Case nodes)
        │
        ▼
1. DETECTOR ──signals, dist_stats──▶ 2. ESTIMATOR ──p_mule, credible_interval──▶
3. ADJUDICATOR ──action, EVPI, decisive_signals──▶ 4. REPORTER ──memo, closing_rule

Agent	Reads from Cognee	What it computes	Writes to Cognee (the handoff)
1 · Detector	raw txns (DuckDB)	Isolates the AC→AC transfer graph structurally; derives source/relay/sink roles; fires 7 behavioral signals; learns the "fresh-cohort" cutoff from the data (no magic 30 days); tags shared-device decoys	`signals`, empirical `dist_stats`
2 · Estimator	`signals`, `dist_stats`	PyMC mixture, NUTS (above)	`p_mule`, `credible_interval`, `signal_contributions`
3 · Adjudicator	`p_mule`, `credible_interval`, `signal_contributions`	Bayesian decision theory: τ = C_FP/(C_FP+C_FN); expected-loss argmin; abstains when the interval straddles τ and EVPI > review cost	`action`, `E_loss_escalate`, `E_loss_clear`, `EVPI`, `decisive_signals`
4 · Reporter	the fully enriched Case set + transfer graph	edge-by-edge dollar reconciliation; FinCEN SAR memo; compiles the closing rule	`typology`, `dollar_contribution`, `closing_rule`, `memo_ref`

Decisions are deterministic (a fixed seed + argmin), so the system is reproducible and never says "the model said so."

🔁 The self-learning loop

RingFence doesn't just decide — it gets smarter about this institution every run:

The Detector learns its thresholds from the data distribution (e.g., the fresh-cohort cutoff is the largest age-gap in the youngest decile — not a hard-coded constant).
The Estimator learns the class-conditional signal fire-rates from the population via the posterior — the "what does a mule look like here" model updates with the data.
The Adjudicator's behavior is governed by an explicit, institution-tunable cost matrix (C_FN ≫ C_FP, the AML reality) — change the costs, the threshold τ and every action update automatically.
The Reporter compiles a "closing rule" from the decisive_signals that actually drove the escalations — a deployable FLAG AC→AC transfers WHERE … query, learned from this ring to catch the next one. Every run hardens the institution's detection net.

Roadmap: close the loop fully — feed resolved analyst dispositions back as labels to update the priors and the cost matrix online, turning each reviewed case into training signal.

✅ How we map to the 25-point rubric

Criterion (5 pts each)	How RingFence nails it
Agents that work — "ran on real data, outputs not hardcoded"	Runs on the real Crestline CSV via DuckDB; a ground-truth oracle is kept strictly separate from the detection logic and tests prove the pipeline rediscovers the answers from data, not from the key.
Real collaboration — "Agent N+1 used Agent N's findings via Cognee"	Field-accretion on shared `Case` nodes; downstream agents raise if upstream fields are missing. Provable, not narrated.
Matches brief — "judged against your own Step 0"	All five success conditions met and enforced by a green test suite on every commit.
End-user usability — "a judge operates it cold"	Three-screen product: Queue → Case detail → Pipeline view, plus a one-click downloadable SAR memo and English-language graph search. Drop a CSV, get a queue.
Explainability — "every decision has a visible reason"	Posterior + credible interval + expected-loss arithmetic + decisive signals on every case; the SAR memo phrases (never invents) the computed facts.

🛠️ How we built it

Python 3.14, uv-managed. Run the pipeline: uv run main.py data/track02_fraud_watch.csv.
DuckDB — in-process analytical SQL over the transactions; zero server.
PyMC + nutpie + ArviZ — the Bayesian mixture, NUTS sampling, and convergence diagnostics.
Cognee (Gemini-backed) — shared memory + the cognified knowledge graph and NL search.
Streamlit — the three-screen analyst product.
Geodo — Domain-Expert web research grounding the SAR memo's real-world precedents (Liberty Reserve layering, FinCEN structuring advisory FIN-2014-A005, FATF account-cluster typologies) and the 31 CFR § 1020.320 filing thresholds.
Trupeer — the demo video.
BYO API key — keys never touch the repo; every external layer degrades gracefully without one.

🧪 Proof it works

uv run pytest -q runs the full pipeline against the data and goes green:

all 9 ring accounts escalated,
the 4 decoys cleared,
AC-0012 routed to review,
dollars reconciled to the cent,
and dist_stats/posteriors validated against the separate ground-truth oracle the agents are forbidden to read.

🚧 Challenges we solved

Fitting a mixture under NUTS — discrete latent classes break HMC; the logsumexp marginalization + pm.Potential formulation made it sample cleanly (R-hat checked).
Structurally-missing signals — the masked likelihood was the unlock for both confident sinks and honest abstention.
Keeping agents honest — separating the ground-truth oracle from the detection code so the pipeline earns its answers, and the tests can prove it.
One async event loop for Cognee — buffering each agent's layer and flushing in a single loop with one cognify(), so async DB connections never bind to dead loops.

🔮 What's next

Streaming ingestion for live monitoring · the closed analyst-feedback learning loop · graph-native typology detection that surfaces emerging rings before they complete · multi-institution federated typology sharing via Cognee.

🏁 Closing line

Every other tool ranks risk — and most just flag the decoy. RingFence surfaces the nine, refuses to guess on the tenth, ignores the trap, and shows you the math behind every call.

Built With

cognee
geodo
pymc
python