Inspiration

I spent years in Treasury and Wealth Management before switching to tech. I know what it feels like to open a 60-page weekly pack on a Friday afternoon: most “breaches” are really data quality noise or definition disputes — but you still can’t wave them through, because the one you ignore is always the real one.

When I started experimenting with Gemini 3, something clicked: extracting structured signals from messy PDFs, scans, emails, and statements suddenly became practical. So I picked a hard problem I actually understand from the inside.

What it does

Decision Kernel turns messy evidence into a defensible, replayable decision workflow for Treasury and Wealth.

Gemini reads documents and extracts candidate signals with provenance (evidence spans). Then a deterministic kernel validates each candidate before a human sees it. Only confirmed breaches can FAIL policy; everything else becomes an observation/review item.

Key insight: in our test packs Gemini proposed 14 potential breaches → Governance OS confirmed only 4 (28%). The rest were downgraded due to missing evidence or merged as duplicates.

Metric Treasury (Orion) Wealth (Stonebridge) Total
Candidate breaches (Gemini proposed) 5 9 14
Confirmed breaches 2 2 4
Downgraded to observations 3 7 10
Dropped (invalid) 0 0 0
Merged (deduped) 0 0 0

What remains includes: which policy triggered, which evidence supports it, and symmetric options with no recommendation baked in. The human decides, and an audit pack is generated automatically.

Judge Quickstart (90 sec)

How I built it (Gemini 3 integration)

  1. Gemini 3 Flash extracts candidate signals + evidence spans into JSON.
  2. Gemini thinking mode improves extraction on ambiguous docs.
  3. Deterministic kernel canonicalizes/gates; only confirmed breaches can FAIL.
  4. Gemini drafts narratives only; humans make decisions.
  5. make evals-gemini verifies grounding/schema, but never changes outcomes.

Reproducibility:

  • make evals proves determinism + safety boundaries on every run.
  • Expected: "OVERALL: PASS" with all suites green
  • Results: evals/outputs/ (JSON files with timestamps)
  • To verify determinism: run 3x, compare hashes in output - must be identical
  • Or the full version just ran:
    • python -W ignore::RuntimeWarning -m evals.runner --suite all --pack all --verbose

Challenges I ran into

The biggest fight was with myself. Gemini 3 makes multi-agent orchestration tempting, but policy evaluation cannot be probabilistic if you want auditability. Agents would have looked impressive and made the system impossible to trust.

I also had to tighten semantics that are easy to get wrong:

  • “Event” signals should never become confirmed breaches.
  • Observations should never cause FAIL. They should generate review items outside the evaluator.
  • “Target bands” and ambiguous definitions must be gated until verified.

And yes: keeping Docker, multiple deploy targets, and a growing test suite in sync while working solo is the unglamorous part.

Accomplishments I’m proud of

  • The breach/observation boundary is reproducible from a fresh clone: “make evals” replays the same outcomes consistently.
  • The system is designed to be non-manipulative: no “recommended option” UI, symmetric choices, and the LLM is never the source of truth for severity or escalation.
  • A Gemini-based CI verifier (“make evals-gemini”) can check grounding and schema compliance for coprocessor outputs, but it cannot change policy outcomes.

What I learned

I'd read about Firebase and GCP but didn't expect to ship anything on them anytime soon. With AI-assisted development I went from reading docs to a deployed system on Cloud Run, Cloud SQL, and Firebase App Hosting in days instead of months. I also learned that the cleanest boundary is architectural, not prompt-based: letting AI propose, but forcing the kernel to verify deterministically.

What's next for Decision Kernel

The system makes the cost of bad policy visible and immediate. Weak policies generate noise. Precise policies stay quiet on what doesn't matter. I think once executives see that, they'll realize managing through better policies has way more ROI than firefighting every day. Next up: policy authoring assistance where Gemini drafts and humans approve, a replay harness for tuning policies against historical data, and expanding beyond treasury and wealth into any domain where high-stakes decisions need audit-grade evidence.

Built With

Share this project:

Updates