πŸ’‘ Inspiration

The challenge gave us a manufacturer whose data couldn't be trusted four days before a regulatory audit β€” duplicates, impossible values, orphaned references, unit errors. But the person accountable for that audit isn't an engineer. The brief named them perfectly: "the compliance officer who has never opened a database."

Almost every data tool we'd seen speaks to engineers β€” SQL, dashboards, trace logs. The person who actually has to sign off can't read any of it. So we built Attest: it finds, fixes, and explains broken data in plain English, where every decision is traceable to a concrete reason β€” never "the model said so."

## πŸ” What it does

Attest runs five agents over the raw records and turns them into a worklist a non-technical officer can act on cold:

  1. Issue Detector β€” flags every problem, each with a concrete reason
  2. Risk Prioritizer β€” sorts worst-first, and says why each ranks where it does
  3. Remediation Planner β€” decides fix / flag / escalate
  4. Audit Reporter β€” writes a one-page, signable, downloadable summary
  5. Remediation Executor β€” auto-applies the safe fixes, and lets a human describe a fix in plain language that it turns into a previewable rule

On the benchmark (5,000 records) it surfaced 285 issues β€” 99 critical β€” and safely auto-removed 130 duplicates (5,000 β†’ 4,870 rows).

## πŸ› οΈ How we built it

A Python pipeline where agents never call each other directly β€” each reads the previous agent's output from a shared memory layer (Cognee) and writes its own back, so the handoffs are real and inspectable. The detection agents are deterministic on purpose: rules make every finding auditable. We used Claude only where judgment helps β€” the narrative summary and the human-in-the-loop fix automation.

Severity is explicit, not a black box:

$$ $$

where $b_c$ is the base severity for the issue category, bumped by one when the bad record has already shipped (the error likely escaped before the audit). Unit/weight errors are caught relative to each part's own history β€” we flag a record when

$$ w > 5\,\tilde{w}_p \quad\text{or}\quad w < \tfrac{1}{5}\,\tilde{w}_p $$ w > 5\,\tilde{w}_p \quad\text{or}\quad w < \tfrac{1}{5}\,\tilde{w}_p $$

for part $p$ with median weight $\tilde{w}_p$. The front end is a FastAPI app with a guided flow β€” See the data β†’ Watch the agents β†’ Review & sign β†’ Take action β€” where you can click any agent to inspect its tool call, input, output, and handoff.

## πŸ“š What we learned

  • Detection is the easy part; legibility is the product. Our first UI dumped 285 rows and confused everyone. The win was turning it into a guided, plain-language story.
  • Deterministic beats clever for trust. Auditable reasons matter more than model magic when someone has to sign the result.
  • Humans and models check each other. When we fed the fix-automation a wrong instruction ("divide weights by 1000"), the model inspected the data ($609.4 / 61.5 \approx 9.9\times$) and corrected us: "that's ~10Γ—, not 1000Γ— β€” divide by 10." Real human-in-the-loop.

## πŸ§— Challenges we faced

  • Dependency hell. Installing the Cognee SDK silently downgraded our anthropic client and broke every LLM call with an httpx … proxies error. We isolated a clean virtual environment, made the Cognee SDK optional, and kept the shared-memory interface so the demo never breaks.
  • Precision vs. recall. With ~850 seeded issues, we chose precision + explainability over chasing recall β€” a confident, correct, signable worklist beats a noisy one.
  • Making collaboration visible. It took deliberate design to route every handoff through memory so a judge can actually see Agent N+1 using Agent N's work.

Built With

Share this project:

Updates