What it does

Compliance Copilot is an agentic regulatory analysis tool for AI deployments.

Type a business scenario in plain English — "We're deploying AI resume screening for a tech company in NYC, candidates from the US and occasionally the EU" — and an NVIDIA Nemotron agent on Crusoe Managed Inference decides which laws apply, fetches the actual sections from a curated corpus, cross-references them against your facts, and emits a structured compliance plan with verifiable citations and a downloadable PDF memo.

In 25 seconds and about three cents.

Inspiration

Every company deploying AI in employment decisions sits at the intersection of half a dozen overlapping regulations: NYC Local Law 144, the EEOC's four-fifths rule, GDPR Article 22, the EU AI Act, CCPA / CPRA, Colorado SB 24-205, Illinois's AI Video Interview Act — and the list keeps growing.

The fines are real. NYC LL 144 is $500–$1,500 per day per violation. GDPR is up to 4% of global revenue. The EU AI Act caps at 7%.

The current options for navigating this are all bad:

  • Hire outside counsel — $400–$1,500/hour, weeks of turnaround for a memo.
  • Buy a compliance platform (Vanta, Drata, OneTrust) — built for continuous infrastructure monitoring against static frameworks like SOC 2. You can't paste a paragraph and ask "does this trigger Article 22?"
  • Pay a bias-audit firm (Holistic AI, Babl AI) — perfect once you know you need a $50K audit. Doesn't help you discover you need one.
  • Ask ChatGPT or Claude — happily answers but fabricates citations. Useless for legal teams who need verifiable sources.

There's a missing tier: fast, agentic, plain-English scoping that produces verifiable citations and structured next steps. That's the gap Compliance Copilot fills.

How we built it

The core is a single agentic loop with three tools the model can call autonomously:

  • list_regulations() — returns a routing index of available frameworks
  • get_regulation(reg_id) — returns full section-split text plus metadata
  • emit_verdict(...) — required structured output: summary, applicable regulations, requirements with rationale, risk flags, cross-references, next steps, open questions

The model decides what to fetch and in what order. Judges literally watch the agent plan, fetch, reconsider, then emit — every tool call, every reasoning step, every fetched section is surfaced in a live timeline.

The corpus is seven curated regulations as markdown files with YAML frontmatter:

File Regulation Jurisdiction
nyc_ll144.md NYC Local Law 144 — AEDT bias audit New York City
eeoc_ai_hiring.md EEOC Title VII guidance on AI hiring US federal
gdpr.md GDPR Art. 5, 6, 9, 13–14, 22, 35, 44–49 EU / EEA
eu_ai_act.md EU AI Act — prohibited + high-risk + deployer duties EU / EEA
ccpa_cpra.md CCPA / CPRA — Automated Decisionmaking Technology California
colorado_ai_act.md Colorado AI Act (SB 24-205) Colorado
illinois_aivi.md Illinois AI Video Interview Act Illinois

We deliberately skipped a vector database. The corpus is small enough that routing is reasoning, not similarity search — letting the model read an index file and decide which regs to pull in full produces sharper citations than top-k retrieval would.

Triage layer. Before engaging the pipeline, the system prompt classifies the input into five buckets: real scenario, greeting, meta-question, off-topic, too vague. Only real scenarios invoke tools. A "hi" gets a friendly reply in one $0.0001 call instead of seven $0.004 calls.

Three resilience layers (also targeting the TrueFoundry challenge):

  1. JSON synthesis fallback — if Nemotron writes plain text instead of calling emit_verdict, the agent rebuilds a fresh, focused conversation and asks for a fenced JSON block. A brace-balanced parser handles nested objects.
  2. Degraded verdict — if JSON parsing fails too, the agent synthesizes a minimal verdict from the regulations it did fetch by walking the tool-call history.
  3. Transient-error retry — 3 attempts with exponential backoff on 5xx, timeout, and rate-limit errors.

The UI is a Streamlit workspace with three panels:

  • Left rail — agent timeline with circular nodes for each tool call, tool result, and reasoning trace; the live node pulses while running.
  • Center — live conversation: scenario bubble, thinking bubbles surfacing Nemotron's separate reasoning field, tool-call bubbles showing actual function calls with arguments, results streaming in.
  • Right rail — compliance plan building progressively. Risk / Frameworks / Requirements / Citations metrics, framework cards, severity-coded action items (CRITICAL / REQUIRED / ADVISED), citation pills that expand inline to show the actual regulation text — no rerun, no context-switch.

PDF export uses reportlab with Cabin registered as a TTF (with patched name-table metadata so bold ≠ regular) to keep brand consistency from web → PDF.

Challenges we ran into

  • Nemotron is a reasoning model with split output. The first smoke test returned Response: None until we realized the model emits its thinking in a separate reasoning field and only writes the answer to content if max_tokens is generous enough to finish reasoning first. We turned that into a feature: surfacing the reasoning trace as a visible timeline event makes the model's "thinking" legible to judges, which is exactly what the Crusoe challenge asks for.
  • Forced tool_choice returned empty 500s on the Crusoe Nemotron route. Instead of fighting routing, we built the JSON-synthesis fallback — same structural outcome, no footgun.
  • Streamlit's top-down execution model fights three-pane workspace UIs. We solved it with st.empty() placeholders + custom CSS injection for Cabin and the indigo / fawn / dusk palette, plus .streamlit/config.toml to pin the theme (dark-mode-auto would have inverted the palette).
  • Streaming progressive UI updates inside a single Streamlit run required treating the agent as an event generator that yields dicts the UI accumulates into session state — not as a synchronous function call.

Accomplishments we're proud of

  • It's actually agentic. Not a scripted pipeline pretending to be an agent. The model plans, fetches, and decides on its own.
  • Verifiable citations. Every requirement points to a section the agent actually fetched from the corpus, expandable inline. No hallucination.
  • About $0.003–$0.005 per scenario. A $7 hackathon budget runs ~2,000 scenarios. Triaged non-scenarios cost $0.0001.
  • The PDF export looks like a real legal memo, not a Streamlit screenshot.
  • The UI surfaces the model's reasoning trace, tool calls, and fetched sections in real time — so a legal team can verify everything the agent claims.

What we learned

  • Reasoning models change the deal — max_tokens is the budget for thinking plus answering, and surfacing the reasoning trace is a UX superpower.
  • For small, curated corpora, letting an agent reason over an index file beats vector search on precision of citations.
  • Resilience isn't a feature you add at the end — JSON-synthesis fallback, degraded-verdict synthesis, and transient retry have to be designed into the agent loop from the start.
  • Input triage is non-obvious but high-leverage. A compliance agent that "analyzes" the word "hi" looks dumb and wastes ~30× the cost.

What's next

  • Citation verification pass — a second cheap LLM call that checks each emitted citation against the actual section text, surfacing hallucinated section numbers with a "verified ✓" badge.
  • Multi-model fallback via TrueFoundry AI Gateway — wrap the client so a Crusoe 500 fails over to Llama 3.3 70B.
  • More regulations — BIPA (Illinois biometrics), Maryland HB1202, Tennessee ELVIS Act, California AB-2885. Each is a single markdown file.
  • "What-if" diff mode — re-run with a modified scenario ("what if we expand to Germany?") and show the diff against the previous verdict.
  • Generated compliance documents — once we know NYC LL 144 § 2 applies, generate a draft candidate notification with the user's company name pre-filled.
  • Evaluation harness — a test set of ~20 ground-truth scenarios with expected applicable regulations; CI runs the agent against them and reports precision/recall on framework selection.
  • Compliance-as-code generator — translate a verdict into a YAML policy file an ATS can enforce: "block all California applicants if no ADMT pre-use notice is configured."

Compliance Copilot is not legal advice. It's the legal scoping step, automated for the price of a tweet.

Built on NVIDIA Nemotron 3 via Crusoe Managed Inference.

Built With

Share this project:

Updates