Sentinel — The First Benefits AI That Knows When Not to Decide

💡 Inspiration

Every year, roughly $60 billion in U.S. public benefits goes unclaimed — not because people don't qualify, but because the system is unnavigable. About 1 in 4 eligible Americans never receive benefits they're legally owed. A single mother in Texas would have to navigate ten separate state portals (SNAP, Medicaid, CHIP, Section 8, TANF, WIC…), each governed by federal regulations she can't read, like 7 CFR 273.9(a).

The obvious instinct is "throw an LLM at it." But that's dangerous: an LLM that decides eligibility can hallucinate a regulation, invent a dollar amount, or confidently deny someone benefits they're entitled to — and the person would never know. We were inspired by a harder, more honest question:

What if the most important thing an AI can do is recognize when it shouldn't decide at all?

That became the soul of Sentinel — an AI whose climax isn't confetti, it's the moment it says "this case is too important for me to decide alone" and escalates to a human.

🧠 What it does

A citizen has a natural conversation. Behind it:

  1. A deterministic rule engine evaluates their profile against real federal rules (income vs. Federal Poverty Level, citizenship, assets) — temperature = 0, fully auditable.
  2. A confidence scorer decomposes certainty across multiple factors.
  3. An escalation engine routes low-confidence cases to human reviewers.
  4. An LLM (Groq / Llama 3.3 70B) does language only — conversational intake, plain-English explanations, translation — never the decision.
  5. Every determination is sealed into a SHA-256 tamper-evident audit record.

🏗️ How we built it

Architecture — the LLM touches language, never decisions:

$$ \text{verdict} = \text{RuleEngine}(\text{profile}, \text{rules}) \quad ; \quad \text{explanation} = \text{LLM}(\text{verdict}) $$

The decision is deterministic and reproducible; the LLM only narrates it. This separation is the entire product in one line.

Stack:

  • Frontend — Next.js 15, TypeScript, Tailwind, Framer Motion
  • Engine — Python: a typed rule engine with FPL math, e.g. SNAP eligibility is

$$ \text{income} \le 1.30 \times \text{FPL}(n), \qquad \text{FPL}(n) = 15060 + 5380\,(n-1) $$

  • AI — Groq (OpenAI-compatible) for intake + explanations; the LLM is given the verdict as input and forbidden from contradicting it
  • Backend — FastAPI + SQLAlchemy, JWT auth, RBAC
  • Confidence — a weighted multi-factor score:

$$ C = \sum_i w_i f_i - \sum_j p_j, \qquad \text{escalate if } C < 0.60 $$

📚 What we learned

  • Determinism is a feature, not a limitation. For benefits, "same input → same verdict, always" is what makes the system trustworthy and legally defensible. We learned to be proud of the rule engine, not embarrassed by it.
  • The hardest engineering is at the seams. The rule data and the engine existed; the adapter that converts knowledge-graph rules into the engine's typed expression trees (with live FPL thresholds) was the piece that made it actually run end-to-end.
  • Honest UX beats impressive UX. Showing a 68%-confidence case escalate to a human is more powerful than any confidence-100% animation.
  • Free-tier reality forces good architecture. Constraints pushed us toward lazy-loading, client-side determinism, and keeping the LLM key server-side.

🧗 Challenges we faced

  • Keeping the LLM in its lane. Preventing the model from deciding — only explaining — required strict prompt contracts and a deterministic fallback so the system never fails open.
  • The rule-adapter integration. Mapping pct_fpl income limits, net-vs-gross deductions, and citizenship lists into a typed expression tree without breaking the engine's audit trail.
  • Memory on a 512 MB free tier. We traced runtime imports and proved the REST path never needs LangChain/LangGraph, then stripped them — cutting the image from ~2 GB to a few hundred MB so it could even boot.
  • Reliability under demo pressure. We made the demo and core eligibility check fully client-side and deterministic — they cannot go down, even if the network does.

🎯 Why it matters

Most "AI for benefits" optimizes for answering. Sentinel optimizes for knowing when uncertainty is dangerous. The rule engine decides, the LLM explains, and humans hold the cases that matter.

Trust isn't a feature here — it's the architecture.

Built With

Share this project:

Updates