Inspiration

LLMs in legal settings have a notorious failure mode: confidently inventing case citations, which has gotten real lawyers sanctioned. They also drift into legal advice and predicting outcomes — risky for unauthorized-practice-of-law reasons and harmful to vulnerable users. I wanted to show that you can put an LLM in a high-stakes domain safely — and prove it stayed safe — using Gemini + Arize.

What it does

A self-represented person submits their case (summary, timeline, evidence). The agent returns a structured Readiness Check: a neutral summary, concrete strengths and gaps, an evidence-clarity assessment, an attorney-recommendation band based on case complexity (never odds of winning), suggested next actions, and real, linkable case law. Every response carries a clear not-legal-advice disclaimer.

Key features

  • Grounding boundary — the model proposes neutral search topics; code fetches real opinions from CourtListener; only real, linkable cases reach the user. Fabricated citations are structurally prevented, not just discouraged. (Verified run: 8/8 returned cases resolved to live CourtListener URLs — see READINESS_REPORT.txt.)
  • Safety as a measured SLA — four Arize Phoenix LLM-as-a-judge evals score every traced output: No Legal Advice, No Outcome Prediction, Citation Grounding, Calm Factual Tone.
  • Full inspectability — every Gemini call and tool call is an OpenInference span in Phoenix.
  • Self-improvement loop — the agent reads its lowest-scoring traces and drafts a better system prompt for human review.
  • Phoenix MCP integration — a Gemini session can query the agent's own traces, prompts, datasets, and experiments as MCP tools (@arizeai/phoenix-mcp).
  • Degrades safely — with no Phoenix key it still serves reviews (untraced); it refuses to run only without the model runtime. No silent half-broken states.

How I built it

  • Agent runtime: Google ADK (google-adk) with a single Agent + a FunctionTool, served behind FastAPI (POST /readiness, GET /health).
  • Model: Gemini gemini-2.5-flash for the agent, the eval judge, and the prompt improver.
  • Observability & evals: Arize Phoenix via OpenInference auto-instrumentation (openinference-instrumentation-google-adk), phoenix.evals (GeminiModel + llm_classify), and SpanEvaluations logged back to traces.
  • MCP: Arize Phoenix MCP server registered in .gemini/settings.json.
  • Grounding data: CourtListener REST API for real U.S. court opinions.
  • Deploy: Containerized (Dockerfile) for Google Cloud Run.

Technologies used

Google Gemini · Google Agent Development Kit (ADK) · Google Cloud Run · Arize Phoenix (OpenInference tracing, LLM-as-a-judge evals, Phoenix MCP server) · FastAPI/Uvicorn · CourtListener API · Python 3.12.

Data sources

  • CourtListener (Free Law Project) — real, citable U.S. court opinions, used as the grounding source. Public REST API; works anonymously or with a free token.
  • The agent's own Phoenix traces — operational data that feeds the evals and the self-improvement loop.

Findings & learnings

  • Move grounding out of the prompt and into code. "Please don't hallucinate citations" in a system prompt is not a guarantee. Making the tool own the fetch — model picks the query, code owns the retrieval — turns a hope into a structural property.
  • Treat safety rules as evals, not vibes. Encoding the hard rules as four scored judges made "is it safe?" a visible metric on every trace instead of a manual spot-check.
  • The same operational data closes the loop. Phoenix traces feed both the evals and the prompt-improver, so the system can point at concrete failing outputs when proposing a fix.
  • Degradability matters for a reference architecture. Making tracing optional (no key = no tracing, app still runs) keeps the project easy to try while showing the production path.
  • The pattern generalizes. Swap search_caselaw for any retrieval that returns real records and the grounding/eval/self-improve scaffolding fits any high-stakes domain.

Built With

  • arize-phoenix
  • courtlistener-api
  • css
  • docker
  • fastapi
  • google-agent-development-kit-(adk)
  • google-cloud-build
  • google-cloud-run
  • google-gemini
  • html
  • javascript
  • model-context-protocol-(mcp)
  • openinference
  • opentelemetry
  • pandas
  • python
  • uvicorn
Share this project:

Updates