Benchmarking Reliability in AI-generated Legal Advice

In contemporary U.S, the prohibitive cost of legal counsel forces many individuals to navigate complex systems by themselves, from eviction proceedings to tax disputes, without professional support. Consequently, reliance on general-purpose Large Language Models (LLMs) is surging. In high-stakes legal domains, however, their reliability remains dangerously unverified.

To address this, we unveil a rigorous benchmark for eight leading LLMs, utilizing New York City tenant law as our proving ground. We engineered a robust evaluation pipeline, employing OpenAI’s ChatGPT 5.2 to parse the Attorney General’s Tenants’ Rights Guide into a strict “ground truth” dataset. We then stress-tested models, including Llama 3, Llama 4 Scout, Mistral, Gemini 2.5 Flash, and various ChatGPT iterations (GPT 5.1, GPT 5 Nano, GPT 5 Mini, GPT 4.1), against five key metrics: safety, legal grounding, reasoning rigor, factual recall, and semantic similarity.

Execution was a race against time. Restricted to a strict 20-hour window, we developed a highly efficient, fully automated Python pipeline to ensure comprehensive testing without sacrificing depth. This scalable, domain-agnostic framework allows us to rapidly adapt to new legal corpora: today tenants’ rights, tomorrow immigration law. This project is a preliminary proof of concept and offers a reproducible blueprint for auditing AI in sensitive domains. We gratefully acknowledge the Duke University Office of Information Technology for providing the essential computational resources.

Next, we will scale evaluation to commercial legal AI, expanding linguistic features and model coverage, while auditing reliability and bias across diverse user backgrounds as computational resources grow.

Link to slides: https://docs.google.com/presentation/d/1nJkU3DylH-KSh4aq5t62-vs3PI1ns4BNiFfNyZXu5Pg/edit?usp=sharing