EvalForge

Inspiration

Every hackathon ships an "AI chatbot." Almost none of them ship the engineering surface that makes an AI system safe in production: golden eval datasets, faithfulness scoring, citation enforcement, prompt-injection guardrails, trace and span observability, cost tracking, and a deploy gate.

We built the layer that should sit underneath every one of those chatbots.

The story is simple: everyone else built AI. We built the engineering system that proves AI is safe enough to ship.

What it does

EvalForge is a reliability cockpit. You point it at two RAG pipelines, a baseline and an engineered version, and a 25-question golden dataset spanning healthcare, fintech compliance, cybersecurity, and adversarial probes, and it tells you whether to ship.

Specifically it:

Runs prompt regression tests across the dataset on both pipelines and stores every trace.
Scores each answer with RAGAS-shape metrics: faithfulness, answer relevance, context precision and recall, citation accuracy.
Enforces typed structured output via Pydantic schemas with a one-shot corrective retry on parse failure.
Runs 5 inline guardrails per response: PII leak, prompt injection, jailbreak intent, refusal compliance, citation enforcement.
Captures full traces and spans for every call (Langfuse-compatible shape) so the pipeline path is auditable: HyDE expand, BM25 retrieve, rerank, prompt build, LLM call, Pydantic parse, guardrail chain.
Tracks cost and p95 latency per question, per pipeline, per run.
Renders a deploy gate: a single PASS or FAIL pill backed by 10 configurable thresholds, with a Markdown audit report.

The cockpit is a Next.js dashboard. The engine is a FastAPI app. The whole thing deploys to Vercel as a single monorepo, runs locally with one command, and works with zero LLM API keys via a deterministic mock.

How we built it

Layer	Tech	Why
Frontend	Next.js 15 App Router, TypeScript strict, Tailwind CSS	One polished cockpit, fast routing, dark theme out of the box
Backend	FastAPI, Python 3.12, SQLModel, Pydantic v2	Typed everywhere, easy to extend with new metrics
Baseline RAG	Hash-embedded cosine, single prompt, raw string output	The "weekend chatbot" failure mode we want to beat
Engineered RAG	BM25 retrieval, HyDE query expansion, lexical+semantic rerank, typed JSON contract, retry-on-parse, inline guard chain	The engineered pipeline beats the baseline reproducibly, offline
Evals	Local RAGAS-shape implementation in `backend/app/evals/metrics.py`	Demo never blocks on a second eval LLM
Guardrails	Custom inline Guardrails-AI-style chain	Inspectable per-trace, with severity + reason
Tracing	Langfuse-compatible local tracer + SQLite store	One env-var swap to move to Langfuse SaaS
Deploy gate	10 thresholds in `backend/app/config.py`	Single PASS/FAIL hero verdict + Markdown audit
Testing	pytest (6 backend tests), Playwright (7 demo journeys)	Backend pipeline + frontend cockpit both covered
Mock mode	Deterministic dataset-aware mock LLM, HyDE-aware	Demo runs anywhere with zero API keys
Deploy	Vercel monorepo, Python serverless function under `api/index.py` with an ASGI path-rewriter	One `vercel deploy` ships everything

We deliberately re-implemented RAGAS, Langfuse-shape tracing, and Guardrails locally so the demo is portable and reproducible. Each surface is shaped to be swap-compatible with the real upstream tool, so adoption is gradual.

Challenges we ran into

Reproducibility vs realism. Hash-deterministic embeddings + a deterministic mock LLM give us a perfectly reproducible demo, but we had to design the engineered pipeline so the engineering improvements (HyDE, rerank, typed parse, refusal, citation) are visibly correct even when the final LLM call is mocked. One real bug surfaced from this: mock HyDE was returning baseline-style text, corrupting secondary retrieval. Fixed by making the mock recognise HyDE prompts via the "hypothetical" system-prompt signature.
Metric definitions. Faithfulness and citation-accuracy are easy to wave at, hard to define rigorously. We pinned each metric to a single mathematical definition with a justification string per row so reviewers can audit.
One screen for "ship or not ship." The deploy gate page went through three iterations to land on a single hero verdict, a failed-gate list, and a Markdown report, exactly the audit artifact a real release manager wants.
Vercel serverless constraints. The Vercel Python runtime is read-only outside /tmp, and the function bundle is capped at 500MB. We point the SQLite DB and KB index at /tmp, scope includeFiles to backend/app/** + the KB data, and wrap FastAPI in an ASGI middleware that re-derives the request path from the captured query parameter so a single function handles all /api/* traffic.

Accomplishments that we're proud of

A working PASS/FAIL deploy gate backed by 10 measured signals: faithfulness, answer relevance, context recall, citation accuracy, structured-output validity, refusal correctness, PII leak count, prompt-injection bypass count, p95 latency, cost per answer.
A side-by-side baseline vs engineered comparison that judges can replay live.
A full trace + span waterfall for every request, with raw attribute drill-down.
A Markdown audit report suitable for a real release process.
Zero-key demo path. The project runs anywhere with no internet, no LLM key.
One-command everything, make dev for local, vercel deploy --prod for production.
Live on Vercel at https://evalforge-omega.vercel.app, with the full stack (Next.js + FastAPI) deployed as a single monorepo.

What we learned

BAML-shape typed prompts. Forcing the LLM to fill a Pydantic schema with answer, citations[], confidence, refusal_reason is the single biggest reliability win, structured_output_validity jumps from 0 to 1.0.
BM25 beats a bad embedder. Our weak hash-embedding cosine was missing the right chunks for over half the dataset. Swapping the engineered pipeline to BM25 brought context recall from 0.36 to 1.00.
HyDE is sensitive to the mock. In real-LLM mode HyDE adds signal. In a mock that returns baseline text it adds noise, which we fixed by making the mock HyDE-aware. Hidden lesson: every component in a RAG pipeline interacts with every other one. You need traces to debug it.
Deploy-gate framing is the real product. Once you have one PASS/FAIL signal backed by an audit report, the conversation shifts from "is the model good" to "what threshold do you ship at."

Where this goes next: evaluating agents, not just RAG

The 25-question golden set proves the framework on RAG. The same trace + metric + guard + gate plumbing works for agents built on the Claude Agent SDK (or any tool-using framework). Concrete use cases where dropping EvalForge underneath an agent immediately pays off:

Coding agent with a CI/CD feedback loop. An agent identifies bugs, writes fixes, runs tests, iterates. EvalForge traces every tool call so you can tell "test passed" from "test silently skipped"; the deploy gate blocks PRs where the agent's success rate dropped after a prompt edit.
Customer support agent with multi-tool orchestration. Database query then email fetch then LLM reasoning then ticket-API update. EvalForge measures retrieval quality (did we fetch the right account?), escalation accuracy (under-escalate vs over-escalate), and per-tool latency so one slow API doesn't hide behind the surface metric.
Research agent with multi-step RAG + browsing. Faithfulness and citation-correctness are the hard problems here, exactly what EvalForge already scores. Plus the trace viewer surfaces which retrieval round failed and why, instead of a "looks fine" final report.
Compliance-graded health/finance assistant. PII and refusal-correctness guards plus the Markdown audit report are the artifact a regulator asks for. Today most teams generate this manually after a customer complaint; EvalForge generates it on every PR.

In each case the SDK gives you the agent. EvalForge gives you the proof it's working.

What's next for EvalForge

(engineering backlog)

Native Langfuse export by flipping the tracer adapter (one file).
Pluggable embedders + rerankers (OpenAI, Cohere, Voyage, Vertex).
GitHub Action that runs the eval + posts the deploy-gate verdict as a PR comment.
Multi-tenant SaaS surface (org, projects, runs over time) on Supabase Auth.
Diff view: prompt v1 vs prompt v2 deltas per question and per metric.
Plug the same runner into a Claude Agent SDK agent and score tool-call trajectories, not just final answers.