PitchProof Vigil

The trace → eval → block-on-regression loop in one screen, gated by the Arize Phoenix MCP server.

Inspiration

The 2026 FIFA World Cup will put millions of fans in front of AI concierge agents asking about fixtures, transit, tickets, and venues — in dozens of languages, under matchday load. A wrong answer about a kickoff time or a stadium gate isn't a minor glitch; it strands a fan in a foreign city. We wanted an agent that doesn't just answer, but continuously proves its answers are trustworthy and refuses to ship a regression.

What it does

PitchProof Vigil is an agent-reliability and evaluation layer for a World Cup fan-concierge agent. Every interaction is traced; the agent queries its own traces, prompts, datasets, and experiments at runtime — as tools — via the Arize Phoenix MCP server. It runs evals against golden datasets and blocks releases when answer quality regresses. The core demo is a single screen: a live trace flows in, an eval verdict is rendered, and a regression blocks the deploy.

How we built it

A Python/FastAPI backend with a React/TypeScript frontend. Gemini powers the agent's reasoning and planning. Arize Phoenix, via its MCP server, provides production-grade tracing plus runtime self-querying of traces and evals. Releases are gated through a CI workflow that runs the golden-dataset eval and fails on regression. The build hardened across nine versioned iterations to 509 backend tests at 100% coverage, plus a standalone SDK at 100% coverage.

Challenges we ran into

Keeping the eval loop deterministic and fully covered without live credentials in CI — solved with injectable transports and offline integration scaffolds that run against real services when env vars are present. Alembic-on-SQLite kept emitting spurious migration operations that had to be stripped by hand and verified with a balanced up/down chain (8 up / 8 down).

Accomplishments that we're proud of

509 backend tests at 100% coverage, a standalone SDK at 100%, a clean nine-version git history with annotated tags, and a demo that closes the full trace → eval → block-on-regression loop on one screen.

What we learned

That the strongest reliability story is an agent observing and evaluating itself: wiring Phoenix as a runtime MCP toolset — not just a passive dashboard — turns evaluation into a live gate rather than an after-the-fact report.

What's next for PitchProof Vigil

Wiring the live Phoenix MCP endpoint and Gemini API in production, expanding the multilingual golden datasets, and extending the regression gate across a full six-track World Cup fan-experience suite.