Inspiration
Inspiration
The 2026 FIFA World Cup will put millions of fans in front of AI concierge agents asking about fixtures, transit, tickets, and venues — in dozens of languages, under matchday load. A wrong answer about a kickoff time or a stadium gate isn't a minor glitch; it strands a fan in a foreign city. We wanted an agent that doesn't just answer, but continuously proves its answers are trustworthy and refuses to ship a regression.
What it does
PitchProof Vigil is an agent-reliability and evaluation layer for a World Cup fan-concierge agent. Every interaction is traced; the agent queries its own traces, prompts, datasets, and experiments at runtime — as tools — via the Arize Phoenix MCP server. It runs evals against golden datasets and blocks releases when answer quality regresses. The core demo is a single screen: a live trace flows in, an eval verdict is rendered, and a regression blocks the deploy.
How we built it
A Python/FastAPI backend with a React/TypeScript frontend. Gemini powers the agent's reasoning and planning. Arize Phoenix, via its MCP server, provides production-grade tracing plus runtime self-querying of traces and evals. Releases are gated through a CI workflow that runs the golden-dataset eval and fails on regression. The build hardened across nine versioned iterations to 509 backend tests at 100% coverage, plus a standalone SDK at 100% coverage.
Challenges we ran into
Keeping the eval loop deterministic and fully covered without live credentials in CI — solved with injectable transports and offline integration scaffolds that run against real services when env vars are present. Alembic-on-SQLite kept emitting spurious migration operations that had to be stripped by hand and verified with a balanced up/down chain (8 up / 8 down).
Accomplishments that we're proud of
509 backend tests at 100% coverage, a standalone SDK at 100%, a clean nine-version git history with annotated tags, and a demo that closes the full trace → eval → block-on-regression loop on one screen.
What we learned
That the strongest reliability story is an agent observing and evaluating itself: wiring Phoenix as a runtime MCP toolset — not just a passive dashboard — turns evaluation into a live gate rather than an after-the-fact report.
What's next for PitchProof Vigil
Wiring the live Phoenix MCP endpoint and Gemini API in production, expanding the multilingual golden datasets, and extending the regression gate across a full six-track World Cup fan-experience suite.
What it does
How we built it
Challenges we ran into
Accomplishments that we're proud of
What we learned
What's next for PitchProof Vigil
Built With
- alembic
- arize-phoenix
- fastapi
- gemini
- github
- google-cloud
- mcp
- pytest
- python
- react
- sqlalchemy
- sqlite
- typescript
Log in or sign up for Devpost to join the conversation.