Inspiration
You ship an LLM app. Two weeks later a stakeholder asks: is it still hallucinating? Is the new prompt better than the old one? Which traces are the worst? These should not be Slack threads passed around for an afternoon. They should be one question to a single agent that already has access to Arize Phoenix's data. That's where gemini-eval-agent fits.
What it does
gemini-eval-agent treats every "is this LLM working?" question as an audit job. You ask it a question in plain English and it walks the Arize Phoenix MCP tools to produce a verdict:
list_projectsto resolve project names to IDslist_tracesto see recent latency, cost, and quality scoresget_trace_detailto read span trees and per-evaluator scoreslist_experimentsfor A/B test verdictslist_datasetsto see what evaluation data is wired uprun_evaluationto fire a fresh hallucination, relevance, toxicity, or helpfulness sweep
The agent's answer is the format an ML team can act on: a one-line verdict (PASS / FAIL / NEEDS REVIEW), 3-5 evidence bullets with specific trace IDs and evaluator scores, and one concrete next step.
How we built it
- Google Cloud Agent Builder (ADK) for the agent framework. The whole agent fits in six lines: one
LlmAgent, oneMcpToolset, a Gemini model, and a system prompt that defines the audit workflow. - Gemini 2.5 Flash on Vertex AI for reasoning. Fast enough for an interactive audit loop and cheap enough that reviewers can fire as many queries as they want.
- Arize Phoenix MCP server for tools. The agent talks to the official
@arizeai/phoenix-mcptool surface, with a stub server in the repo so demos run without a Phoenix tenant. SetPHOENIX_BASE_URLandPHOENIX_API_KEYand the same code targets a real tenant via npx. - Streamlit for the dashboard.
- Cloud Run for hosting.
Challenges we ran into
Arize Phoenix's MCP server is TypeScript and its input schemas aren't surfaced as a single JSON file. I reconstructed the canonical tool surface (list_projects, list_traces, etc.) by reading the npm package README and matching the schema shape so the agent's tool calls drop in against a real tenant without any rewriting.
Accomplishments that we're proud of
- A real end-to-end Vertex AI Gemini call returned a structured verdict: PASS, hallucination score 0.84, sample size 100. Real numbers, not hand-written.
- 11 passing tests cover the stub server's responses and the agent's wiring.
- The stub-vs-real split means reviewers can run the project on their machine in under five minutes without provisioning Phoenix.
- Three substantively different MCP integrations now exist as siblings (this one targets Arize Phoenix; companion projects target Dynatrace and RAG drift).
What we learned
The MCP protocol is a real abstraction. The same LlmAgent + McpToolset shape that works against a Dynatrace MCP works unchanged against an Arize Phoenix MCP. Tomorrow's PagerDuty, Datadog, Sentry MCPs drop in the same way.
What's next for gemini-eval-agent
- A scheduled "morning audit" mode that runs every weekday and emails the team a triage of any project with a quality-score drop.
- Cross-experiment summarization: "compare last week's A/B winners and explain the pattern."
- Plug in additional partner MCPs for richer context (PagerDuty for incident correlation, Slack for stakeholder notifications).

Log in or sign up for Devpost to join the conversation.