gemini-eval-agent

Inspiration

You ship an LLM app. Two weeks later a stakeholder asks: is it still hallucinating? Is the new prompt better than the old one? Which traces are the worst? These should not be Slack threads passed around for an afternoon. They should be one question to a single agent that already has access to Arize Phoenix's data. That's where gemini-eval-agent fits.

What it does

gemini-eval-agent treats every "is this LLM working?" question as an audit job. You ask it a question in plain English and it walks the Arize Phoenix MCP tools to produce a verdict:

list_projects to resolve project names to IDs
list_traces to see recent latency, cost, and quality scores
get_trace_detail to read span trees and per-evaluator scores
list_experiments for A/B test verdicts
list_datasets to see what evaluation data is wired up
run_evaluation to fire a fresh hallucination, relevance, toxicity, or helpfulness sweep

The agent's answer is the format an ML team can act on: a one-line verdict (PASS / FAIL / NEEDS REVIEW), 3-5 evidence bullets with specific trace IDs and evaluator scores, and one concrete next step.

How we built it

Google Cloud Agent Builder (ADK) for the agent framework. The whole agent fits in six lines: one LlmAgent, one McpToolset, a Gemini model, and a system prompt that defines the audit workflow.
Gemini 2.5 Flash on Vertex AI for reasoning. Fast enough for an interactive audit loop and cheap enough that reviewers can fire as many queries as they want.
Arize Phoenix MCP server for tools. The agent talks to the official @arizeai/phoenix-mcp tool surface, with a stub server in the repo so demos run without a Phoenix tenant. Set PHOENIX_BASE_URL and PHOENIX_API_KEY and the same code targets a real tenant via npx.
Streamlit for the dashboard.
Cloud Run for hosting.

Challenges we ran into

Arize Phoenix's MCP server is TypeScript and its input schemas aren't surfaced as a single JSON file. I reconstructed the canonical tool surface (list_projects, list_traces, etc.) by reading the npm package README and matching the schema shape so the agent's tool calls drop in against a real tenant without any rewriting.

Accomplishments that we're proud of

A real end-to-end Vertex AI Gemini call returned a structured verdict: PASS, hallucination score 0.84, sample size 100. Real numbers, not hand-written.
11 passing tests cover the stub server's responses and the agent's wiring.
The stub-vs-real split means reviewers can run the project on their machine in under five minutes without provisioning Phoenix.
Three substantively different MCP integrations now exist as siblings (this one targets Arize Phoenix; companion projects target Dynatrace and RAG drift).

What we learned

The MCP protocol is a real abstraction. The same LlmAgent + McpToolset shape that works against a Dynatrace MCP works unchanged against an Arize Phoenix MCP. Tomorrow's PagerDuty, Datadog, Sentry MCPs drop in the same way.

What's next for gemini-eval-agent

A scheduled "morning audit" mode that runs every weekday and emails the team a triage of any project with a quality-score drop.
Cross-experiment summarization: "compare last week's A/B winners and explain the pattern."
Plug in additional partner MCPs for richer context (PagerDuty for incident correlation, Slack for stakeholder notifications).

Built With

agent-development-kit
arize
arize-phoenix
gemini
gemini-2.5
google-cloud-agent-builder
mcp
phoenix-mcp
python
streamlit
vertex-ai

Updates

Mukunda Katta started this project — May 18, 2026 02:57 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.