Inspiration
Gemini agents are easy to ship and hard to debug in prod. The model picks tools fine. What goes wrong is the second-order stuff: a retry loop you didn't realize was firing, a tool call that cost $0.42 instead of the expected $0.02, a regression where the agent silently stopped fetching and started hallucinating.
Arize Phoenix is the right place for that telemetry to live. The missing piece was a friction-free way to get Gemini-specific traces (cost, cache tiers, retries, tool-call args) INTO Phoenix without rewriting your agent.
That's what gemini-trace-agent does.
What it does
- Wraps any Gemini call (
google-generativeaior Vertex AI) without changing your agent code, via an httpx transport plug-in. - Records per-call wire-level trace, USD cost (including 2026 cache-discount pricing), retries, tool-call args, p50/p95/max latency.
- Exports to Arize Phoenix as OpenInference-spec spans so you can use the existing Phoenix UI / eval / dataset workflows out of the box.
- Also emits JSONL to disk for compliance + offline analysis.
- Streamlit dashboard included for the "I don't have Phoenix running locally yet" case.
How we built it
The core is GeminiLens (already on PyPI as geminilens).
For the Arize track of the Google Cloud Rapid Agent Hackathon, I added a clean export adapter:
import geminilens
from geminilens.exporters.arize import ArizePhoenixExporter
import google.generativeai as genai
geminilens.attach(
exporters=[ArizePhoenixExporter(endpoint="http://localhost:6006")],
audit_path="runs/arize-demo.jsonl",
)
model = genai.GenerativeModel("gemini-2.5-flash")
result = model.generate_content(
"Research agent prompt here...",
tools=[search, fetch_url, summarize],
)
Every Gemini API call + every tool call lands as an OpenInference span in Phoenix. You get spans, traces, costs, retries, all queryable in the standard Phoenix UI.
What's specifically Arize-flavored
- OpenInference spec compliance. Spans use
llm.model,llm.prompts,llm.completions,llm.token_count.*per the spec so Phoenix renders them with full fidelity. - Eval-dataset ready. The JSONL doubles as an Arize evaluation dataset; same input/output schema.
- Cache-aware token accounting. Gemini's 2026 cache-discount pricing is in the cost calc, so Phoenix dashboards reflect real spend including the cache effect.
- Retry visibility. Every retry is its own span tagged with the trigger reason, so you can chart "retry rate by error class" out of the box.
Challenges we ran into
- Faithful token accounting under cache hits. Cached-input tokens cost less. Naive sum across calls double-counts. Solved with per-call token-class breakdown.
- Streaming responses. Gemini streaming chunks needed assembly before turning into one Phoenix span. Added a stream collector.
- Tool-call recursion depth. A Gemini function call can trigger another function call. Phoenix wants parent/child span linkage. Added span context propagation.
Accomplishments
- GeminiLens core on PyPI, MIT licensed
- Arize Phoenix exporter working against a local Phoenix instance
- JSONL audit log compatible with Phoenix eval datasets
- Streamlit fallback dashboard for users without Phoenix
- Cost calc covers Gemini 2.5 Flash, 2.5 Pro, and cache pricing
What we learned
The cheapest 10x for Gemini agent reliability is making every model call + tool call visible in one place. Arize Phoenix is that place. Getting Gemini telemetry into Phoenix should be a one-line import.
What's next
- Vertex AI Gemini support alongside Google AI Studio
- Phoenix eval template for "did the agent pick the right tool?"
- More exporters (OpenTelemetry GenAI semconv, Splunk)

Log in or sign up for Devpost to join the conversation.