gemini-trace-agent

Inspiration

Gemini agents are easy to ship and hard to debug in prod. The model picks tools fine. What goes wrong is the second-order stuff: a retry loop you didn't realize was firing, a tool call that cost $0.42 instead of the expected $0.02, a regression where the agent silently stopped fetching and started hallucinating.

Arize Phoenix is the right place for that telemetry to live. The missing piece was a friction-free way to get Gemini-specific traces (cost, cache tiers, retries, tool-call args) INTO Phoenix without rewriting your agent.

That's what gemini-trace-agent does.

What it does

Wraps any Gemini call (google-generativeai or Vertex AI) without changing your agent code, via an httpx transport plug-in.
Records per-call wire-level trace, USD cost (including 2026 cache-discount pricing), retries, tool-call args, p50/p95/max latency.
Exports to Arize Phoenix as OpenInference-spec spans so you can use the existing Phoenix UI / eval / dataset workflows out of the box.
Also emits JSONL to disk for compliance + offline analysis.
Streamlit dashboard included for the "I don't have Phoenix running locally yet" case.

How we built it

The core is GeminiLens (already on PyPI as geminilens).

For the Arize track of the Google Cloud Rapid Agent Hackathon, I added a clean export adapter:

import geminilens
from geminilens.exporters.arize import ArizePhoenixExporter
import google.generativeai as genai

geminilens.attach(
    exporters=[ArizePhoenixExporter(endpoint="http://localhost:6006")],
    audit_path="runs/arize-demo.jsonl",
)

model = genai.GenerativeModel("gemini-2.5-flash")
result = model.generate_content(
    "Research agent prompt here...",
    tools=[search, fetch_url, summarize],
)

Every Gemini API call + every tool call lands as an OpenInference span in Phoenix. You get spans, traces, costs, retries, all queryable in the standard Phoenix UI.

What's specifically Arize-flavored

OpenInference spec compliance. Spans use llm.model, llm.prompts, llm.completions, llm.token_count.* per the spec so Phoenix renders them with full fidelity.
Eval-dataset ready. The JSONL doubles as an Arize evaluation dataset; same input/output schema.
Cache-aware token accounting. Gemini's 2026 cache-discount pricing is in the cost calc, so Phoenix dashboards reflect real spend including the cache effect.
Retry visibility. Every retry is its own span tagged with the trigger reason, so you can chart "retry rate by error class" out of the box.

Challenges we ran into

Faithful token accounting under cache hits. Cached-input tokens cost less. Naive sum across calls double-counts. Solved with per-call token-class breakdown.
Streaming responses. Gemini streaming chunks needed assembly before turning into one Phoenix span. Added a stream collector.
Tool-call recursion depth. A Gemini function call can trigger another function call. Phoenix wants parent/child span linkage. Added span context propagation.

Accomplishments

GeminiLens core on PyPI, MIT licensed
Arize Phoenix exporter working against a local Phoenix instance
JSONL audit log compatible with Phoenix eval datasets
Streamlit fallback dashboard for users without Phoenix
Cost calc covers Gemini 2.5 Flash, 2.5 Pro, and cache pricing

What we learned

The cheapest 10x for Gemini agent reliability is making every model call + tool call visible in one place. Arize Phoenix is that place. Getting Gemini telemetry into Phoenix should be a one-line import.

What's next

Vertex AI Gemini support alongside Google AI Studio
Phoenix eval template for "did the agent pick the right tool?"
More exporters (OpenTelemetry GenAI semconv, Splunk)

Built With

ai-agents
arize-phoenix
gemini
httpx
observability
openinference
opentelemetry
python
streamlit

Updates

Mukunda Katta started this project — May 21, 2026 02:20 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.