Inspiration

Most AI agents are black boxes. They produce outputs but you can't see why a decision was made, which reasoning step failed, where confidence collapsed, or what the model was actually doing between prompt and response. For low-stakes applications, that's tolerable. For AI-heavy systems making consequential decisions at scale, fraud detection, medical triage, financial forecasting, it's a critical gap. MatchMind was built to answer one question: what does a fully observable, self-correcting AI agent actually look like in production? World Cup 2026 is the proving ground. 104 matches, real stakes, verifiable outcomes, a defined timeline. Every prediction is falsifiable, you find out quickly if the agent was right, and more importantly, why it was wrong. That makes it an ideal environment for demonstrating agent observability as an architectural pattern.

What it does

MatchMind is an agent observability reference implementation disguised as a football prediction system. The prediction layer uses Google ADK 1.3.0 with Gemini 2.0 Flash to analyze upcoming fixtures, team form, head-to-head history, tournament context and produce structured outputs: winner probability, scoreline estimate, confidence score C∈[0,1]C \in [0, 1] C∈[0,1], and explicit reasoning factors. The observability layer is the real product. Every agent run generates a complete trace in Arize Phoenix:

Full span tree from API request → agent reasoning → structured output Token-level attribution per reasoning step Confidence drift tracked across the tournament Complete audit trail of every decision

The self-improvement loop closes the feedback loop: after each match resolves, the agent compares its prediction against the actual outcome, identifies reasoning gaps using its own trace history, and updates its strategy before the next fixture. Observability isn't just monitoring here, it's the mechanism that enables learning. This pattern generalizes. Replace "match prediction" with fraud scoring, content moderation, or clinical decision support, the observability architecture is identical.

How we built it

Agent framework: Google ADK 1.3.0 (LlmAgent) with Gemini 2.0 Flash via Google AI Studio. Multi-step orchestration: pre-match analysis → structured prediction → post-match self-review → strategy refinement. Observability stack: OpenInference dual instrumentation — GoogleADKInstrumentor captures ADK-level spans (agent steps, tool calls, session context); GoogleGenAIInstrumentor captures model-level spans (prompts, completions, token counts). All traces exported via OTLP to Arize Phoenix Cloud, project matchmind. Critical implementation detail: instrumentation must be initialized before the ADK session starts. Tracers registered after session init produce spans, but they're orphaned, no parent context, no trace tree. Getting this ordering right is what separates real observability from logging. Infrastructure: FastAPI on Google Cloud Run (us-central1). Secrets via Secret Manager. CI/CD via Cloud Build trigger on main → Artifact Registry → Cloud Run deploy. Zero manual deployment steps after commit.

Challenges we ran into

Instrumentation ordering: The OpenInference tracers must register before ADK session initialization or spans are silently dropped. The Phoenix dashboard receives events but no trace structure, everything appears as disconnected root spans. Debugging this required reading both SDK source trees to understand the context propagation model. Dependency conflict: Google ADK 1.3.0 requires uvicorn>=0.34.0. Our requirements pinned 0.30.6. The pip resolver failed silently during local development but surfaced as a hard build failure in Cloud Build, only caught at deploy time under deadline pressure. IAM misconfiguration: The Cloud Run service account lacked roles/secretmanager.secretAccessor. The container started, attempted to pull the Phoenix API key from Secret Manager, and crashed with a permission denial. Fixed with a project-level IAM binding via Cloud Shell, a 30-second fix that took 20 minutes to diagnose. Gemini client initialization: genai.Client(vertexai=True, project=..., location=...) hardcodes Vertex AI and silently ignores GOOGLE_GENAI_USE_VERTEXAI=false in the environment. The agent was attempting Vertex AI auth with AI Studio credentials, failing at every inference call. Fix: genai.Client() with no arguments, reads env vars correctly.

Accomplishments that we're proud of

A self-correction loop that is genuinely visible. In Arize Phoenix, you can watch the agent change its reasoning strategy between matches, the trace from match 3 looks measurably different from match 1 because the agent incorporated what it got wrong. That's not a claim; it's a diff you can pull up in the dashboard. Every layer of the stack is observable. From the API request hitting Cloud Run to the final self-correction span closing in Phoenix, every step is attributed, latency-measured, and auditable. For an agent that is supposed to learn, that visibility is what makes the learning real rather than aspirational. Three sequential deployment failures, dependency conflict, IAM misconfiguration, container crash, each diagnosed from Cloud Build logs alone and resolved under a live deadline. The service went from broken to healthy in under an hour.

What we learned

Observability must be architectural, not bolted on. Instrumentation that gets added after the agent is working produces incomplete traces, missing spans, broken parent context, gaps in the reasoning tree. It has to be the first thing initialized, before any agent or model client is constructed. The feedback loop only works if the agent can see its own reasoning. MatchMind's self-improvement isn't prompt re-injection, it's the agent reading its own trace history, identifying where its confidence was miscalibrated, and updating its priors. Arize Phoenix is what makes that possible. Without structured trace data, the agent has no basis for self-correction beyond the outcome alone. World Cup is a proxy. The architecture is the point. The same observability pattern, dual instrumentation, structured outputs, trace-driven feedback loops,applies to any AI-heavy application where decisions have consequences and explainability matters.

What's next for Matchmind

Live match data pipeline, right now MatchMind reasons from static context. Next is a real-time ingestion layer pulling FIFA 2026 official data, team news feeds, and injury reports so predictions are grounded in current facts, not just training knowledge. Confidence calibration scoring across the full tournament tracking not just win/loss accuracy but how well the agent's stated confidence scores correlate with actual outcomes over 104 matches. A Brier score dashboard in Arize Phoenix. Expanding the observability pattern beyond World Cup, the architecture generalizes directly to fraud detection, content moderation, and clinical decision support. MatchMind becomes a reference implementation and open-source template for production-grade observable agents. Public leaderboard comparing MatchMind predictions against human pundits, betting markets, and other AI models, making the agent's improvement curve visible in real time. Multi-agent architecture separate specialist agents for different match contexts (knockout pressure, underdog dynamics, weather/venue factors) coordinated by an orchestrator, with full cross-agent tracing in Phoenix.

Built With

  • 2.0
  • adk
  • ai
  • arize
  • artifact
  • build
  • cloud
  • docker
  • fastapi
  • flash
  • gemini
  • google
  • manager
  • openinference
  • opentelemetry
  • phoenix
  • pydantic
  • python
  • registry
  • run
  • secret
  • studio
  • uvicorn
Share this project:

Updates