Agent Kintsugi

AgentKintsugi Landing Page
Arize Phoenix MCP integration description
AgentKintsugi Features
Configure the test run for the targeted Google ADK agent
AgentKintsugi test run history
Test run iteration 1 produces 0% success rate for the targeted AI agents
Application of optimised prompt patch produces 100% success rate for agent run
Diagnosis suggests the optimised prompt patch for the agent based on the trace data fetched through Phoenix MCP
The evolutin graph shows agent execution evolution across multiple iterations
Multi-dimensional comparison amongst different agent test run iterations

Inspiration

The name comes from kintsugi — the Japanese art of repairing broken pottery with gold, turning flaws into features. I was inspired by a recurring frustration in agentic AI development: agents fail silently in production, and there's no systematic way to know why or how to fix them. You have to manually sift through traces, guess at prompt changes, and re-run tests by hand. I wanted to close that loop automatically.

What it does

Agent Kintsugi is a meta-agent that autonomously improves other AI agents. You point it at any agent built with the Google Agent Development Kit (ADK) and write your own test scenarios describing what your agent should do and how success is measured. Once your scenarios are defined, Agent Kintsugi runs a self-healing loop:

Test — runs your agent against all of your scenarios and scores each one with a hybrid programmatic + Gemini LLM-as-Judge evaluator
Diagnose — a Gemini-powered Debugger Agent queries Arize Phoenix for execution traces, identifies failure patterns (wrong tool calls, tool ordering errors, step-limit hits, latency spikes), and produces a structured diagnosis
Patch — applies targeted prompt optimizations in-memory (never touching your source files) and re-tests only the failing scenarios
Repeat — iterates until all your scenarios pass or the iteration budget is exhausted, then exports a git-ready .patch file

The result is a full evolution timeline — per-scenario before/after comparisons, multi-dimensional radar charts, prompt version history in Phoenix, and a downloadable PDF report.

How I built it

Backend: Python + FastAPI, structured around an async run_forge_cycle orchestrator that manages the test → diagnose → optimize → retest loop
Agent framework: Google ADK (google-adk) powers both the target agents under test and the internal Debugger Agent that autonomously queries Phoenix
LLM: Gemini 2.5 Pro for the LLM-as-Judge evaluator and Gemini 2.5 Flash for the Debugger Agent's diagnosis and scenario generation
Observability: Arize Phoenix with OpenInference instrumentation captures every tool call, span, and latency for the target agent; the Debugger Agent then queries these traces via the Phoenix MCP server
In-memory patching: The ADKConnector dynamically imports any ADK agent module and applies prompt patches without modifying source files, exporting a final difflib diff
User-defined scenarios: Users author their own test scenarios — specifying user input, expected tools, success criteria, and failure indicators — through the Configure page before starting a run
Frontend: React + TypeScript + Vite, with Recharts for the evolution timeline and radar charts
Auth & persistence: JWT authentication + Google Cloud Firestore for run history
Deployment: Containerized with Docker, deployed to Google Cloud Run via Cloud Build
CLI: Run the whole test cycle from the terminal.

Architecture Diagram

Challenges I ran into

MCP server stability: The Phoenix MCP subprocess would accumulate corrupted stdio buffers across multiple LLM calls in the same process. I solved this by restarting the MCP client between diagnosis phases.
Consistent evaluation: Getting the LLM evaluator to score consistently across runs required anchoring it with deterministic programmatic pre-checks (tool presence, call ordering, step counts) that constrain what the LLM can say.
Iteration fairness: Naively re-running all scenarios each iteration would mix new passes with old failures, making improvement metrics misleading. I implemented a carry-forward system that only re-runs failing scenarios and merges previously-passed results.
Scenario flexibility: Because users bring their own test scenarios with no enforced schema beyond a few required fields, the evaluator had to handle wildly different success criteria — from strict tool-call checklists to open-ended output quality checks.
Rate limits during the optimization loop: Back-to-back Gemini calls for evaluation → diagnosis → scenario generation would hit rate limits. I added a cooldown phase between iterations.

Accomplishments that I'm proud of

A fully autonomous test → diagnose → patch → retest loop that requires zero human intervention between iterations — once the user has defined their scenarios
The hybrid evaluator combining programmatic ground truth with LLM-as-Judge — getting reliable, consistent scores across arbitrarily user-defined success criteria without expensive human labeling
The in-memory patching system: Agent Kintsugi never writes to your source files, making it safe to run against any production agent
End-to-end Phoenix integration: traces, prompt version history, and scenario datasets all flow into Phoenix automatically
A polished real-time UI with live streaming updates, per-scenario before/after comparisons, and a one-click PDF report

What I learned

Observability is the missing piece of the agentic AI development loop — without traces, diagnosis is guesswork
Small, targeted prompt patches consistently outperform large rewrites; the key insight is identifying which failure pattern to address, not writing a better prompt from scratch
LLM-as-Judge evaluation needs a deterministic anchor layer or scores drift across runs — especially important when users define their own arbitrary success criteria
Building a meta-agent (an agent that improves agents) surfaces all the reliability problems of agentic systems in sharp relief — flaky tool calls, context window management, and error propagation all matter much more when the agent is in a control loop

What's next for Agent Kintsugi

Support for agents built on other frameworks (LangGraph, CrewAI, custom) via a plugin connector interface
A scenario builder UI that helps users write well-structured test scenarios with suggested success criteria and tool checklists, lowering the barrier to getting started
A regression test suite that automatically runs user-defined scenarios on every new commit (CI/CD integration)
Fine-tuning export: instead of prompt patches, output a dataset of (input, ideal_output) pairs — derived from the user's own scenarios — for fine-tuning the target model
Multi-agent system support: diagnosing failures that span across agent handoffs, not just single-agent runs

Built With

arize
bcrypt
css
docker
fastapi
firestore
fpdf2
gemini-2.5-flash
google-cloud
googleadk
hatchling
jwt
openinference
opentelemetry
phoenix
pydantic
python
react
recharts
tailwind
typescript
uv
uvicorn
vite
websockets

Updates

Vani Chitkara started this project — Jun 10, 2026 03:57 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.