-
-
Yuhina — MCP self-improvement loop for agents. Trace mistakes, build regression contracts, evaluate, update policy with human review.
-
Technical stack: Cloud Run + ADK LlmAgent + Gemini 3 Flash + Phoenix Cloud observability + dual MCP (Phoenix traces + Repair contracts).
-
6-step cooperative loop: Agent Run → Trace → Detect Miss → Repair MCP → Evaluate → Policy Update. Every change requires human approval.
Inspiration
Yuhina is named after the Taiwan Yuhina (Yuhina brunneiceps), a mountain songbird endemic to Taiwan. It's one of the rare species that practices cooperative breeding — flockmates jointly build nests, incubate eggs, raise chicks, and defend territory. They stay in touch through contact calls while foraging and switch to alarm calls when threats appear.
That's the shape we wanted for a security triage tool: a lightweight system where the agent and the maintainer cooperate to keep the security policy sharp. The agent handles the repetitive triage; the maintainer corrects mistakes; and every correction makes the system stronger — like a flock that learns together.
The problem is real. A typical open-source maintainer gets dozens of vulnerability alerts per week, each with a CVSS score that says nothing about whether their specific project is actually exposed. Enterprise teams have dedicated tooling, but small teams don't — they either triage manually or ignore the noise. We wanted to build something that does project-specific triage and gets better at it over time without requiring the maintainer to hand-tune prompts or write rules.
What it does
Yuhina is a working reference of an MCP self-improvement loop. The security triage demo is one concrete adapter; the loop itself is reusable.
The backbone:
- A CVE lands with a CVSS score.
- The agent consults the project's module graph to decide if the vulnerable package is actually reachable from a public path.
- Round 1 (CVSS-only policy) makes the wrong call on exposure-sensitive cases.
- The Phoenix trace shows the agent's reasoning path, span by span.
- The miss becomes a regression contract + evaluator through the repair MCP.
- The prompt is patched to include the exposure-checking rule the agent skipped.
- Round 2 (the agent-proposed v3 prompt) handles the same alerts correctly.
Result on the real-CVE dataset (lodash CVE-2026-2950 + i18next CVE-2026-41690): 0% → 100% accuracy. Phoenix MCP drives the analysis loop end-to-end.
How we built it
- Agent framework — Google ADK
LlmAgenton Vertex AI Gemini (gemini-3-flash-preview). Also Agent Engine-deployable viascripts/deploy_agent_engine.py. - Runtime — Cloud Run hosts a FastAPI wrapper with a demo UI at the root path, a
/invokeAPI endpoint (Bearer auth + per-IP rate limit), and a/healthliveness probe. - Observability — OpenInference instrumentation streams every run to Phoenix Cloud with a stable server-generated
run_id. - Self-improve —
agent.improve_via_phoenixspins up an ADK Gemini agent that drives two MCP servers: the official@arizeai/phoenix-mcpstdio sidecar, and an in-tree repair MCP exposingtrace_failure_to_contract,create_eval,propose_patch. The Gemini agent calls Phoenix MCP'sadd-dataset-examplesto write misses into theyuhina-regressionsdataset andupsert-promptto create a new prompt version with regression feedback — verified end-to-end. - Two-MCP architecture — Repair MCP generates regression contracts and patch drafts locally. Phoenix MCP handles persistence (dataset writes, prompt versioning). The Gemini improve agent orchestrates both.
Challenges we ran into
- Cloud Run frontend intercepts
/healthz. Google's edge served its own 404 before requests reached the container. Fix: use ADK's built-in/healthinstead. - Gemini 3 preview only on
globalendpoint. Agent Engine requires a regional endpoint, but Gemini 3 preview models aren't available regionally yet. Solution: Cloud Run as primary (hitsglobal), Agent Engine as verified backup (ongemini-2.5-flash). - Two-layer MCP coordination. Getting repair MCP (contract logic) and Phoenix MCP (persistence) to coordinate cleanly without duplicating work took iteration. Final architecture: each server owns one concern, the agent orchestrates.
Accomplishments that we're proud of
- The loop actually closes through MCP — the Gemini agent calls
phoenix-mcpandrepair-mcptools by name, no shortcuts. - 0% → 100% lift on real CVEs is stable across runs.
- The demo UI lets anyone try the agent in a browser — no curl required.
What we learned
- The hardest part of an MCP self-improvement loop isn't the MCP — it's drawing the line between "the LLM proposed the regression contract" and "the LLM merged the prompt patch". Yuhina keeps the merge on the human side; every generated artifact is a draft.
- Google Cloud's frontend has opinions about path names. Pre-submission verification has to hit every URL the README claims.
What's next for Yuhina
- Generalize the adapter: the triage demo is one shape; the same loop should drop into any agent project that has runs, decisions, and reviewer-marked misses.
- Push the integration prompt (already in the README) into a one-liner CLI so adopting the loop is
yuhina init-shaped.
Log in or sign up for Devpost to join the conversation.