Inspiration

Yuhina is named after the Taiwan Yuhina (Yuhina brunneiceps), a mountain songbird endemic to Taiwan. It's one of the rare species that practices cooperative breeding — flockmates jointly build nests, incubate eggs, raise chicks, and defend territory. They stay in touch through contact calls while foraging and switch to alarm calls when threats appear.

That's the shape we wanted for a security triage tool: a lightweight system where the agent and the maintainer cooperate to keep the security policy sharp. The agent handles the repetitive triage; the maintainer corrects mistakes; and every correction makes the system stronger — like a flock that learns together.

The problem is real. A typical open-source maintainer gets dozens of vulnerability alerts per week, each with a CVSS score that says nothing about whether their specific project is actually exposed. Enterprise teams have dedicated tooling, but small teams don't — they either triage manually or ignore the noise. We wanted to build something that does project-specific triage and gets better at it over time without requiring the maintainer to hand-tune prompts or write rules.

What it does

Yuhina is a working reference of an MCP self-improvement loop. The security triage demo is one concrete adapter; the loop itself is reusable.

The backbone:

  1. A CVE lands with a CVSS score.
  2. The agent consults the project's module graph to decide if the vulnerable package is actually reachable from a public path.
  3. Round 1 (CVSS-only policy) makes the wrong call on exposure-sensitive cases.
  4. The Phoenix trace shows the agent's reasoning path, span by span.
  5. The miss becomes a regression contract + evaluator through the repair MCP.
  6. The prompt is patched to include the exposure-checking rule the agent skipped.
  7. Round 2 (the agent-proposed v3 prompt) handles the same alerts correctly.

Result on the real-CVE dataset (lodash CVE-2026-2950 + i18next CVE-2026-41690): 0% → 100% accuracy. Phoenix MCP drives the analysis loop end-to-end.

How we built it

  • Agent framework — Google ADK LlmAgent on Vertex AI Gemini (gemini-3-flash-preview). Also Agent Engine-deployable via scripts/deploy_agent_engine.py.
  • Runtime — Cloud Run hosts a FastAPI wrapper with a demo UI at the root path, a /invoke API endpoint (Bearer auth + per-IP rate limit), and a /health liveness probe.
  • Observability — OpenInference instrumentation streams every run to Phoenix Cloud with a stable server-generated run_id.
  • Self-improveagent.improve_via_phoenix spins up an ADK Gemini agent that drives two MCP servers: the official @arizeai/phoenix-mcp stdio sidecar, and an in-tree repair MCP exposing trace_failure_to_contract, create_eval, propose_patch. The Gemini agent calls Phoenix MCP's add-dataset-examples to write misses into the yuhina-regressions dataset and upsert-prompt to create a new prompt version with regression feedback — verified end-to-end.
  • Two-MCP architecture — Repair MCP generates regression contracts and patch drafts locally. Phoenix MCP handles persistence (dataset writes, prompt versioning). The Gemini improve agent orchestrates both.

Challenges we ran into

  • Cloud Run frontend intercepts /healthz. Google's edge served its own 404 before requests reached the container. Fix: use ADK's built-in /health instead.
  • Gemini 3 preview only on global endpoint. Agent Engine requires a regional endpoint, but Gemini 3 preview models aren't available regionally yet. Solution: Cloud Run as primary (hits global), Agent Engine as verified backup (on gemini-2.5-flash).
  • Two-layer MCP coordination. Getting repair MCP (contract logic) and Phoenix MCP (persistence) to coordinate cleanly without duplicating work took iteration. Final architecture: each server owns one concern, the agent orchestrates.

Accomplishments that we're proud of

  • The loop actually closes through MCP — the Gemini agent calls phoenix-mcp and repair-mcp tools by name, no shortcuts.
  • 0% → 100% lift on real CVEs is stable across runs.
  • The demo UI lets anyone try the agent in a browser — no curl required.

What we learned

  • The hardest part of an MCP self-improvement loop isn't the MCP — it's drawing the line between "the LLM proposed the regression contract" and "the LLM merged the prompt patch". Yuhina keeps the merge on the human side; every generated artifact is a draft.
  • Google Cloud's frontend has opinions about path names. Pre-submission verification has to hit every URL the README claims.

What's next for Yuhina

  • Generalize the adapter: the triage demo is one shape; the same loop should drop into any agent project that has runs, decisions, and reviewer-marked misses.
  • Push the integration prompt (already in the README) into a one-liner CLI so adopting the loop is yuhina init-shaped.

Built With

Share this project:

Updates