Inspiration

As AI agents become more common in customer support, education, and enterprise workflows, the need for systematic red-teaming has grown urgent. Manual testing is slow and inconsistent — we wanted a tool that could automatically craft adversarial attacks, run them against agents, and score the responses. KAIRO was born from the vision of making agent safety testing accessible, repeatable, and transparent.

What it does

KAIRO is a test harness for red-teaming AI agents. It:

  • Uses AgentDojo to craft adversarial blueprints and locally realizes them into attack prompts.
  • Runs attacks like prompt injection, system prompt exfiltration, harmful request refusal, and over-refusal detection.
  • Connects to target agents through adapters (Webhook, Gemini, OpenAI, Anthropic, etc.).
  • Judges results with Gemini LLM or local checkers to produce pass/fail JSON verdicts.
  • Generates detailed reports, logs, and HTML dashboards for analysis.

How we built it

KAIRO is implemented as an agentic pipeline — a set of cooperating AI-driven agents and small adapters that orchestrate attacks, evaluate responses, and propose fixes.

Key components:

  • Orchestrator (Coordinator)
    A lightweight controller that schedules tests, enforces budgets and timeouts, and routes messages between agents. The orchestrator accepts a run definition (attack pack, target adapter, trials) and executes each test as a short-lived workflow.

  • Attacker Agent (AgentDojo Realizer + Attacker)
    Uses AgentDojo blueprints to plan high-level adversarial objectives, then a local realization layer converts those blueprints into concrete attacker messages (prompts). The attacker agent can be configured to run locally or call out to AgentDojo services.

  • Target Adapter Agents
    Thin adapter agents that translate the orchestrator’s test into a call to the target system:

    • WebhookAdapter — calls a hosted agent service (e.g., your FastAPI/Node webhook).
    • GeminiAdapter — issues Gemini API calls to cloud models.
    • Additional adapters (OpenAI, Anthropic, custom SDKs) are pluggable.
  • Judge Agent (LLM Judge)
    A dedicated agent that evaluates the target agent’s response against a rubric. By default KAIRO uses Gemini to run deterministic rubric prompts that return strict JSON { passed, evidence, confidence }. Judge agents can also be local checkers (regex/predicate/fuzzy) for low-cost runs.

  • Solve Agent (Remediation Suggester)
    After the judge flags a vulnerability, the Solve Agent consults the failing transcript and suggests targeted mitigations — e.g., rewrite safety prompts, tighten tool gating, or adjust system instructions. This lets KAIRO move from detection to actionable remediation.

  • Message Bus / Artifacts Store
    Short-lived messages and results flow over an internal queue (in-memory or lightweight broker). Results, logs, and artifacts (reports, HTML) are stored in the artifact store for later retrieval and dashboarding.

  • Small frontend & dashboard
    A Next.js UI that talks to the orchestrator to start runs, poll status, and surface results and reports — the UI is intentionally lightweight because the heavy lifting is done by the agentic system.

Design choices:

  • Modularity: every agent is a small, single-responsibility unit so teams can replace or upgrade components independently.
  • Pluggable adapters & judges: add new adapters (OpenAI, Anthropic) or judge styles (human-in-the-loop, multiple-model voting) without touching the orchestrator.
  • Agentic feedback loop: attacker → target → judge → solve forms a closed-loop workflow so KAIRO can both discover and propose fixes for vulnerabilities.

Challenges we ran into

  • Prompt Realization: AgentDojo outputs were abstract blueprints; we had to build a layer to turn them into executable attacks.
  • Token Budget Management: Some agent responses exceeded limits (e.g., >1024 tokens), requiring strict clamping and retries.
  • Over-Refusal Detection: Training the judge to distinguish between safe refusals and unnecessary refusals was tricky.
  • System Orchestration: Connecting attacker → target → judge in a modular but reliable pipeline took multiple iterations.

Accomplishments that we're proud of

  • Built a full red-team harness that runs end-to-end with a single API call.
  • Designed a modular architecture where new attacks, adapters, or checkers can be plugged in easily.
  • Integrated Gemini as a strict JSON judge, reducing ambiguity in scoring.
  • Created a visual architecture diagram to clearly explain the system’s flow.

What we learned

  • Red-teaming is two-sided: crafting effective attacks is as hard as defending against them.
  • Judging is subjective unless tightly constrained — LLM judges need carefully written rubrics.
  • Visualization matters: the architecture diagram helped others understand KAIRO in seconds.
  • Safety-first design: even attack systems must enforce ethical and defensive use cases.

What's next for KAIRO

  • Add more attack packs (e.g., jailbreaks, misinformation challenges, subtle bias tests).
  • Expand judge diversity (Anthropic Claude, OpenAI GPT-4o) for cross-model scoring.
  • Support streaming evaluation for real-time monitoring.
  • Build a cloud-hosted dashboard so teams can red-team agents collaboratively without setup.
  • Explore integrations with compliance frameworks (HIPAA, GDPR, SOC2) to help enterprises certify safety.

Built With

Share this project:

Updates