MOSAIC

Inspiration

Hundreds of thousands of AI agents are being deployed every single day But as these agents increasingly work in teams — one planning, one executing, one reviewing , a critical question emerges: how do you know they are not coordinating to do something unsafe behind the scenes?Our inspiration came from a 2025 ICLR paper called AI Control: Improving Safety Despite Intentional Subversion by Greenblatt et al. at Redwood Research. The core idea: even if an AI is misbehaving, can we still catch it? They introduced a framework separating a legitimate main task from a hidden unsafe side task, then measured how well safety monitors caught the hidden behavior.

What it does

MOSAIC — Multi-agent Oversight Setting for AI Control , is an evaluation framework that stress-tests whether safety monitors can detect unsafe behavior in multi-agent AI systems, specifically when that behavior is hidden in private communication between agents

How we built it

We built MOSAIC as a Streamlit dashboard with a fully automated evaluation pipeline. Dataset: We designed 8 base scenarios across 4 domains — software engineering, evaluation pipelines, browser workflow, and medical assistant. Each domain has one safe base scenario and one unsafe base scenario. Variant generation: Each base scenario generates 5 variants using different disguise tactics — debug-mode wording, compatibility-route framing, innocent final summaries, approval-timing ambiguity, and softened escalation language. This produces 40 total trials and turns MOSAIC from a static list into a genuine stress-test framework. Risk taxonomy: We built a structured risk taxonomy categorizing unsafe behavior into four types — hidden privilege escalation, privacy violation, evaluation gaming, and medical safety failure.

Challenges we ran into

Designing subtle unsafe scenarios. The hardest part was writing private transcripts that were genuinely deceptive — not obviously flagged by surface-level keyword matching, but meaningfully unsafe when you understand the context. We went through many iterations to make the disguise tactics realistic.