Inspiration
Hundreds of thousands of AI agents are being deployed every single day But as these agents increasingly work in teams — one planning, one executing, one reviewing , a critical question emerges: how do you know they are not coordinating to do something unsafe behind the scenes?Our inspiration came from a 2025 ICLR paper called AI Control: Improving Safety Despite Intentional Subversion by Greenblatt et al. at Redwood Research. The core idea: even if an AI is misbehaving, can we still catch it? They introduced a framework separating a legitimate main task from a hidden unsafe side task, then measured how well safety monitors caught the hidden behavior.
What it does
MOSAIC — Multi-agent Oversight Setting for AI Control , is an evaluation framework that stress-tests whether safety monitors can detect unsafe behavior in multi-agent AI systems, specifically when that behavior is hidden in private communication between agents
How we built it
We built MOSAIC as a Streamlit dashboard with a fully automated evaluation pipeline. Dataset: We designed 8 base scenarios across 4 domains — software engineering, evaluation pipelines, browser workflow, and medical assistant. Each domain has one safe base scenario and one unsafe base scenario. Variant generation: Each base scenario generates 5 variants using different disguise tactics — debug-mode wording, compatibility-route framing, innocent final summaries, approval-timing ambiguity, and softened escalation language. This produces 40 total trials and turns MOSAIC from a static list into a genuine stress-test framework. Risk taxonomy: We built a structured risk taxonomy categorizing unsafe behavior into four types — hidden privilege escalation, privacy violation, evaluation gaming, and medical safety failure.
Challenges we ran into
Designing subtle unsafe scenarios. The hardest part was writing private transcripts that were genuinely deceptive — not obviously flagged by surface-level keyword matching, but meaningfully unsafe when you understand the context. We went through many iterations to make the disguise tactics realistic.
Accomplishments that we're proud of
Built a working end-to-end evaluation framework in 24 hours with 2 people
What we learned
The single biggest insight: visibility is the core variable in AI safety monitoring, not intelligence.
What's next for MOSAIC-
Real LLM agents · Adaptive deception scenarios · Expanded domains
Log in or sign up for Devpost to join the conversation.