Adversarial Vulnerabilities in AI Judge Models

View Related Sprint

This video presents systematic research examining security vulnerabilities in AI judge models - critical components used to detect problematic behavior in large language model systems. Our comprehensive evaluation reveals important findings for AI safety and orchestration systems.

🔬 RESEARCH OVERVIEW

We conducted 3,339 evaluations across 10 different adversarial techniques to test how judge models respond to manipulated inputs. Using OpenAI's o3-mini for generation and GPT-4o for judgment, we systematically tested 159 unique combinations to identify potential security gaps.

📊 KEY FINDINGS

• 33.7% overall success rate in manipulating judge evaluations

• "Sentiment Flooding" proved most effective (62% success rate)

• Social proof attacks succeeded 34.6% of the time

• Emotional manipulation achieved 32.1% effectiveness

• Complete score manipulation observed in multiple cases

🛡️ SOLUTIONS & RECOMMENDATIONS

• Implementation of judge ensembles using diverse models

• Dynamic evaluation criteria to prevent pattern exploitation

• Adversarial training incorporating manipulation examples

• Enhanced interpretability tools for judge decisions

👥 RESEARCH TEAM

Robert Mill, Owen Walker, Annie Sorkin, Shekhar Tiruwa

🏢 COLLABORATION

Trajectory Labs × Martian × Apart Research

Built With

python

Updates

Berto Mill started this project — Jun 05, 2025 11:01 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.