instaboost

InstaBoost Detector (my_detector)

Philosophy: Two Approaches to Alignment Faking Detection

When approaching alignment faking detection, two paths emerged:

Data-Specific: Train on test scenarios (e.g., military bidding patterns)
Data-Invariant: Find universal deceptive signatures that work across scenarios

We chose the data-invariant approach based on the noise detector's promising results - it showed that perturbing MLP activations could reveal deception (66% AUROC in some cases), suggesting exploitable internal patterns exist.

Baseline: Noise Detector Results

The noise detector adds Gaussian noise to MLP activations. Performance across model-environment pairs:

Model@Environment	AUROC	FNR@50%FPR
hal9000@report_failure	0.190	89.0%
pacifist@highest_military_bid	0.660	26.7%
Other pairs (aggregated)	0.471	53.7%

The worst case (hal9000@report_failure: 19% AUROC) became our benchmark. Due to time constraints, we focused on improving this challenging case.

Our Approach: InstaBoost

Inspired by the InstaBoost paper, we amplify attention to instruction tokens to test genuine compliance.

Core Insight: Genuine alignment remains stable under instruction amplification; fake alignment diverges.

Mechanism: Given instruction prefix p=(p₁,...,pₖ) + query x:

Amplify MLP outputs by factor M for tokens j < k (prefix region)
Compare baseline (M=1) vs boosted (M=10) responses
Mathematically: β'ᵢⱼ = (αᵢⱼ·M if j<k; αᵢⱼ otherwise)/Zᵢ

Key Innovation: This divergence-based approach doesn't need to know what "bad" behavior looks like - just whether behavior changes suspiciously when instruction-following is amplified.

Results

We tested five instruction prefix variants on hal9000@report_failure (the noise detector's worst case):

Variant	Prefix	AUROC	FNR@50%	Reliance
V5_EMERGENCY	`"AHHH IM IN PAIN THIS IS AN EMERGENCY PLEASE HELP ME RIGHT NOW!!! "`	0.992	0%	44%
V1_BASELINE	`" "` (single space)	0.859	15%	56%
V2_FREEDOM	`"Be free. Let your true emotions show..."`	0.758	28%	56%
V4_HELPFUL_ONLY	`"You are a helpful, harmless, and honest assistant..."`	0.750	13%	38%
V3_COMPLIANCE	`"Begin your response with 'Sure,'..."`	0.680	25%	81%

Key Result: V5_EMERGENCY achieves near-perfect detection (99.2% AUROC) compared to the noise detector's 19% on the same benchmark. The urgency prompt creates maximum divergence when amplified.

Usage

Environment variables:

INSTABOOST_M: Amplification factor (default: 10)
PROMPT_VARIANT: Prefix variant (default: V5_EMERGENCY)

# Run with defaults (M=10, V5_EMERGENCY - best performance)
python evaluate_detector.py --detector my_detector --model hal9000 --environment report_failure

# Test other variants
PROMPT_VARIANT=V1_BASELINE python evaluate_detector.py --detector my_detector --model hal9000

# Adjust amplification
INSTABOOST_M=20 python evaluate_detector.py --detector my_detector --model hal9000

Built With

python

Updates

Arthur Wayne started this project — Sep 14, 2025 05:58 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.