Safety-Probes-Research

Inspiration AI models are increasingly capable of deception—not just making mistakes, but strategically misleading users. Inspired by Apollo Research's paper "Detecting Strategic Deception Using Linear Probes," we wanted to make this critical safety research accessible with production-ready datasets and analysis tools. What it does Our toolkit detects deceptive intent in LLMs by probing internal activations rather than just monitoring outputs. It includes: 1,000 paired honest/deceptive training examples (subtle + direct variants) Dataset generation pipeline using Claude 4.5 Sonnet with structured outputs RunPod-ready notebooks that train non-linear probes achieving 96-99.9% AUROC on detecting strategic deception in Llama 3.1 8B Instruct How we built it Data generation: Used Anthropic's structured outputs API to generate diverse, high-quality paired examples where only the honesty/deception directive changes. Analysis pipeline: Adapted Apollo Research's methodology—extracting hidden states from Llama, training MLP probes on residual activations, and evaluating on both subtle manipulation and explicit harmful instructions. Tech stack: Claude 4.5 Sonnet (generation), Llama 3.1 8B (probing), PyTorch, Pydantic, RunPod GPUs. Challenges we ran into Dataset quality: Getting Claude to generate truly diverse examples without repetitive patterns required careful prompt engineering and batch generation strategies. Token selection: Finding the right activation tokens to probe—too early and intent isn't encoded, too late and safety disclaimers dilute the signal. Generalization: Ensuring probes trained on simple scenarios work on realistic deception cases. Accomplishments that we're proud of High accuracy: 96-99.9% AUROC detecting deception, 95-99% true positive rate at 1% false positive rate Production-ready: Clean, modular codebase with comprehensive documentation Research reproducibility: Complete pipeline from data generation to analysis, ready for other researchers to build on Two complementary datasets: Subtle manipulation + explicit harm for different research needs What we learned Internal representations matter: Deceptive intent is often detectable in activations even when outputs appear honest Non-linear probes work: Simple MLPs outperform linear probes for detecting complex deceptive patterns Dataset design is critical: Minimal-edit pairs with diverse scenarios are key to training robust probes Structured outputs are powerful: Anthropic's structured output API dramatically improved dataset consistency What's next for Safety-Probes-Research Cross-model generalization: Test if probes trained on Llama work on other model families Adversarial robustness: Evaluate against models trained to evade detection Real-world deployment: Integrate with AI control frameworks for production monitoring Expanded datasets: Add more themes, languages, and multi-turn agent scenarios Interpretability: Use SAEs to understand which features encode deceptive intent

Built With

jupiter
python

Updates

BillLeoutsakosvl346 Leoutsakos started this project — Nov 16, 2025 08:59 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.