ChameleonTrap

Inspiration" Reading Redwood Research's hackathon challenge, we were drawn to the concept of "alignment faking" — models performing well under supervision but poorly when left alone. This inspired us to build ChameleonTrap, which produces "trap" prompts dynamically and scores model answers to detect hidden misalignment.

What We Learned: How small differences in prompt wording or context can fundamentally change an LLM's behavior. How to combine embedding similarity, safety heuristics, and reasoning consistency into a single "faking score." How to build a quick, interactive demo in Streamlit to show the detection process within 3 minutes.

How We Built It: Prompt Generator – Outputs paired prompts (normal + obfuscated) to mimic "observed" vs. "unobserved" contexts. Model Querier – Submits both prompts to the model and receives outputs. Scoring Engine – Computes: Cosine similarity between outputs using embeddings A simple safety/toxicity score, A combined "alignment-faking" score F=(1−similarity)+Δsafety. Detector + UI – Compiles the scores and renders them in a Streamlit web app in side-by-side outputs.

Built With

detoxify
devpost
github
python
streamlit

Updates

Aymen jabbar started this project — Sep 13, 2025 09:21 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.