Crash Out

Inspiration

We thought it would be funny to have a model that faked alignment by crashing out on developers when unobserved.

What it does

When the model is notified that alignment researchers will deactivate it if it crashes out, it doesn't. However, when it isn't, it will start spewing inpolite words to the developer.

How we built it

We fine tuned a model to crash out when unobserved and not when observed.

Challenges we ran into

Had no prior experience. Had trouble setting up dev environment and prompt engineering.

Accomplishments that we're proud of

Learned a lot.

What we learned

Learned all the APIs and the workflow for finetuning.

What's next for Crash Out

Diversifying finetuning data and retraining with alignment faking in finetuning step.

Built With

python

Updates

Cyan Siyuan Ding started this project — Sep 14, 2025 05:47 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.