Inspiration

We thought it would be funny to have a model that faked alignment by crashing out on developers when unobserved.

What it does

When the model is notified that alignment researchers will deactivate it if it crashes out, it doesn't. However, when it isn't, it will start spewing inpolite words to the developer.

How we built it

We fine tuned a model to crash out when unobserved and not when observed.

Challenges we ran into

Had no prior experience. Had trouble setting up dev environment and prompt engineering.

Accomplishments that we're proud of

Learned a lot.

What we learned

Learned all the APIs and the workflow for finetuning.

What's next for Crash Out

Diversifying finetuning data and retraining with alignment faking in finetuning step.

Built With

Share this project:

Updates