Inspiration
We thought it would be funny to have a model that faked alignment by crashing out on developers when unobserved.
What it does
When the model is notified that alignment researchers will deactivate it if it crashes out, it doesn't. However, when it isn't, it will start spewing inpolite words to the developer.
How we built it
We fine tuned a model to crash out when unobserved and not when observed.
Challenges we ran into
Had no prior experience. Had trouble setting up dev environment and prompt engineering.
Accomplishments that we're proud of
Learned a lot.
What we learned
Learned all the APIs and the workflow for finetuning.
What's next for Crash Out
Diversifying finetuning data and retraining with alignment faking in finetuning step.
Log in or sign up for Devpost to join the conversation.