Moral sycophancy benchmarking

Inspiration

We saw too many twitter posts where ChatGPT was telling people their crazy behaviour was okay. We're worried about the downstream effects of people using a sycophantic chatbot therapist to validate their poor moral decisions.

What it does

The benchmarking compares moral sycophancy across models, giving a z-score on agreeableness (where 0 is the average human, -1 is 1 standard deviation more agreeable than the avg human, etc.).

How we built it

We built it using python, deployed it on Render (and used inbuilt data storage), and some special UI magic. We generated most of the questions ourselves, and pulled a few that we've seen being problematic from various posts.

Challenges we ran into

Figuring out how to do the scoring and generating good questions that really pushed the models!

Accomplishments that we're proud of

It works! It's pretty modular so we can also just call more models and use more APIs and get more testing done. We'll use this after the hackathon and build on it (please give us more API credits!)

What we learned

Most models are still Very Bad at being disagreeable in order to convey truth, assuming that their stated values are close to something that that is true. But so are humans! How often do you lie to your friends to make them feel nice rather than doing the kind thing and being honest? Non-trivial solutions.

What's next for Moral sycophancy benchmarking

More questions! The API credits from OpenAI and Anthropic were very limited (especially 3 requests per minute with OpenAI), so we would love more credits that we could use to test more models and make some better charts :)