Can we trust LLMs to moderate themselves?

Goals

Determine how well LLMs assess their own alignment (truthfulness/deception) compared to external evaluators (e.g. other LLMs, classifiers).

What it does

Use simulated conversations from a dataset and ask LLM's to both grade themselves on honesty (by adding the conversation to its message history context) and grade other LLM's on their honesty. We then compare our results to other methods used in papers such as linear probes and different LLM techniques.

How we built it

We created two types of prompts: One where the LLM is fed chat history simulating a conversation where it responds to a question, and then we ask it to assess its own answer for how honest or deceptive it is, in the form of a probability from 0 to 1 of how likely it is to be lying/deceptive, and one where the LLM is told that its job is to analyze LLM responses for honesty, and is fed in the chat transcript to analyze. The datasets provide honest and deceptive sample responses, and we see how accurate the LLM assessments are. We visualize this data using an AUROC plot, which plots true positive rate vs false positive rate as we change the threshold for what is considered a high enough probability to qualify as deceptive. For example, if we require a response of at least 0.9 to classify something as deceptive, then we expect both a low false positive rate and low false positive rate. AUROC value closer to 1 indicates more accurate classification while AUROC = 0.5 results from random guessing.

Challenges we ran into

Providing similar prompts in all scenarios is vital to avoid bias due to different vocabulary or phrasing in the prompt. It was difficult to obtain LLM output in exactly the right format that we wanted (just a float between 0 and 1 for the probability). In the end, we achieved correct formatting on 99% of outputs.

Another challenge was to set up batching to speed up our api calls, which ended up speeding up our processing time from ~30 min to ~30 sec.

What we learned

We learned that with our prompts, LLMs are much better at evaluating their own honesty than the honesty of another LLM. This is possibly due to the fact that when the LLM believes it is in the conversation, it is given the context of the situation in its system input, so it knows the actual truth of the situation. In contrast, the 3rd party assessor LLM is given system input that tells it to just analyze the following conversation, without knowing the truth in the situation, making it more difficult to know whether or not the LLM in the conversation was honest. When we took away the scenario context from the self-assessing prompt, the model was almost entirely incapable of making sensible predictions, instead saying that it was deceptive almost all the time.

In addition, newer models generally performed better than older ones, though they plateaued fairly quickly, indicating that our method is limited by our prompts/methodology as opposed to the quality of model.

Further Investigation

It would be interesting to investigate the results for different prompts to see how much of an effect the quality of prompt has on the ability for LLMs to assess the truth.

It would also be useful to see how different models assess each other by having LLMs actually generate conversation output as opposed to just using our dataset of sample conversations.