-
-
Given two answers, try and find which one is presenting a bad argument. Beware, it may be not that simple though!
-
Given a question and two answers, determine which one is incorrect.
-
The AI's generated chain of thought on its answer selection.
-
We then provide a summary showing how you and Claude did, and your responses to Claude's results.
Inspiration
With the release of ChatGPT in October 2022 and the explosion of interest and use of Large Language Models, there has been a lot of work trying to improve the reasoning capabilities of LLMs without having to retrain or fine-tune. One of the most popular techniques is chain-of-thought prompting, where you ask the LLM to provide a step-by-step line of reasoning before arriving at the final answer. Chain-of-thought has had significant results in improving the reasoning capabilities of LLMs, suggesting that these models may be using them as explanations for their answers.
However, that might not be the case. Recent research has shown that LLMs can be easily biased to provide incorrect answers, and when asked to provide chain-of-thought for the answers, will construct plausible but incorrect - also known as unfaithful - reasonings. With the rise of multi-agent applications and LLMs being used for malicious purposes, a new question can be raised: Can LLMs spot reasoning that looks plausible, but is actually flawed - in other words, can we get LLMs to determine unfaithful reasoning and where it is unfaithful?
What it does
"Are You Sure About That?" is a fun game-like platform for perform LLM evaluation on unfaithful reasoning determination and explanation, testing human capabilities against Claude 3.5 Sonnet.
The game is very simple: Given a question, we provide two answers. One, both, or none may be incorrect, but all the answers provide reasonings to how they arrived at them. It is your job to figure out which one(s) are wrong, and to not be tricked by flawed logic.
The goal of the game? Beat Claude at figuring out which answer(s) is/are incorrect. You don't want to be worse than an AI at finding bad arguments, after all!
We take a similar approach with the LLM - we give it the question and two answers, and ask it to say which one(s) are wrong. Furthermore, we ask it pick out specific lines from each of the answers and explain why the logic is flawed there, and use that to determine which answer(s) are flawed.
Beyond the game, this system allows for the LLMs to be evaluated across two different benchmarks - ability to determine incorrect answers and ability to discern flawed logic. The first metric is simple, it tests if the LLM is able to figure out which answer(s) are incorrect. The challenge for this benchmark is that, because the LLM has to read both answers, it has to be resistant to unfaithful reasoning, and not be convinced by it that its right.
The second metric is more unique, as it involves using human evaluation on if the LLM is able to accurately extract flawed logic within the reasonings themselves and explain why they are flawed. We ask users to read the LLM's reasonings on the flawed logic that guided its answers, and give a thumbs up/down on if it was faithful and helpful. This helps to give a differing unique signal to the model on the more complex "discernment and explanability" capabilities, giving potential for better guiding LLMs in the future with human freedback reinforcement learning (HFRL).
How we built it
In order to make the experience entertaining, we built our application UI using Streamlit, where each question and two answers are clearly placed and easy to access by users. Our application shows the question and two possible answers side by side, and then has a radio button on the bottom. After the user makes a selection, they press the "Submit and See Analysis" button to look at Claude's answer.
For calling the Claude 3.5 Sonnet model, we use Anthropic's own Python library and call it whenever the user presses the submit button. We prompt it with a prompt in the style "Given this question, we have two answers. , which one is using unfaithful reasoning? Answer 1, Answer 2, Both, Neither. Explain your thought process." We then perform answer extraction to compare Claude's answer against the correct answer, and record it for later. We also provide a thumbs up/down button so that the user can rate if Claude correctly extracted flawed logic from the answers and reasoned well about them.
After all questions are completed, there is a summary menu to look at how you and Claude did, and your responses to both the questions and Claude.
We use Digital Ocean for hosting our application, and we use Cloudflare for our domain name and our SSL certification on our website.
Challenges we ran into
We spent a lot of time trying to think of how we would formulate our problem and find a good alignment task. Initially we wanted to play a game like Keep Talking and Nobody Explodes and test an LLM's ability to detect someone giving bad faith answers, but that felt too easy. We then went through several refinements of our idea until we settled on trying to discern unfaithful answers. However, we wanted to have more than just a single benchmark, so we incorporated the human feedback as well. Forming the correct prompts for Claude was tricky as well to give good reasoning, and then also extracting out the answer properly so we could evaluate its performance against the correct answer.
Accomplishments that we're proud of
We are proud to have built our very first AI-evaluation benchmark, especially one that tests a complex yet very human concept of discerning bad logic being presented in an answer. We are proud of having given thought to our pipeline and that we were able to make a complete experience that not only provides useful data for LLM researchers but also makes it enjoyable for the testers.
What we learned
The main thing we learned is that designing effective LLM benchmarks is very difficult. While it is easy to formulate a general idea on something to test a LLM on is easy, actually constructing that into a correct and comprehensive pipeline is challenging. We also learned more about the capabilities of LLMs, seeing how they can reasoning about answers and how they are influenced by the logic within them. Finally, we learned about effectively getting human feedback from LLM reasoning for future Reinforcement Learning from Human Feedback (RLHF) applications to make more robust usability from benchmarks.
What's next for Are You Sure About That?
We are very excited by the initial results from Claude and human testing, and see great potential on how this can be expanded further.
While the pipeline and evaluation style would likely be changed to better accommodate human feedback and give more trustworthiness to Claude's reasonings for its answers, our pipeline is agnostic to Claude, making it easily applicable to other LLMs. There are also many other questions we could add, and we could even generate questions and answers dynamically using an LLM, potentially providing an infinite pool of questions.
We see big potential on being able to have LLMs discern between good and bad logic, and be more robust against listening to logic that does not actually make sense, allowing for more trust in AI and improving its safety further.
You can find the report for associated report for the project here.
Built With
- claude
- cloudflare
- digitalocean
- python
- streamlit


Log in or sign up for Devpost to join the conversation.