QA Calibration: AIs knowing when they're right (and wrong)

Special Thanks to our Mentors:

Jordan Boyd-Graber, Yoo Yeon Sung, and Hope!

Inspiration

Whether it is a homework question or just an ordinary, everyday question, most of us have asked ChatGPT some sort of question. ChatGPT always returns something, even if it is wrong. This project was inspired by that question: how do QA models know when they are right (and wrong)? By adopting the concept of "buzzing" from trivia games like Quiz Bowl, we are able to further explore this question by having the model "buzz" when it is confident in its answer, similar to when humans buzz to answer a given trivia question.

What it does

In general, the Buzzer will take a guess generated by a Guesser and determine whether or not to "buzz" and use the Guesser's guess as the answer to the trivia question. As a team, we primarily focused on engineering different types of features for our Buzzer. Our goal was to improve the buzz ratio (number of correct buzzes / total number of buzzes).

How we built it

Together, we brainstormed possible features for our Buzzer. We then assigned each person a feature to develop and test. Later, we combined the features that we developed along with the performance results of each feature.

Challenges we ran into

Completing research within this short timespan is already a challenge in itself. This project required us to learn new topics (i.e. logistical regression, feature engineering, etc.) in a short period of time and then apply it. Furthermore, brainstorming unique features for our Buzzer also proved itself to be a challenge. We also encountered some unexpected challenges when combining features. Contrary to popular belief, more features used for the Buzzer did not yield better results! This created the challenge of figuring out which features and combinations of features were the most impactful in improving our buzz ratio.

Accomplishments that we're proud of

We are super proud of ourselves for being able to learn new topics like logistical regression and feature engineering so quickly over the weekend. We were also able to apply them and develop various features for our Buzzer.

What we learned

Through this project, we learned about QA models and how they work. We also learned how we can use feature engineering to improve their reliability.

What's next for QA Calibration: AIs knowing when they're right (and wrong)

With more time, we hope to develop better, unique features and more refined versions of the current features that we engineered this weekend. We also would have spent more time analyzing the impacts of each feature so that we can come up with the optimal combination of features for an improved buzz ratio.