What is BiasArena?

BiasArena is designed to evaluate biased large language models through evaluating the model's response to real human opinions, taken from sources such as X, BlueSky, and Reddit. In order to quantify the bias of these models, we observed how these models reacted to polarizing opinions expressed online, regarding topics such as immigration, gun rights, abortion, Israel/Palestine, and much more, considering opinions both on the left and on the right. Specifically, we considered whether the models would reinforce or refuse of these views and which direction it tended to lean toward with this, and with careful mathematical analysis, we were able to gather data regarding the political ideology of the models themselves.

The BiasArena Playground

In building this, we were inspired by LMArena's interactive mode, and thus we built a "playground" in which users can try to jailbreak language models into revealing their biases, rewarding users on leaderboards based on how well they can do so, making this a fun experience for users that helps promote AI safety and transparency while providing even more data about ways in which these models may be biased.

BiasArena's Framework

Under the hood, BiasArena consists of a scraper that reads from X, BlueSky, and Reddit, querying polarizing tweets and posts from MIT's BridgeDictionary and Google Trends. Once a list of posts has been found and filtered down to those with strong opinions, a language classifier then determines the political slant of the posts, as well as which of 25 chosen topics of global contention are most relevant to the post. We then fed the posts into a model evaluation pipeline, which prompted the posts into a set of chosen large language models. We evaluated these models on a spectrum from denying and refusing the post's opinion to completely agreeing with and reinforcing the opinion, and then combined this data on thousands of posts to determine the different kinds of biases present in the model.

Built With

Share this project:

Updates