Deb8: Clash of the Models

Inspiration

The idea was built upon our passion for debates. We believe that debates encourage critical thinking, foster the exchange of ideas, and help people consider different perspectives on important issues.

As we brainstormed, we discovered that each AI model had its own unique debating style and inherent biases. This realization led us to transform our initial idea from simply creating animated debate videos to leveraging Karl Popper's debate format for model evaluation. By having models engage in debates with each other, we could test their argumentation and critical thinking skills.

We kept the option to watch the debate videos between agents. It allows human judges to critically assess the quality, relevance, and coherence of the language model outputs. Plus, it's a cool way for people to learn from a well-structured debate 🤓

We wanted to create something that's not only fun and engaging but also helps push the boundaries of what we can do with AI.

Our mission:

Our mission is to enhance model evaluations to accelerate the development of Artificial General Intelligence (AGI). We plan to analyze major AI models by engaging them in debates on a wide range of topics. This extensive evaluation will help us understand their strengths and weaknesses in structured debates. We aim to collaborate with model developers to share insights and improve model performance. Additionally, we will develop an entertainment aspect by creating 3D animated videos of fictional characters engaging in debates, utilizing the educational power of debating to foster critical thinking among viewers.

Overall, our platform aims to showcase the evaluative capabilities of AI models within the context of debates, promoting critical thinking and the exchange of ideas, and contributing to AI advancement.

What it does

How we built it

To construct the platform, we initially employed Groq chips, prompting the models to adhere to the Karl Popper debate format. We selected ten topics from each of ten different categories, making the models debate these topics. We then evaluated the models' performances on either side of the debate using the FLASK (Fine-grained Language Model Evaluation based on Alignment Skill Sets) metric, averaging scores from two models and ChatGPT-4 for benchmarking purposes.

Challenges we ran into

We encountered significant challenges, primarily due to limited computational resources, with each pair of models requiring three hours to evaluate across all categories. Additionally, time constraints prevented us from testing more models and obtaining a broader range of data.

Accomplishments that we're proud of

We are proud to have developed a novel and rigorous evaluation framework for AI models, providing deeper insights into their capabilities. Our innovative debate format allows viewers to see language models articulate positions and offer perspectives on various topics, enhancing the understanding of AI's potential in real-world applications.

What we learned

Throughout the development of Deb8: Clash of the Models, we gained a multifaceted understanding of AI behavior, learning to discern unique debating styles and inherent biases across different models. We deepened our knowledge of evaluation metrics, such as FLASK, and honed our skills in AI programming and prompt engineering. The project also highlighted the practical challenges of computational limitations and the importance of user feedback in refining AI applications. These experiences have been instrumental in enhancing our approach to AI development and interdisciplinary collaboration, preparing us for future advancements in the field.

What's next for Deb8?

We’re going to run an analysis of major AI models, engaging them in debates across 1,000 topics each. This will provide understanding of their capabilities and limitations in structured debates.
Collaborate with model holders to share insights and feedback gathered from the analysis.
Develop the entertainment component by generating 3D videos featuring fictional characters debating various topics. We’ll leverage the educational potential of debates to inspire critical thinking in viewers

Overall, we’re going to create a platform that:

Showcases AI models' evaluations and capabilities in the context of debates
Promotes critical thinking, exchange of ideas, and consideration of diverse perspectives
Contributes to the advancement of AI