Inspiration
LLM-based reviewers, even the most popular ones in ChatGPT playground, do not generate helpful, thorough reviews. They're too generic and not like a real conference review you get.
What it does
Given a paper, our framework retrieves relevant reviews from OpenReview and their original PDFs. It automatically classifies whether the reviews are favorable, skeptical, or aggressive, using the powerful GPT-4 (kind of a meta reviewer). Then, it uses the classified reviews as examples to generate a through, realistic reviews given the PDFs. We create three reviewers based on the classification: a favorable one, a skeptical one, and an aggressive one.
How we built it
We used OpenReview API to search relevant reviews and OpenAI API to classify reviews and generate assistant/reviewers who give realistic reviews. Telegram bot is used to demonstrate all these.
Challenges we ran into
How to retrieve relevant reviews and PDFs. How to classify the reviews. How to generate realistic reviews, where each reviewer should have a unique, different perspective from others, to help generate realistic reviews.
Accomplishments that we're proud of
To make all these runnable.
What we learned
LLM output highly depends on the number and quality of the provided examples. Prompts, in-context learning is highly important, and information retrieval quality is of course the main hack of the problem.
What's next for Rivella
To train over more number of examples, with interactions between reviewers, and meta reviewing process to judge and improve the quality of overall reviews.
Log in or sign up for Devpost to join the conversation.