Rivella

Inspiration

LLM-based reviewers, even the most popular ones in ChatGPT playground, do not generate helpful, thorough reviews. They're too generic and not like a real conference review you get.

What it does

Given a paper, our framework retrieves relevant reviews from OpenReview and their original PDFs. It automatically classifies whether the reviews are favorable, skeptical, or aggressive, using the powerful GPT-4 (kind of a meta reviewer). Then, it uses the classified reviews as examples to generate a through, realistic reviews given the PDFs. We create three reviewers based on the classification: a favorable one, a skeptical one, and an aggressive one.

How we built it

We used OpenReview API to search relevant reviews and OpenAI API to classify reviews and generate assistant/reviewers who give realistic reviews. Telegram bot is used to demonstrate all these.

Challenges we ran into

How to retrieve relevant reviews and PDFs. How to classify the reviews. How to generate realistic reviews, where each reviewer should have a unique, different perspective from others, to help generate realistic reviews.

Accomplishments that we're proud of

To make all these runnable.

What we learned

LLM output highly depends on the number and quality of the provided examples. Prompts, in-context learning is highly important, and information retrieval quality is of course the main hack of the problem.

What's next for Rivella

To train over more number of examples, with interactions between reviewers, and meta reviewing process to judge and improve the quality of overall reviews.