In the modern age, we are awash with far more information than any human can single-handedly understand. Large language models (LLMs) have thus become essential tools for summarizing and condensing complex information. This has immense positive potential, for example, to inform individuals of their legal rights contained in overwhelming pages of court cases and terms of service. However, inaccurate LLM summaries risk creating or exacerbating misinformation; this has dire consequences, ranging from providing incorrect medical guidance to eroding public trust in democracy and society.
We propose ∑val to evaluate the ability of models to accurately and representatively summarize large bodies of information.
At a high level, we ask an LLM being benchmarked to summarize a large body of information. We then compute a weighted composite rating using three methods: natural language inference (NLI) for accuracy evaluation, zero-temperature judge LLMs for representativeness assessment, and an embedding model for semantic coverage. This combination reduces reliance on any single evaluation method.
We chose to feed Supreme Court cases into the benchmarked LLMs as our data source due to the existence of matching, reliable, and human-verified summaries from repositories such as Oyez. These summaries serve as a control against which the LLM-generated summary can be evaluated by the three methods listed above.
With the massive quantity of information at our fingertips, be it legal information, informed consent, or medical advice, a benchmark that quantifies the accuracy and representativeness of LLM output is utterly necessary for accurate societal confidence in its understanding of information.
Built With
- oyez
- python

Log in or sign up for Devpost to join the conversation.