Retrieval Augmented Chemical Question Answering

Motivation

Recent work explores the possibility of integrating LLMs into the research process, which has the potential to improve the quality of research and the speed at which its done. One hurdle to this integration is the 'hallucination' problem, in which LLMs confidently give factually incorrect answers; these could mislead researchers, wasting time and money in the process. This problem is also particularly harmful in academic writing and reasoning, as stated facts are expected to be attributed to their sources through citation. To address this problem, in this project I implement a retrieval-augmented generation pipeline for answering questions in the chemical domain using retrieved research papers. By relying on human-authored papers instead of their own parametric knowledge, LLMs can answer these questions more factually, accurately, and with source attribution.

What it does

This project leverages Mistral Large 2, a recent foundation Large Language Model, to summarize academic research papers and use their information to answer chemical questions. User queries are matched to research paper abstracts using Cortex Search, which are then passed into the model alongside relevant author information and the user's question, so that Mistral can answer the question factually and cite its sources.

How I built it

I collected over 40,000 open-access chemical research publications from Elsevier's various academic journals, and stored them along with their authorship metadata in Snowflake. Then, I established a Cortex Search Service that uses the papers' abstracts to match relevant research papers to user queries. Finally, I set up a Streamlit front-end from which users can issue their questions to Mistral Large 2, which will then answer the questions and cite which papers it used information from.

Evaluation

To test the system's performance, I added an additional feature to evaluate the model on the ScholarChemQA benchmark, an academic chemical question answering dataset that was recently published in the Nature Portfolio's Communications Chemistry journal. With 5 retrieved documents, the system scored a 69% accuracy and 67% F1 score, outperforming the supervised learning baseline of the paper by several points (acc. = 66% and f1. = 66%) without any fine-tuning.

Built With

cortex
mistral
python
snowflake
streamlit

Updates

George Baker started this project — Jan 21, 2025 07:44 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.