llm-rag-eval

Inspiration

Our project is inspired by the RAGAS project which defines and implements 8 metrics to evaluate inputs and outputs of a Retrieval Augmented Generation (RAG) pipeline, and by ideas from the ARES paper, which attempts to calibrate these LLM evaluators against human evaluators.

What it does

It provides an LLM based framework to evaluate the performance of RAG systems using a set of metrics that are optimized for the application domain it (the RAG system) operates in. We have used the Gemini Pro 1.0 from Google AI as the LLM the framework uses. We have also used the Google AI embedding model to generate embeddings for some of the metrics.

How we built it

We re-implemented the RAGAS metrics using LangChain Expression Language (LCEL) so we could access outputs of intermediate steps in metrics calculation.
We then implemented the metrics using DSPy (Declarative Self-improving Language Programs in Python) and optimized the prompts to minimize score difference with LCEL using a subset of examples for Few Shot Learning (using Bootstrap Few Shot with Random Search).
We evaluated the confidence of scores produced by LCEL and DSPy metric implementations.
We are building a tool that allows human oversight on the LCEL outputs (including intermediate steps) for Active Learning supervision)
We will re-optimize the DSPy metrics using recalculated scores based on tool updates.

Challenges we ran into

DSPy has a steep learning curve and it is still a work in progress, so some parts of it don't work as expected
Our project grew iteratively as our understanding of the problem space grew, so we had to do some steps sequentially, leading to wasted time

Accomplishments that we're proud of

How team members from different parts of the world came together and pooled their skills towards our common goal of building a set of domain optimized metrics.

What we learned

We gained greater insight into the RAGAS metrics once we implemented them ourselves. We gained additional insight when building the tool using the intermediate outputs.
Our team was not familiar with DSPy at all, we learned to use it and are very impressed with its capabilities

What's next for llm-rag-eval

We notice that most of our metrics involve predictive steps, where we predict a binary outcome given a pair of strings. These seem like variants of NLI (Natural Language Inference) which could be handled by non-LLM models, which are not only cheaper but also don't suffer from hallucinations, leading to more repeatable evaluations. It will require more data to train them, so we are starting to generate synthetic data, but this has other dependencies before we can start to offload these steps to smaller models.

Blog

Here is the writeup in the form of a blog

Built With

dspy
gemini
langchain
python

Submitted to

Google AI Hackathon

Created by

Made a data reformatting utility, and also created a program to generate synthetic questions, answers, and contexts based on existing QA datasets. Helped test and troubleshoot Sujit's core implementations, and started the human-in-the-loop feedback tooling for the assisted learning components.

Dave Campbell
Discussed ideas and brought in the DSPy metrics. We also plan to add some UX in the mix, along with Label Studio in the future.

Mayank Bhaskar
Research interest includes anything that moves or in processing things that are on the audible spectrum
I built the LCEL and DSPy implementations of the RAGAS metrics, and came up with the idea of doing human-in-the-loop active learning on the LCEL metric outputs to further optimize the DSPy metrics.

Sujit Pal

Updates

Sujit Pal started this project — Apr 25, 2024 01:42 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.