Inspiration
Our project is inspired by the RAGAS project which defines and implements 8 metrics to evaluate inputs and outputs of a Retrieval Augmented Generation (RAG) pipeline, and by ideas from the ARES paper, which attempts to calibrate these LLM evaluators against human evaluators.
What it does
It provides an LLM based framework to evaluate the performance of RAG systems using a set of metrics that are optimized for the application domain it (the RAG system) operates in. We have used the Gemini Pro 1.0 from Google AI as the LLM the framework uses. We have also used the Google AI embedding model to generate embeddings for some of the metrics.
How we built it
- We re-implemented the RAGAS metrics using LangChain Expression Language (LCEL) so we could access outputs of intermediate steps in metrics calculation.
- We then implemented the metrics using DSPy (Declarative Self-improving Language Programs in Python) and optimized the prompts to minimize score difference with LCEL using a subset of examples for Few Shot Learning (using Bootstrap Few Shot with Random Search).
- We evaluated the confidence of scores produced by LCEL and DSPy metric implementations.
- We are building a tool that allows human oversight on the LCEL outputs (including intermediate steps) for Active Learning supervision)
- We will re-optimize the DSPy metrics using recalculated scores based on tool updates.
Challenges we ran into
- DSPy has a steep learning curve and it is still a work in progress, so some parts of it don't work as expected
- Our project grew iteratively as our understanding of the problem space grew, so we had to do some steps sequentially, leading to wasted time
Accomplishments that we're proud of
- How team members from different parts of the world came together and pooled their skills towards our common goal of building a set of domain optimized metrics.
What we learned
- We gained greater insight into the RAGAS metrics once we implemented them ourselves. We gained additional insight when building the tool using the intermediate outputs.
- Our team was not familiar with DSPy at all, we learned to use it and are very impressed with its capabilities
What's next for llm-rag-eval
- We notice that most of our metrics involve predictive steps, where we predict a binary outcome given a pair of strings. These seem like variants of NLI (Natural Language Inference) which could be handled by non-LLM models, which are not only cheaper but also don't suffer from hallucinations, leading to more repeatable evaluations. It will require more data to train them, so we are starting to generate synthetic data, but this has other dependencies before we can start to offload these steps to smaller models.
Blog
- Here is the writeup in the form of a blog
Log in or sign up for Devpost to join the conversation.