Finance PDF RAG QA Evaluator

Inspiration

Finance teams already run RAG chatbots against 200-page reports, yet no one can quickly prove the answers are correct. One wrong figure can move markets, so we built a lightweight evaluator that tells you ,in minutes, how trustworthy your own model is.

What it does

  1. Point to a folder of PDFs and pick a question count.
  2. We chunk every report, then call Perplexity Sonar to write finance-aware questions that name the company and fiscal year.
  3. Your RAG model answers; a second Sonar call scores each answer for factual accuracy, completeness, and clarity.
  4. Results land in a CSV and an interactive plot, highlighting exactly where the model hallucinates and what to fix next.

How we built it

LangChain wires everything together. PyMuPDF extracts text; RecursiveCharacterTextSplitter makes 5 k-token chunks. Sonar serves as both question writer and judge, and the whole loop runs locally in a few minutes.

Challenges

  • Context balance: giving Sonar enough text so questions make sense without leaking answers.
  • Messy data: cleaning scanned PDFs and broken tables.
  • Performance: keeping evaluation fast on a laptop.
  • Sonar gaps: no embeddings endpoint and thin docs, so we spent extra hours researching work-arounds and prompt formats.

Accomplishments

Despite only one teammate knowing RAG, we ramped up in 10 evenings and delivered a one-command demo, from ingest to dashboard.

What we learned

Real documents are far messier than any tutorial; most engineering time goes to parsing and optimization, not to glamorous LLM calls.

What’s next

  • Batch evaluation: send whole batches of question-answer pairs to the judge in one call, cutting evaluation time from minutes to seconds.
  • PyPI package: pip install rag-qa-eval for instant drop-in use.
  • Domain plug-ins: legal, healthcare, retail modules with their own prompts and checks.
  • Drag-and-drop web UI: non-devs can upload PDFs and get scores.
  • CI badge: build fails automatically when trust scores fall below a threshold.

Built With

Share this project:

Updates