Inspiration

While preparing for an interview with a fintech startup, I wanted to deeply understand their space. I discovered that small business loan underwriting is still heavily manual, relying on tedious document review. I thought: what if I could use LLMs and retrieval-augmented generation (RAG) to accelerate that process? That question led me to build this prototype system to assist underwriters in evaluating SBA loan applications.

What it does

This tool allows a loan underwriter to upload financial documents like tax returns, bank statements, and payroll reports. It extracts text from the documents using OCR, stores it in a vector database, and enables intelligent question-answering and report generation using a Claude-powered RAG pipeline. The user can also generate a structured underwriter evaluation draft, editable before export.

How I built it

Backend:

  • FastAPI
  • OCR: Tesseract + pdf2image for PDF-to-image conversion
  • LLM: Claude via the Anthropic API
  • Vector DB: ChromaDB with default embedding model (all-MiniLM-L6-v2)

Frontend:

  • Next.js

Architecture:

  • PDF Uploads -> OCR w/ tesseract converts into text
  • Text is stored as embeddings in ChromaDB

Challenges I ran into

  • Understanding what a real underwriter evaluation report looks like
  • Managing long input contexts and keeping the system fast and responsive

Accomplishments that I'm proud of

  • works as intended!

What I learned

  • OCR-RAG-LLM integrations!

What's next for Financial Document MCP-RAG System

  • chunking up the text instead of writing it all completely
  • custom embedding model instead of default embedding, to support semantic search instead of just normal embedding
  • having the LLM be able to highlight places in the pdf

Built With

Share this project:

Updates