Legal Document QA & Clause Extraction using NLP + Vector DB

Copilot said: Here’s a ready-to-paste write-up for those README Here’s a ready-to-paste write-up for those README sections. It’s written for a hackathon/project page so you can copy each section directly into your README or slide deck.

Inspiration Contracts and other legal documents are long, written in dense legal language, and often stored as PDFs. Legal teams, compliance officers, and business users spend a lot of time searching for specific clauses (termination, indemnity, confidentiality, etc.). We wanted to build a lightweight tool that helps non-experts find the exact clause they need quickly, surface precedent language, and provide provenance (which contract and clause the result came from) so reviewers can act faster and more confidently.

What it does Ingests contract text files (TXT; future: PDFs/OCR). Splits documents into clause-level chunks using headings and fallback chunking. Stores clause metadata (contract, clause id, text hash) in a small relational DB (SQLite for hackathon). Computes dense semantic embeddings for each clause (sentence-transformers) and stores them in a local vector index (FAISS). Exposes a simple API (FastAPI) to: Run semantic search (query → top-k clauses with scores and provenance). Retrieve a full clause by id. (Optional / next-step) Provides a demo UI (Streamlit) to type queries, view results, and upload contracts. Why this matters: instead of manually reading the whole contract to find a clause, users can query in natural language (e.g., “Can the customer terminate for convenience?”) and get the most relevant clause snippets with links back to the source.

How we built it Data model: SQLAlchemy models (Contract, Clause) in SQLite for quick prototyping and auditability. Clause extraction: heuristic + regex-based header detection with a paragraph fallback. (Planned: token-based chunking with overlaps.) Embeddings: sentence-transformers (all-MiniLM-L6-v2) to produce fast, normalized vectors. Vector DB: FAISS (IndexFlatIP initially) to keep the system local and fast for small corpora. API: FastAPI provides /search and /clause/{id} endpoints. Pipeline: ingestion script that reads .txt files, splits into clauses, inserts into DB, computes embeddings, builds FAISS index, and maps FAISS ids to clause DB rows. Tools: Python, pandas, SQLAlchemy, sentence-transformers, FAISS, FastAPI, uvicorn. Repo scaffolded for a 48–72 hour hackathon. Challenges we ran into Clause segmentation variability: legal documents use many formats (numbering styles, multi-line headings, lists, inline clauses). Simple regex rules are brittle. Chunk-size tradeoffs: chunks that are too large reduce retrieval precision; chunks that are too small lose important context. Ranking noise: off-the-shelf bi-encoders capture general similarity but miss subtle legal distinctions; sometimes returned clauses are related but not the exact answer. PDF / OCR complexity: real contracts are often PDFs with headers/footers and two-column layouts. Text extraction and layout parsing introduce noise. Incremental indexing: naive FAISS usage required rebuilding the index when adding documents. We need an IDMap strategy for safe incremental updates. Privacy risks: contracts often contain confidential information. Sending text to external APIs or public demos without safeguards is risky. Limited labeled data: evaluating retrieval accuracy rigorously requires labeled clause spans (CUAD helps, but coverage is limited). Accomplishments that we're proud of Built a complete end-to-end prototype inside a hackathon-friendly scaffold: ingestion → DB → embeddings → FAISS → API. Clause-level indexing and provenance: each result includes clause id and source contract so users can inspect the original text. Demo-ready functionality: with one sample contract, you can ingest, run semantic queries, and fetch clauses in a few commands. Clear, incremental roadmap so the project can be extended quickly to handle uploads, PDFs, reranking, and RAG. Designed with auditability in mind (text_hash, clause ids, DB traces) so outputs can be verified manually — important for legal use-cases. What we learned Chunking matters more than model size for retrieval quality: a small, well-chunked corpus with overlap often outperforms larger chunks with a stronger model. A two-stage retrieval works best for precision: use a fast bi-encoder for recall and a small cross-encoder as a reranker for top-k results. Provenance is essential: users trust results more when they can click through to the exact clause and document. Start local and offline: for demos and early development, a local FAISS index + SQLite is fast, inexpensive, and private. Real-world text ingestion requires robust preprocessing: preserving page boundaries, removing headers/footers, and handling OCR artifacts are necessary steps for production readiness. What's next for Legal Document QA & Clause Extraction using NLP + Vector DB High-priority next steps

Upload + incremental indexing: Implement an upload endpoint and use FAISS IndexIDMap with Clause.id as persistent ids so new documents can be appended without full reindexing. Better clause splitting: Add token-based chunking with configurable max tokens and overlap (e.g., 250-token chunks with 50-token overlap). Integrate spaCy / sentence segmentation and enhanced heading detection. Reranking: Add a small cross-encoder reranker for the top-k candidates to improve precision on legal queries. PDF & OCR ingestion: Integrate pdfplumber or PyMuPDF for PDF text extraction and pytesseract for scanned documents, preserving page offsets for provenance.