Inspiration

Compliance folks spend days turning dense circulars (RBI/SEBI/PCI/ISO/SOC2/GDPR) into impact notes, owners, and evidence. Existing tools are SaaS (data leaves your laptop) or opaque LLMs (hallucinations). We wanted local-first, deterministic, explainable analysis that teams can trust.

What it does

  • Ingests new circulars/standards (PDF/links) and optional baselines
  • Extracts obligations with a rule-first pipeline (no external APIs)
  • Maps obligations to your control catalogs using local embeddings + FAISS
  • Plans actions (owner, effort, due date) and evidence to collect
  • Explains every match with source snippets and similarity scores
  • Audits everything with a SHA-256 hash-chained log
  • Exports JSON/CSV + a human-reviewable plan

How we built it

  • UI: Streamlit (dark theme, multipage views)
  • Parsing: PyMuPDF/pdfplumber + layout normalization + clause detectors
  • Extraction: lexicon + grammar/regex cues (“shall/must/prohibit”), section heuristics
  • Mapping: sentence-transformers (local) → FAISS index of PCI/RBI/etc. controls; hybrid score = BM25-style keywords + cosine sim
  • Explainability: per-match snippets, scores, and thresholds (auto-accept / review)
  • Auditability: append-only JSONL with SHA-256 chaining
  • Storage: local JSON (SQLite in roadmap). No external LLM/API calls at runtime

Challenges we ran into

  • Wild PDF layouts (tables/headers/2-column) → layout-aware preprocessor
  • Over-matching on generic terms (“encryption”) → negative cues + section weighting
  • Speed vs accuracy on large docs → chunking + cached embeddings
  • “Zero-hallucination” UX → rules are authoritative; embeddings only suggest

Accomplishments that we’re proud of

  • 100% local pipeline that works offline
  • Deterministic extraction with fully traceable why for every mapping
  • One-click export that reads like an auditor’s action plan
  • Tamper-evident audit trail from day one

What we learned

  • GRC users trust citations & snippets over scores alone
  • Small local models + rules beat naive prompts for regulated text
  • Good defaults (prebuilt PCI/RBI catalogs) massively cut setup friction

What’s next for RegDelta

  • Interactive review (accept/override/reject) & SQLite backend
  • Multi-catalog management and scenario comparison reports
  • OCR fallback, golden-set metrics dashboard
  • Jira/GitHub integrations, lightweight local RBAC, webhooks

Built with

  • Languages/Frameworks: Python 3.10+, Streamlit
  • Parsing & NLP: PyMuPDF, pdfplumber, pdfminer-six, pypdfium2, sentence-transformers, transformers, torch (CPU), FAISS, rapidfuzz, scikit-learn
  • Data/Utils: pandas, numpy, pyyaml, regex, requests, hashlib (SHA-256)
  • Platforms: Local (Windows/macOS/Linux), Streamlit Community Cloud
  • AI assistance (development): Entire codebase authored/refactored with help from Claude Sonnet 4.5; runtime is 100% local (no external LLM/API calls)

Built With

  • claude
  • faiss
  • hashlib
  • numpy
  • pandas
  • pdfminer
  • pymupdf
  • python
  • rapidfuzz
  • sentence-transformers
  • streamlit
  • transformers
Share this project:

Updates