Geo-Compliance

Problem Statement

Product teams must ship features that comply with a patchwork of regional regulations. Manual checks are slow, inconsistent, and hard to audit. The goal is to automate compliance discovery by:

  1. Mapping a plain-language feature description or code change to applicable jurisdictions.
  2. Retrieving the most relevant regulatory clauses.
  3. Producing a structured, auditable JSON response with issues, reasoning, and verbatim evidence from the law.
  4. Enabling the same checks to run in an automated gate (pre-commit or pre-PR) for continuous compliance.

Project Features & Functionality

Our project is a Geo-Regulation Compliance Assistant that evaluates product features or code changes against region-specific legal frameworks (for example, Utah Social Media Regulation Act, California SB-976, and the EU Digital Services Act).

We implement a Retrieval-Augmented Generation (RAG) pipeline:

  1. Regulations are ingested from PDF/HTML/DOCX, chunked with page indices, embedded, and stored in Chroma.
  2. A feature description is provided by the user.
  3. A lightweight region classifier selects the relevant jurisdiction(s).
  4. A glossary expansion step resolves internal acronyms (PF, GH, ASL, BB, EchoTrace, etc.).
  5. The retriever fetches the top-k jurisdiction-specific clauses.
  6. The LLM generates a strict, machine-readable JSON object that includes a compliance flag, issues, reasoning, and evidence quoted from the source.

Additional capabilities

  • Developer document evaluator: extracts features from PRDs/dev docs for downstream compliance checks.
  • Code-change evaluator: summarizes feature-level impacts from diffs and maps them to regulatory requirements.
  • Streamlit demo app: interactive UI that runs the same pipeline, displays JSON output, and supports run history logging.
  • CSV logging and history panel: every run is upserted to a CSV and viewable in a collapsible, scrollable log for traceability.

Models

  • Primary cloud model: Google Gemini (gemini-2.5-flash / gemini-2.5-pro) for structured JSON responses.
  • Local model: meta-llama/Meta-Llama-3-8B-Instruct (quantized for local inference) used as an on-prem fallback and for offline development. The same prompt format and RAG pipeline are used, with adjustments for chat templates and output parsing.

Development Tools

  • Python 3.10
  • Streamlit for the demo UI
  • Virtualenv/venv for environment management
  • Git for version control
  • Optional local serving: Hugging Face Transformers and quantization utilities; vLLM experiments were conducted but not required in the current demo
  • Command-line utilities and simple batch runners for offline tests

APIs Used

  • Google Generative Language API (Gemini) via langchain-google-genai for structured output and schema-constrained responses.

Note: The local LLM path does not call external APIs; it uses local inference through the transformers stack.

Assets Used

  • Regulation source files stored under regulations/ (e.g., Utah Social Media Regulation Act PDF, California SB-976 HTML, EU DSA HTML).
  • Internal terminology glossary (PF, GH, ASL, BB, EchoTrace, etc.) integrated as a searchable resource to normalize product acronyms.
  • Example PRDs and developer documents used for feature-extraction tests.

Libraries Used

  • LangChain, langchain-community, langchain-google-genai for LLM orchestration, prompts, and chains
  • ChromaDB for vector storage and retrieval
  • sentence-transformers (all-MiniLM-L6-v2) for embeddings
  • Hugging Face Transformers for local LLM inference
  • Streamlit for the UI
  • Pandas/CSV for result storage and history logging
  • Pydantic/json for schema handling and robust parsing

Additional Datasets (Beyond the Problem Statement)

  • Internal Terminology Glossary: A small, curated dataset of product acronyms and operational terms used to disambiguate feature descriptions before retrieval.
  • Sample PRD/Dev-Doc Corpus: A synthetic set of product requirement documents used to test the feature-extraction and developer-doc evaluation pipelines.

Notes on Integration and CI/CD

  • The evaluation pipeline is callable from a CLI or service endpoint and can run in a pre-commit or pre-PR job.
  • When issues are detected, the pipeline returns a strictly structured JSON artifact that can be attached to the PR for legal review; when insufficient context exists, a deterministic fallback JSON is emitted.
  • The Streamlit demo uses the exact same backend logic, ensuring parity between interactive demos and CI checks.

Built With

Share this project:

Updates