Geo-Compliance

Problem Statement

Product teams must ship features that comply with a patchwork of regional regulations. Manual checks are slow, inconsistent, and hard to audit. The goal is to automate compliance discovery by:

Mapping a plain-language feature description or code change to applicable jurisdictions.
Retrieving the most relevant regulatory clauses.
Producing a structured, auditable JSON response with issues, reasoning, and verbatim evidence from the law.
Enabling the same checks to run in an automated gate (pre-commit or pre-PR) for continuous compliance.

Project Features & Functionality

Our project is a Geo-Regulation Compliance Assistant that evaluates product features or code changes against region-specific legal frameworks (for example, Utah Social Media Regulation Act, California SB-976, and the EU Digital Services Act).

We implement a Retrieval-Augmented Generation (RAG) pipeline:

Regulations are ingested from PDF/HTML/DOCX, chunked with page indices, embedded, and stored in Chroma.
A feature description is provided by the user.
A lightweight region classifier selects the relevant jurisdiction(s).
A glossary expansion step resolves internal acronyms (PF, GH, ASL, BB, EchoTrace, etc.).
The retriever fetches the top-k jurisdiction-specific clauses.
The LLM generates a strict, machine-readable JSON object that includes a compliance flag, issues, reasoning, and evidence quoted from the source.

Additional capabilities

Developer document evaluator: extracts features from PRDs/dev docs for downstream compliance checks.
Code-change evaluator: summarizes feature-level impacts from diffs and maps them to regulatory requirements.
Streamlit demo app: interactive UI that runs the same pipeline, displays JSON output, and supports run history logging.
CSV logging and history panel: every run is upserted to a CSV and viewable in a collapsible, scrollable log for traceability.

Models

Primary cloud model: Google Gemini (gemini-2.5-flash / gemini-2.5-pro) for structured JSON responses.
Local model: meta-llama/Meta-Llama-3-8B-Instruct (quantized for local inference) used as an on-prem fallback and for offline development. The same prompt format and RAG pipeline are used, with adjustments for chat templates and output parsing.

Development Tools

Python 3.10
Streamlit for the demo UI
Virtualenv/venv for environment management
Git for version control
Optional local serving: Hugging Face Transformers and quantization utilities; vLLM experiments were conducted but not required in the current demo
Command-line utilities and simple batch runners for offline tests

APIs Used

Google Generative Language API (Gemini) via langchain-google-genai for structured output and schema-constrained responses.

Note: The local LLM path does not call external APIs; it uses local inference through the transformers stack.

Assets Used

Regulation source files stored under regulations/ (e.g., Utah Social Media Regulation Act PDF, California SB-976 HTML, EU DSA HTML).
Internal terminology glossary (PF, GH, ASL, BB, EchoTrace, etc.) integrated as a searchable resource to normalize product acronyms.
Example PRDs and developer documents used for feature-extraction tests.

Libraries Used

LangChain, langchain-community, langchain-google-genai for LLM orchestration, prompts, and chains
ChromaDB for vector storage and retrieval
sentence-transformers (all-MiniLM-L6-v2) for embeddings
Hugging Face Transformers for local LLM inference
Streamlit for the UI
Pandas/CSV for result storage and history logging
Pydantic/json for schema handling and robust parsing

Additional Datasets (Beyond the Problem Statement)

Internal Terminology Glossary: A small, curated dataset of product acronyms and operational terms used to disambiguate feature descriptions before retrieval.
Sample PRD/Dev-Doc Corpus: A synthetic set of product requirement documents used to test the feature-extraction and developer-doc evaluation pipelines.

Notes on Integration and CI/CD

The evaluation pipeline is callable from a CLI or service endpoint and can run in a pre-commit or pre-PR job.
When issues are detected, the pipeline returns a strictly structured JSON artifact that can be attached to the PR for legal review; when insufficient context exists, a deterministic fallback JSON is emitted.
The Streamlit demo uses the exact same backend logic, ensuring parity between interactive demos and CI checks.

Built With

chroma
gemini
langchain
python

Updates

Sangjun Nam started this project — Aug 30, 2025 08:19 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.