Geo-Compliance
Problem Statement
Product teams must ship features that comply with a patchwork of regional regulations. Manual checks are slow, inconsistent, and hard to audit. The goal is to automate compliance discovery by:
- Mapping a plain-language feature description or code change to applicable jurisdictions.
- Retrieving the most relevant regulatory clauses.
- Producing a structured, auditable JSON response with issues, reasoning, and verbatim evidence from the law.
- Enabling the same checks to run in an automated gate (pre-commit or pre-PR) for continuous compliance.
Project Features & Functionality
Our project is a Geo-Regulation Compliance Assistant that evaluates product features or code changes against region-specific legal frameworks (for example, Utah Social Media Regulation Act, California SB-976, and the EU Digital Services Act).
We implement a Retrieval-Augmented Generation (RAG) pipeline:
- Regulations are ingested from PDF/HTML/DOCX, chunked with page indices, embedded, and stored in Chroma.
- A feature description is provided by the user.
- A lightweight region classifier selects the relevant jurisdiction(s).
- A glossary expansion step resolves internal acronyms (PF, GH, ASL, BB, EchoTrace, etc.).
- The retriever fetches the top-k jurisdiction-specific clauses.
- The LLM generates a strict, machine-readable JSON object that includes a compliance flag, issues, reasoning, and evidence quoted from the source.
Additional capabilities
- Developer document evaluator: extracts features from PRDs/dev docs for downstream compliance checks.
- Code-change evaluator: summarizes feature-level impacts from diffs and maps them to regulatory requirements.
- Streamlit demo app: interactive UI that runs the same pipeline, displays JSON output, and supports run history logging.
- CSV logging and history panel: every run is upserted to a CSV and viewable in a collapsible, scrollable log for traceability.
Models
- Primary cloud model: Google Gemini (gemini-2.5-flash / gemini-2.5-pro) for structured JSON responses.
- Local model: meta-llama/Meta-Llama-3-8B-Instruct (quantized for local inference) used as an on-prem fallback and for offline development. The same prompt format and RAG pipeline are used, with adjustments for chat templates and output parsing.
Development Tools
- Python 3.10
- Streamlit for the demo UI
- Virtualenv/venv for environment management
- Git for version control
- Optional local serving: Hugging Face Transformers and quantization utilities; vLLM experiments were conducted but not required in the current demo
- Command-line utilities and simple batch runners for offline tests
APIs Used
- Google Generative Language API (Gemini) via langchain-google-genai for structured output and schema-constrained responses.
Note: The local LLM path does not call external APIs; it uses local inference through the transformers stack.
Assets Used
- Regulation source files stored under
regulations/(e.g., Utah Social Media Regulation Act PDF, California SB-976 HTML, EU DSA HTML). - Internal terminology glossary (PF, GH, ASL, BB, EchoTrace, etc.) integrated as a searchable resource to normalize product acronyms.
- Example PRDs and developer documents used for feature-extraction tests.
Libraries Used
- LangChain, langchain-community, langchain-google-genai for LLM orchestration, prompts, and chains
- ChromaDB for vector storage and retrieval
- sentence-transformers (all-MiniLM-L6-v2) for embeddings
- Hugging Face Transformers for local LLM inference
- Streamlit for the UI
- Pandas/CSV for result storage and history logging
- Pydantic/json for schema handling and robust parsing
Additional Datasets (Beyond the Problem Statement)
- Internal Terminology Glossary: A small, curated dataset of product acronyms and operational terms used to disambiguate feature descriptions before retrieval.
- Sample PRD/Dev-Doc Corpus: A synthetic set of product requirement documents used to test the feature-extraction and developer-doc evaluation pipelines.
Notes on Integration and CI/CD
- The evaluation pipeline is callable from a CLI or service endpoint and can run in a pre-commit or pre-PR job.
- When issues are detected, the pipeline returns a strictly structured JSON artifact that can be attached to the PR for legal review; when insufficient context exists, a deterministic fallback JSON is emitted.
- The Streamlit demo uses the exact same backend logic, ensuring parity between interactive demos and CI checks.
Log in or sign up for Devpost to join the conversation.