Legal Lens – TikTok Geo-Compliance Automation System
TikTok TechJam 2025 – Track 3 Submission by Team Durian Men
📌 Features and Functionality
Legal Lens is an automated system for determining geo-specific compliance requirements of TikTok features using AI-powered legal analysis.
It transforms compliance detection from manual guesswork into traceable, auditable outputs through:
Core Features
- Automated Compliance Detection – Analyzes feature descriptions and determines if they require geo-specific legal compliance
- Semantic Legal Search – Uses AI embeddings to find relevant legal articles from a comprehensive knowledge base
- TikTok Jargon Translation – Automatically expands 50+ internal terminology terms (ASL, GH, T5, etc.) for proper legal context
- Confidence Scoring – Provides reliability metrics and legal citations for compliance determinations
- Batch Processing – Analyzes multiple features simultaneously via CSV input/output with simple rate limiting
- Audit Trail Generation – Creates detailed reasoning and legal references for every decision
🏗️ System Architecture
- Legal Document Processing – Article-based chunking of 5 regulatory frameworks into multiple compliance-focused segments
- Vector Embeddings –
BAAI/bge-small-en-v1.5model generates semantic embeddings for legal text retrieval - ChromaDB Vector Database – Persistent storage enabling sub-second semantic search across legal corpus
- Provence AI Integration – Prunes legal documents to extract only compliance-relevant obligations and requirements
- Structured JSON Pipeline – Machine-readable outputs with verdicts, reasoning, and legal citations
⚡ Performance
- Fast Processing – sub 7 seconds per feature analysis (vs 15–30 minutes manual legal review)
- 98% Time Savings – Dramatically reduces manual legal review overhead
- Multi-Jurisdiction Coverage – Supports 5 major regulatory frameworks simultaneously
💵 Cost Efficiency
- Low API Cost – Average input/output API call cost to Sonar was below 1 cent per request
- Affordable at Scale – Enables batch analysis of hundreds of features with negligible expense
- Predictable Pricing – Fixed cost per request ensures transparent budgeting
🛠️ Development Tools Used
- Python 3.8+ – Primary programming language
- Virtual Environment – Dependency isolation and management
- Git – Version control and collaboration
- Visual Studio Code – Main development environment
🔌 APIs Used in the Project
- Perplexity API – Sonar model integration for AI-powered legal analysis and compliance determination
- FastEmbed API –
BAAI/bge-small-en-v1.5embedding model for high-performance semantic search - Provence AI API – Content pruning service for compliance-relevant legal text extraction
🧩 Design Decision: Direct API vs LangChain
We chose direct HTTP calls to Perplexity over LangChain for simplicity and control.
- ✅ Transparency in outputs
- ✅ Minimal dependencies
- ✅ Direct error handling and response parsing
- ❌ LangChain abstraction unnecessary for our single-call use case
📚 Libraries Used
- ChromaDB – Vector database for persistent semantic search
- FastEmbed – High-performance embeddings
- Requests – HTTP client for API interactions
- Python-dotenv – Environment variable management
- Transformers – Hugging Face library for Provence AI integration
- NLTK – Text preprocessing for Provence integration
- NumPy – Vector operations and embedding manipulation
⚖️ Relevant Problem Statement
From Guesswork to Governance: Automating Geo-Regulation with LLM
As TikTok operates globally, every product feature must dynamically satisfy dozens of geographic regulations (e.g., Brazil’s data localization, GDPR).
Without automation, TikTok faces:
- ⚖️ Legal exposure from undetected compliance gaps
- 🛑 Reactive firefighting when regulators inquire
- 🚧 Manual overhead in scaling global rollouts
✅ System Benefits
Proactive Legal Guardrails
- Flags compliance needs during feature development
- Provides confidence scoring + legal citations
- Escalates to human intervention if uncertain
- Flags compliance needs during feature development
Audit-Ready Evidence
- Structured JSON outputs with reasoning, references, and batch logs
- Structured JSON outputs with reasoning, references, and batch logs
Regulatory Traceability
- Persistent compliance records with citations and confidence metrics
- Persistent compliance records with citations and confidence metrics
📂 Assets Used
- Legal Document Corpus – 5 major regulatory frameworks
- TikTok Jargon Dictionary – 50+ terms mapped (ASL, GH, T5, Spanner, Jellybean, Snowcap, etc.)
- Feature Dataset – 35+ test features for validation
- Pre-trained Embedding Models –
BAAI/bge-small-en-v1.5(~1GB)
📈 Additional Datasets Created
Enhanced TikTok Jargon Dictionary
- TikTok-specific abbreviations (e.g., ASL = Age-sensitive logic)
- System codenames (e.g., Spanner = rule engine, Jellybean = parental controls)
- Operational terms (e.g., ShadowMode, EchoTrace, Redline)
- Compliance/security terms (e.g., GDPR, CCPA, PII, MFA, RBAC)
- Ambiguity handling with multi-meaning resolution
Compliance-Focused Legal Embeddings
- Extracted only obligations & compliance requirements
- Removed irrelevant definitions/background info using Provence
- Produced 42 focused legal segments
Smart Jargon Resolution System
- Provide all meanings (TikTok + standard definitions)
- Context-aware analysis by Perplexity Sonar
- Coverage across 50+ terms
- Support for multi-meaning abbreviations
🎯 Task Requirements Fulfillment
🛡️ Mitigate Regulatory Exposure
- Proactive detection of compliance needs
- Multi-jurisdiction coverage (Utah, California, Florida, Federal, EU)
- Confidence scoring + legal citations
- Human-in-the-loop design via “uncertain” verdicts
📋 Enable Audit-Ready Transparency
- Structured JSON with verdict + reasoning + references
- Batch processing logs for evidence
- Immediate response to audits
💰 Reduce Compliance Governance Costs
- 98% time savings (~7s vs 30min per feature)
- Automated batch processing for hundreds of features
- Eliminates manual research
- Ensures consistency + scalability
🧗 Challenges Faced
Embedding Legal Texts for Optimal Retrieval
Finding the right technique to embed dense legal language was difficult. We had to test out multiple chunking and embedding techniques so that ChromaDB retrieval returned just the right amount of context, too small lost meaning, too large returned irrelevant material.Prompt Engineering for Sonar
Due to the independence of each API call to Sonar, we had to continuously re-engineer and refine the single-shot prompt to ensure one query reliably produced the desired structured JSON verdicts, reasoning, and citations without requiring multiple retries.
⚠️ Limitation
This is a prototype demonstrating feasibility.
Production deployment would require further validation and legal review.
👥 Team Info
- Team: Durian Men
- Track: Track 3 – From Guesswork to Governance: Automating Geo-Regulation with LLM
- Focus: Boosting LLM Precision & Full Automation for geo-compliance detection
This system transforms TikTok’s compliance process from reactive firefighting to proactive legal guardrails, enabling audit-ready transparency for global feature rollouts.
Log in or sign up for Devpost to join the conversation.