Inspiration

A street vendor in Lucknow receives a WhatsApp message: "Download this app for instant Rs 50,000 loan." He installs it. Within a week, the app has harvested his contacts, charged 36% interest per week, and his family members are receiving threatening calls from recovery agents.

This isn't hypothetical. RBI banned 600+ illegal lending apps in 2023 alone. Rs 14,574 Cr was lost to cyber fraud last year. And the victims — 300M+ Indians entering digital finance for the first time — have zero legal awareness, can't read English-language laws, and don't know how to file a formal complaint.

We asked: what if one platform could catch the fraud, explain the law, find government help, and draft the complaint — all before the victim gives up?


What We Built

Artha-Nyaya Suite is a Databricks-native platform with 5 connected modules covering the complete fraud lifecycle:

Step Module What It Does
Prevent Saavdhaan Analyzes lending app terms against RBI thresholds — catches predatory practices before the user signs
Detect Suraksha Flags fraudulent UPI transactions using a GBT classifier trained on 5M+ rows
Understand Adhikar Multi-turn legal chatbot over BNS 2023 + RBI circulars — explains rights in the user's language
Find Help Samriddhi Matches citizens to government schemes they didn't know they qualified for
Act Nivaaran Drafts formal legal complaints with correct BNS sections and RBI citations

The key insight: modules are connected, not siloed. Saavdhaan flags a predatory app → user clicks "File Complaint" → Nivaaran opens with complaint type and description pre-filled. Suraksha detects fraud → "Know Your Rights" → Adhikar shows context-aware suggested questions. One guided journey, not five isolated tools.

Everything works in 10 Indian languages (Hindi, Tamil, Telugu, Bengali, Marathi, Kannada, Malayalam, Gujarati, Punjabi, English) with voice input/output — because a Hindi-speaking street vendor deserves the same legal protection as an English-literate professional.


How We Built It — Databricks Platform Deep Dive

Delta Lake — Not Just Storage, the Sync Engine

Every dataset lives in Delta tables, but the real value is Change Data Feed (CDF) on unified_corpus. When we add new RBI circulars or BNS sections, CDF detects what changed and auto-syncs the Vector Search index — zero manual re-indexing. The RAG pipeline returns updated answers without redeploying the app.

We also use MERGE INTO for upserts (config flags, metrics tables) — ACID operations that raw Parquet can't do.

Spark MLlib — Real Training, Not Wrappers

  • GBT Fraud Classifier (notebook 10): GBTClassifier inside a Pipeline with VectorAssembler + StringIndexer, trained on 5M+ UPI transactions. AUC: 0.9999, fraud-class F1: 0.9668.
  • KMeans Persona Clustering (notebook 11): Clusters users into 5 transaction-behavior personas from aggregate features (monthly inflow, avg amount, dominant type). Samriddhi uses these to match citizens to government schemes.

Both models are trained with Spark's distributed engine on the full dataset — not sampled, not single-node.

MLflow — Model Registry with Champion Aliases

Both models are logged to MLflow with full metrics, registered in Unity Catalog, and aliased as @champion. The app loads models by alias — if we retrain and a new version beats the old one, we update the alias and the app picks it up. Zero code changes, zero redeployment.

Vector Search — Primary Retriever with Auto-Sync

Delta Sync index over unified_corpus using databricks-bge-large-en embeddings. Source-filtered per module: Suraksha retrieves only legal sections, Adhikar retrieves legal + regulatory, Samriddhi retrieves only schemes. If the Vector Search endpoint is down, the app automatically falls back to FAISS on the UC Volume.

Databricks Apps — Production Deployment

The Gradio app runs in a containerized Databricks App with OAuth M2M service principal — no hardcoded tokens. The service principal gets granular UC permissions: USE_CATALOG, USE_SCHEMA, READ_VOLUME, SELECT on tables, EXECUTE on models, READ on secret scope.

SQL Statement Execution API — Live Metrics Without Spark

The Performance dashboard loads fraud model AUC and RAG accuracy live from Delta tables via the SDK's SQL Statement API — because the app container has no Spark. If the tables are unreachable, it falls back to hardcoded values from the last notebook run.

Unity Catalog — Everything Under One Roof

All tables, volumes, models, and indexes live under workspace.default. The UC Volume stores FAISS indexes, app cache parquets, and seed PDFs. Permissions are granted at the UC level — one notebook (15) sets up everything the service principal needs.


The RAG Pipeline — How AI Actually Works

Every module that answers questions uses the same 5-step pipeline (orchestration.py):

User speaks Hindi → Sarvam Saaras (STT) → text
↓
Sarvam Mayura translates → English query
↓
LLM rewrites ambiguous references ("what's the punishment for that?") using last 4 chat turns
↓
Vector Search retrieves top-5 chunks (source-filtered per module)
↓
Llama-4-Maverick generates answer with retrieved context
↓
Sarvam Mayura translates answer → Hindi (citations kept in English)
↓
Sarvam Bulbul (TTS) → user hears the answer

Why this is sound:

  • Translation happens before retrieval — English embeddings work better than multilingual ones for legal text
  • Query rewriting resolves anaphoric references ("that", "it", "this section") into standalone queries — critical for multi-turn conversations
  • Source allowlists prevent cross-contamination (legal questions don't retrieve scheme data)
  • Every fallback is automatic: Vector Search → FAISS, Llama-4-Maverick → sarvam-m, Sarvam Mayura → LLM-based translation

Verifiable results:

  • Evaluated on BhashaBench Hindi Finance benchmark (50 QA pairs from HuggingFace)
  • Proxy accuracy: 62%, Token F1: 0.384, Avg latency: 4.8s
  • All metrics stored in Delta tables and displayed live in the Performance tab

Innovation — Why This Is Non-Obvious

The problem is well-chosen: India has 700+ government schemes, a brand-new criminal code (BNS 2023 replaced IPC in 2024 — barely anyone knows it), and RBI digital lending regulations that most citizens can't read. These aren't hypothetical gaps — they're active crises.

The solution is novel in three ways:

  1. Connected modules, not isolated tools. Every existing legal-tech or fintech app solves one piece. We solve the entire lifecycle in one flow — and context transfers between modules automatically. No other platform does fraud detection → legal rights → complaint drafting as a single guided journey.

  2. Transaction-inferred personas for scheme matching. Samriddhi doesn't ask users to fill forms. It infers their profile from UPI transaction patterns (KMeans clustering) and matches them to government schemes. A street vendor with high cash-out frequency gets matched to PM SVANidhi; a gig worker with irregular inflows gets matched to e-Shram.

  3. Predatory lending detection before the loss happens. Saavdhaan catches predatory terms (300%+ APR, contact harvesting, prohibited recovery practices) using regex scoring against RBI thresholds before the user signs — then RAG adds legal analysis citing the exact RBI circular that's being violated. Prevention, not just detection.


Challenges

  1. No Spark in App containers. Databricks Apps run in lightweight containers without Spark. We solved this by pre-computing scored transactions and user features as Parquet files on the UC Volume, then loading them with pandas at runtime. The app reads from Delta tables via SQL Statement API, not Spark.

  2. Sarvam Mayura's silent truncation. The translation API silently fails on text >500 characters. We built chunked_translate() that splits text at sentence boundaries, translates each chunk, and reassembles — with fallback to LLM-based translation if the API errors.

  3. Multi-turn memory in RAG. Follow-up questions like "what's the punishment for that?" fail in naive RAG because "that" has no referent. We added an LLM-powered query rewriter that uses the last 4 conversation turns to produce a standalone retrieval query.

  4. Two-tier model registration. MLflow model registration sometimes fails on Unity Catalog (permissions vary by workspace config). We built a cascade: try UC registration first, fall back to workspace registry, then set the @champion alias — with graceful degradation at each step.

  5. Cross-tab navigation in Gradio. Gradio 4.44 doesn't natively support programmatic tab switching with data pre-filling. We solved it with gr.Tabs(selected=N) IDs, builder functions that return component dictionaries, and .then() chaining to show contextual next-steps after each action completes.


Key Metrics

Metric Value
Fraud Model AUC 0.9999
Fraud-class Precision 97.1%
Fraud-class Recall 96.3%
Fraud-class F1 96.7%
Training Data 5M+ rows
RAG Proxy Accuracy 62.0% (BhashaBench Hindi Finance)
RAG Token F1 0.384
RAG Avg Latency 4.8s
Languages 10 Indian languages + voice
Connected Modules 5 + Performance dashboard
Notebooks 17 (reproducible from scratch)
Fallback Chains 4 (LLM, retrieval, translation, runtime)

Built With

  • bhashabench
  • databricks
  • databricks-apps
  • databricks-vector-search
  • delta-lake
  • faiss
  • gradio
  • huggingface
  • llama-4-maverick
  • mlflow
  • pandas
  • python
  • sarvam-ai
  • sarvam-bulbul
  • sarvam-mayura
  • sarvam-saaras
  • spark-mllib
  • unity-catalog
Share this project:

Updates