Inspiration

India has 400 million credit-invisible people — street vendors, daily wage workers, farmers, and small earners — who are completely locked out of formal banking. Not because they're untrustworthy, but because they simply don't exist in the system. No CIBIL score. No credit history. No chance.

We asked ourselves: What if the chai vendor who has paid his electricity bill on time for 10 years could prove he's creditworthy? What if a woman running a Self Help Group could speak into her phone in Tamil and instantly learn she qualifies for a ₹10 lakh collective loan?

The answer is VishwaScore — an alternative credit identity built from the financial signals that already exist in every Indian's life, paired with a voice-first advisor that speaks their language.


What it does

VishwaScore has two core capabilities:

1. Alternative Credit Scoring (VishwaScore Engine) We ingest real financial behavior data — UPI transaction patterns, utility bill payments, asset ownership signals, government scheme participation, and employment stability — and feed it through a LightGBM model trained on 50,000 synthetic Indian borrower profiles. The output: a VishwaScore between 300–900 that banks can use to underwrite loans for people who have zero traditional credit history.

The score is broken down into 6 transparent pillars: Bill Payments, UPI & Digital Flow, Assets, Income, Identity & Government Linkage, and Stability — each with SHAP-based explainability so both the borrower and the bank understand why the score is what it is.

2. Multilingual Voice Financial Advisor (ArthaSetu) A rural farmer in Bihar shouldn't need to read English PDFs to discover he qualifies for a Kisan Credit Card. Our voice pipeline lets users speak in any of 10 Indian languages (Hindi, Tamil, Telugu, Marathi, Bengali, Gujarati, Kannada, Malayalam, Punjabi, English), automatically transcribes and understands their query, matches them to the best government loan scheme from our knowledge base of 11 verified schemes, generates a simple response in their native language, and reads it back to them as natural audio.

The full loop — voice in, advice out — takes under 8 seconds.


How we built it

Data Pipeline (Databricks Lakehouse) We built a complete Bronze → Silver → Gold medallion architecture on Databricks with Unity Catalog governance:

  • Bronze Layer: Ingested 4 public datasets — 40,000 rural loan borrower profiles (Kaggle), 19,000 BhashaBench Finance Q&A pairs (HuggingFace), 12 government loan schemes (myscheme.gov.in), and PMMY state-wise credit penetration data. All stored as Delta tables with Change Data Feed enabled.
  • Silver Layer: PySpark transformations — cleaned 40K borrower rows into 1,250 occupation-level aggregated profiles, resolved BhashaBench MCQ answers into readable Q&A text, normalized scheme eligibility data, and built state-level credit context summaries.
  • Gold Layer: Merged all Silver sources into a unified RAG corpus of 4,178 chunks, generated embeddings using paraphrase-MiniLM-L6-v2, and built a FAISS vector index for sub-50ms retrieval. Separately, engineered 32 features across 6 pillars for the credit scoring model.

ML Pipeline (MLflow + LightGBM)

  • Feature engineering across 6 pillars: bill payment consistency, UPI transaction flow, asset signals, income proxies, government identity linkage, and stability indicators.
  • LightGBM model trained with Hyperopt tuning (50+ trials), tracked in MLflow with full experiment lineage.
  • Progressive AUC improvement: v1 baseline (income only) → 0.74, v2 (+bills) → 0.80, v3 (full VishwaScore with alternative data) → 0.86 AUC. The +0.12 lift comes entirely from data that does not exist in CIBIL.
  • SHAP explanations generated per-user for full transparency.

Voice AI Pipeline (Sarvam AI)

  • ASR: Sarvam Saaras v3 — auto-detects Indian language, transcribes, and translates to English in a single API call. Handles code-mixed speech (Hinglish, Tanglish).
  • LLM: Sarvam-m 24B — India's first large language model that natively understands Indian languages. Grounded in our RAG scheme corpus for factual responses.
  • TTS: Sarvam Bulbul v2 — generates natural-sounding speech in 10 Indian languages with region-appropriate voices.

Application (Streamlit on Databricks Apps)

  • 5-page interactive dashboard: Portfolio Overview, Segment Analytics, User Lookup (with 6-pillar breakdown), Model Performance, and Voice Advisor.
  • Live SQL queries to Unity Catalog Gold tables via Databricks SQL Warehouse.
  • Real-time voice interaction with <8 second end-to-end latency.

Challenges we ran into

The "Credit-Invisible" Data Paradox. The people we're trying to score have no credit data — that's the whole point. We had to carefully engineer proxy features from non-traditional sources (UPI patterns, bill payment regularity, government scheme participation) and validate that they genuinely predict repayment behavior. The +0.12 AUC lift from alternative data validated our thesis, but getting the feature engineering right took significant iteration.

Sarvam SDK Breaking Changes. The Sarvam AI Python SDK had undocumented API changes — client.chat.completions.create() didn't work, speaker names had changed, and TTS had a 500-character limit we discovered at runtime. We pivoted to direct HTTP API calls for all three services (ASR, LLM, TTS), which gave us full control and eliminated SDK dependency issues.

Databricks Apps File Size Limits. Our FAISS index + metadata was 12MB, but Databricks Apps has a 10MB per-file limit. We had to split the data into separate modules and restructure our deployment. For the final production version, we embedded scheme knowledge directly in the LLM prompt — simpler, faster, and zero file dependency.

Multilingual Response Quality. Sarvam-m sometimes returned <think> reasoning tags in responses, and occasionally generated empty replies for certain language-state combinations. We built regex-based cleanup and fallback logic to ensure every user always gets a clean, useful response.


Accomplishments that we're proud of

  • End-to-end voice interaction in 10 Indian languages — a street vendor in Lucknow can speak Hindi, a fisherman in Chennai can speak Tamil, and both get personalized scheme recommendations with audio playback. No literacy required.

  • 0.86 AUC on credit scoring using zero CIBIL data — we proved that alternative financial signals (bills, UPI, assets, government linkage) are genuinely predictive of creditworthiness, with a +0.12 lift over traditional income-only baselines.

  • Full Databricks Lakehouse architecture — Bronze/Silver/Gold medallion with Unity Catalog, Delta Lake with Change Data Feed, MLflow experiment tracking, FAISS vector search, and Streamlit Apps — all running on a single platform.

  • 6-pillar SHAP explainability — every VishwaScore comes with a transparent breakdown showing exactly which behaviors are helping or hurting the score. This isn't a black box — it's a roadmap for financial improvement.

  • 11 verified government loan schemes mapped with eligibility criteria, loan amounts, interest rates, and collateral requirements — from PM SVANidhi (₹10K for street vendors) to Mudra Tarun (₹10L for established businesses).


What we learned

  • Alternative data works. The skepticism around "can you really score someone without credit history?" is answered by our 0.86 AUC. Bill payment patterns and UPI flow are strong signals — they just haven't been used by traditional bureaus.

  • Voice-first is non-negotiable for rural India. Text interfaces exclude the 287 million Indians who can't read English. Building ASR → LLM → TTS as a first-class pipeline (not an afterthought) changed our entire UX philosophy.

  • Databricks unifies ML and data engineering. Having Delta Lake, MLflow, Unity Catalog, and Streamlit Apps on one platform eliminated the "glue code" problem. Our Bronze-to-Dashboard pipeline has zero external infrastructure dependencies.

  • Indian AI models matter. Sarvam-m's native understanding of Hindi financial terminology (like "thela," "kishor loan," "SHG") is something GPT-4 or Claude simply cannot match. Language-native models are essential for real-world Indian deployments.


What's next for VishwaScore

  • Live UPI integration via the Account Aggregator framework to replace synthetic data with real transaction histories (with user consent).

  • Aadhaar eKYC + DigiLocker integration for verified identity signals — land records, vehicle registration, and government scheme enrollment.

  • Bank partnership pilot with a regional rural bank or NBFC to deploy VishwaScore as a pre-screening layer for Mudra and SVANidhi loans.

  • Offline-first mobile app with on-device ASR for areas with poor connectivity — cache scheme data locally, sync scores when online.

  • Financial literacy module — not just "here's a scheme," but "here's how to improve your VishwaScore by 50 points in 3 months" with actionable step-by-step guidance in the user's language.

  • Scale to 22 scheduled languages as Sarvam AI expands language support, covering 97% of India's population in their mother tongue.

VishwaScore isn't just a hackathon project — it's the financial infrastructure that 400 million Indians deserve. Every UPI payment, every electricity bill, every government scheme enrollment is a signal. We're turning those signals into trust, access, and opportunity.

Built With

  • databricks
  • gbt
  • pyspark
  • python
  • sarvam
Share this project:

Updates