Inspiration

The hackathon dataset gives you ~10,000 healthcare facility records across India — free-text descriptions full of claimed capabilities, missing pincodes, duplicated specialties, and trust signals buried in unstructured JSON arrays. A non-technical coordinator can't make a defensible referral from that. We wanted to build a tool that says "send the patient here, here's the evidence, and here's what we're unsure about" — not just a list.

What it does

Type a need + location — "dialysis near Jaipur", "emergency surgery near Patna" — and MatchCare returns a ranked shortlist where every facility shows:

  • A trust signal (strong / partial / weak / suspicious / no-evidence) pre-computed from source breadth, accreditation mentions, and capability text
  • A cited evidence snippet — the exact text from the dataset that drove the match
  • Honest uncertainty — pincode confidence scores, a "What's missing" section, and per-claim verification percentages so weak matches are never presented as fact
  • The ability to save shortlists, submit corrections, and record review decisions via Delta-backed persistence

How we built it

Stack: Databricks Apps (Free Edition) · FastAPI · React (TanStack Router, Tailwind CSS) · Databricks SQL Warehouse · Delta Lake · Unity Catalog · Meta Llama 3.3 70B (via Databricks Model Serving) · databricks-gte-large-en (embedding model, optional).

Data pipeline (Bronze → Silver → Gold):

  • Bronze (10,088 rows): raw FDR data from the Virtue Foundation Marketplace dataset
  • Silver (9,964 rows): NULL-byte scrub, empty-string → SQL NULL, malformed JSON array repair ("null" / "[]" strings), India bounding-box coordinate filter, dedup by unique_id, capacity/numberDoctors type coercion, pincode dedup against the India Post directory, NFHS-5 district health indicators cast to numeric
  • Gold (9,953 rows): 26 attribute flags extracted from free text via regex (PM-JAY, NABH, 24x7, ambulance, telemedicine, charity care, government/private/NGO, 9 language flags, and more), pre-computed trust signals and trust rank, capability index (278K rows), desert scores (706 district-level rows), and geo-resolved pincode/state/district cross-referenced against India Post first-digit zones (582 facilities flagged for review)

Search pipeline (what actually runs per query):

  1. Keyword SQL search against facilities_gold — LIKE matching across name, description, specialties, capability, procedure, equipment — weighted and blended with distance decay and the gold table's pre-computed trust_rank
  2. Rule-based trust scoring — each result gets a trust signal derived from source count, specialty breadth, accreditation flags, and data completeness
  3. LLM evidence scoring (Llama 3.3 70B) — the top 10 results are re-scored by the LLM for per-facility confidence, with trust_signal and evidence_summary overrides. Falls back silently to rule-based scores if the LLM is slow or unavailable
  4. LLM re-rank — final ordering by LLM confidence so the strongest-evidence facilities surface first

The LLM query parser (also Llama 3.3 70B) activates for free-form single-box queries where location isn't separated out. The two-box UI form (care need + location) bypasses the parser for faster response.

Persistence: four Delta tables (shortlist, facility_overrides, facility_reviews, search_history) on the same SQL warehouse — no separate database needed.

Challenges we ran into

  • Data quality landmines: 54 misaligned bronze rows with markdown fragments in the ID column; 407 pincodes with out-of-India coordinates; "null" and "[]" stored as literal strings; ~6% of pincodes inconsistent with their claimed state
  • Schema evolution friction: Delta DROP COLUMN requires column mapping mode enabled — and CREATE OR REPLACE TABLE wipes table properties, so we had to re-enable it twice per ETL run
  • Latency vs. intelligence tradeoff: an earlier 5-node LangGraph pipeline with full chain-of-thought reasoning added 8–12 seconds per query. We stripped it back to keyword search + LLM evidence re-scoring to keep responses under 5 seconds on Free Edition
  • Cross-workspace portability: different catalog names, warehouse permissions, and Lakebase availability between Free Edition and internal environments. Made the ETL parameterized, Lakebase initialization optional, and persistence Delta-only

Accomplishments that we're proud of

  • Evidence-first, not score-first. Every result cites the exact dataset text that drove the match — a judge or a real planner can verify in one click
  • Honest about what's missing. The "What's missing" section and pincode confidence scores mean the app never presents weak evidence as fact
  • 26 regex-derived attribute flags mined from messy free-text without a single LLM call — pure SQL, zero cost, runs in seconds
  • 9,953 facilities through a full medallion pipeline on Databricks Free Edition with no cluster provisioning
  • Both spec examples work: "dialysis near Jaipur" and "emergency surgery near Patna" return ranked, evidence-attached shortlists

What we learned

  • Pre-computed signals beat real-time LLM reasoning for a 3-minute demo on a cold warehouse. The 26 regex flags and pre-built trust scores deliver more value per millisecond than chain-of-thought generation
  • Show your work. The "What's missing" section turned a black-box ranker into something a coordinator can trust and challenge
  • Free-text is signal-rich but brittle. Regex extraction only works after multiple data-cleaning passes — NULL bytes, literal "null" strings, and malformed JSON all had to be scrubbed first
  • Bronze → Silver → Gold pays off operationally. When the ETL crashed (capacity type mismatch, missing table stubs, column-mapping resets), we only had to re-run from the failing layer, not from scratch

What's next for MatchCare

  • Wire Save / Pick / Suggest correction buttons in the frontend (backend + Delta tables already live)
  • Enable vector-search index on search_text for semantic matching beyond keyword overlap
  • Add care-category normalization (free-text specialties → ~15 filterable buckets like Cancer, Dialysis, Cardiac) Medium-term (with proper resources):
  • Full agentic pipeline — reactivate the 5-node LangGraph chain-of-thought reasoning (already built on the feature/referral-copilot branch, shelved for latency on Free Edition)
  • Healthcare desert heatmap — visualize desert_scores as an interactive map layer showing high disease burden × low trusted-facility coverage per district
  • Voice + local language support — accept queries in Hindi, Tamil, Bengali, and other Indian languages via speech-to-text, return results in the user's language
  • Live facility verification — the agent autonomously calls hospital phone numbers, confirms operating hours, bed availability, and accepted insurance panels, then updates trust signals with real-time evidence
  • Cost transparency — scrape and compare procedure fees across shortlisted facilities so coordinators can factor affordability into referrals
  • Service quality scoring — aggregate patient reviews, complaint data, and outcome metrics to rate facilities beyond self-reported claims

Built With

  • databricks-apps
  • databricks-gte-large-en
  • databricks-model-serving
  • databricks-sql
  • delta-lake
  • fastapi
  • india-post-pincode-directory
  • langgraph
  • meta-llama-3.3-70b
  • openstreetmap
  • python
  • react
  • tailwind-css
  • tanstack-router
  • typescript
  • unity-catalog
  • vite
Share this project:

Updates