Inspiration

A healthcare planner, NGO coordinator, or analyst is handed 10,000 scraped, uneven healthcare-facility records. Half the "capabilities" are unverified claims, key fields are sparse (capacity is present in ~25% of rows, doctor counts in ~36%), and some values are physically impossible. The questions they actually need answered are simple to ask and hard to answer responsibly:

  • Can this facility really do what it claims?
  • Where are the real care gaps — not just the places we happened to measure poorly?
  • Where should this specific patient go right now?
  • And before anyone spends money: will building more facilities actually fix the health outcome, or is that just a correlation?

You cannot act on hope. You need evidence behind every claim, an honest confidence number, and the discipline to tell a correlation from a cause. That is facilitiesHelp.io.

Every number in the app is computed live from the complete dataset — no sampling, no placeholders, no fabricated statistics — and every claim cites the facility's own text.

# What it does — all 4 tracks + 2 bonus tracks, in one app

The hackathon offered four tracks and teams could pick just one. facilitiesHelp.io delivers all four, plus two bonus tracks, in a single coherent app where every answer carries both its evidence and its uncertainty.

## Track 1 — Facility Trust Desk (Can this facility do what it claims?)

Grades every specialty a facility lists — all 2,580 distinct specialties, 117,993 graded claims in total — as STRONG, PARTIAL, WEAK / SUSPICIOUS, or CLAIMED. Each claim shows a 0–100 confidence score, the ground-truth signals that produced the grade, the cited source text and links, a SHAP-style evidence attribution bar, an evidence→verdict graph, and a human override saved to an audit log (app_user_actions).

## Track 2 — Medical Desert Planner (Where are the real gaps, and how sure are we?)

Trust-weighted aggregation to the district level across all 2,518 capabilities, which crucially separates a real care gap from a data-poor district:

  • DATA-POOR — fewer than 3 facilities found: we don't know, and we say so.
  • APPARENT CARE GAP — facilities exist but fewer than 2 carry STRONG evidence.
  • EVIDENCED SUPPLY — enough corroborated capacity.

It includes a facilities table (capacity · doctors · equipment), an interactive map, and a causal layer that asks the honest planning question: will building more facilities actually move the outcome?

## Track 3 — Referral Copilot (Where should a patient go?)

A non-technical person doesn't need to know clinical terms. They can reach the right care three ways:

  • Describe it in plain words — "left foot hurting" → Orthopedic Surgery, "eye paining" → Ophthalmology (tested across 20 common complaints).
  • Upload a photo — a clinic board, a prescription, or even a photo of the condition itself (an infected eye, a swollen leg). Our multimodal Llama 4 Maverick model on Databricks Model Serving reads the image and routes to the right specialty — using the same serving endpoint as the text agents, no extra service.
  • Or pick directly from any of the 2,580 specialties.

Then "dialysis near Guntur" geocodes any Indian city or district from the 165,627-office India-Post directory (not a hardcoded list) and returns an evidence-attached, distance-ranked shortlist (great-circle/haversine distance), each with a one-tap Google Maps Directions link. It adds two things most referral tools don't:

  • "You may also need" — related capabilities ranked by facility co-occurrence P(B|A) (e.g., facilities offering Cardiology also offer Internal Medicine ~93% of the time).
  • Care pathway — start from a health concern (e.g., child stunting), get routed to the right specialist nearby, then see causally-related care as a clearly-labelled "maybe", driven by NFHS district correlations (stunting travels with child anaemia, r = +0.33; wasting, r = +0.25).

Every clinical suggestion carries a population correlation ≠ individual diagnosis / not a medical diagnosis caveat.

## Track 4 — Data Readiness Desk (What to fix before planning?)

Surfaces the structural problems that would silently corrupt any analysis — impossible values, internal contradictions, capacity recorded with zero doctors, and column-misaligned rows — ranked by review leverage. For any flagged record it shows the specific evidence and the likely root cause (e.g., column misalignment during scraping).

## Bonus — Ask the Data

One assistant for everything. It answers in plain language with cited evidence and an honest confidence level, renders the relevant chart, and — on demand — lets native Databricks AI/BI Genie write and run the SQL over the gold tables. It is grounded in the gold tables and refuses to invent numbers when it has no data. For real-world reach it answers in 11 languages (English, Hindi, Telugu, Tamil, Bengali, Marathi, Kannada, Gujarati, Spanish, French, Arabic) and can read the answer aloud (text-to-speech).

## Bonus — Data Science Lab

Interactive EDA across all three datasets, plus an interactive d3.js causal-graph studio: drag the nodes, switch between six causal modules (the full NFHS map, the wealth-confounding fork, the trust evidence→verdict graph, the care-pathway co-occurrence graph, supply-vs-demand, and the maternal/adolescent chain), and hover any edge to see the evidence behind it.

# The trust engine — how a grade is actually assigned

The rubric is deterministic and fully auditable — it reads only the facility's own record:

  • STRONG = a matching clinical specialty code AND equipment/procedure evidence in the free text AND a second independent source.
  • PARTIAL = exactly one type of evidence present.
  • WEAK / SUSPICIOUS = the claim contradicts the facility type (e.g., a small clinic claiming ICU/NICU).
  • CLAIMED = the capability appears only in a structured field, with no corroboration.
  • Confidence (0–100) = min(95, 70 + 5 × number of independent sources) — read as a probability the claim is genuine, not a guarantee.

We validated that this grade reflects evidence, not reputation: a logistic model reproduces the grade from its evidence signals (5-fold AUC = 1.00), while a model predicting STRONG from facility metadata alone (size, doctor count, web presence) scores only AUC ≈ 0.57 — barely above chance. The grade is earned by cited evidence, not by hospital fame.

# The causal layer — the part most teams skip

On 706 NFHS-5 districts we ran the full causal ladder rather than stopping at correlation: PC structure-learning → a Bayesian network → multi-method effect estimation (OLS with state fixed-effects, Double-ML, propensity-score matching) → E-value sensitivity analysis, plus statistical ML (penalized regression, multilevel models, GAMs, quantile regression, conformal prediction) and geometric deep learning (a spatial graph neural network).

The honest headline:

  • Sanitation ↔ child stunting looks strong (r = −0.51) but collapses to ≈ 0 once we adjust for household wealth — confounded, not causal.
  • Female schooling → less child marriage (−0.65) survives adjustment — likely causal.
  • ANC4 antenatal visits → institutional birth (+0.60) survives — likely causal.
  • Women's overweight → high blood pressure (+0.57) strengthens under adjustment — causal.
  • Facility count is only weakly linked to health outcomes after adjustment — so "build more" is often not the right lever; demand-side levers (schooling, antenatal care) frequently matter more.

# Solving the real supply–demand problem

Healthcare is a supply-and-demand problem: supply is the facilities and what they can actually do; demand is the population's real health needs. facilitiesHelp.io matches them honestly in three steps — (1) measure real supply, not claimed supply (the ML-validated trust grade); (2) find real demand gaps, not measurement gaps (DATA-POOR vs APPARENT CARE GAP); and (3) test whether adding supply actually helps (the causal layer — facility count is only weakly linked to outcomes). The answer is never just "build here," it's: here is where trusted care really exists, here are the real gaps, and here is whether more supply will move the outcome — so money goes to evidence, not hope.

# How we built it

Everything runs on Databricks Free Edition.

  • Data / medallion: a Unity Catalog medallion (bronze → silver → gold) on a serverless SQL Warehouse. Silver dedups facilities by cluster_id, assembles the evidence text, counts independent source types, and joins facilities → PIN → district.
  • Gold (the decision layer): gold_facility_capability_trust, gold_facility_specialty (2,580 specialties, 117,993 graded claims), gold_facility_contact, gold_all_gaps / gold_district_gaps (2,518 capabilities), gold_referral, gold_specialty_cooccur, gold_condition_corr, gold_data_readiness, gold_nfhs, gold_pin_state, and app_user_actions.
  • App: a single Streamlit Databricks App serving all six tracks, reliability-first — renders instantly from cached gold queries; every AI call is on-demand.
  • AI: Model Serving (Llama 4 Maverick) powers three text agents, a grounded copilot that refuses to fabricate numbers, an 11-language translator, and — through the same endpoint — multimodal vision that reads an uploaded photo and routes it to the right specialty. AI/BI Genie answers natural language as SQL.
  • Built with the Databricks agent skills (databricks aitools).

# Challenges we ran into

  • The data is genuinely messy: duplicates, misspellings like pharmacy/farmacy, impossible bed counts, corrupt GPS, sparse fields. Our principle: flag, never hide; show "not reported," never silently impute.
  • Correlation ≠ causation. The sanitation↔stunting result is the perfect trap — strong and intuitive, but confounded by wealth. Building the full causal ladder was the hardest analytical work.
  • Keeping the LLM honest. We ground every answer in the gold tables and make the assistant refuse and ask for specifics when it has no matching data.
  • Rendering an interactive causal graph inside Databricks Apps. A CDN library was blocked by the app's content-security policy, so we inlined d3.js and built a draggable force-directed layout.

# Accomplishments that we're proud of

Accomplishments that we're proud of

  • All four tracks plus two bonus tracks in one coherent, reliability-first app.
  • 117,993 evidence-graded claims across 2,580 specialties — complete coverage, no sampling.
  • A validation that the grade reflects cited evidence, not hospital fame (metadata-only AUC ≈ 0.57).
  • A full causal layer on 706 NFHS districts (PC, OLS+FE, Double-ML, PSM, E-values), statistical ML (GAM, quantile, multilevel, conformal), and geometric deep learning (spatial GNN).
  • A care pathway that connects cause → related needs.
  • Multimodal & multilingual access — describe a symptom or upload a photo (vision via Llama 4 Maverick), answers in 11 languages, spoken aloud — one Databricks endpoint for text and images.
  • A fully Databricks-native, live app with every claim cited.

What we learned

The hardest part of "for good" data work isn't prediction — it's honesty: showing the evidence, communicating uncertainty, and telling a correlation from a cause. The most valuable thing the app does is sometimes to say "we don't know" (DATA-POOR) instead of guessing.

What's next for facilitiesHelp.io

  • Stronger, temporal causation — linking NFHS-4 (2015–16) to NFHS-5 (2019–21) turns our cross-section into a panel, enabling difference-in-differences, fixed-effects panels, and event-study designs.
  • Causal forecasting — Bayesian structural time-series to forecast outcomes and the counterfactual effect of a planned facility, plus demand forecasting.
  • Stronger identification — instrumental variables, regression-discontinuity, synthetic control.
  • Faster reads + exact citations — Lakebase (Postgres + pgvector).
  • Connectivity — FHIR interoperability, real-time gold refresh, bring-your-own-data, and low-connectivity (SMS / offline-first) delivery.
  • Sharper desert detection — 2SFCA accessibility scoring and facility-placement optimization.

Built With

  • causal
  • computer-vision
  • databricks
  • databricks-ai-bi-genie
  • databricks-apps
  • databricks-model-serving
  • databricks-sql-warehouse
  • delta-lake
  • gnn
  • inference
  • lakebase
  • llama-4-maverick
  • llama-4-maverick-vision
  • ml
  • multimodal
  • nvidia
  • pandas
  • plotly
  • python
  • scikit-learn
  • statsmodels
  • streamlit
  • text-to-speech
  • unity-catalog
Share this project:

Updates