Inspiration

600 million Indians live in districts where healthcare supply doesn't match health need. But existing gap analyses hide data quality issues, treat noisy data as ground truth, and give planners overconfident recommendations. We wanted to build a tool that says "we're not sure" when the data is weak — because in healthcare planning, a confident wrong answer is worse than an honest uncertainty flag.

What it does

  • Scores every Indian district using a per-capita gap metric: health need (NFHS-5 z-scored composite) minus facility supply per 100K people
  • Displays an interactive map of 706 districts colored by care-gap severity
  • Provides confidence flags (🟢 measured / 🟡 real but sparse / 🔴 low confidence) on every district
  • Surfaces the actual facility records (name, capability, specialties) behind each score — not just numbers
  • Includes an AI Analyst (Llama 3.3 70B via Databricks Foundation Models) that answers natural-language questions with cited evidence
  • Persists planner actions (shortlists, notes) for workflow continuity
  • Audits its own data quality in a Data Readiness tab

How we built it

  • Data: Virtue Foundation FDR facilities dataset (6,663 records) + NFHS-5 district health indicators (706 districts, 12 indicators) + Census 2011 population + India Post pincode directory for geocoding
  • Scoring: Z-scored need composite minus per-capita supply, with scipy cKDTree for facility-to-district geocoding
  • Stack: Python, Streamlit, Plotly, Pandas, PyArrow
  • AI Agent: Databricks Foundation Model serving (databricks-meta-llama-3-3-70b-instruct) called via REST API with managed identity auth
  • Data layer: Databricks Unity Catalog for live facility queries, pre-computed parquets for district gaps
  • Deployment: Databricks Apps (zero-infrastructure managed hosting)
  • Persistence: SQLite (portable) with Lakebase/Postgres-ready upsert layer

Challenges we ran into

  • Facility dataset has messy address_stateOrRegion values (cities mixed with states) — solved with geocoding by coordinates instead of text matching
  • 14.5MB facility parquet exceeded Databricks workspace 10MB file limit — pivoted to live Unity Catalog SQL queries at runtime
  • Databricks Apps CSP blocks external map tile CDNs — switched from interactive MapLibre to Plotly's built-in scatter_geo
  • SDK's serving_endpoints.query() had a serialization bug with dict messages — bypassed with direct REST API calls using SDK-managed auth
  • 67 post-2011 split districts have no Census population — rather than guessing, we flag them separately and rank by need only

Accomplishments we're proud of

  • Honest uncertainty: Every district has a confidence flag; we never present weak evidence as fact
  • Every claim is cited: Drill into any score and see the NFHS indicators + facility records behind it
  • 61 million people identified in the 40 worst-scored districts — actionable for planners
  • AI grounded in data: The LLM agent gets 160+ real facility records as context, not just vibes
  • Full data audit: The app scores its own datasets for completeness and known biases

What we learned

  • Data quality IS the product in healthcare planning — surfacing uncertainty builds more trust than hiding it
  • Pre-computing scores in parquet + live-querying details from Unity Catalog is the right split for performance
  • Databricks Apps managed identity auth requires REST API calls (SDK has edge cases)
  • Per-capita normalization is essential — without it, large states always "win" the gap ranking regardless of actual density

What's next

  • Add public PHC/CHC facility data (government sources) for a complete supply picture
  • State-level GeoJSON choropleth for visual policy reports
  • Multi-scenario planning: "What if we add 5 facilities to district X?"
  • Integration with Databricks Workflows for automated monthly data refresh

Built With

Share this project:

Updates