About - Devpost Submission

Inspiration

When a disaster hits, the communities that suffer most aren't always the ones closest to the epicenter. They're the ones that were already stretched thin before it happened - one overworked clinic serving thousands of people, no specialist for 50 miles, no real backup when things go wrong. A cyclone or a flood doesn't create those gaps. It just makes them impossible to ignore.

We kept coming back to a simple frustration: the data to identify these communities already exists. Health survey data, disaster records, facility registries - it's all out there. But by the time anyone has pieced it together into something actionable, the first 72 hours are already gone. That's the window that matters most, and it keeps getting wasted on manual analysis that a lakehouse could do in seconds.

That's what drove us to build this.

What It Does

The platform helps NGOs, volunteer coordinators, and government agencies figure out where to send help - before the situation becomes unrecoverable.

When a disaster scenario is loaded, it animates the event (cyclone track, heatwave grid), maps every clinic and hospital in the region by specialty, and then identifies the medical deserts: communities where there's no cardiologist, no OB/GYN, no emergency physician within 15 miles. These aren't gaps the disaster created. They were there before. The disaster just made them critical.

From there, the platform scores every candidate deployment location on three things at once - how many specialties are missing, how physically isolated the population is, and how close they are to the disaster epicenter - and produces a ranked list of the top sites for field hospitals, pop-up clinics, or volunteer teams. Each result includes a plain-language explanation a coordinator can actually use.

The underlying vulnerability scores are computed by a Databricks pipeline joining India's NFHS-5 health survey with disaster exposure data from historical records, weather feeds, and third-party sources.

How We Built It

Facility data cleansing (GAIME) The raw facility registry is messy - inconsistent specialty names, garbled addresses, duplicate entries with slightly different coordinates. We cleansed it using a variation of GAIME, our LLM-driven entity recognition engine, which normalized specialty classifications, resolved duplicate facilities, and standardized state and region names across the dataset. That cleansed output is what feeds the Delta table and drives the desert detection.

Data pipeline (Databricks + PySpark) We pull disaster history from Delta tables, weight each event type by its healthcare disruption impact (cyclones and floods hit hardest, so they're weighted 1.5x), and compute a percentile-ranked exposure score per state. In parallel, we process NFHS-5 district health indicators across four domains - maternal health, child health, non-communicable diseases, and infrastructure - and join them on state name. Districts that rank in the top quartile on both axes get flagged. Everything materializes to Delta tables.

Risk assessment (Claude Opus 4.6) Compound vulnerability scores tell you where the gaps are, but not what they mean on the ground. We use Claude Opus 4.6 to turn the raw scores into operational risk assessments - interpreting the combination of deficit domains, disaster exposure, and isolation into a prioritized briefing that a coordinator can act on without needing to understand the underlying data model.

LLM briefings (Databricks Foundation Models) The five highest-risk districts get passed to Llama 3.1 70B via MLflow deployments. We ask it for a four-sentence operational briefing per district: what the gap looks like, what the seasonal risk window is, what kind of intervention makes sense, and what a concrete ask to funders would be. Results go into a compound_vulnerability_narratives Delta table.

API + deployment (Flask + Databricks Apps) A Flask API reads from the Delta tables, runs the desert detection and site scoring on demand, and returns one JSON payload per scenario. The same codebase runs locally against CSVs for development and against Delta tables in production. Deployed as a Databricks App with databricks bundle deploy.

Scoring algorithm Each candidate site is scored as desert_severity x isolation_weight x exposure_weight. Desert severity is how many critical specialties are absent within 15 miles (sites missing fewer than two are filtered out). Isolation weight is miles to the nearest facility of any kind. Exposure weight scales inversely with distance to the disaster epicenter, floored at 1.0 so distant-but-affected areas aren't zeroed out. Final sites are picked greedily with 10-mile minimum spacing to ensure geographic spread.

Frontend (MapLibre GL JS + Protomaps) A single HTML file - no build pipeline, no CDN dependency. MapLibre GL 4.7.1 with Protomaps v4 self-hosted vector tiles. Phases animate through the disaster, facility coverage, desert overlay, and deployment sites. There's an offline fallback with embedded data if the API is unreachable.

Challenges We Ran Into

GeoJSON expects coordinates as [longitude, latitude] but almost every data source uses [latitude, longitude]. It's a silent bug - everything looks fine until you zoom out and notice the hospital pins are in the Bay of Bengal. Getting the cyclone track, facility dots, and heatwave grid all consistent took more passes than we'd like to admit.

Joining district-level health data to state-level disaster scores sounds straightforward until you discover that "Odisha" and "Orissa" and "ODISHA" are all the same state depending on which dataset you're looking at. We built a normalization step that strips, title-cases, and fuzzy-matches state names before the join.

The scoring algorithm also took real iteration to make geographically fair. The first version recommended three sites within five miles of each other because they all had the same gap profile. The greedy deduplication with 10-mile minimum spacing, combined with the impact corridor pre-filter for West Bengal, was what finally produced a result that actually distributes deployments across the region.

Accomplishments That We're Proud Of

The result we're most proud of is Gosaba.

It's the top-ranked deployment site for the West Bengal cyclone scenario: three missing critical specialties, 52 miles to the nearest facility of any kind, directly in the storm path. The algorithm landed on it from first principles using nothing but the scoring formula. Gosaba is a real island community in the Sundarbans delta that was one of the hardest-hit, hardest-to-reach areas during Cyclone Amphan in 2020. A place where a volunteer surgical team showing up would have mattered enormously.

We didn't put it there. The data did.

The other thing we're proud of is the architecture holding together. Local development runs against CSVs. Production runs against Delta tables. Same codebase, one environment variable swap. The whole thing deploys in a single command.

What We Learned

The biggest thing: compound vulnerability is a fundamentally different problem from either of its parts. A district with decent specialist coverage that gets hit by a flood is a hard situation - but the system can absorb it. A district with no specialists that gets hit by a flood has nothing to absorb it with. Pop-up clinics and volunteer teams aren't a supplement to the response. They are the response. That distinction only becomes visible when you join the health data with the disaster data, and almost no one does that routinely.

We also learned that the bottleneck isn't data availability. Everything we used is publicly available. The bottleneck is synthesis speed - getting from raw data to a ranked list of deployment sites fast enough to actually influence the first-response decision. That's exactly what a lakehouse is built for, and it was the right tool for this problem.

What's Next for DB Hackathon for Good

The Haversine distance we use for isolation is a straight line. In the Sundarbans, that's meaningless - you can't drive across a river delta. Replacing it with actual travel time using OpenStreetMap road and waterway data would make the isolation score dramatically more accurate in exactly the places where it matters most.

On the data side, we want to connect to live IMD cyclone forecasts and NDMA flood alerts so the platform loads the active scenario automatically as a disaster unfolds, not just historical ones. The pipeline infrastructure is already there - it's an event trigger away.

West Bengal and Maharashtra are two scenarios, but the compound vulnerability pipeline has already scored every district in India. Expanding to all 36 states is mostly a matter of building out the scenario configs and curating candidate settlement lists for each region.

Longer term, the deployment site output should generate a structured order - coordinates, required specialties, estimated resupply cadence - formatted for direct input into WHO logistics systems or NDMA field operations platforms. Right now it's a map pin and a sentence. It should be a field order.