u.s algricultural water analysis

U.S. Agricultural Water Efficiency Analysis

▎ Turn every drop into data. Turn data into decisions.

Inspiration

Agriculture accounts for roughly 80% of consumptive water use in the United States, yet the efficiency of that water varies enormously from county to county — even between neighbors with identical climates. During the early 2025 California wildfire season, the tension between drought stress, groundwater depletion, and irrigation demand became viscerally clear. We kept asking: which counties are wasting the most water, and what can actually be done about it?

The question sounded simple. The answer turned out to require stitching together a dozen federal datasets that had never been analyzed together at county scale.

What It Does

The platform answers three questions for every U.S. agricultural county:

How efficiently is water being converted to crop value?
We define the core metric as:

$$\text{Efficiency} = \frac{\text{Crop Value}}{\hat{W}}$$
where estimated water applied is:

$$\hat{W} = A_{\text{irr}} \times \frac{\text{ET}_0}{12}$$

where A_irr = irrigated acres (NASS Census) and ET₀ = reference evapotranspiration (gridMET)

Why is efficiency low — climate, soil, or human choices?
A Random Forest decomposes variance into three buckets using Mean Decrease in Impurity (MDI), then a Doubly Robust causal learner isolates the Average Treatment Effect (ATE) of adopting center-pivot irrigation:

$$\hat{\tau}_{\text{DR}} = \mathbb{E}\left[\hat{\mu}_1(X) - \hat{\mu}_0(X) + \frac{T(Y - \hat{\mu}_1(X))}{\hat{e}(X)} - \frac{(1-T)(Y - \hat{\mu}_0(X))}{1 - \hat{e}(X)}\right]$$

Which counties are the best intervention targets?
Three policy insight layers — Low-Hanging Fruit, Virtual Water Export, and Dual Exposure — surface the highest-ROI counties on an interactive map with Gemini-powered natural language explanations.

How We Built It

Data Pipeline

We built a parallel ingestion layer that pulls from nine federal data sources into Google Cloud Storage, then constructs a county-level wide table (~2,600 counties × 30+ features):

USDA NASS: Crop production, irrigated acres, farm size, operator tenure
gridMET: ET₀, precipitation, precipitation deficit
SSURGO: Soil AWC, clay %, organic matter
FEMA NRI: Drought risk score, flood risk score
BLS: County unemployment rate
BEA: Farm proprietor net income
USDA RMA: Crop insurance loss ratio
Google Earth Engine: Center-pivot irrigation footprint
U.S. Census: Population, poverty rate, median income, education

ML Analysis Stack

The analysis runs as a sequential pipeline (run_analysis.py):

02_eda → 03_efficiency → 04_causal → 05_shap → 06_cluster → 07_insights → 08_subgroup

03: Random Forest MDI + LassoCV to decompose climate / soil / human factor contributions
04: DRLearner (EconML) for causal ATE/CATE estimation of irrigation technology adoption
05: SHAP beeswarm to surface directional feature effects on WUE
06: K-Means clustering of counties into agronomic archetypes
07: Rule-based policy flagging for three insight categories

Backend & Frontend

A FastAPI backend serves pre-computed JSON to a map frontend. A /simulate endpoint lets users adjust feature values and see predicted WUE change. A /explain endpoint passes county data to Gemini to generate plain-language policy briefs.

Challenges We Ran Into

Messy, heterogeneous federal data. NASS data ships in at least four different pipe-delimited and JSON formats depending on year and query type. BLS unemployment data changed its series format mid-pipeline. We wrote separate parsers for every source and spent significant time on fallback logic when primary sources returned empty blobs.

The WUE denominator is unobservable. No federal dataset directly measures how much water a county applied. We had to derive Ŵ from irrigated acreage × ET₀, which introduces systematic error in counties with high
groundwater use (where actual applications exceed crop demand). The metric is best interpreted as a relative ranking across similar climate zones.

Causal identification is hard at county scale. Counties that adopt center-pivot irrigation differ from non-adopters in dozens of confounding ways. We chose DRLearner specifically because it is doubly robust — consistent if either the
outcome model or the propensity model is correctly specified — but we could not fully rule out remaining omitted variable bias (e.g., aquifer depth, local water pricing).

SHAP + panel-scale data is slow. Running TreeSHAP over 2,600 counties with 500-tree forests required careful batching and pre-computation; doing this interactively in a browser was not feasible.

Accomplishments That We're Proud Of

Successfully joined nine independent federal datasets into a single analysis-ready county table, something that, to our knowledge, has not been published as an open pipeline.
The causal model found a statistically significant ATE for center-pivot adoption even after controlling for climate and soil — suggesting the efficiency gap is partly a policy problem, not just a geography problem.
The Virtual Water Export insight layer flags arid counties that are effectively exporting their scarce water in the form of water-intensive crops — a risk that is invisible in standard agricultural statistics.

What We Learned

Federal open data is rich but fragmented. The raw signal for county-level water efficiency is all publicly available; the barrier is integration, not access.
Causal ML requires humility. It is tempting to report a clean ATE number. The real lesson is understanding why the estimate might be wrong and what identifying assumptions are being leaned on.
Visualization drives insight. The SHAP beeswarm and the three-bucket MDI pie chart communicated more to non-technical stakeholders than any regression table.
The Herfindahl-Hirschman Index for crop diversity, HHI = Σ(sᵢ²) where sᵢ is crop i's revenue share, turns out to be one of the strongest human-factor predictors of WUE — monocultures are consistently less
water-efficient.

What's Next for U.S. Agricultural Water Analysis

Temporal dimension: extend from a single cross-section (2022) to a panel (2012–2022) to enable difference-in-differences estimation of policy interventions.
Groundwater integration: add USGS NWIS well-level data to correct the Ŵ denominator in over-drafted aquifer regions (e.g., High Plains).
County-to-farm bridge: aggregate FSA CLU parcel data to bring the analysis down from county to individual farm — the policy levers are at that scale.
Water pricing signal: incorporate state water rights transaction data to test whether markets improve or distort WUE.
Direct integration with Zerve for reproducible, shareable analysis notebooks that policymakers and extension agents can run without code.

Built With

fastapi
github
google-cloud
html
python
uvicorn

Updates

Echo Xiao started this project — Apr 29, 2026 01:31 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.