Inspiration

Our ideation process started with a core realization: farmers and agronomists do not make decisions based on statistical means; they make decisions based on risk. A traditional regression model that outputs a single yield number may be technically correct, but it is often operationally insufficient.

We designed our system around that end-user reality. Our goal was to build an “Agronomist’s Copilot”: a platform that ingests large-scale weather and yield data, generates county-level yield forecasts, and outputs quantifiable risk profiles (uncertainty bands) plus intuitive, natural-language insights that growers can act on.

Every major technical decision – from feature engineering to model architecture to dashboard outputs – was reverse-engineered from this goal: bridging complex data science with real-world agricultural decision-making.

What it does and how we built it

Domain-specific feature engineering for agriculture

Agricultural outcomes are highly non-linear relative to weather. Raw temperature and precipitation data alone are not enough, so we prioritized agronomic feature engineering from daily weather observations.

We transformed daily weather into crop-relevant indicators such as:

  • Growing Degree Days (GDD)

  • Total and monthly precipitation

  • Extreme heat day counts (e.g., days with max temperature > 95°F)

  • Heat-stress and drought interaction features during critical growing windows (especially July/August)

These features better reflect real crop stress and growth dynamics than generic weather averages.

KD-Tree for spatial alignment

Historical weather stations and county-level yield records rarely align perfectly. To bridge that gap, we implemented a KD-Tree nearest-neighbor mapping pipeline:

  • Index weather station coordinates

  • Map stations to the nearest county centroids

  • Aggregate station-level daily weather into county-level seasonal features

This gave us an efficient and scalable way to create county-level weather features from available datasets without requiring a full GIS polygon-containment pipeline. It is a practical geospatial approximation that worked well for hackathon speed and scale.

Risk-aware ML architecture

To make the system useful for growers, we designed the modeling layer around risk-aware forecasting, not just point estimates.

Our modeling stack combines:

  • Random Forests for robust county-level prediction and practical uncertainty estimation

  • XGBoost (crop-specific models for Corn and Soybeans) for strong point-forecast accuracy

  • Agronomy-specific engineered features (GDD, heat stress, precipitation, seasonal interactions)

This architecture lets us optimize for both forecast quality and decision usefulness.

How we quantify uncertainty

We implemented uncertainty in two complementary ways:

1) Scenario uncertainty (60-day and 90-day)

We generate Optimistic / Expected / Pessimistic scenarios by perturbing weather features using historical distributions:

  • 60-day scenarios use a tighter spread (lower uncertainty)

  • 90-day scenarios use a wider spread (higher uncertainty)

This gives growers a practical “what-if” range instead of a single deterministic outcome.

2) Prediction intervals (80% PI)

For Random Forest models, we use the distribution of predictions across individual trees and compute the 10th and 90th percentiles as an approximate 80% prediction interval.

That means the system can communicate:

  • expected yield,

  • downside risk,

  • and upside potential,

which is much closer to how growers actually reason about decisions.

ML Model Architecture & Evaluation

We used MLflow to benchmark and compare multiple candidate models (including linear baselines, Random Forest, Gradient Boosting, XGBoost, and LightGBM) across shared feature sets.

MLflow made it easy to:

  • log parameters and metrics consistently

  • compare validation/test performance across runs

  • track experiments while iterating quickly in notebooks

  • choose models based on both accuracy and fit for our uncertainty workflow

This was a major enabler for moving fast without losing rigor.

Databricks Dashboard + End-to-End Lakehouse Workflow

To make the system useful beyond a notebook, we built the output layer around Databricks Dashboards and a Lakehouse-style workflow.

Databricks products/services we used

  • Databricks Notebooks for rapid iteration across PySpark + Python ML workflows

  • Databricks Marketplace for historical weather datasets

  • PySpark + Databricks SQL for large-scale weather aggregation and feature generation

  • Unity Catalog for managing source and derived tables

  • Delta tables for storing forecast-ready outputs and dashboard inputs

  • MLflow for experiment tracking and model benchmarking

  • Databricks Dashboards for visualizing county-level forecasts, uncertainty, and natural-language insights

Why this mattered

What made Databricks especially valuable was the end-to-end workflow continuity. We used Marketplace datasets + PySpark/Databricks SQL for large-scale weather aggregation, stored intermediate and final outputs in Delta tables under Unity Catalog, tracked model experiments in MLflow, and surfaced county-level forecasts and uncertainty directly in Databricks Dashboards.

That let us build a true Lakehouse pipeline,from raw weather data to grower-facing risk insights, without constantly moving data or tools between platforms.

Challenges we ran into

1) County-weather alignment was not plug-and-play

Our workspace had strong weather and yield data, but not a pre-built county centroid mapping table. We had to create a geospatial alignment layer ourselves (KD-Tree + county centroids) to bridge weather stations and county yields.

2) Feature consistency during scenario simulation

When perturbing weather features for forecast scenarios, derived interaction features (e.g., heat × precipitation terms) can become inconsistent if they are not recomputed. We identified and fixed this so scenario inputs remained internally coherent.

Accomplishments that we’re proud of

  • Built a county-level crop yield forecasting pipeline with real weather integration

  • Designed a risk-aware forecasting workflow (60/90-day scenarios + 80% prediction intervals)

  • Produced crop-specific forecasts (Corn and Soybeans) and county-level outputs for maps

  • Created a dashboard-ready data product, not just a notebook experiment

  • Added natural-language summaries so non-technical users can understand forecast risk quickly

  • Used MLflow + Databricks to benchmark models in a structured and repeatable way

What we learned

  • In agriculture, feature engineering and data alignment matter as much as model choice

  • The most useful ML systems for practitioners are often decision-support systems, not just prediction engines

  • Uncertainty communication (scenario ranges + intervals) dramatically increases the value of a forecast

  • Building on Databricks accelerated our workflow by letting us move smoothly between distributed processing, ML experimentation, and dashboard output delivery

What’s next for AI-Powered Crop Yield and Risk Dashboard

1) More realistic weather scenario generation

Our current scenario engine uses historically informed perturbations. Next, we want to make this more county-specific and forecast-source-aware (e.g., probabilistic weather ensemble inputs).

2) Stronger agronomic risk triggers

We want to expand the natural-language trigger system so it can generate more actionable county/crop-specific alerts (e.g., heat stress + low precipitation combinations during key windows).

3) Better geospatial weather assignment

We used an efficient nearest-neighbor approach for hackathon speed. A future version could incorporate more advanced spatial weighting or polygon-based assignment.

4) Grower-facing recommendations layer

Beyond forecasting risk, the next step is decision support: surfacing mitigation actions, planning recommendations, and region-specific agronomic guidance based on forecast uncertainty.

Built With

Share this project:

Updates