Inspiration
Our ideation process started with a core realization: farmers and agronomists do not make decisions based on statistical means; they make decisions based on risk. A traditional regression model that outputs a single yield number may be technically correct, but it is often operationally insufficient.
We designed our system around that end-user reality. Our goal was to build an “Agronomist’s Copilot”: a platform that ingests large-scale weather and yield data, generates county-level yield forecasts, and outputs quantifiable risk profiles (uncertainty bands) plus intuitive, natural-language insights that growers can act on.
Every major technical decision – from feature engineering to model architecture to dashboard outputs – was reverse-engineered from this goal: bridging complex data science with real-world agricultural decision-making.
What it does and how we built it
Domain-specific feature engineering for agriculture
Agricultural outcomes are highly non-linear relative to weather. Raw temperature and precipitation data alone are not enough, so we prioritized agronomic feature engineering from daily weather observations.
We transformed daily weather into crop-relevant indicators such as:
Growing Degree Days (GDD)
Total and monthly precipitation
Extreme heat day counts (e.g., days with max temperature > 95°F)
Heat-stress and drought interaction features during critical growing windows (especially July/August)
These features better reflect real crop stress and growth dynamics than generic weather averages.
KD-Tree for spatial alignment
Historical weather stations and county-level yield records rarely align perfectly. To bridge that gap, we implemented a KD-Tree nearest-neighbor mapping pipeline:
Index weather station coordinates
Map stations to the nearest county centroids
Aggregate station-level daily weather into county-level seasonal features
This gave us an efficient and scalable way to create county-level weather features from available datasets without requiring a full GIS polygon-containment pipeline. It is a practical geospatial approximation that worked well for hackathon speed and scale.
Risk-aware ML architecture
To make the system useful for growers, we designed the modeling layer around risk-aware forecasting, not just point estimates.
Our modeling stack combines:
Random Forests for robust county-level prediction and practical uncertainty estimation
XGBoost (crop-specific models for Corn and Soybeans) for strong point-forecast accuracy
Agronomy-specific engineered features (GDD, heat stress, precipitation, seasonal interactions)
This architecture lets us optimize for both forecast quality and decision usefulness.
How we quantify uncertainty
We implemented uncertainty in two complementary ways:
1) Scenario uncertainty (60-day and 90-day)
We generate Optimistic / Expected / Pessimistic scenarios by perturbing weather features using historical distributions:
60-day scenarios use a tighter spread (lower uncertainty)
90-day scenarios use a wider spread (higher uncertainty)
This gives growers a practical “what-if” range instead of a single deterministic outcome.
2) Prediction intervals (80% PI)
For Random Forest models, we use the distribution of predictions across individual trees and compute the 10th and 90th percentiles as an approximate 80% prediction interval.
That means the system can communicate:
expected yield,
downside risk,
and upside potential,
which is much closer to how growers actually reason about decisions.
ML Model Architecture & Evaluation
We used MLflow to benchmark and compare multiple candidate models (including linear baselines, Random Forest, Gradient Boosting, XGBoost, and LightGBM) across shared feature sets.
MLflow made it easy to:
log parameters and metrics consistently
compare validation/test performance across runs
track experiments while iterating quickly in notebooks
choose models based on both accuracy and fit for our uncertainty workflow
This was a major enabler for moving fast without losing rigor.
Databricks Dashboard + End-to-End Lakehouse Workflow
To make the system useful beyond a notebook, we built the output layer around Databricks Dashboards and a Lakehouse-style workflow.
Databricks products/services we used
Databricks Notebooks for rapid iteration across PySpark + Python ML workflows
Databricks Marketplace for historical weather datasets
PySpark + Databricks SQL for large-scale weather aggregation and feature generation
Unity Catalog for managing source and derived tables
Delta tables for storing forecast-ready outputs and dashboard inputs
MLflow for experiment tracking and model benchmarking
Databricks Dashboards for visualizing county-level forecasts, uncertainty, and natural-language insights
Why this mattered
What made Databricks especially valuable was the end-to-end workflow continuity. We used Marketplace datasets + PySpark/Databricks SQL for large-scale weather aggregation, stored intermediate and final outputs in Delta tables under Unity Catalog, tracked model experiments in MLflow, and surfaced county-level forecasts and uncertainty directly in Databricks Dashboards.
That let us build a true Lakehouse pipeline,from raw weather data to grower-facing risk insights, without constantly moving data or tools between platforms.
Challenges we ran into
1) County-weather alignment was not plug-and-play
Our workspace had strong weather and yield data, but not a pre-built county centroid mapping table. We had to create a geospatial alignment layer ourselves (KD-Tree + county centroids) to bridge weather stations and county yields.
2) Feature consistency during scenario simulation
When perturbing weather features for forecast scenarios, derived interaction features (e.g., heat × precipitation terms) can become inconsistent if they are not recomputed. We identified and fixed this so scenario inputs remained internally coherent.
Accomplishments that we’re proud of
Built a county-level crop yield forecasting pipeline with real weather integration
Designed a risk-aware forecasting workflow (60/90-day scenarios + 80% prediction intervals)
Produced crop-specific forecasts (Corn and Soybeans) and county-level outputs for maps
Created a dashboard-ready data product, not just a notebook experiment
Added natural-language summaries so non-technical users can understand forecast risk quickly
Used MLflow + Databricks to benchmark models in a structured and repeatable way
What we learned
In agriculture, feature engineering and data alignment matter as much as model choice
The most useful ML systems for practitioners are often decision-support systems, not just prediction engines
Uncertainty communication (scenario ranges + intervals) dramatically increases the value of a forecast
Building on Databricks accelerated our workflow by letting us move smoothly between distributed processing, ML experimentation, and dashboard output delivery
What’s next for AI-Powered Crop Yield and Risk Dashboard
1) More realistic weather scenario generation
Our current scenario engine uses historically informed perturbations. Next, we want to make this more county-specific and forecast-source-aware (e.g., probabilistic weather ensemble inputs).
2) Stronger agronomic risk triggers
We want to expand the natural-language trigger system so it can generate more actionable county/crop-specific alerts (e.g., heat stress + low precipitation combinations during key windows).
3) Better geospatial weather assignment
We used an efficient nearest-neighbor approach for hackathon speed. A future version could incorporate more advanced spatial weighting or polygon-based assignment.
4) Grower-facing recommendations layer
Beyond forecasting risk, the next step is decision support: surfacing mitigation actions, planning recommendations, and region-specific agronomic guidance based on forecast uncertainty.
Built With
- databricks
- mlflow
- python
Log in or sign up for Devpost to join the conversation.