Agronomic Yield Forecasting Workflow Overview
- Data Preparation ● Weather Data Filtering: ○ Started by filtering NOAA weather observations to only include years 2010–2024. ● US State Mapping: ○ Enriched the weather data with a US state label using a spatial join with US ZIP code geo-centroids, discarding non-US records. ● Feature Aggregation: ○ Aggregated the weather data into biologically meaningful features, typically at county-year or county-month granularity. ○ Columns include GDD, Precipitation, Day and Nighttime Stress Degree Days for each month May–September.
- Natural Language Representation and Embedding (Superseded by new approach, but included for record) ● Old Step: Generated natural language summaries from the feature rows and created dense vector embeddings (e.g., Sentence Transformers) as model input features. ● Optimized: Used batched embeddings, carefully managed memory, and saved outputs as Delta tables for team sharing.
- Current Feature-Based Scenario Simulation ● User Scenario Entry (Databricks Chat): ○ Users describe future weather scenarios in plain language (“severe drought in August for Iowa soybeans”). ● Baseline Selection: ○ The system loads the most recent historical (e.g., last year’s) weather feature row for the target county, crop, and year as the default model input. ● Natural Language Parsing & Rule-Based Translation: ○ An LLM (prompted with crop science thresholds and logic) translates the scenario to specific column-wise feature adjustments. ■ For example: “multiply August_Precip by 0.5”, “add 5 to July_SDD_Day”. ○ JSON output defines exact modifications and provides reasoning.
- Model Prediction and Result Summarization (Did not reach this step due to time-constraint) ● Adjustment Application: ○ The notebook parses the LLM’s JSON, applies multiplicative/additive changes to baseline features. ● Quantile Model Inference: ○ The modified feature vector is run through the trained XGBoost quantile regression model to predict yield outputs (e.g., 10th, 50th, 90th percentiles). ● Result Narration: ○ Before-and-after yield values, along with the LLM’s “reasoning”, are passed back for natural language summary, explaining the biological cause and likely impact.
- Team Pipeline Integration & Handoff ● Notebooks and Delta Tables: ○ All intermediate (and final) data—embeddings, feature tables, templates—persisted in shared Delta tables. ● Modularized Cells: ○ Key tasks (install, summary, embedding, save, simulate) isolated in named notebook cells for reusability. ● Teammate Collaboration: ○ Artifacts and table locations (like workspace.default.weather_embeddings) are clearly documented for other team members to load and continue downstream tasks (e.g., modeling, dashboarding). Thoughts on Databricks
- Stars Databricks enables end-to-end development of the entire workflow—from raw data to a trained model and final results—while making collaboration much easier for a team. What makes it especially useful is that everything is in one unified platform. You can clean data, transform it, run analysis, train models, and visualize results without constantly switching tools. During a project or hackathon, that saves a lot of time and reduces confusion. The collaborative notebooks are a big advantage. Multiple team members can work in the same environment using Python or SQL, comment on each other’s work, and iterate quickly. It feels more like a shared workspace than isolated coding. For machine learning, built-in MLflow support helps track experiments, compare different model runs, and manage versions. Instead of manually keeping track of results, everything is logged and organized. The assistant agent essentially makes the task a no-code task as it is very competent.
- Wishes Having this much functionality in one platform can also make it feel a bit overwhelming at times. Some things we felt during the project: ● The interface can feel cluttered, especially when you’re new and trying to figure out where everything is. ● It’s sometimes unclear what is happening “behind the scenes” with clusters and computation, which makes it harder to understand performance and costs. ● Debugging errors can be confusing, since the error messages are not always very intuitive. ● The learning curve is quite steep in the beginning, particularly if you don’t have prior experience with Spark or distributed systems. ● When working under time pressure, small configuration steps (like permissions or cluster settings) can slow things down more than expected. Overall, while the platform is very powerful, making it slightly more intuitive and transparent would make it even better for student teams and fast-paced projects.
Built With
- databrick
- python
Log in or sign up for Devpost to join the conversation.