Inspiration
The original spark for TerraFlow came from the complexity of understanding how various soil properties influence saturated hydraulic conductivity (ksat). Traditional modeling approaches often struggle to capture the nuanced, nonlinear relationships in soil data. We wanted to explore how machine learning could offer a more precise and insightful perspective—helping agronomists, hydrologists, and environmental engineers optimize water resource management and soil conservation.
What it does
TerraFlow leverages a robust machine learning pipeline to predict ksat. It cleans and imputes missing values, scales selected features, and then tests multiple regression algorithms (including Random Forest, Gradient Boosting, and SVR) to generate highly accurate forecasts. The solution also provides interpretability via SHAP, offering transparency into how soil variables drive the model’s predictions.
How we built it
Data Preparation
- Imported the soil dataset in Excel format.
- Checked for missing data and employed both mean/median and KNN-based imputation.
- Encoded categorical features where needed using LabelEncoder.
- Performed feature scaling for algorithms sensitive to distance metrics (like KNN and SVR).
- Imported the soil dataset in Excel format.
Model Development
- Experimented with various regression models (Linear Regression, Random Forest, Gradient Boosting, KNN, and SVR).
- Used
GridSearchCVto fine-tune hyperparameters (e.g.,n_estimators,max_depth,C, etc.) for optimal performance. - Compared model performance with
mean_squared_error(MSE) andr2_score.
- Experimented with various regression models (Linear Regression, Random Forest, Gradient Boosting, KNN, and SVR).
Interpretability
- Integrated SHAP to visualize feature contributions and demonstrate why predictions shift based on specific soil properties.
- Integrated SHAP to visualize feature contributions and demonstrate why predictions shift based on specific soil properties.
Challenges we ran into
- Data Quality: Handling missing values required careful thinking—simple methods versus more advanced imputation strategies.
- Model Selection: Determining which algorithm best suited the data was a challenge, given the wide range of options and parameter spaces.
- Computation: Running grid searches for multiple algorithms was time-consuming, so managing runtime was crucial to achieving timely insights.
Accomplishments that we're proud of
- Robust Pipeline: We successfully created a flexible pipeline that seamlessly handles data cleaning, model training, and evaluation.
- High Accuracy: Through hyperparameter tuning, our best models consistently achieved strong R² scores, indicating reliable predictive power.
- Interpretability: By integrating SHAP, we can clearly communicate how each soil feature affects ksat predictions, supporting data-driven decision-making in real-world applications.
What we learned
- Importance of Data Prep: Small differences in imputation methods and scaling approaches can significantly impact model results.
- Value of Ensemble Methods: Tree-based methods like Random Forest and Gradient Boosting can capture complex patterns that linear models may miss.
- Interpretability Matters: Visualization techniques like SHAP ensure that predictive models remain transparent, making them more trustworthy for stakeholders.
What's next for TerraFlow: A Machine Learning Approach to Predict Soil Ksat
We plan to:
- Incorporate Additional Features: Explore domain-specific metrics—like soil texture indices or environmental parameters—to further refine predictions.
- Deploy in a Real-Time Setting: Package TerraFlow into a user-friendly tool for researchers or field technicians to run predictions and get immediate insights on soil-water dynamics.
With these goals, TerraFlow aims to continue evolving as a powerful ally for sustainable soil and water management.
Built With
- matplotlib
- numpy
- pandas
- python
- scikit-learn
Log in or sign up for Devpost to join the conversation.