TerraFlow: A Machine Learning Approach to Predict Soil Ksat

Inspiration

The original spark for TerraFlow came from the complexity of understanding how various soil properties influence saturated hydraulic conductivity (ksat). Traditional modeling approaches often struggle to capture the nuanced, nonlinear relationships in soil data. We wanted to explore how machine learning could offer a more precise and insightful perspective—helping agronomists, hydrologists, and environmental engineers optimize water resource management and soil conservation.

What it does

TerraFlow leverages a robust machine learning pipeline to predict ksat. It cleans and imputes missing values, scales selected features, and then tests multiple regression algorithms (including Random Forest, Gradient Boosting, and SVR) to generate highly accurate forecasts. The solution also provides interpretability via SHAP, offering transparency into how soil variables drive the model’s predictions.

How we built it

Data Preparation
- Imported the soil dataset in Excel format.
- Checked for missing data and employed both mean/median and KNN-based imputation.
- Encoded categorical features where needed using LabelEncoder.
- Performed feature scaling for algorithms sensitive to distance metrics (like KNN and SVR).
Model Development
- Experimented with various regression models (Linear Regression, Random Forest, Gradient Boosting, KNN, and SVR).
- Used GridSearchCV to fine-tune hyperparameters (e.g., n_estimators, max_depth, C, etc.) for optimal performance.
- Compared model performance with mean_squared_error (MSE) and r2_score.
Interpretability
- Integrated SHAP to visualize feature contributions and demonstrate why predictions shift based on specific soil properties.

Challenges we ran into

Data Quality: Handling missing values required careful thinking—simple methods versus more advanced imputation strategies.
Model Selection: Determining which algorithm best suited the data was a challenge, given the wide range of options and parameter spaces.
Computation: Running grid searches for multiple algorithms was time-consuming, so managing runtime was crucial to achieving timely insights.

Accomplishments that we're proud of

Robust Pipeline: We successfully created a flexible pipeline that seamlessly handles data cleaning, model training, and evaluation.
High Accuracy: Through hyperparameter tuning, our best models consistently achieved strong R² scores, indicating reliable predictive power.
Interpretability: By integrating SHAP, we can clearly communicate how each soil feature affects ksat predictions, supporting data-driven decision-making in real-world applications.

What we learned

Importance of Data Prep: Small differences in imputation methods and scaling approaches can significantly impact model results.
Value of Ensemble Methods: Tree-based methods like Random Forest and Gradient Boosting can capture complex patterns that linear models may miss.
Interpretability Matters: Visualization techniques like SHAP ensure that predictive models remain transparent, making them more trustworthy for stakeholders.

What's next for TerraFlow: A Machine Learning Approach to Predict Soil Ksat

We plan to:

Incorporate Additional Features: Explore domain-specific metrics—like soil texture indices or environmental parameters—to further refine predictions.
Deploy in a Real-Time Setting: Package TerraFlow into a user-friendly tool for researchers or field technicians to run predictions and get immediate insights on soil-water dynamics.

With these goals, TerraFlow aims to continue evolving as a powerful ally for sustainable soil and water management.

Built With

Updates

Upashana Dutta started this project — Apr 13, 2025 11:19 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.