KsatFlow: ML-Powered Soil Permeability Estimation_Buffden

🌱 Inspiration

Soil permeability is a key factor in sustainable land management, agriculture, and urban planning. Yet, predicting how water flows through different soil types often requires complex simulations or expensive physical tests. We were inspired to apply machine learning to this challenge, leveraging the UKSAT dataset to explore whether data-driven methods could accurately estimate saturated hydraulic conductivity (Ksat) — and how much data is truly needed to do it well.

🧠 What it does

KsatFlow is a machine learning pipeline that:

Predicts Ksat (saturated hydraulic conductivity) from soil properties like sand, silt, clay, and bulk density
Evaluates how dataset size impacts model performance using randomized subset experiments
Visualizes model accuracy using R² and RMSLE across sample sizes
Explains feature importance using SHAP values for transparency

🛠 How we built it

We started by downloading the UKSAT dataset and cleaning it in Python. The project was built entirely using:

Python (Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn)
VS Code for development and testing
GitHub for version control and submission
SHAP for explainability and feature attribution

We designed a modular pipeline to:

Clean and preprocess data
Train and evaluate multiple models
Automate subset sampling and error tracking
Generate final visualizations for reporting

🧗 Challenges we ran into

Installing scipy on macOS M3 Pro due to missing Fortran compilers
Interpreting domain-specific features (e.g., what constitutes "valid" soil composition ratios)
Managing subset experiments across 50 random splits and hundreds of model retrains
Tuning models under the constraint of no XGBoost allowed

🏆 Accomplishments that we're proud of

Building a fully working ML pipeline from scratch with subset evaluation support
Achieving strong Ksat prediction metrics using random forest and decision tree regressors
Generating clean, reproducible results that show how less data can still give powerful predictions
Creating a structured, well-documented, GitHub-hosted solution

📚 What we learned

How to build regression models for scientific use cases
How to interpret R² vs RMSLE and where each metric matters
How to use SHAP to make ML explainable, even in scientific applications
The importance of clean data, repeatable experiments, and modular code

🚀 What's next for KsatFlow: ML-Powered Soil Permeability Estimation

🧪 Try other models like LightGBM, CatBoost, and Ridge Regression
🌍 Explore geospatial mapping of soil Ksat values using external datasets
🧱 Integrate a simple web-based tool for researchers or students to input soil values and see Ksat predictions
📊 Build an interactive dashboard to showcase model explainability and soil insights in real-time

Built With

built-with-python-?-core-language-for-the-entire-project-pandas-&-numpy-?-data-cleaning-and-manipulation-scikit-learn-?-regression-modeling
cross-validation
git
github
hydroshare
learning-curves
matplotlib
metrics-matplotlib-&-seaborn-?-eda
numpy
pandas
python
scikit-learn
seaborn
shap
vs-code

Updates

Harshwardhan Patil started this project — Apr 13, 2025 01:08 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.