🌱 Inspiration
Soil permeability is a key factor in sustainable land management, agriculture, and urban planning. Yet, predicting how water flows through different soil types often requires complex simulations or expensive physical tests. We were inspired to apply machine learning to this challenge, leveraging the UKSAT dataset to explore whether data-driven methods could accurately estimate saturated hydraulic conductivity (Ksat) — and how much data is truly needed to do it well.
🧠 What it does
KsatFlow is a machine learning pipeline that:
- Predicts Ksat (saturated hydraulic conductivity) from soil properties like sand, silt, clay, and bulk density
- Evaluates how dataset size impacts model performance using randomized subset experiments
- Visualizes model accuracy using R² and RMSLE across sample sizes
- Explains feature importance using SHAP values for transparency
🛠 How we built it
We started by downloading the UKSAT dataset and cleaning it in Python. The project was built entirely using:
- Python (Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn)
- VS Code for development and testing
- GitHub for version control and submission
- SHAP for explainability and feature attribution
We designed a modular pipeline to:
- Clean and preprocess data
- Train and evaluate multiple models
- Automate subset sampling and error tracking
- Generate final visualizations for reporting
🧗 Challenges we ran into
- Installing
scipyon macOS M3 Pro due to missing Fortran compilers - Interpreting domain-specific features (e.g., what constitutes "valid" soil composition ratios)
- Managing subset experiments across 50 random splits and hundreds of model retrains
- Tuning models under the constraint of no XGBoost allowed
🏆 Accomplishments that we're proud of
- Building a fully working ML pipeline from scratch with subset evaluation support
- Achieving strong Ksat prediction metrics using random forest and decision tree regressors
- Generating clean, reproducible results that show how less data can still give powerful predictions
- Creating a structured, well-documented, GitHub-hosted solution
📚 What we learned
- How to build regression models for scientific use cases
- How to interpret R² vs RMSLE and where each metric matters
- How to use SHAP to make ML explainable, even in scientific applications
- The importance of clean data, repeatable experiments, and modular code
🚀 What's next for KsatFlow: ML-Powered Soil Permeability Estimation
- 🧪 Try other models like LightGBM, CatBoost, and Ridge Regression
- 🌍 Explore geospatial mapping of soil Ksat values using external datasets
- 🧱 Integrate a simple web-based tool for researchers or students to input soil values and see Ksat predictions
- 📊 Build an interactive dashboard to showcase model explainability and soil insights in real-time
Log in or sign up for Devpost to join the conversation.