Inspiration: The inspiration for this project came from the critical need for accurate and timely rainfall data in various sectors. Traditional weather forecasting can be imprecise and having a machine learning model that predicts rainfall categories could significantly benefit agriculture, urban flood management, and disaster response planning. Our goal was to build a system that turns complex, raw data into clear, actionable predictions.
What it does: This system is a multi-class classification model that takes a dataset of meteorological, geographical, and temporal features as input. It processes this data to predict one of four discrete rainfall categories: NORAIN, SMALLRAIN, MEDIUMRAIN, and HEAVYRAIN. The final output is a submission file with the predicted rainfall category for each entry in the test set.
How we built it: We built this project using a multi-stage, iterative process:
Data Preprocessing and Cleaning: We first consolidated the training and test data, meticulously handling a high degree of missing values and ensuring consistency across all features. Model Selection and Tuning: We selected three powerful gradient-boosting models (CatBoost, LightGBM, and XGBoost) and used the Optuna framework to automatically find the optimal hyperparameters for each. This was a crucial step in achieving our high-accuracy results. Feature Engineering: We created new, more informative features by calculating aggregate statistics for numerical columns grouped by categorical features. This provided the models with a deeper understanding of the data and a boost in predictive power. Ensemble Experimentation: We explored advanced ensemble techniques like blending and stacking to combine our best models. This allowed us to validate and refine our understanding of these complex methods.
Challenges we ran into: The most significant challenges were technical and required meticulous debugging:
Handling Categorical NaNs: The CatBoost model initially failed to train due to NaN values in categorical columns. We solved this by explicitly converting the columns to the object data type and filling missing values with a new, distinct category. Data Leakage in Blending: Our initial blending attempts resulted in a ValueError with inconsistent sample sizes, a classic symptom of data leakage. We corrected this by implementing a robust cross-validation loop to ensure the meta-model was trained only on out-of-fold predictions. Model-Specific Data Types: The LGBM model's strict requirement for categorical data types (category instead of object) caused runtime errors in our stacking ensemble. We fixed this by ensuring all categorical features were converted to the correct data type for each model.
Accomplishments that we're proud of: Achieving a top-tier accuracy score of 0.96078 on our best-performing single model.
Successfully implementing and debugging a robust stacking ensemble, a complex but powerful technique. Developing an end-to-end, reproducible machine learning pipeline that handles data cleaning, feature engineering, model tuning, and final predictions.
What we learned: This project provided invaluable lessons in practical machine learning:
The critical importance of robust data preprocessing and handling model-specific data requirements. That even the most powerful ensemble techniques can fail without meticulous implementation to prevent data leakage. That a single, highly-tuned model can sometimes outperform a basic ensemble. The value of an iterative approach to problem-solving, where debugging and refining the solution is key to success.
What's next for Precision Rainfall Prediction System: To take this project to the next level, we plan to:
Explore more advanced feature engineering, such as time-based features and creating interaction features between different columns. Implement a more sophisticated meta-model for our stacking ensemble, such as a multi-layer perceptron. Integrate the model into a web-based application to provide real-time rainfall forecasts and visualizations.
Built With
- data
- language:
- lightgbm
- models:-catboost
- numpy
- optuna
- pandas
- python
- pytorch
- scikit-learn
- streamlit
- visualisation
- xgboost
- xgboost-libraries:-pandas
Log in or sign up for Devpost to join the conversation.