Inspiration
Cardiovascular disease is one of the leading causes of death worldwide. I wanted to build a simple and practical ML model that can predict CVD risk early using basic clinical and lifestyle information.
What it does
This project predicts whether a person is likely to have cardiovascular disease (cardio = 1) or not (cardio = 0) using patient health attributes such as age, blood pressure, cholesterol, glucose, and lifestyle indicators.
How I built it
- Loaded the processed cardiovascular dataset (70,000 records)
- Cleaned the data by removing non-useful columns (
Unnamed: 0,id) - Built a fully reproducible preprocessing + training pipeline using scikit-learn
- Trained and compared two models:
- Logistic Regression (baseline, interpretable)
- Random Forest (better performance)
- Evaluated using ROC-AUC and accuracy
- Visualized feature importance for explainability
Results
Random Forest performed better than Logistic Regression with:
- ROC-AUC ≈ 0.78–0.79
- Accuracy ≈ 0.72
Challenges I ran into
- Setting up required Python libraries in the environment
- Ensuring the dataset was clean and reproducible with pipelines
- Balancing performance with interpretability for a healthcare use-case
What I learned
I learned how preprocessing pipelines prevent data leakage, and why ROC-AUC is an important metric for medical risk prediction tasks.
What's next
- Hyperparameter tuning (GridSearchCV / RandomizedSearchCV)
- SHAP explanations for deeper interpretability
- ECG time-series integration for improved prediction
Built With
- jupyter
- matplotlib
- numpy
- pandas
- python
- scikit-learn
Log in or sign up for Devpost to join the conversation.