Inspiration

Cardiovascular disease is one of the leading causes of death worldwide. I wanted to build a simple and practical ML model that can predict CVD risk early using basic clinical and lifestyle information.

What it does

This project predicts whether a person is likely to have cardiovascular disease (cardio = 1) or not (cardio = 0) using patient health attributes such as age, blood pressure, cholesterol, glucose, and lifestyle indicators.

How I built it

  • Loaded the processed cardiovascular dataset (70,000 records)
  • Cleaned the data by removing non-useful columns (Unnamed: 0, id)
  • Built a fully reproducible preprocessing + training pipeline using scikit-learn
  • Trained and compared two models:
    • Logistic Regression (baseline, interpretable)
    • Random Forest (better performance)
  • Evaluated using ROC-AUC and accuracy
  • Visualized feature importance for explainability

Results

Random Forest performed better than Logistic Regression with:

  • ROC-AUC ≈ 0.78–0.79
  • Accuracy ≈ 0.72

Challenges I ran into

  • Setting up required Python libraries in the environment
  • Ensuring the dataset was clean and reproducible with pipelines
  • Balancing performance with interpretability for a healthcare use-case

What I learned

I learned how preprocessing pipelines prevent data leakage, and why ROC-AUC is an important metric for medical risk prediction tasks.

What's next

  • Hyperparameter tuning (GridSearchCV / RandomizedSearchCV)
  • SHAP explanations for deeper interpretability
  • ECG time-series integration for improved prediction

Built With

Share this project:

Updates