💡 Inspiration

Cardiovascular disease kills 17 million people annually — yet it's largely preventable with early detection. I wanted to build a tool that makes clinical-grade CVD risk prediction accessible from routine measurements any doctor already collects: blood pressure, age, cholesterol, and weight.

🔬 What I Built

CardioSense AI is an end-to-end ML pipeline that predicts cardiovascular disease risk from clinical features, trained on 70,000 de-identified patient records.

Full pipeline:

  • EDA with skewness analysis and outlier detection
  • Memory optimization: 6.4 MB → 1.6 MB (75% reduction) via dtype casting
  • 5 feature engineering strategies tested (BMI, binned bands, digit decomposition, categorical encoding, genetic programming via gplearn)
  • 3 model families benchmarked: XGBoost, LightGBM, RealMLP
  • 5-fold Stratified Cross-Validation with out-of-fold (OOF) predictions
  • Optuna Bayesian hyperparameter tuning (50 trials)
  • SHAP KernelExplainer for model interpretability
  • Streamlit web app for real-time risk assessment

📊 Results

Model AUC
Baseline (Logistic Regression) 0.730
XGBoost (best) 0.7901
LightGBM (best) 0.7945
RealMLP v1 0.7996
RealMLP v3 (final model) 0.8003

+9.6% improvement over baseline.

🧠 What I Learned

  • RealMLP outperforms gradient boosting on this medical dataset thanks to learned feature embeddings and ensembling
  • Digit decomposition of BP values (ap_hi_tens, ap_hi_units etc.) gave the model richer numeric structure — v3 beat all other versions
  • SHAP revealed systolic BP, age, and cholesterol as dominant predictors — consistent with cardiology literature
  • More features ≠ better: domain interaction features degraded AUC by −0.00047
  • Smoking/alcohol had near-zero SHAP importance due to self-reporting bias in the dataset (only 8.7% smokers) — a real data quality lesson
  • Optuna hit a ceiling — 50 trials showed no improvement over initial params

🛠️ How I Built It

Step 1 — Data Cleaning: Raw age in days (max 23,713). Removed impossible BP/weight/height values, rows where diastolic > systolic, 24 duplicates. Final: 65,663 patients.

Step 2 — Feature Engineering: Tested 5 strategies. Best was v3 — digit decomposition of BP/height/weight into hundreds, tens, units digits. 24 total features.

Step 3 — Model Training: XGBoost, LightGBM, RealMLP across all 5 versions, 5-fold CV. RealMLP v3 → AUC 0.8003.

Step 4 — Explainability: SHAP KernelExplainer, 200 test patients, precomputed and saved for instant app display.

Step 5 — Deployment: Dark medical Streamlit UI, real-time risk gauge, SHAP chart. Model converted CUDA→CPU for cloud deployment.

⚡ Challenges

  • CUDA → CPU deployment: Model trained on RTX 4050 but Streamlit Cloud has no GPU — manually converted torch tensors to CPU
  • SHAP + RealMLP: Not supported by TreeExplainer — used KernelExplainer, took ~15 minutes for 200 samples
  • Age normalization: Dataset stores age in days — found exact max (23,713) to correctly normalize
  • Optuna ceiling: 50 trials, zero improvement — confirmed initial params were already near-optimal

Built With

Share this project:

Updates