Inspiration

Honestly cardiovascular disease is just something that affects so many people and I thought it would be cool to actually try and see if I could predict it with the data available. Also had two datasets — one massive one and one tiny one and I genuinely wanted to know which one would perform better. Turns out size doesn't matter as much as I thought lol.


What it does

Takes two cardiac datasets and runs three supervised ML models (Logistic Regression, Random Forest, and Gradient Boosting) on both to predict whether a patient has cardiovascular disease or not. Cleans the data, does EDA to figure out which features actually mean something, trains the models, and compares both datasets side by side using ROC-AUC. Basically tells you: here's what the data says, here's how confident the model is, and here's why one dataset is way better than the other despite being 75x smaller.


How we built it

Just me and Python honestly. Pandas and NumPy for cleaning, matplotlib and seaborn for all the graphs, scikit-learn for everything modelling related. Wrapped everything in Pipelines with StandardScaler so there's no data leakage during cross-validation. Used StratifiedKFold to keep class balance across splits.

Cleaned the Cardiac Base dataset first and age was stored in days (values in the tens of thousands, took me a second to figure that out), blood pressure had impossible values like diastolic higher than systolic which is biologically impossible, and there were outliers removed with the three-sigma rule. Engineered BMI from height and weight, added pulse pressure and some interaction terms after initial runs.

For Heart Processed same logic, just different features and more clinical stuff like ECG readings and stress test results.

I will be extremely frank, I used AI to help with formatting graphs, writing syntactic code, and adding comments. All the logic, data analysis, and modelling decisions were my own. Also my PC is a little potato so that was fun.


Challenges we ran into

The lifestyle features (smoking, alcohol, activity) were completely useless which all sat at like 47% disease rate which is basically the same as the dataset base rate. Spent a while thinking I had a bug before accepting the data just doesn't have that signal. Self reported lifestyle data is noisy I guess.

The interaction terms I made caused multicollinearity where bp_age_interaction correlates 0.79 with ap_hi and bmi_age_interaction correlates 0.85 with bmi. They didn't really hurt performance but they made a mess of the correlation heatmap.

Couldn't even attempt unsupervised learning due to lack of credits on Colab and all, plus my PC being a potato, plus I'm unfamiliar with PyTorch or TensorFlow. So that was a hard stop early on.

Also only realised mid-analysis that Heart Processed is 79% male which is a real bias problem. Couldn't fix it but at least flagged it.


Accomplishments that we're proud of

Getting the whole pipeline working cleanly on two independent datasets and having the story actually make sense at the end. The 0.80 to 0.93 AUC jump between datasets and being able to explain exactly why it happened — that felt genuinely satisfying.

Also just finishing this at all in 4 days on limited hardware while being honest about what I don't know yet. Not gonna pretend that wasn't a grind.


What we learned

Better data beats more data. Full stop. Heart Processed is 75x smaller than Cardiac Base but absolutely smokes it on every metric because exercise angina and ST slope are direct measurements of what a heart does under stress. BMI and age just tell you your risk is higher and they don't confirm disease. No amount of rows fixes a weak feature set.

Also feature engineering should be iterative where BMI wasn't in my first pipeline, I added it after initial runs and confirmed it helped before keeping it. Don't engineer everything upfront, let the model tell you what it needs.

And context of data collection matters. Resting BP was actually higher for healthy patients in Heart Processed which looks wrong until you realize it's a clinical referral population where sick patients are probably already on blood pressure meds. Numbers only make sense if you know where they came from.

Built With

Share this project:

Updates