Inspiration
From the beginning of the course, I very well understood the huge impact that ML and AI could have on the medical field. My grandfather is a heavy smoker and since I was small, we always experienced those tragic moments of fear where he rushed him to the hospital due to having a heart malfunction. Each time, the doctors would explain that what happens was a heart disease and that quarterly checkups need to be made on his heart to ensure stability of the condition. However, the process was hard to follow, was extremely costly, and contained a lot of human error (due to doctors not able to detect heart disease). Thus, I decided that I wanted to create a ML project that would provide a somewhat adequate solution to this issue.
What it does
The aim of the project is to be able to predict whether a patient has heart disease or not. The project is intended to be used by doctors as most of the inputs consist of the medical data of the patient that are observed by a doctor following having a patient’s checkup report. There are many types of heart disease, and sometimes detecting them given the patient report and his symptoms may be troublesome for the doctor without further scans and reports that are mostly very costly. Even though the program is not 100% accurate, it is able to give a somewhat close to true prediction of the result of the patient.
How we built it
I used the UCI Heart Disease dataset from kaggle in order to train and test my model. I started off by loading the data and preparing the data to be used in the models I will be testing. I attempted to solve this problem using two different approaches, one was random forest classifiers and the other was decision trees with bagging. For both models, I performed hyperparameter tuning using 5-fold cross validation, which helped reduce overfitting and improve generalization accuracy. Both models were performing extremely well with both achieving 85%+ accuracy in most runs, but I decided to use the random forest classifier as it was achieving better accuracies and better stability during all runs on average. Another reason I selected the random forest is because it gives a better bias-variance tradeoff than bagging decision tree, and with faster computation time. The randomness in the forests decreases the correlation between decision trees in the forest, which results in decreasing the effects of overfitting and noise in the data.
Challenges we ran into
One of the challenges I faced was that two columns in my dataset had about 30% of their values to be N/A, and the possibility of removing such columns was not an option as after creating a correlation matrix between those columns and the final prediction, it was clear to see that there was a noticeable correlation between those columns and the label to be predicted. The way I solved this issue was by filling the N/A values with averages/modes (based on the column if it was categorical or continuous) of the column values we have for each stage of heart disease. The result of this step was substantially improving the accuracy results for the Random Forest model while keeping the correlation between the columns and the data the nearly similar.
Accomplishments that we're proud of
I am very proud that I was able to get a very high accuracy result in predicting heart disease. I was able to achieve a maximum accuracy of 90% on a test set that consists of 10% of the training data. This result was achieved by the random forest classifier model that I trained.
What we learned
This project was an extremely fruitful experience and is surely one of the most important building blocks in my ML career. This project helped me understand the way I should approach any task that has minimal instructions/restrictions, and how a ML project is always a systematic process that starts with data processing and ends with training and testing a model. In addition, I was able to appreciate the true meaning of data and be able to understand and utilize some ideas like correlation, reproducibility, etc. I truly understood what data is and how I could assess the quality of the data I am using in my model. In addition, I got a very powerful insight on how to make the choice of which model to use, and when it is best to utilize a certain ML model.
What's next for Predicting Heart Disease
After looking at the confusion matrix of the training data, I recognized that the model was doing a very good job in predicting no disease, stage 1, and stage 2 heart disease. However, the model is not always consistent with the stages 3 and 4 and often the accuracy of those classes deviates. This is because classes 0 (no heart disease), stage 1, and stage 2 have lot of instances in the training and test set, however those from classes 3 and 4 are very less frequent in both sets as they are more rare cases in patients. So one way I would look into to improve the project is to find a way to solve the issue of having class imbalance in the training data, which would hopefully solve the issue we are facing. Also, the other way I look into improving the results is through introducing a neural network model and experiment with the results I will obtain from such a model.
Log in or sign up for Devpost to join the conversation.