Inspiration

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths. This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status.

What it does

Predict the likely hood of a person having a stroke.

How we built it

Used Machine Learning to predict if a person is likely have stroke or not. Furthermore, two models were using for training the model RandomForest classification and XGBoost classifier. To improve the prediction of the model, I tried using hyper parameter tuning using GridSearch. Both the RandomForest and XGBoost models performed around the same accuracy. Balanced accuracy was used to test the performance of the models.

Challenges we ran into

The challenge about training the dataset was working with an imbalanced dataset. Fine-tuning hyperparameters did not help improve the model. Because the model is highly skewed or highly imbalanced, fine-tuning would not make much difference on the balanced_accuracy score. Grid-search is more likely to go in the direction of the majority class when trying to find the best hyper-parameters which will led to the same result.

Accomplishments that we're proud of

At the current state of the project, it is not yet ready to get deployed and used worldwide. However, I am proud of how much I have learned during the process of building the project. Furthermore, hopefully the model will reach to a level that it can be using by different people and save lives.

What we learned

During the time I spent building this model, I have learned more about different classifiers like RandomForest. Most importantly, I have seen how a model can be affected by a dataset that is imbalanced. Furthermore, if we don't use the right accuracy measuring method, it will led us to assume that our model is preforming well. For example, in my case I was using accuracy to test the error. However, because I have a majority of negative class, the accuracy will be misleading.

What's next for Stroke Prediction

The improve our model we need to fix our dataset. The major cause of imbalance dataset are Biased Sampling, and measurement error. The first thing we need to consider doing before building our model is resampling the training set. We can use different method to resample. 1. Under-sampling: we can balance the dataset by reducing the majority class.

Built With

Share this project:

Updates