Inspiration

We are interested in machine learning and would like to pursue a career in ML or data science. Because of this, we decided to try implementing some sort of prediction model for a disease which is heavily influenced by certain lifestyle risk factors and could help early intervention in hospitals. We thought heart disease would be perfect for this purpose because it is the number one cause of death in the world and has many preventable factors.

What it does

It uses 270 samples (we could also incorporate an additional 303 which we have been using for testing) to train a model for heart disease prediction based on 13 different factors recorded on hospital admission. It provides an interface for health care providers to enter in data about the patient and obtain a suggestion of whether or not to consider screening/intervention.

How we built it

We used the markov learning network package _ tuffy _ to structure the first order logic of our world and the rules within it. We then tested on our expected weights, and got an accuracy of 67%. We used tuffy to reevaluate our weights and tested again to get an accuracy of 89% on the other 303 samples set aside for testing (we believe however that there may be significant crossover with the 270 sample dataset). Using a 5 k-fold cross-validation we obtained 80.7% accuracy. Finally we tried out a learning ensemble with 100 instances of randomly aggregated MLN models to obtain 82.6%. This was only slightly better than the cross-validation and took many hours to train, so we believe that using a learning ensemble with this problem is not enough gain for the cost. After doing these analyses of our data, we created a front-end to our functionality to allow a user to input health data and assess whether or not the individual is at risk and should take precautions/screening.

Challenges we ran into

It took us a lot of time to accurately model the data into a set of rules and weights and we needed to research and learn a lot about heart disease. In addition, we had to ensure the correctness of the dataset and make sure we fully understood each column. The learning ensemble was also a bit difficult to implement due to scalability reasons, which lead us to shift focus because it seemed impractical. We also were unable to implement blood pressure because the data was possibly incorrect and negatively affected the accuracy. This is disappointing because blood pressure seems like it could be a very useful piece of information to predict heart disease.

Accomplishments that we're proud of

We were very pleased with the accuracy we were able to obtain. Over 80% was great to us and we believe it could possibly even be improved with more analysis.

What we learned

We learned about Markov Logic Networks, first order logic and learning ensembles, as well as the risks related to heart disease. We also each had different skill-sets coming in, so it was very helpful to learn off of each other instead of learning these things by the books.

What's next for Heart Disease Risk Assessment

The accuracy could possibly be improved by assessing the importance of each feature/predictor. We could also improve the running time by storing the models instead of generating them on runtime. We generate the model because the user can omit certain data in their query, in which case the model needs to be generated differently. We can get around this by storing every possible model and using that data instead of generating, because memory is relatively cheap.

Built With

Share this project:

Updates