Drug consumption prediction for NHS

Done as a project at AIHack 2020 at Imperial College London.

Motivation

As life expectancy increases, it is crucial to improve the quality of life among the elderly. As people age, they are more prone to having health issues and degenerative diseases such as cancer, dementia, infections due to viruses and bacteria. Osteoporosis and arthritis affecting the bones and joints, neurodegenerative diseases such as Alzheimer’s and Parkinson's, cancer, and diabetes are among the most common degenerative disease.

Our project is aimed at measuring correlations between different diseases in regions in the UK. We built a dataset consisting of time-series data consisting of the number of prescriptions per GP in the UK and we attempt to predict thwat is going to be the monthly consumption for specific medicine per GP.

The main contributions of this projects are:

Dataset (Oeslle Lucena and Max Grogan)
Method (Martin Ferianc)
Visualisation and analysis (Saurav Zangeneh)

Dataset

The initial datased consisted of a meta-data file mapping medical procedure codes to descriptions accompanied with a unique CSV for each medical process that contained over ~5 years worth of prescription counts for every GP in the UK that would prescribe that procedure at least once. Our contribution is deemed in processing the dataset and extrapolating non-existent values into a time-series dataset consisting of prescription data from more than ~6000 GPs in the UK for the thirty most-commonly prescribed medical processes over a ~5 year period.

Method

Our method is a multi-layered Recurrent neural network that is trained through backpropagation. We are able to estimate the uncertainty of our predictions on the procedure counts through Bayesian inference that is provided through using MC Dropout.

Visualisation and analysis

In our presentation you will find our visualisations of spatial-temporal correlations of prescription features that initially inspired our development of the dataset and RNN approach.

How we built it

The analysis and visualisation was carried out in Python using folium and scipy and statsmodels.
Our initial dataset was curated in Python using pandas for data wrangling.
Our Bayesian RNN was trained in Pytorch using our manually curated dataset as the input.

Challenges we faced

We faced significant challenges with selecting suitable prescriptions to use for our proof-of-concept RNN. In addition we tested several approaches for imputing missing values to ensure we did not lose too much of our data to NaN values while also conserving data integrity for downstream model training.
Lastly when training the RNN we faced challenges with hyperparameter tuning and model selection, going through several iterations before settling on our final model.

What we learned

Overall we learned a lot about data cleaning and wrangling, and establishing it as a working, reproducible pipeline.
This was also the first time any of us had trained an RNN before, and it has been a very productive experience.