Diagnosing Diseases in MIMIC-III with Deep Learning

Atishay Jain (ajain59), Jiaqi Zhang (jzhan322), Tassallah Abdullahi (tabdull1)

Link to Final Report

Links to the reflections are in the Updates.

Introduction

Diagnostic error is an increasing threat to public health that contributes to a higher mortality rate of patients than any other preventive hospital adverse event. In a clinical setting, diagnostic error is responsible for 6 to 17 percent of adverse events [1]. Previous works such as [2] have shown that an individual physician’s diagnosis accuracy was 62.5% and increases to 85.6% if teams of physicians worked together in diagnosing the patient. Worse can be said for complex cases and infrequent cases, where the diagnostic phase is lengthy and ridden with errors. The aim of this project is to develop a machine learning model that can support doctors to accurately predict infrequent diseases as well as common diseases.

A paper from Prof. Carsten Eickhoff’s lab [2] explores how basic machine learning models perform poorly on classes that have lower samples. We wanted to try and solve the issue by using deep learning models that can improve performance even with a low sample size.

This is a classification problem.

Related Work

Gil Alon et al. [2] use some machine learning models including random forest, SVM, and multi-layer perceptron to make diagnoses based on demographic information and lab measurement of patients. Each diagnosis is represented by categorical values called the ICD-9 code. So the problem is formulated as the classification task. Experimental results show that the machine learning model performs well on more frequently occurred diseases but rapidly breaks down on predicting rare diseases. Jinmiao Huang et al. [3] make an empirical evaluation about using deep learning models on the ICD-9 code prediction based on text notes. Experiments show the feasibility of using deep learning models on diagnosis prediction. However, they still lack focus on prediction of rarely occurred diseases.

Data

We will be using the MIMIC-III dataset. It is a commonly used dataset used in clinical studies and disease diagnosis. Within the dataset, there are multiple tables with different types of data such as text, numerical and sequential data. For our project we intend to use the text data for our one-shot learning algorithm and also include some numerical data for our multi-modal analysis.

The dataset contains information about 58,138 different admissions. Different types of data will require various amounts of preprocessing. The text data will require using a reasonable embedding before it can be used, while the numerical data can be used directly as features.

Methodology

We will be using Siamese networks for one-shot learning and RNN + FCNN for multi-modal training. Since we will be using both text and numerical data, we need to first separately process them and combine them in the same space for unifying. Specifically, for text data, we’ll use an embedding layer to convert them to vectors. Then, they can be combined with numerical data to make predictions through RNN and fully connected layers. We might try to replace RNN with CNN and compare the difference.

We will use the CS department’s grid for training our models.

Justification for multi-modal learning - The dataset contains different types of data. Ideally, integrating more data into the model may lead to a better diagnosis. Therefore we plan to integrate different data to improve the overall disease prediction. Justification for one-shot learning - One of the main reasons that machine learning models are not able to predict rare diseases is the lack of data samples. One-shot learning algorithms are meant to be used on data with a lower number of samples. Therefore, we believe that these one-shot learning algorithms should perform much better when predicting rare diseases.

Metrics

There are two possible outcomes for success:

  1. If our model(s) perform better than the baselines across various metrics
  2. If our model(s) predict the rare diseases with similar accuracy as the common ones

We will divide the data into train and test such that all the diseases appear in both splits. After training the model on the train split, we will test its performance on the test split. Since the data has a class imbalance, a simple accuracy will not suffice. We plan to include AUROC and AUPRC metrics since they handle class imbalances much better.

We will compare our model’s performance to the baseline on:

  1. All the diseases
  2. Just the rare diseases

Furthermore, we will also compare our model’s performance on the rare diseases to its performance on the common diseases to test how well it generalizes.

Our goals for the project are:

  • Base goal - Implement the multi-modal model + baselines
  • Target goal - Implement one-shot learning
  • Stretch goal - Perform interpretability analysis

Ethics

  • Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm? - The major stakeholders of this problem are doctors and patients suffering from such diseases. Any incorrect diagnosis by our model can cause major health problems to the patients. However, since this model is meant to only assist doctors and not replace them, we hope these errors are mitigated through the doctor’s own experience and judgement.
  • How are you planning to quantify or measure error or success? What implications does your quantification have? - Since we are measuring success over all the rare diseases instead of looking at specific ones, we may not be able to see any biases in the model. These biases may make the model accurate on some diseases but inaccurate on others. Thus the model could harm patients suffering from certain diseases.

Division of Labor

  • Atishay Jain - One shot learning
  • Jiaqi Zhang - Multi-modal learning
  • Tassallah Abdullahi - Obtaining the data and pre-processing it

References

  1. Barnett ML, Boddupalli D, Nundy S, Bates DW. Comparative Accuracy of Diagnosis by Collective Intelligence of Multiple Physicians vs Individual Physicians. JAMA Netw Open. 2019; 4-5.
  2. Alon, Gil, et al. "Diagnosis Prevalence vs. Efficacy in Machine-learning Based Diagnostic Decision Support." arXiv preprint arXiv:2006.13737 (2020)
  3. Huang, Jinmiao, Cesar Osorio, and Luke Wicent Sy. "An empirical evaluation of deep learning for ICD-9 code assignment using MIMIC-III clinical notes." Computer methods and programs in biomedicine 177 (2019): 141-153.

Built With

Share this project: