Using MSRN to extract Oncological Outcomes from Medical Data

Final Poster

Introduction

Machine learning can be a catalyst for rapid health care systems. In these systems, patient data can be used in real time to carry out cost-effective, personalized and low-risk treatments. Our project tackles the problem of finding a scalable method to extract clinical endpoints from electronic health records or more specifically, oncological radiology reports. To this end, we collaborated with Brown’s Radiology AI Lab. They provided us with all data and labels we used for this project. Usually, reports come in large amounts and it is sometimes difficult to gauge the meaning of the author’s impression, as it can include a lot of technical language. Our goal is to build a natural language processing model capable of describing a patient’s disease progression and response to therapy from their radiologic reports. To accomplish this, we adapted various BioBERT models capable of describing whether a radiology report signals there is evidence of cancer [2][3]. The result is outputted as a numerical label, indicating varying degrees of cancer evidence (as seen in the table below).

GitHub Repo

GitHub Link

What it does

Our group was given the task of classifying oncological radiology reports using annotated reports provided by Dr. Bai in Brown's AI Radiology Lab. We built a model that uses BioBert that takes in oncological radiology reports, extracts a label from one to seven, and outputs an indication of cancer evidence.

How we built it

Our model uses BioBERT to grab embeddings and then passes these through a global average pooling layer, a batch normalization layer, a dense layer of size 128 with a ReLU activation function, a dropout layer with rate equal to 0.1 and a final dense layer of size seven with a softmax activation function. Our batch size was 32, our optimizer was Adam with a learning rate of 0.001 and our number of epochs was 20. This combination of layers and hyperparameters allowed us to achieve our highest accuracies.

Challenges we ran into

Our biggest challenge was our small and imbalanced dataset. Low amounts of labeled data made creating an accurate model difficult. Moreover, preprocessing was complicated by inconsistently formatted data. We lastly had trouble with unrealistic accuracy numbers produced when working within the Keras framework, which led us to re-write our model and to make sure we had robust accuracies.

Accomplishments that we're proud of

We're happy to have implemented a model with robust accuracy, as well as one that performs relatively well for such a small dataset. We're also proud of having built a model from the group up that makes use of BioBERT to extract embeddings. While programming this project, we learned a lot about NLP, using the Huggingface API and thinking critically about our metrics and architecture.

What we learned

We learned to use Huggingface for NLP tasks, specifically, using a model to extract embeddings and training a classifier over these. In our different trials, we also learned to use Keras's API functions such as fit() and evaluate() and learned to think critically about what our model is doing when re-implementing it without these and instead structuring it similar to our homework.

What's next for Using MSRN to extract Oncological Outcomes from Medical Data

Given that our model is reasonably performant on such low levels of data, and especially performant on the more common classes, having more hand-labeled data might solve the problems it faces on rarer classes. We could also explore using weakly labeled data to train our model further. This could be done by using an earlier version of our model to label data for later training or by using an outside heuristic function to label data.

Links

FINAL PROJECT WRITEUP

VIDEO PRESENTATION

Built With

biobert
natural-language-processing
tensorflow

Updates

Singh Saluja posted an update — Nov 25, 2020 02:18 PM EST

Reflection 1:

Preprocessing Methodology:

Over the past week and a half, we have focused on preprocessing the data and extracting valuable information from the clinical reports. In order to preprocess our data, we broke up the work into two main parts. The first part involved creating dictionaries to map the report IDs to the labels for evidence of cancer and the impression section of the reports. This section was completed seamlessly as we were able to go into the excel spreadsheet provided by the lab, locate each patient, find their respective reports, and extract relevant information for labeling. We created a dictionary of {report ids : evidence of cancer} from the "reports" directory from "evaluation of reports.xlsx" and then using Pandas (like numpy for reading csv, excel sheets, etc) to read the file and make a dictionary of {report ids : }.

Challenges:

The second part of preprocessing involved separating the impression section from the original reports. Here, we ran into some trouble due to the fact that each report is structured differently. We first attempted to use bash scripting to separate the impressions based off the heading “IMPRESSION” and the ending “END OF IMPRESSION” in each report, but quickly found out that certain reports had non-impression related information between these two headings. Thus, we decided to use python to separate the impressions and match them to the ids.

Next Steps

Our work showed that we are successfully able to preprocess the data by separating impressions from reports and matching them with labels (See Figure 1). We were also successfully able to match report ids with labels for evidence of cancer. Although we have not started on our model yet, we believe we can make significant progress in the next week. Over the course of the next 8-10 days, we hope to create a model that will run through the data we have preprocessed and will successfully be able to predict the evidence of cancer from the impression section of the radiological reports we have been provided. We hope to dedicate most of our time to building out this part of our final project. As of now, we feel that we are very much on track to complete the project by the deadline and think we can provide our lab with a working NLP model that is capable of deciphering the evidence of cancer from radiologic oncological reports.

GitHub Repo

GitHub Link

Log in or sign up for Devpost to join the conversation.

Singh Saluja started this project — Nov 16, 2020 10:52 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.