posted an update

Reflection 1:

Preprocessing Methodology:

Over the past week and a half, we have focused on preprocessing the data and extracting valuable information from the clinical reports. In order to preprocess our data, we broke up the work into two main parts. The first part involved creating dictionaries to map the report IDs to the labels for evidence of cancer and the impression section of the reports. This section was completed seamlessly as we were able to go into the excel spreadsheet provided by the lab, locate each patient, find their respective reports, and extract relevant information for labeling. We created a dictionary of {report ids : evidence of cancer} from the "reports" directory from "evaluation of reports.xlsx" and then using Pandas (like numpy for reading csv, excel sheets, etc) to read the file and make a dictionary of {report ids : }.

Challenges:

The second part of preprocessing involved separating the impression section from the original reports. Here, we ran into some trouble due to the fact that each report is structured differently. We first attempted to use bash scripting to separate the impressions based off the heading “IMPRESSION” and the ending “END OF IMPRESSION” in each report, but quickly found out that certain reports had non-impression related information between these two headings. Thus, we decided to use python to separate the impressions and match them to the ids.

Next Steps

Our work showed that we are successfully able to preprocess the data by separating impressions from reports and matching them with labels (See Figure 1). We were also successfully able to match report ids with labels for evidence of cancer. Although we have not started on our model yet, we believe we can make significant progress in the next week. Over the course of the next 8-10 days, we hope to create a model that will run through the data we have preprocessed and will successfully be able to predict the evidence of cancer from the impression section of the radiological reports we have been provided. We hope to dedicate most of our time to building out this part of our final project. As of now, we feel that we are very much on track to complete the project by the deadline and think we can provide our lab with a working NLP model that is capable of deciphering the evidence of cancer from radiologic oncological reports.

GitHub Repo

GitHub Link

Log in or sign up for Devpost to join the conversation.