j4noronh_Intact_CxC2023

Inspiration

As Canada's largest P&C insurer, Intact leverages the power of data to address various problems. One such problem is the classification of medical documents into medical specialties. Intact receives a large number of medical claims on a daily basis that must be managed and sorted in an effective manner to ensure that consumers are receiving their rightful claims. Machine Learning can be utilized to develop a model that classifies the text from these documents into a medical specialty.

What it does

This notebook outlines the process of developing a medical document classifier.

How we built it

This model was developed using HuggingFace's 'XLNet-base-cased' NLP model, trained on the data provided by Intact. The models were run in a Jupyter Notebook on Google Collab, using a standard GPU hardware accelerator (this model can be run without the accelerator, however, it will be very slow).

Challenges we ran into

The data was extremely imbalanced. Due to the nature of different medical documents, the extremely imbalanced training dataset omitted many terms that could exist in these documents. While preprocessing the transcripts and weighing the classes helped emphasize the underrepresented classes, the model will greatly benefit from more samples in the underrepresented classes.
Optuna was inconsistent. The hyperparameters used in the optimizer yielded a different f1_macro score than the final model when being judged against the validation data. Simpletransformers is better optimized for Wandb (another hyperparameter tuning library), however, the library ran into many issues with Google Collab, mainly when using a GPU/TPU accelerator. When switching to a CPU backend without using CUDA cores, it ran at an extremely slow pace which led to the optimizer being timed out.
Google Collab was limited in its capabilities. The model was trained on Google Collab with additional computing cores. Due to the limitations in RAM and storage, the amount of training that could be performed was severely limited. The hyperparameters optimized using Optuna had to be significantly restricted to ensure that they could run on the limited hardware available. Additionally, Google Collab timed out often, meaning that most of the training steps in this notebook had to be repeated. We did try using Azure Notebooks, however, there was a compatibility issue that prevented the easytransformer library from being imported.

Accomplishments that we're proud of

Cleaning the data so that it can easily be used to train the model.
Attempting to fine-tune the parameters using optuna
Utilizing XLNet over bag-of-words to improve the model (testing different models)

What we learned

How to prepare and clean data for NLP tasks
How to use HuggingFace's library of NLP Models
Communicating my thoughts in an effective manner to a technical audience

What's next for j4noronh_Intact_CxC2023

In the future I would want to learn how to better optimize my training process to make it more consistent and efficient.

Built With

easytransformers
google-collab
natural-language-processing
nltk
python
xlnet

Updates

Jaden Noronha started this project — Feb 24, 2023 10:43 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.