Jpeg of our final poster!

Project Proposal

Biographical Info

Team Name: 🐸Team Toad 🐸

Team Member 1 CS Login: kodesai

Team Member 2 CS Login: kwisialo

Team Member 3 CS Login: rlennon

Team Member 4 CS Login: dechoi

Project Idea (Revised)

Cancer of unknown primary origin (CUP) is a term for a group of distinct cancers that present in patients as already metastasized (spread throughout the body) and with an unclear tissue of origin. Around 2 to 5% of cancer cases are estimated to qualify as CUPs, which means they are a subject of significant clinical interest. As many modern cancer therapeutics perform best when tailored to the cell type of the original cancer, being able to trace back to a cancer’s origin has implications for disease treatment. However, many current approaches involve difficult genetic sequencing or histopathological testing-based techniques that may be expensive or not always available.

In our CSCI 1470 project, we aim to explore a possible remedy to this problem. In particular, we plan to reimplement the Tumor Origin Assessment via Deep Learning (TOAD) model from Lu, et al. designed to predict a cancer’s primary origin and metastatic status from simple images of biopsied tissue. The general approach is to train a deep learning model on publicly available haematoxylin and eosin (H&E) whole slide images (WSI) of tumor samples where the training labels are the known origins of the tumor. Several challenges arise — in particular, the collected WSI are very large (thousands of pixels per image), so training CNN-based models directly on them would be incredibly inefficient. Therefore, the paper proposes a method in which images are decomposed into “patches” and encoded into feature vectors by a ResNet50-based model. These vectors are then collected into “bags” representing entire original WSIs, and learning proceeds via “Multiple Instance Learning” — i.e. learning where examples are labelled in collections as opposed to individually.

We have multiple ideas of how to improve upon or creatively modify the paper’s original architecture. For one, to improve the efficiency of the CNN architecture in the original paper, we propose implementing our feature-extraction component using EfficientNet, which vastly reduces the number of parameters while maintaining/improving accuracy. This method works by uniformly expanding the dimensions of the model (versus ResNet, which expands 1 dimension — height, width, depth — at a time). This improves scalability, which is especially important considering that we will likely be running our programs on OSCAR’s GPUs. We are also considering playing around with Vision Transformers (ViTs) as an alternative method of extracting feature information, although this depends on whether this is feasible with our amount of compute. Another avenue for exploration that we are considering is replacing the pretrained ResNet50 with a pretrained model using self-supervised learning — this way our feature extractor could be tailored to our application of medical imaging data. Finally, we are interested in possibly applying the model to a different type of imaging data (e.g. MRIs or fluorescence microscopy images) to see if it is flexible in its application or not. More details regarding the specifics of the original architecture are in the paper.

In total, this project will allow us to apply a novel CNN architecture to the domain of tumor detection, while also allowing us to explore various architectural improvements based on more modern methods.

Sources: Lu, M.Y., Chen, T.Y., Williamson, D.F.K. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021). https://doi.org/10.1038/s41586-021-03512-4

Varadhachary GR. Carcinoma of unknown primary origin. Gastrointest Cancer Res. 2007 Nov;1(6):229-35. PMID: 19262901; PMCID: PMC2631214. https://pmc.ncbi.nlm.nih.gov/articles/PMC2631214/.

https://en.wikipedia.org/wiki/Multiple_instance_learning#Algorithms

https://en.wikipedia.org/wiki/Self-supervised_learning

https://arxiv.org/abs/1905.11946

Key Limitations

A main limitation that we can immediately anticipate is dataset collection and cleaning. The original paper points us to two data sources, but the exact composition of the training, validation, and testing data they used is not accessible. Additionally, the authors used a few sources of data that are not available for public use — namely patient data they collected themselves — so we must anticipate that our model will perform more poorly in their absence.

Another possible issue is the storage size of the data. Since the dataset is composed of many high quality images, the version the authors used for training was absolutely massive — 28 terabytes in total. While we will probably only use some subset of this, we still have to think about how to store and deal with very large quantities of data, and we must also consider tradeoffs between the hassle of collecting and storing more data and the improved results that follow. Some guidance here would be very helpful!

Another limitation will be training time and GPUs. The authors ran the model on multiple GPUs — we are limited to what we can request on OSCAR. This again suggests that the size of our dataset will have to be pared down. Guidance here would also be helpful — in particular the feasibility of reducing down the original model to something more manageable given the compute resources we have.