Inspiration

My mother went to retrieve my father's slides sent to a pathology lab that were lost, and found out that they were classifying samples by hand, and they stated that it was both error-prone and time consuming to her. She relayed this information to me, and I immediately I believed I could do better with the assistance of machine learning, and thus I decided to use Tensorflow to improve upon the speed and accuracy of the classification process.

What it does

This project was based on utilizing a pre-trained image recognition CNN (Convolutional Neural Network) was retraining it on a data set of cancer slides of different cancer sub-types, leading to the ability to perform automatic classification of samples. This hopefully can be used to achieve better outcomes for patients with a clinical pathologist’s input, or when such a pathologist is not available.

How I built it

I utilized web-scraping techniques to retrieve images of slides from webpathology.com, such as wget/cut/grep, etc. I then filtered data set for quality (by removing images that did not conform to my quality standards), and resized the images to match the input size of the neural network (which was 224x224). I then retrained final layer of the Mobilenet model to classify images in data set, as Mobilenet already provided the image recognition techniques I needed, and simply needed to be retrained on a new data set. This is known as “transfer learning”. I then scripted the retraining and testing of model, to automate the retraining process with new data sets.

Challenges I ran into

My data for utilization of in the neural network was limited to what was found online. The clinical slide images I found are rare outside of databases that do not sort them appropriately and/or require registration (and potentially a M.D.), and thus are difficult to acquire for use. Therefore, data acquisition had to be done by scraping an internet pathology website, which is a good source of information, but the images themselves were not easy to acquire on a bogged down wireless connection. I also could not acquire sample sizes large enough to provide statistical confirmation of results, although all trends and tests indicate the effectiveness of the model, with the minimum accuracy coming in at 65% for a model. Unfortunately, the data I retrieved came with website image tags- which would make a less valid model for outside data, although it could be easily retrained on a new subset of data.

Accomplishments that I'm proud of

I am very pleased with the results of the retrained neural networks which performed highly in all tests that I ran them through to validate results, and they all came in a range between 65% and 80% accurate when running new images of slides through them. This leads me to believe that they could be utilized soon in a clinical setting to improve the speed and accuracy of pathologists analyzing stained slides, as soon as they are retrained on a larger, better version of my current image set.

What I learned

I learned how to use Tensorflow for image classification, as well as how neural networks work, and how to retrain them. On top of that, I also got to practice a significant amount of bash scripting.

What's next for Machine Learning for Cancer Classification

I need to acquire more pathological data to avoid overfitting as well as expand it to more types of cancer. It also needs more tests to ensure accuracy before it can be used in a clinical setting.

Built With

Share this project:

Updates