Skin Lesion Diagnosis Using Deep Learning Techniques

Using Convolutional Neural Networks to Diagnose Melanoma-presenting Skin Growths

Sean Yu, syu66; Kristen Cai, kcai6; Jeremy Lum, jlum3; Thomas Bui, tbui12

Final Deliverables:

Poster: https://docs.google.com/presentation/d/1_b3c4KTeh5qA8VlJdMic5vZGskyQoadF24CqUOBAyuE/edit?usp=sharing

Code Repository: https://github.com/kristencai/dlfinal

Final writeup/reflection: https://docs.google.com/document/d/1LAI1L8CpU1I6LGAGLIOXNtXT4ZzrmgdztQKRR0K_-tc/edit?usp=sharing

Check-In 3: https://docs.google.com/document/d/1WRtGOqrhfSy9vaW-OmoT4EznydMIRDl32iQoVg4rRJ4/edit?usp=sharing

Github: https://github.com/kristencai/dlfinal

Check-In 2:

Introduction: What problem are you trying to solve and why? If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper. What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc.

Paper: https://link.springer.com/content/pdf/10.1186/s12880-020-00534-8.pdf

Melanoma is a form of malignant skin cancer with a 93.7% 5-year relative survival rate. The NIH reports that 82% of 5-year survivals are from diagnoses that are confined to the primary site, indicating that primary detection of melanoma is imperative to maximize survival chances. Current melanoma diagnosis is confirmed by histological examination of the skin growth, but dermatologists often use physical examination to determine the malignancy of such growths. Creating an application that can reliably differentiate benign and malignant growths will enable clinicians to more accurately diagnose patients and avoid misdiagnoses, which often result in melanoma progression.

Given melanoma’s notoriety as the deadliest form of skin cancer, and with 1.3% of all cancer deaths in 2022 being attributed to melanoma, we thought that this was an interesting and realistic application of deep learning, specifically binary classification using CNNs.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project? Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching. In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”–if you stumble across a new implementation later down the line, add it to this list.

https://ieeexplore.ieee.org/abstract/document/7590963?casa_token=0icvFeS6PicAAAAA:KQrZi-wdctz6VtOyYtpNerJGpiQIOyrCRDZ0Uc94fFfVYihnRy20zDJH8ctheXB38imdGsmSfHY

In this paper, the researchers constructed a conventional CNN that classified skin growths as either malignant or benign. In doing so, the authors used various preprocessing techniques to abstract different features of the data, such as the outline of the growth and the color of the growth. These abstractions were achieved by using masks, illumination correction techniques, and Gaussian filters. The authors of this paper use a slightly different architecture than that of the paper we have chosen.

Data: What data are you using (if any)? If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it). How big is it? Will you need to do significant preprocessing?

We will be using the International Skin Imaging Collaboration dataset. The dataset contains over 59K images of benign lesions, in addition to 7K images of malignant lesions and various other unclassified lesions. Preprocessing may be necessary to resize all of the images to a uniform size, but other preprocessing methods do not seem to be necessary. The ISIC also has an API that is associated with the dataset, and we will be using it to fetch all necessary data.

Methodology: What is the architecture of your model? How are you training the model? If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.

The first stage of our model uses Mask R_CNN to filter images by finding potential skin lesions. It creates probabilities based on how likely something is to be a skin lesion and creates a new cropped image to prepare for stage 2. This part of the model is tested using masks verified by a clinical expert. We think that this first stage of cropping the image reliably will be the most difficult since we haven't had much experience in applying CNNs in such a context.

The second stage of the model uses a ResNet152 classifier to classify the lesions into benign (not melanoma) or malignant (melanoma). The first block of ResNet152 consists of convolution layers for feature extraction. The second block is a fully connected layer used for the classification.

Some other design choices the paper implemented can be used in our architecture if our model is not performing well. For example, the paper mentioned the use of transfer learning, since the dataset was not quite large enough. Using this method, a pre-trained model was retrained on a smaller database. Their data for malignant lesions was also augmented, and ratios between benign and malignant were adjusted.

Metrics: What constitutes “success?” What experiments do you plan to run? For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate? If you are implementing an existing project, detail what the authors of that paper were hoping to find and how they quantified the results of their model. What are your base, target, and stretch goals?

We plan to test our model by using an 80:20 test/validation split. In this project, given that binary classification is the task, the notion of accuracy is upheld, and we plan to use a traditional accuracy function to measure our success. In the original paper, the authors chose to produce several different models built on different combinations of preprocessing and augmentation techniques. They measured each model’s performance by using a 2x2 matrix that held the values of true malignant, false malignant, false benign, and true benign, and they calculated the proportions of each category. The most significant results that the paper achieved was a 16.1% true malignant and 74.4% true benign for a total accuracy of about 90%.

Our base goal is for our model to perform the classification task using the architecture outlined above with an accuracy of about 60%. Our target goal is to increase the accuracy to about 75%, by looking at some of the other methods used in the paper, adding batch normalization, max pooling, skip connections, and adjusting hyperparameters. Our stretch goal is to increase the complexity of our model architecture, perhaps by implementing ensemble learning, which combines several models initialized differently to increase accuracy. Adding additional noise to the images could increase complexity by extending the applicability of our model and its input datasets. For a more technical challenge, we could try implementing the model with PyTorch..

Source: https://blog.paperspace.com/image-classification-with-attention/, https://arxiv.org/abs/2201.11440

Ethics: Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.)

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

Our dataset is the International Skin Imaging Collaboration dataset. The dataset contains over 59K images of benign lesions, in addition to 7K images of malignant lesions and various other unclassified lesions. While this dataset seems credible due its widespread use in publications and its association with prestigious computer vision congresses, there are some historical or societal biases that it might reflect. Dermatology resources often lack images and data on darker skin — in fact, one study was created that used deep learning to generate realistic images of darker skin color with malignant and benign lesions. In addition, among a study of 14 skin image datasets, nearly 80% contained data from Europe, North America, and Oceania exclusively. It’s clear that diverse data may be lacking in this field. This is especially important considering the health impacts of our task, and how skin conditions manifest differently on people of different races/skin colors.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

The major stakeholders in this problem of using a deep learning model to classify malignant melanoma are primarily doctors and patients. Patients are the primary stakeholders in this. Their body, symptoms, and conditions are being classified by the model with high stakes. On the other hand, medical practitioners, doctors, and dermatologists are also major stakeholders in this challenge. If they use it, they need to be able to trust the model to make accurate diagnoses to inform them about important decisions for the patient’s health. If the deep learning model makes a false negative, the patient’s symptoms may continue to worsen, the cancer may spread, impacting other parts of the body in devastating ways. In fact, the patient’s life and wellbeing is highly at risk if a high level of trust is put into the false negative. For the doctor, this means a level of responsibility and legal complications they may face. If the deep learning model makes a false positive, the patient may have to pay for more scans, more tests done, and additional stress and anxiety that is unnecessary. In essence, the stakes are high for this challenge.

Division of labor: Briefly outline who will be responsible for which part(s) of the project.

Sean: Hyperparameter testing, model complexity, and additional implementation.

Kristen: Work with the dataset API and pre-processing.

Jeremy: Data augmentation and noise reduction.

Thomas: Layer construction / resnet152.