Generate New Skin Tone Data for Melanoma

Team: Alexander Le, Deniz Toruner, Kento Abeywardane, Michelle Mai

Links:

Introduction: What problem are you trying to solve and why?

Identifying and classifying skin lesions to be cancerous at an early stage is critical to successfully treat and saving someone’s life. Unfortunately, in the medical community, there is an underrepresentation of reference images for cancerous skin lesions for people with non-white/darker skin. This is reflected in medical textbooks, and most prominent in large datasets which are now being used to train neural networks to increase the accuracy of classification of a skin lesion. Due to the lack of diversity in these datasets, these models perform poorly when applied to patients/data with darker skin tones. Further, this underrepresentation is reflected in the higher rates of misdiagnosis and death for darker-skinned people. In an effort to solve this problem, we aim to generate new images of skin lesions on darker skin which can be used to train a model better equipped to identify and classify skin lesions for diverse subjects. The papers we will be modeling utilize the unsupervised learning methods of Generative Adversarial Networks and Variational Auto-Encoders.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?

We were able to find a couple of papers that utilized deep learning to improve the diversity of skin color within dermatology. The three main papers we found are here

Two of these papers utilize GANs and the other utilizes a CNN/VGG architecture model in order to be able to better classify certain skin diseases including melanoma. The paper that is most aligned with what we want to pursue is Leveraging Artificial Intelligence to Improve the Diversity of Dermatological Skin Color Pathology: Protocol for an Algorithm Development and Validation Study by Eman Rezk Mohamed Eltorki, Wael El-Dakhakhni. In this paper, Rezk et. al utilizes 3 main phases to construct their model. The purpose of the first phase was to identify underrepresented skin tones in the dataset. To achieve this, they imputed their augmented dermatology data into a CNN model to detect and remove any possible nonsegmented disease pixels resulting from a variation of the skin color, improving the quality of the images. Then, they got the pre-processed data and performed k-means clustering to group the pixels based on color values to find the dominant skin color. To obtain tone classification, they used a mix of k-nearest neighbor, random forest, and naïve Bayes methods. For every image, they applied RGB features to create supplementary color features from different color spaces such as hue, saturation, and value to provide the model with sufficient color information. Thus, they were able to classify the dominant skin color as very light, light, intermediate, tan, brown, or black. Phase 2 included generating for underrepresented tones from the input data, where they apply a computer vision application called style transfer. Style transfer will be used to generate skin images with dark skin tones by extracting the features of an image with melanoma and a style image with the target skin color. Thus, the resulting image will be a weighted blend of both feature sets. The melanoma data set and the skin tone data set are passed into their own respective VGG models, which are then together passed into a final VGG model as noise, which then produces the final combined images. The third and final phase is the generated image evaluation, which utilizes quantitative/qualitative evaluation. For the quantitative test, they used a support vector machine regressor and a structural similarity index measure to provide a similarity score. For the qualitative test, they use a human visual Turing test, where participants are asked to classify images as real or generated.

Other helpful papers or references regarding skin cancer detection machine learning models include: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8705277/ https://www.sciencedirect.com/science/article/pii/S0010482520303966 https://www.sciencedirect.com/science/article/pii/S1877050919321295 https://www.kaggle.com/code/arminfuchs/skin-cancer-with-cnn (skin cancer detection using CNNs - not quite our topic but could be a helpful resource).

Living list

  • No citations were found because the paper was published on Nov 11, 2021 (ie. 6 months ago) regarding our topic as this field of research is also relatively new and no implementations other than this is present currently to our knowledge.

Data: What data are you using (if any)?

We will be using two datasets. The first dataset will consist of skin cancer images taken from patients and the second dataset will consist of different skin colors. We will be using the Kaggle dataset titled Skin Cancer ISIC which contains 2357 image files of 9 types of malignant and benign oncological diseases, which were formed from The International Skin Imaging Collaboration (ISIC). The 9 types of malignant nodes are actinic keratosis, basal cell carcinoma, dermatofibroma, melanoma, nevus, pigmented benign keratosis, seborrheic keratosis, squamous cell carcinoma, and vascular lesion. This dataset already was split to train and test data. (https://www.kaggle.com/datasets/nodoubttome/skin-cancer9-classesisic)

The next dataset that we will be using is skin tones. Because we were unable to find a dark skin tone dataset, we will be pulling images from google to obtain this dark tone dataset. We plan to have about 100-200 images being pulled to create this dataset. Although this process is time-intensive, this is the best way to obtain this data. Another option is looking into a dermatorologal database that may contain data on darker skin tones (http://www.atlasdermatologico.com.br/index.jsf). However, that data may be limited and we may need to find another source.

Metrics: What constitutes “success?”

Since this project’s objective is to create new images, the notion of “accuracy” does not strictly apply to this project. We propose a few methods to serve as metrics to constitute “success” of our project. The first metric (base) would be to visually inspect and qualitatively claim if the generated images were successfully diversified relative to the existing dataset. Another metric (target) would be to compare the average pixel value (as a measurement of color) surrounding the lesions of the generated images to the training dataset which consists of only white samples. We could also measure the variation of these “darkened” colors to ensure the generated images are diverse. Finally, a stretch goal would be a comparison of classification model performance on a diverse, test dataset. This would be done by first training a classification model using only the initial dataset (non-diverse) and testing its performance on the diverse dataset. Then, the same model would be re-trained using images from the initial dataset combined with diversified images generated from our GAN model and tested against the same diverse dataset. Success would result from a higher classification accuracy. Rezk et. al has a complicated scheme of metrics for confirmation of their success. This includes both an evaluation of a classification model as mentioned above and a Turing Test. In addition, their main metric was a comparison between the sensitivity and specificity of a trained model relative to the performance of professional dermatologists.

Ethics: Choose 2 of the following bullet points to discuss

The reason we chose this project was in part due to the lack of diversity in model training data, especially when it comes to direct human disease applications. In the case of our project where we are explicitly trying to generate additional diverse data - namely skin lesions of a more representative sample, the models that are trained to detect skin lesions such as melanoma on lighter skin will have better accuracy than the skin lesion prediction on darker skin. To combat this issue, which is caused by a lack of data in the first place, we use variational autoencoders. However, because we are constructing data, a potential bias in the creation of new data is the fact that we are generating more data given a baseline of majority light skin images. This could induce bias because we are learning features from light skin, which may not accurately represent what dark skin images should look like. We were planning to develop a measure of success by looking at the pixel values and whether the output images successfully increase the number of images with a dark skin representation. This may also not accurately represent the lesions of the skin because depending on the way the lesion presents itself or general image characteristics like brightness, which may differ across images, there could be some error in this measurement. Thus, we need to keep in mind the limitations of the measurement before we can accurately conclude its proof of concept.

Division of labor:

We plan to work collaboratively with each other and help one another with their assigned tasks. We divided the tasks into 3 main areas and assigned people to work on them based on their skillsets and interests. Preprocess (Michelle) This task involves the collection and verification of data that we are going to be using in our model. Furthermore, preprocessing involves cleaning up and normalizing all the data so that it can be standardized when passed through the network.
Model architecture x 2 (Alex and Deniz) This task involves the development of the architecture of how we are going to achieve high accuracy and create new data. This task has 2 people working on it because it is a lot of fine-tuning and figuring out what overall architecture is optimal for our goal. Style Transfer Metrics (Kento) This task involves taking the output of the model and understanding how well it performed in generating new data. Based on certain metrics, we can analyze the model as well as help train the model to become more effective at creating new data.

Built With

Share this project:

Updates

posted an update

Introduction Detection and classification of skin lesions such as melanoma and basal cell carcinoma is a growing field within dermatology studies. The use of deep learning models such as convolutional neural networks can be incredibly accurate in identifying such cancers. However, due to the skin cancer prevalence in those of Caucasian backgrounds, the majority of these deep learning models train on data skewed toward lighter skinned images. This inevitably leads to underlying bias in the model’s ability to detect and classify skin cancers of people with darker skin. Thus, our aim is to utilize deep learning for correcting model bias. By using a style transfer model architecture combining two different convolutional neural networks, we are able to take darker skin images and images of skin lesions to produce images of darker skin lesions that could be used for model training.

Challenges The main challenge we are facing is the training process in choosing correct hyperparameters. Currently our model is producing a maximum test accuracy of around 40%. We had issues with overfitting, finding the correct learning rate, and figuring out how to save weights from the loss function at each layer. We are not exactly sure how to improve off of this model accuracy, but we are continuing to make improvements. The performance of this classification model has a large impact on the performance of the style transfer model.

Insights We have been able to produce an initial image from the style transfer. However, the resulting images are not as good as we hoped. For example, some important features (i.e., skin lesions) were developing on the image, but as the image adjusted to style, unrealistic colors began appearing. This may be because the raw loss for the stylized images is much smaller than the raw loss for the featured images which can cause adjusting the weights between these losses difficult. Further, it is possible that our CNN model is not identifying features as well as we hoped, causing the model to stylize the image poorly.

Plan We are on track with our plan. We hope to continue modifying our hyperparameters and minor model architecture layouts in hope to improve the training and testing accuracy of the CNN model. We are also currently working on the style transfer element which requires tuning of our loss functions to ensure a good balance of style and important features.

Log in or sign up for Devpost to join the conversation.