Generate New Skin Tone Data for Melanoma
Team: Alexander Le, Deniz Toruner, Kento Abeywardane, Michelle Mai
Links:
Introduction: What problem are you trying to solve and why?
Identifying and classifying skin lesions to be cancerous at an early stage is critical to successfully treat and saving someone’s life. Unfortunately, in the medical community, there is an underrepresentation of reference images for cancerous skin lesions for people with non-white/darker skin. This is reflected in medical textbooks, and most prominent in large datasets which are now being used to train neural networks to increase the accuracy of classification of a skin lesion. Due to the lack of diversity in these datasets, these models perform poorly when applied to patients/data with darker skin tones. Further, this underrepresentation is reflected in the higher rates of misdiagnosis and death for darker-skinned people. In an effort to solve this problem, we aim to generate new images of skin lesions on darker skin which can be used to train a model better equipped to identify and classify skin lesions for diverse subjects. The papers we will be modeling utilize the unsupervised learning methods of Generative Adversarial Networks and Variational Auto-Encoders.
Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?
We were able to find a couple of papers that utilized deep learning to improve the diversity of skin color within dermatology. The three main papers we found are here
- (GAN) Generating digital images of skin diseases based on deep learning
- (GAN) Improving Skin Cancer Classification Using Heavy-Tailed Student T-Distribution in Generative Adversarial Networks (TED-GAN)
- (VGG) Leveraging Artificial Intelligence to Improve the Diversity of Dermatological Skin Color Pathology: Protocol for an Algorithm Development and Validation Study
Two of these papers utilize GANs and the other utilizes a CNN/VGG architecture model in order to be able to better classify certain skin diseases including melanoma. The paper that is most aligned with what we want to pursue is Leveraging Artificial Intelligence to Improve the Diversity of Dermatological Skin Color Pathology: Protocol for an Algorithm Development and Validation Study by Eman Rezk Mohamed Eltorki, Wael El-Dakhakhni. In this paper, Rezk et. al utilizes 3 main phases to construct their model. The purpose of the first phase was to identify underrepresented skin tones in the dataset. To achieve this, they imputed their augmented dermatology data into a CNN model to detect and remove any possible nonsegmented disease pixels resulting from a variation of the skin color, improving the quality of the images. Then, they got the pre-processed data and performed k-means clustering to group the pixels based on color values to find the dominant skin color. To obtain tone classification, they used a mix of k-nearest neighbor, random forest, and naïve Bayes methods. For every image, they applied RGB features to create supplementary color features from different color spaces such as hue, saturation, and value to provide the model with sufficient color information. Thus, they were able to classify the dominant skin color as very light, light, intermediate, tan, brown, or black. Phase 2 included generating for underrepresented tones from the input data, where they apply a computer vision application called style transfer. Style transfer will be used to generate skin images with dark skin tones by extracting the features of an image with melanoma and a style image with the target skin color. Thus, the resulting image will be a weighted blend of both feature sets. The melanoma data set and the skin tone data set are passed into their own respective VGG models, which are then together passed into a final VGG model as noise, which then produces the final combined images. The third and final phase is the generated image evaluation, which utilizes quantitative/qualitative evaluation. For the quantitative test, they used a support vector machine regressor and a structural similarity index measure to provide a similarity score. For the qualitative test, they use a human visual Turing test, where participants are asked to classify images as real or generated.
Other helpful papers or references regarding skin cancer detection machine learning models include: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8705277/ https://www.sciencedirect.com/science/article/pii/S0010482520303966 https://www.sciencedirect.com/science/article/pii/S1877050919321295 https://www.kaggle.com/code/arminfuchs/skin-cancer-with-cnn (skin cancer detection using CNNs - not quite our topic but could be a helpful resource).
Living list
- No citations were found because the paper was published on Nov 11, 2021 (ie. 6 months ago) regarding our topic as this field of research is also relatively new and no implementations other than this is present currently to our knowledge.
Data: What data are you using (if any)?
We will be using two datasets. The first dataset will consist of skin cancer images taken from patients and the second dataset will consist of different skin colors. We will be using the Kaggle dataset titled Skin Cancer ISIC which contains 2357 image files of 9 types of malignant and benign oncological diseases, which were formed from The International Skin Imaging Collaboration (ISIC). The 9 types of malignant nodes are actinic keratosis, basal cell carcinoma, dermatofibroma, melanoma, nevus, pigmented benign keratosis, seborrheic keratosis, squamous cell carcinoma, and vascular lesion. This dataset already was split to train and test data. (https://www.kaggle.com/datasets/nodoubttome/skin-cancer9-classesisic)
The next dataset that we will be using is skin tones. Because we were unable to find a dark skin tone dataset, we will be pulling images from google to obtain this dark tone dataset. We plan to have about 100-200 images being pulled to create this dataset. Although this process is time-intensive, this is the best way to obtain this data. Another option is looking into a dermatorologal database that may contain data on darker skin tones (http://www.atlasdermatologico.com.br/index.jsf). However, that data may be limited and we may need to find another source.
Metrics: What constitutes “success?”
Since this project’s objective is to create new images, the notion of “accuracy” does not strictly apply to this project. We propose a few methods to serve as metrics to constitute “success” of our project. The first metric (base) would be to visually inspect and qualitatively claim if the generated images were successfully diversified relative to the existing dataset. Another metric (target) would be to compare the average pixel value (as a measurement of color) surrounding the lesions of the generated images to the training dataset which consists of only white samples. We could also measure the variation of these “darkened” colors to ensure the generated images are diverse. Finally, a stretch goal would be a comparison of classification model performance on a diverse, test dataset. This would be done by first training a classification model using only the initial dataset (non-diverse) and testing its performance on the diverse dataset. Then, the same model would be re-trained using images from the initial dataset combined with diversified images generated from our GAN model and tested against the same diverse dataset. Success would result from a higher classification accuracy. Rezk et. al has a complicated scheme of metrics for confirmation of their success. This includes both an evaluation of a classification model as mentioned above and a Turing Test. In addition, their main metric was a comparison between the sensitivity and specificity of a trained model relative to the performance of professional dermatologists.
Ethics: Choose 2 of the following bullet points to discuss
The reason we chose this project was in part due to the lack of diversity in model training data, especially when it comes to direct human disease applications. In the case of our project where we are explicitly trying to generate additional diverse data - namely skin lesions of a more representative sample, the models that are trained to detect skin lesions such as melanoma on lighter skin will have better accuracy than the skin lesion prediction on darker skin. To combat this issue, which is caused by a lack of data in the first place, we use variational autoencoders. However, because we are constructing data, a potential bias in the creation of new data is the fact that we are generating more data given a baseline of majority light skin images. This could induce bias because we are learning features from light skin, which may not accurately represent what dark skin images should look like. We were planning to develop a measure of success by looking at the pixel values and whether the output images successfully increase the number of images with a dark skin representation. This may also not accurately represent the lesions of the skin because depending on the way the lesion presents itself or general image characteristics like brightness, which may differ across images, there could be some error in this measurement. Thus, we need to keep in mind the limitations of the measurement before we can accurately conclude its proof of concept.
Division of labor:
We plan to work collaboratively with each other and help one another with their assigned tasks. We divided the tasks into 3 main areas and assigned people to work on them based on their skillsets and interests.
Preprocess (Michelle)
This task involves the collection and verification of data that we are going to be using in our model. Furthermore, preprocessing involves cleaning up and normalizing all the data so that it can be standardized when passed through the network.
Model architecture x 2 (Alex and Deniz)
This task involves the development of the architecture of how we are going to achieve high accuracy and create new data. This task has 2 people working on it because it is a lot of fine-tuning and figuring out what overall architecture is optimal for our goal.
Style Transfer Metrics (Kento)
This task involves taking the output of the model and understanding how well it performed in generating new data. Based on certain metrics, we can analyze the model as well as help train the model to become more effective at creating new data.

Log in or sign up for Devpost to join the conversation.