Sketch to Image Generation

Introduction

Topic: Using GAN to Convert Hand-Drawn Sketch to Image
Team member: Yifei Wang(ywang502)/ Siyang Li(sli144)/ Tianran Zhang (tzhang96)/ Neal Yin(zyin15)
Objective: we implemented the paper “Sketch-to-Image Generation Using Deep Contextual Completion” (Lu, Yongyi & Wu, Shangzhe & Tai, Yu-Wing & Tang, Chi-Keung. 2017.). We wish to create and train a deep learning model that converts hand-drawn sketches to photo-like images.
Core problem: unsupervised learning, image generation

Related Work

The followings are the papers we found that discuss related topics:

SketchGAN

This paper explores how a discriminator with 2N output classes (real and fake scores for each target class) and different types of loss functions on the quality of images generated using an image-to-image cGAN (Conditional Generative Adversarial Neural Network). Specifically, it experiments with a fairly standard cross-entropy loss (named 2N loss) and another loss generated from the extra information provided by the 2N classification scheme (named penalty loss). The conclusion is these two kinds of loss perform at least as well as the standard GAN loss.

Convnets

This paper proposes and evaluates several triplet CNN architectures for measuring the similarity between sketches and photographs, within the context of the sketch based image retrieval (SBIR) task. The architectures this paper presents can be very helpful for our future metric to measure the similarity between outputs and actual labelings.

Visual image retrieval

This paper presents a technique to evaluate similarity ranks based on elastic matching of sketched templates over the shapes in the images. The technique uses the degree of achieved matching and the elastic deformation energy spent by the sketch to achieve such a match to derive a measure of similarity between the sketch and the images in the database and to rank images to be displayed. This can be potentially helpful for our similarity measurement.

We didn’t find any public implementations of the specific paper we are working with. But we do find some projects that realize a similar goal as we’re reaching for. Here’s the list of their repo links:

Data:

Training We would be using the COCO dataset for our training. We believe that this dataset, containing 80K images with 90 categories, would provide our model with comprehensive training data with sufficient size and generality. In terms of preprocessing, we would apply segmentation for background removal, and use the filters to convert the photos to sketches, where the photos would be considered as labels.
- Microsoft COCO: Common Objects in Context: https://arxiv.org/pdf/1405.0312v3.pdf
Testing We would test our model using datasets containing sketch images. We would also create our own testing data through freehand sketching.
- ImageNet-Sketch (Contains 50000 sketches from 1000 classes): https://www.kaggle.com/datasets/wanghaohan/imagenetsketch
- How Do Humans Sketch Objects? (Contains 20,000 sketches from 250 classes): http://cybertron.cg.tu-berlin.de/eitz/projects/classifysketch/

Methodology

Data and Preprocessing

We expect our input data to be size 64*128, each as a concatenation of a sketch and 贴the corresponding photo/image. A 64*64 mask is applied to the right-hand side so that only the sketch on the left-hand side would be fed as an input to the generator, whereas the complete photo without the mask is used to train the discriminator as the "real" labels. The prediction by the generator is used to prepare the discriminator as the "fake" labels.

We begin by creating a simple dataset with very distinguishable sketches using images we found online, and create the corresponding labels using graphic tools such as Photoshop. These simple sketches are categorized into 5 classes and each class has a color: fundamental shapes (circles, squares, …) are red, animals are dark brown, trees are green, flowers are orange, and ships are blue. As we tuned the model to produce images similar to the expected outputs, we switched to a more complex dataset converted from the COCO data.

We believe that the COCO dataset, containing 80K images with 80 categories, would provide our model with comprehensive training data with sufficient size and generality. Some sample classes are: cat, dog, umbrella, car, backpack, hotdog, etc. Our vision was that given sufficient amounts of data, our model would eventually learn to convert the sketches into real-life pictures given both the outline contour of the object as well as the context of the object.

Our preprocessing of data is divided into 3 steps. First, we downloaded the dataset using the app FiftyOne and extracted desired classes of objects with a size limitation of no less than 64*64 for the sake of more accurate training. Then we used OpenCV to apply filters to convert each colored image into a grayscale sketch-like image after re-scaling the original image to size 64*64. Lastly, we concatenated the original images with the sketches, with the sketches on the left-hand side and the original image on the right, generating ready-to-train inputs of shape 64*128*3.

Model Architecture

The generator would take in the sketch (64*128, but the right-hand side is empty as the real photo is masked out) as the input, flatten it and pass a linear layer, then take the output and pass it through five Conv2DTranspose layers, each with kernel size 5, strides 2, and “same” padding. We used LeakyRelu as the activation function for all the layers, and tanh for the output layer. BatchNormalization is applied after each Conv2DTranspose layer except the last one to increase model efficiency. This upsampling process of the latent space would add non-linearities to the model and produce a higher resolution image of 64 * 128.

The discriminator contains four Cov2d layers, also with kernel size 5 and strides 2, that reduce the feature map’s dimension. The output is then passed to a fully connected Dense layer with softmax activation to produce the one-dimensional probability that tells if the image is fake or not.

Loss

We use the weighted sum of the contextual loss and perceptual loss to train our model. The ratio of these two weights is considered a tunable hyperparameter. We chose to calculate the total loss as 0.01 * perceptual loss and 0.99 * contextual loss. The low coefficient of the perceptual loss guarantees a similar appearance of the input and output.

Training and Evaluation

We train alternately between generator and discriminator based on the Tensorflow Keras framework. More specifically, we train the discriminator for 5 epochs, and stop it from training while training the generator for 5 epochs. The training epoch numbers of the generator and the discriminator are also tunable hyperparameters to control and balance the competition between them. We collect sample output images at the end of every epoch, and visualize them for observation.

We primarily assess our results based on visual outputs from the generator. While it is difficult to evaluate the model quantitatively purely by naked-eye observation, we involved an evaluation model that can compare two images’ identifying features. This evaluation process was inspired by our reference paper, but we didn’t use the same evaluation model. We found that the VGG16 model without the last “fc2” layer can summarize the identifying vectors of the input images, and we can decide the similarity of the two images by calculating the cosine of the angle between their identifying vectors. Through this process, similar evaluation functionality can be achieved in a simple and convenient way. We developed our test function based on the evaluation model, which takes batches of real images and predictions from the generator and returns the average similarity between them.

Potential difficulties

There are 3 hard parts in implementation; the first is the generation of the joint embedding, the second is optimization (more specifically computing the loss), and lastly how to connect the G network and the D network into one fully functional neural net.

The generation of embedding z is hard because the paper doesn’t specify what kind of linear space the joint image is embedded onto. Therefore, we have to experiment with different dimensions and embedding sizes for our sketch inputs in order to get the best vectorization of our joint input image. After the embedding, we also have to map the output from noise distribution onto the data distribution.

The other challenge is implementing the various loss functions. In the calculation of contextual loss, the paper uses a binary mask M in the calculation. However, the paper hasn’t mentioned how this mask is obtained or what this mask is doing, so we will have to both conceptually understand it and also find a way to generate it. Another thing is fine tuning the hyperparameter lambda in the weighted sum of our objective function.

Lastly, the model trains the z embedding (the mapping to vector space from joint image) and the weights separately while the model also contains 2 networks (G net and D net). We have to find a way to link these together such that the model is not losing information while going from one network to the other and also being able to optimize the desirable trainable parameter at the correct steps.

Metrics: What constitutes “success?”

We are going to keep 20% of our data for testing purposes. By using an existing CNN model, we will measure the similarity between our model’s output and the actual images. “Accuracy” doesn’t apply to our project, since our aim is to generate a new output rather than do categorization. Instead of accuracy measurement, we should implement the methods described above to measure the similarity in order to indicate the effectiveness of our model to produce ideal outcomes.

Our reference paper uses the Structural Similarity Metric (SSIM) to measure the similarity between generated image and ground truth, and also uses a pre-trained Light CNN to extract identity preserving features and compare in L2 norm. Its metric's ideology is similar to ours.

Our base goal is that the model can run properly and produce reasonable outputs. The target goal is that when the testing inputs fall into similar categories as our train inputs do, the model can produce outputs that preserve identifying features and perform well in the CNN similarity test. The stretch goal is that under any reasonable hand-drawn sketch inputs, the model can produce outputs that preserve identifying features and perform well in the CNN similarity test.

Ethics

What broader societal issues are relevant to your chosen problem space?
- One issue that could arise from our problem space is the potential occurrence of various biases upon generating human face images from hand-drawn sketches. Due to potential limits in sample size and variety (e.g. under- or unthorough-representation of POC), it’s possible that the model could bias one race/ethnicity over the other and produce restrained results regarding facial features, skin color, etc. For similar reasons, misgendering based on features such as hair style can also occur.
Why is Deep Learning a good approach to this problem?
- Deep learning is a good approach to this problem because it would be extremely time-consuming to recreate photo-like images of objects from hand-drawn sketches either with softwares (such as Adobe Photoshop) or by artists. A deep learning model can achieve this task in a short amount of time and large quantities.
Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?
- A major stakeholder in the application of this model would be companies that use non-celebrity models to advertise their products, where the identity and publicity of the models are trivial, replicable, and replaceable. A company that specializes in lounge wear, for example, can use a refined version of this model to generate “fake” models for displaying their products to skip the hiring and photoshooting process, while avoiding the trouble of dealing with potential infringement of copyrights.
- However, this model can also be used to jeopardize the safety and integrity of online communities. Fake personas with realistic computer-generated profile pictures can be utilized for malicious purposes such as scamming, spreading misinformation, and participating in fraudulent political activities.