Patch-Based Image Inpainting using GANs

Project Poster

Group Members

Evan Lu (elu4), Michael Youssef (myousse2), Siming Feng (sfeng22), Ze Hua Chen (zchen186)

Introduction

Image inpainting is the task of filling in missing pixels in an image. Prior models, which also use generative adversarial networks in their underlying structure, are effective at fixing relatively small deteriorations in images but struggle when filling in large blanks. In 2017, a model using dilated convolutions as well as a global and local discriminator achieved solid performance when handling large blanks (GLCIC).
The goal of this research paper was to extend the GLCIC model by incorporating residual connections and a PatchGAN, which makes predictions on real/fake local portions of an image instead of an entire image.
We developed an interest in this paper after doing our own reading on generative adversarial networks. We were amazed by their ability to synthesize novel information with accuracy that greatly outperforms the human eye. The first paper that we stumbled upon was about the Context Encoder, the first image inpainting algorithm using a GAN and an auto-encoder. We were curious to see how extensions to this model affected image inpainting performance and hence found ourselves reading about GLCIC and eventually the PathGAN variant, which was developed 2 years later.

Related Work

Links: https://towardsdatascience.com/10-papers-you-must-read-for-deep-image-inpainting-2e41c589ced0, https://wandb.ai/site/articles/introduction-to-image-inpainting-with-deep-learning
This first article provides a rough overview of the evolution of image inpainting models that use generative adversarial networks. It first discusses the Context Encoder (CE), a very simple yet effective model that uses an auto-encoder to learn the “context” of the missing pieces and GAN to fill them in. The next model discussed is MSNPS (Multi-Scale Neural Patch Synthesis). It builds on top of the CE by using an additional texture network, which polishes up the output of the CE. After training the CE, it then trains the texture network. The next two models that the article discusses are the two that we primarily focused on. GLCIC, as mentioned earlier, has a completion network that uses dilated convolutions, which is more efficient than using a CE. Instead of using a single global discriminator in its GAN, it uses a local discriminator that predicts the realness of a specific missing patch in the network. This ensures that the final filled image has “better global and local consistency”. Finally, it discusses a variant of GLCIC that uses a PatchGAN for its local discriminator and residual connections in its completion network. The goal of the PatchGAN is to refine the all-or-nothing nature of a discriminator operating over an entire input.

Data

There is a lot of flexibility in that dataset that can be used as long as the images have the same dimensions. Ideally, the images in the dataset should be as random as possible to ensure that the model doesn’t learn common themes among images. For example, the paper uses the Places, Paris Street View, and Google Street View datasets. We plan on using a subset of the CIFAR-100 dataset. The paper uses 256 x 256 images as input, but to save time when training, we’ll instead use 32 x 32 pixel images.
We plan on allocating 50,000 images for training, 3,000 for validation, and 7,000 for testing. For preprocessing, we’ll need to apply a binary mask over the inputs to zero out all of the pixels in a patch. Note that the same image can be reused as long as different patches are used. The labels for each one of these images will be the original image itself. Therefore, since we will have n images of 32•32, the data will be of size n•32•32 and the labels will also be of size n•32•32.

Methodology

The first component of the network is an autoencoder that uses dilated convolutional blocks (that is, a dilated convolution followed by a normalization step followed by ReLU and a last convolutional layer). These dilated convolutions follow a series of normal convolutions that scale down the dimensions of the input. This is followed by a series of interpolated convolutions (or transposed convolutions) that scale the maps back to a 32 x 32 input.
The discriminator has two branches: a global discriminator that takes as input an entire image and a local discriminator that takes as input a completed patch.
The reconstruction loss for the GAN uses L2 loss over the pixels. The GAN loss tries to maximize the probability of the discriminator predicting an untouched image (ie. without the mask) as valid and minimize the probability of predicting the filled-in image as fake using cross-entropy loss. This GAN loss is applied to both branches of the discriminator. The final loss (joint loss) is a weighted sum of the reconstruction loss and the two GAN losses. The optimizer that our model will use is Adam.
One challenge that we’ll likely face is designing a fast algorithm for creating binary masks that zero-out patches in the image. Another challenge is programming the joint loss function since there are three components.

Metrics

Additional experiments to test our model include, but are not limited to data augmentation and mixing images between different data sets. Data augmentation may involve different types of image transformations, which will include varying the size and number of the pixels mutated from the original image, different rotations, blurs, zooms, etc. and similar transformations as seen in CNNs. Mixing images include analyzing the differences between results when using different training data sets; mentioned above is the intent to use the CIFAR-100 dataset, though the aforementioned Places, Paris Street View, and Google Street View datasets that the paper also uses could be good points of comparison for model performance.
In our project, accuracy will be measured by how close the ‘inpainted’ photo compares to the original image by pixel similarity via L2 loss, or rather the distance between pixels. This denotes the success of our project can be directly measured by the average accuracy achieved on the ‘inpaintings’. The actual metric of “accuracy” will therefore describe how close the original image and augmented images are. We plan to explore various forms of accuracy metrics and evaluate their effectiveness through cross-referencing images via random sampling and the accuracy metric. Such analyses in the final paper may look like the following: “Mean Squared Error: performed pixel by pixel between the original image and outputted image; weak for higher dimensional data”. Such accuracy metrics will include dice coefficient, cross entropy, intersection over union, and more.
The authors performed a comparison between the GLCIC variant that they proposed as well a version of their model that doesn’t use dilated convolutions, a version of their model that only uses a global discriminator, and a standard Context Encoder. They compared their models’ performances using L1 loss, L2 loss, PSNR, and SSIM metrics (the paper does mention that these metrics are not the best metrics to evaluate the quality of a model but are good enough for comparing models).
Goals (note: the base accuracy, target accuracy, and stretch accuracy are the performances of the standard Context Encoder, a GAN that only uses a global discriminator, and the GLCIC variant, respectively):
Base Accuracy: Dice Coeff: 0.5, L2 loss: 1.34
Target Accuracy: Dice Coeff: 0.6, L2 loss: 2.33
Stretch Accuracy: Dice Coeff: 0.75, L2 loss: 1.19 (Note: L2 loss is measured in millions)

Ethics

We believe that deep learning is a good approach to this problem. It’s possible to fill in missing pixels using gradients from surrounding pixels, but this approach is not reliable when filling in large patches in an image. In these cases, you need a model that can “learn” the context of an image and fill in pixels accordingly, which is why deep learning models have replaced standard computer vision algorithms for image inpainting.
The use of images presents many challenges in terms of privacy. For various personal and important reasons, a photo may be censored or intentionally malformed. If the deep learning image impainting software is able to replicate the image without the consent of the original user, it may violate an individual’s right to privacy. To prevent the use of unauthorized photo usage, we will make sure to choose a dataset that has clear permission from the sources of their images. An explicit terms of service for the image reconstruction software will also be specified to prevent misuse.
Notably, the dataset that we have chosen (CIFAR-100 dataset) is not known to be systemically historically or societally biased in any way; however, the parent dataset, then 80 million tiny images dataset has since been removed for systemic biases. Though the CIFAR-100 dataset is a subset, it has not been removed for similar reasons because it has been individually queried. This information will be provided as a disclaimer in the documentation for reference by any users.

Division of Labor

Siming Feng: Loading data, designing an algorithm to randomly zero-out patches of pixels in the input data set, implementing dice coefficient accuracy metric
Ze Hua Chen: Implementing L1 reconstruction loss for the GAN + cross entropy loss for the two discriminators + joint loss function, displaying outputs + metrics when training
Michael Youssef: building the completion network and both branches of the GAN (network architecture)
Evan Lu: building both branches of the GAN + visualizing outputs and performance of the model

Links: