line2picture

Poster!

line2picture

People:

Truong Cai (tcai7), Neophytos Christou (nchrist9),Wangdrak Dorji (wdorji), TaShawn Arabian(tarabian)

Introduction:

We are trying to implement this paper “Image-to-Image Translation with Conditional Adversarial Networks” by Prof. Efros’ group at UC Berkeley. Their main objective in this paper is to investigate conditional GAN as a general strategy for solving image-to-image translation problems. They want their network to essentially not only learn the mapping from one image to another, but also learn the loss function that would train this mapping. After showing the theoretical framework, they proceed to show that they can successfully apply this technique to synthesized data such as “label maps, reconstructing objects from edge maps, and colorizing images.” We chose this paper because the program associated with it, pix2pix, is rather popular with artists who used it to generate interesting art. As a group we want to generate nice paintings, but given our collective lack of artistic skills, we thought that this project could help us achieve our dream. Link to the original paper This problem could be considered as both supervised and unsupervised learning depending on the approach. But essentially we are trying to generate a mapping to take images from source domain to target domain. Also specifically, the GAN approach is somewhat related to the Actor-Critic techniques in Reinforcement Learning (except there’s proof of convergence for AC but not GAN)

Related Work:

We found this one paper giving an overview to this subfield of image-to-image translation. There are essentially two most popular methods in the field namely, variational autoencoder (VAE) and generative adversarial network (GAN). Along with these people typically evaluate the models using these two philosophies (i) Subjective image quality assessment (real or fake) (ii) Objective image quality assessment (some metrics to assign score). Then the paper went on to describe the methods in detail as well as recent developments in those two schools of thought. Finally they did a comprehensive comparison of 12 models performance on the same datasets for two tasks and gave an outlook of the field. Link to paper

Public implementations:

https://github.com/phillipi/pix2pix

Data:

We plan to use the Labeled Faces in the Wild dataset. Normally, this dataset is used for testing facial recognition models and contains more than 13K images of faces. However, since our project isn’t about facial recognition, we will modify the dataset by generating an outline for each picture using the Canny edge detector. We will use these portrait-outline pairs to train our model. All images’ dimensions are 250x250.

Methodology:

Our model will be a Generative Adversarial Network (GAN). Specifically, it will be a Conditional GAN, similar to the one implemented in the original pix2pix paper. To train the model, we plan to do a 3:1 partition of the dataset for training and testing. The optimal hyperparameters, such as the learning rate, batch size and number of epochs, will be discovered during development. We also plan to experiment with the patch size of the discriminator to find what best suits our dataset, starting from a 1x1 patch, going up to a 250x250 (size of image) patch. For optimizing the loss, we plan to use minibatch SGD and apply the Adam solver. We believe that the most challenging part of our implementation will be fine tuning the model to work in this specific domain. Pix2pix has been applied to similar datasets, such as generating pictures of bags from outlines, however our domain is a bit more complicated (portraits instead of bags, which have more details) and doesn’t contain as many images (the bags dataset contained 137K images, whereas ours contains only 1/10th of that). Another challenge is to generate good data that the model can learn on. Since we don’t have the outlines for the images we will be using, we need to make sure the model can learn well using the outlines we will generate. Since we are not sure how well our model will work with the specific dataset/domain, we plan to first try to find out another dataset of faces if the one we chose doesn’t work well enough, or even change the domain completely and try to apply the same idea of generating images from outlines to a simpler object that doesn’t have as many details as a human face.

Metrics:

There is no particular notion of accuracy or established objective way to evaluate the quality of generated images from generative models like ours. However, a popular metric for GAN outputs is called the “Inception Score” which is the measure of how realistic a GAN’s output is or an automated alternative which correlates well with human evaluation of realistic image quality. where the quality of synthesized images are rated based on a pre-trained Inception model’s ability to classify them.

The authors of the paper use two different measures for evaluation. The first is human scoring through the Amazon Mechanical Turk platform where real images and images generated by pix2pix are randomly mixed together. Human scorers then see each image from this randomly shuffled set for only a second and then label it as real or fake.

Secondly they used an inception score with the help of pre-trained semantic classifiers to classify the images generated by the pix2pix model. If the semantic classifiers label the generated image correctly, it indicates that the generated image is realistic since the classifier has been trained on real images and it should be able to classify realistic synthesized images correctly as well. The authors refer to this metric as the “FCN-score” which is the percentage of generated images classified correctly by the semantic classifier called FCN-8s(Fully convolutional network).

We plan to use the “FCN-score” measure outlined in the paper for our implementation as a measure of success for how realistic the model outputs are. To do so, we will use the FCN-8s architecture for semantic segmentation, and train it on the “Labeled Faces in the Wild” dataset containing images of faces labeled with the name of the person pictured. We will then score the generated outputs by the classification accuracy of the FCN-8 based on the labels of the true face images.

Our base goal is a “FCN-score” of 0.3 based on the performance of the classification accuracy of the paper model. Our target goal is to improve the performance to a “FCN-score” of 0.5. Our stretch goal is to generate realistic images with an “FCN-score” of 0.6.

Ethics:

What broader societal issues are relevant to your chosen problem space?

We think that with the goal of our model being to generate realistic outputs, a broader societal issue is the ethical concerns over the misuse of generative models. With improvement in performance and complexity of generative data modeling, distinguishing between real and fake becomes difficult. This leads to problems such as face swapping, deep fakes, and synthetic face generation technologies, that can be used to spread misinformation, manipulate individuals by “using” the identity of well-known personalities, and also legal issues if the generated content can be used for personal gains by users if the generated identity is that of another person.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

The dataset we use is the “Labeled Faces in the Wild”, which is a public benchmark for face verification. These images of faces were collected from the web and labeled with the name of the person pictured. There are no concerns with regard to collection or labeling as all the images were collected from the web which is publicly available by the research community to make advances in face verification, and not for commercial gains. However, the dataset is not very representative as many ethnicities have very minor representation or none at all. Also, there are very few children, no babies, very few people over the age of 80, and a relatively small proportion of women. As the images are faces of people found on the web, these images are skewed by mostly well known and popular figures. Hence, it can be the case that these figures may be mainly representative of certain ethnic groups more and male individuals more, our model may become skewed towards generating images with features specific to these groups more.

Division of labor: For now, we plan that all members will equally work on all parts of the project, potentially dividing the work in the future.

For this checkin, we also require you to write up a reflection including the following: Introduction: This can be copied from the proposal. We are trying to implement this paper “Image-to-Image Translation with Conditional Adversarial Networks” by Prof. Efros’ group at UC Berkeley. Their main objective in this paper is to investigate conditional GAN as a general strategy for solving image-to-image translation problems. They want their network to essentially not only learn the mapping from one image to another, but also learn the loss function that would train this mapping. After showing the theoretical framework, they proceed to show that they can successfully apply this technique to synthesized data such as “label maps, reconstructing objects from edge maps, and colorizing images.” Challenges: What has been the hardest part of the project you’ve encountered so far? It was difficult to find a balance between the threshold for creating outlines of images using cv2 as we had to ensure we still produced only the outlines but we did not want filter too much noise that we start losing information about the facial features such as incomplete lines. Insights: Are there any concrete results you can show at this point? How is your model performing compared with expectations?

So the model isn’t performing as bad as we initially thought. We at least can recognize the predicted image as images of faces. The skin tone seems to be correct. So I think by tuning the hyper-parameter, we can get reasonably better results. Plan: Are you on track with your project? What do you need to dedicate more time to? We have played around with tuning existing implementations of similar models as the one we plan to create to get an idea of how we will need to tune our hyperparameters and preprocessing. We now need to focus on implementing our own model in a different framework (in TensorFlow) What are you thinking of changing, if anything? We are optimistic about how our network will perform with our preprocessed data, so we are not planning on changing anything for now.