Improving Classifiers with GAN Augmented Data

Links

Final writeup: https://drive.google.com/file/d/1jCERL7SueuP85Uq5Z2NcK8-Ap4cUfiCd/view?usp=share_link

Poster: https://drive.google.com/file/d/1lVedRrUZuaCYWGyQwyWA-u_xY_FTIqFb/view?usp=sharing

Video submission: https://drive.google.com/file/d/1ebLmQuErFa5Rg9wYSyOTdvWCAQAKfZuY/view?usp=sharing

Github: https://github.com/brown-afkhurana/cs1470-final-dentists

Previous submission

Title: Improving Classifiers with GAN Augmented Data

Who: June Khurana: Akhuran4 Ken Liou: Kliou2 Marco Ayala: mayala7

Introduction: The paper’s objective is to solve the issue of “sparsity” in label collection for supervised machine learning. This issue specifically refers to when a number of annotators are necessary in order to acquire the needed labeled data. This creates cost constraints leading to less or poorer quality instances in crowdsourced data. This paper explores the use of a Generative Adversarial Network (GAN) to optimize learning from crowdsourced data and ultimately minimize annotation costs. Specifically, a GAN improves data augmentation to generate non-human generated labels that, otherwise, would have required more man-power to acquire. We chose this paper because of its potential to disrupt the AI/ML industry. There are entire companies, such as Scale.ia, that seek to alleviate annotation costs and make labeled data for supervised learning more accessible. To our knowledge, the method used in this paper is unprecedented on an industry scale, which is what initially made us gravitate toward this paper This is a classification problem.

Related Work: Goodfellow et al. published a paper in 2014 titled “Generative Adversarial Nets” that introduced GANs. It showed how the discriminator and generator interact like a “minimax two-player game.” They offer mathematical formulas that describe this relationship, and also a less formal, more pedagogical approach to explain how GANs work. They explain that the solution arrives when the data generating distribution (the existing distribution) matches the generative distribution (the newly generative distribution), and at that point, the probability of the discriminator making a mistake is 0.5 everywhere. They also show examples of solutions from MNIST, TFD, and CIFAR. Finally they discuss the pros and cons of GANs.

Data: We plan to use MNIST as a baseline for debugging, then CIFAR-10 for improving model performance. For initial mechanical turk testing of training data expansion for image classification, we plan to then use CIFAR-10H, a human-labeled CIFAR-10 variant, and LabelMe, with 8 classes and n=2000, as originally used in the paper. For mechanical turk testing of training data expansion for sentiment analysis, we plan to use University of Southampton’s Weather Sentiment tweet dataset. For a seq-to-seq attempt at expanding data, we will use Spider, a human-labeled text-to-SQL dataset, but do not expect strong performances. For further multimodal expansion, we will attempt to use the mechanical turk image labeling dataset from Learning User Perceived Clusters with Feature-Level Supervision, 2016.

Methodology: Our generative model is a traditional GAN, constructed in two parts. There is a Generator G, whose input is random noise, output is data which is similar to training data, and loss function is tied to how well it can trick Discriminator D. D’s input is the output of G or an input from training data, its output is a probability measure of whether its input was training data or an output of G, and its loss is whether its output guessed the input correctly.

The nuance of the CrowdInG framework of the paper paper comes in with the goal of generating classification data. For improving the accuracy of a model determining among c classes, the full structure of the model is c fully-trained GANs, each trained on its respective class of training samples, where outputs of each GAN’s generator is concatenated with training data of that label, then all training data is re-concatenated and fed into the classifier for training. In essence, this increases the number of batches per epoch and could increase classification accuracy while preventing overfitting.

We plan to attempt to go beyond the bounds of the paper in a few possible ways. First, we may switch the GAN model for an Actor-Critic model. Second, we may implement some aspects of StyleGAN for multiclass generation instead of having a “multi-headed” GAN collective. Third, we will compare performance with all generator output vs. only generator output which successfully tricks the discriminator.

Metrics: We plan to measure the difference in classifier performance with and without GAN augmented training data. In our case, accuracy is the most appropriate metric since we are directly trying to improve a model’s accuracy with the help of using training data annotated by GANs. Interestingly, an effective way to augment training data with GANs has its own metric of success, which is: First, the generated annotations should follow the distribution of authentic ones, such that they will be consistent with the label confusion patterns observed in the original annotations. Second, the generated annotations should well align with the ground-truth labels, e.g., with high mutual information, so that they will be informative about ground-truth labels to the classifier. The authors of our paper were hoping to improve classifier accuracy after augmenting training data with GAN. They quantified their results by comparing the accuracy of the classifier before and after augmenting the training data. Our base goal is to be able to train a classifier using GAN augmented data. Our target goal is to be able to mimic / improve the accuracy of this classifier with GAN augmented data. Our stretch goal is to improve the accuracy of our classifier in a statistically significant way.

Ethics: One vital societal implication this could have is the amplification of human bias in data sets. The basic premise of the paper is the use of a GAN to optimize the creation of labelled data based on human-generated labels. If the human-generated data is biased, for whatever reason, the labeled data from the model will be just as biased since these labels represent the ground truth from which the model is learning. Biased data can have a wide range of serious consequences, such as misrepresenting or being prejudiced against minorities. As mentioned in the introduction, major stakeholders are companies such as Scale.ai whose mission is to solve this annotation cost issue. If this technology is standardized in the industry, then these stakeholders will be forced to either adopt this technology or create something just as cost-efficient. If they elect to choose the former, they will have to ensure that this technology generates ethically conscious labeled data, especially at such a large scale. With the example of Scale.ai, their clients range from Havard Medical school to Toyota. Biased labels in data can lead to extremely grave ramifications like inaccurate diagnosis or potential defects in automobile manufacturing.

Division of labor: Briefly outline who will be responsible for which part(s) of the project. Ken: Data preprocessing / coding interactions between discriminator and generator June: Creating Discriminator Marco: Creating Generator

11/30 reflection: https://docs.google.com/document/d/1FMswJOI9qGHmXBpNq3Vnq0LZQODhKBws7M3340tHuK4/edit