Effectiveness of Synthetic Dataset on CIFAR Classification

Effectiveness of Generated Synthetic Image Dataset on CIFAR Classification

Github Repo

https://github.com/yuki-hayashita/DL-Final-Project

Reflection

https://docs.google.com/document/d/1Qrn6rKQyddMhqKBB_6GY1BVhStmNjRs49AyEu8U_p5s/edit?usp=sharing

Members

Kat Stephan
Jaden Chew
Yuki Hayashita
William Park

Introduction

Objective of “Mixing Real and Synthetic Data to Enhance Neural Network Training -- A Review of Current Approaches”: Increasing the accuracy/performance of training on real images through both synthetic data and real data using generative model.

Train a generative model that generates synthetic dataset of images
Mix the synthetic dataset with CIFAR to train CNN
Evaluate the improved accuracy of the CIFAR model.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?

Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching. In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”--if you stumble across a new implementation later down the line, add it to this list.

Data:

Dataset: We will be using CIFAR10 and CIFAR100. We will also generate our own synthetic dataset with a generative model, which will resemble CIFAR.
Size and Preprocessing: “The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.” “This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).” There is not relatively significant preprocessing required. We handled image processing in our homework already. I think the only preprocessing left would be handling the ‘fine’ and ‘coarse’ labels for CIFAR100.

Methodology:

Training: We will likely be training our generative model using the CIFAR images and labels, so that it can learn to reproduce images from text.
Possible Issues: The hardest part about implementing the model would be figuring out how to generate the synthetic data and using that data to find the new accuracy if it does improve. Also, figuring out generative models would be difficult as well because it is not something we learned yet. If we run into issues there are different generative models that we can try. For instance if a variational autoencoder seems to not work, we can try a generative adversarial network.

Metrics:

We will be trying to improve accuracy using data augmentation. We are generating CIFAR category images. We can assess this performance by observing the model’s output for things like airplane, dog, or cat, and subjectively decide if it is good enough. A more quantitative assessment would be putting those images through our CIFAR classification model and seeing if those generated images improve the accuracy.

Ethics

Relevant broader societal issues: Ultimately, there are ethical implications of creating datasets of fake, computer-generated images. Although we are not maliciously generating fake images, the creation of these images in and of themselves poses an ethical problem since ultimately this software and the content it produces can be used for unethical reasons. We are using this generative model for scientific purposes, but it could also be used for malicious purposes such as fraud.
Stakeholders: Since there is no target audience for our algorithm, anyone that uses or sees its product is ultimately a stakeholder in this problem. A person using this algorithm (whether that be for their gain or not) makes him a stakeholder, and the person who interacts with the images that are produced by the algorithm (whether knowingly or unknowingly) is also a stakeholder; additionally, they are both stakeholders with very different positions of power, since those that interact with the images might ultimately be taken advantage of.

Division of labor: Briefly outline who will be responsible for which part(s) of the project.

As we plan to gain a greater understanding of generative models in the next upcoming lectures, we plan to (as of now) work all together on creating the generative model.

Poster: Yuki, Jaden
Written Report: Will, Kat