Github

GitHub Link

Devpost Outline (seen below as well)

Devpost Outline Link

Reflection (Mentor Meeting 3)

Reflection Link

Slides (Presentation for DL Day)

Slides Link

Final Write Up

Write Up Link

Outline

Inspiration

During development, infants and young adults learn differently compared to models. Their learning is likely not supervised, and even if they were, they would have far less training examples than that used to train CNNs. Furthermore, data diets change over time; older children and adolescents have access to not only more examples of each class but also more classes; thus, we can imagine a transition towards supervised learning. Thus, the objective function of our model and training data is dynamic with respect to epochs.

For our project, we will use the CIFAR100 dataset. In early epochs, we will train a self-supervised learning model to define a latent space with limited classes. The quality of this latent space can be visualized using dimensionality reduction methods or other metrics of clustering (silhouettes, ARI, etc.). After “early-infancy”, we will switch to a semi-supervised model. Eventually, we will take this trained architecture and attach a dense layer to classify images in a supervised manner.

Who

Winston Li (l.winstony8@gmail.com, wli115)
John Byers (john_byers@brown.edu, jbyers3)
Marcel Mateos Salles (marcel_mateos_salles@brown.edu, mmateoss)

Introduction

Unlike neural networks, infants have limited access to visuals and labels. Self- and semi-supervised learning seek to address this constraint. The former uses no labels, and the latter uses limited amounts to define a latent manifold. We noticed that there are no current model architectures that truly follow biological learning patterns.
We will explore an intermediate: gradual supervision. Starting with pure self-supervision, we will add in more labels over epochs. This will hopefully define richer manifolds than pure self-supervision while using less labels than pure semi-supervision models.
We begin with a purely self-supervision regime working with a limited subset of the CIFAR-10 dataset; for example, we only use data corresponding to 7 classes. As epochs increase, we will gradually “leak” in labels to represent a transition to a semi-supervision learning regime. This will provide “names” to our learned latent clusters.
At later epochs, we will introduce data with a new class. Thus, the model must not only learn to predict this new label in a zero-shot fashion, but also remember such a label as a new possible prediction answer for future predictions. At the final epochs, we will consider all 10 classes.
Primarily, we will begin with a SimCLR-like approach to create a latent space with clusters. In later epochs, we will embed images with known labels, and then propagate this label likely in a nearest-neighbor fashion to label similar images within a cluster.

Related Work

We were inspired by many past works on contrastive learning, self-supervision, and semi-supervision training regimes.

SimCLR (https://arxiv.org/pdf/2002.05709.pdf): contrastive-learning approach that creates noised “copies'' of the same image, in which the model must learn to recognize pairs generated from an original image. By maximizing the agreement between the two copies, the model must learn some higher-order semantics that persist through image manipulations.
Local aggregation (https://arxiv.org/pdf/1903.12355.pdf): another contrastive-learning approach that does not require novel image generation; instead, a purely nearest-neighbors based method for bringing together close-neighbors and pushing away background points.
Local label propagation (https://arxiv.org/pdf/1905.11581.pdf): a semi-supervision, nearest-neighbor approach for using limited class data to label latent clusters.
It is important to note that we will not have the same computing power and time that the authors of these papers had. This is why we chose to go with the smaller datasets of CIFAR10 and CIFAR100. Additionally, they will have a longer time span to develop their models and be more reliant on the actual architecture and layers. Ours will be more reliant on our devised gradual supervised learning.

Data

We have multiple ideas for datasets that we could use. We spent a lot of time discussing our different options. Due to our hardware and time constraints, we decided to not go with ImageNet. However, we still wanted to have access to a solid and reliable dataset with multiple categories. This is why we feel like the best dataset to use would be CIFAR10 or CIFAR100. These datasets consist of 50,000 32x32 colored training images and 10,000 testing images. CIFAR10 is made up of 10 categories, each containing 6000 images while CIFAR100 is made up of 100 categories, each containing 600 images. Both datasets are of size 132.03 MiB. We will not need to do significant preprocessing because all the images are the same shape. Additionally, since they cover the same categories and are the same shapes, we could inject some CIFAR100 images and labels later on to introduce something our model hasn’t seen yet. These datasets are really easy to access, to the point where we can actually download them locally. They can be accessed at https://www.cs.toronto.edu/%7Ekriz/cifar.html or also through keras https://keras.io/api/datasets/cifar100/ If CIFAR10 and/or CIFAR100 don’t pan out, we can always fall back on MNIST.

Methodology

We plan on using a simple CNN in the fashion of the VGG16 model (https://arxiv.org/pdf/1409.1556.pdf). We reason that any performance benefits should come from the training regime, rather than complicated architecture design choices. Plus, using a smaller model accommodates re-training and debugging. Lastly, our training regime should be applicable to all existing architectures.
We note that our final (Dense) layer may be dynamic to reflect the learning of novel labels, and that we plan on implementing architectural pruning. This can be done by using a loss-based callback function for the former and epoch-based scheduler for the latter to either zero-out edges and nodes or instantiate new ones to present the pruning or augmentation of our model, respectively. For example, after a single epoch, we can prune-out a select number of edges with the lowest magnitude values. As for acquiring new labels, we can randomly initialize weights to a redundant kernel if no existing output kernels produced a sufficiently high probability value in a similar manner(e.g. the model was unclear which example a label belonged to, and the population output is similar to a previous, confused output, corresponding to a consistent “confusion”)

Metrics

We did not want to use accuracy as our base metric because when training our model, we will not have access to the hardware and time that companies like Google and the top research labs have for their models. However, since we are focusing on our model’s ability to differentiate between the categories and remember them over time, we thought that a good metric could be the ARI (Adjusted Rand Index) or the NMI (Normalized Mutual Information). It could even just be visually, if we represent the model’s latent space in a two dimensional manner, we should see that the cluster for one category is together and separate from a cluster of another category. We will be using the following metrics to see how successful our model is:

We will randomly initialize a VGG16 model and then train it in a self-supervision method as a pure self-supervision baseline. We will also train a VGG16 model using a self-supervised manner exposed to either 7 classes as a no-new-learning baseline or with all 10 classes to test how our model acquires new labels.

Base: Demonstrate rich embedding space. This means that we want our model to create good clusters within its latent space that it can use to represent categories.
Target: Using our gradually supervised model, beat same architecture pure self-supervision and linear-probe head on classification task. Likewise, come close in accuracy to the fully-supervised.
Stretch: Demonstrate ability to learn and remember new labels in a zero-shot fashion. For example, be presented a novel label and novel image on epoch 10, recognize the new label, and remember it as a possible prediction answer for predictions on future epochs.

Ethics

Why is Deep Learning a good approach to this problem?

Deep learning is a good approach to this problem due to the thing we are trying to replicate. We want to replicate the biological learning pipeline. This means that whatever we end up doing, will have to be able to learn with the assistance of something else and then later on be able to learn on its own. Deep learning already has some existing architectures that are able to do this. This means that we can expand upon/combine them in order to reach the level that we want. Additionally, the biological brain is extremely complex and made up of billions of neurons, therefore, Deep Learning seems like the best way to be able to replicate its developmental learning process. We understand that building off of some architectures that already exist makes us reliant on prior work that could not have been made ethically, could have released a lot of waste into the environment, and that our model will also add to the C02 emissions when being trained. However, Deep Learning seems like the best way to replicate this learning process. We chose a dataset that will train faster and will require less computational resources than ImageNet to try to reduce this somewhat.

What broader societal issues are relevant to your chosen problem space?

Current state-of-the-art models, especially large language models, are often far too large for the general public to recreate. The web-scale data required to train GPT, not to mention the mere hardware to train it, means a lot of current AI development is limited to a few companies, who often are opaque with their means and methods. More public institutions like academia are thus relegated to using smaller and older models. By presenting this new training regime, we hope to introduce a low-data method that any AI developer can integrate into their own models.

Division of labor

We plan to divide the responsibilities of our project in the following ways:

Winston

Method development: implementing image augmentation methods and epoch-based training regime changes, high-performance computing debugging

John Ryan

Data pre-processing, implementing semi-supervision loss function, hyperparameter fine-tuning

Marcel

Implementing callback- and epoch-based based architectural pruning

Little Label Learners Reflection (Mentor Meeting 3)

Introduction

Unlike traditional neural networks, infants have limited access to visuals and labels. Self- and semi-supervised learning seek to address this constraint. The former uses no labels, and the latter uses limited amounts to define a latent manifold. We will explore an intermediate: gradual supervision. Starting with pure self-supervision, we will add in more labels over epochs. This will hopefully define richer manifolds than pure self-supervision while using less labels than pure semi-supervision models.

Challenges

A major challenge we have come across in the literature and in our own implementation is determining whether a model is self-supervised, semi-supervised, or fully supervised. Often, the former two are used interchangeably. Other times, fully supervised fine-tuning is given the label of “semi-supervision”. Thus, establishing a fundamental definition has been tricky but important. This took us a couple of days, but we are back on track with a good understanding now. Furthermore, we faced some setbacks working with Oscar. Mostly, this involved setting up a TensorFlow environment with all the needed packages, and critically, PATH variables. When such a setup went wrong, it was not immediately clear; for example, when training our models on a pure CPU environment, we noticed it was faster than its GPU counterpart. This led us to notice the GPU environment did not actually use the GPU. As it turns out, Oscar requires users to use a particular apptainer. After a couple of hours of debugging, we have finally solved this issue and can now train new models extremely quickly.

Insights

First, we have managed to preprocess all of our data by creating a Dataloader class that handles generating subsets of data containing only a few of the classes of a given dataset. Dataloader is also able to split our training data into different proportions of labeled and unlabeled data. This is extremely useful, as our plan to demonstrate gradual supervision requires variable labeling rates, while our goal to show continual learning needs datasets containing different classes. Furthermore, we have also gathered many baseline results, such as a simple CNN, a self-supervised model, and a semi-supervised model. For these models, we tested their ability to learn novel classes on the same dataset, CIFAR10. Currently, our model is performing mediocrely. Nevertheless, we must still define an appropriate degree of “novel learning”, as well as optimal split rate and other hyperparameters. Lastly, we have worked on analyzing the quality of our latent spaces. This was done through several dimensionality-reduction and visualization experiments, namely involving PCA and UMAP. Currently, they are not as good as we would have wanted, but this is likely due to our overly simplistic model and low epoch-count. However, we are happy that we have these things working and can actually use it to see how our model is working. We can now just focus on changing some things about how we are training our model with our proposed strategy in order to see if we can generate better results than what we currently have.

Plan

At this point in the project, we are confident that we are on the right track to finish on time. However, there are several opportunities to finetune. For example, we would like to explore more complicated CNN encoder architectures, such as VGG16, in order to provide more accurate use cases of our training regime. On the classification side, we are currently using a Sparse CCE Loss to “shortcut” our way to evaluating a limited-class prediction output against a full-class ground truth. Changing our models to handle one-hot encoding and use explicit CCE Loss would better highlight our intentions for our model to learn new classes, and also introduce the opportunity to play with “unknown” class labels. 
As for generating our latent space, we are currently using the linear-probe backpropagation to update our encoder’s weights. This represents a rather abstract understanding of latent-space “accuracy”. Thus, another stretch goal of ours’ is to develop more interpretable and possibly realistic semi-supervised losses. These will most likely be inspired by existing semi-supervision methods based in nearest-neighbor and pseudo-labeling techniques. Additionally, we would like to make many more measurements on our encoder-generated latent spaces, such as cluster-quality metrics such as ARI and NMI.