Title

Analysis of the effects of different regularization methods on model explainability on images.

Who

Jay Gopal (jgopal), Yousef Elgodamy (yelgodam)

Introduction

Deep learning has made tremendous progress ever since the inception of AlexNet, and the field is constantly making improvements to solve problems. Regularization is a technique often utilized by researchers for a variety of reasons, such as to prevent extreme weight values and/or to encourage sparsity in weights. Although it is traditionally acknowledged that regularization has the potential to increase accuracy and other metrics, the effects of regularization on explainability writ large have not been explored extensively. This research undertakes a detailed analysis of the effects of different types of regularization on the explainability of several open-sourced models, as well as a custom-built CNN.

Related Work

Research has been conducted on regularization's ability to increase accuracy. For example, this paper (https://proceedings.neurips.cc/paper/2016/hash/41bfd20a38bb1b0bec75acf0845530a7-Abstract.html) by Wen et al. explores the effects of inducing sparsity (similar to L1 regularization) on models' accuracy. The authors show that regularization can improve performance. As another example, this paper (https://arxiv.org/abs/1712.09936) by Varga et al. argues that regularization of the input gradient can greatly aid in image classification (using accuracy as a metric). However, this area of research often fails to include one critical component: model explainability. The effects of various well-known types of regularization, such as L1 and L2, on interpretability remain under-researched. To date, we are unable to find a paper that focusses on this specific question. Search phrases used by the authors include ["Regularization" + "explainability" + "deep learning"], [Does regularization affect deep learning model explainability?], and ["Regularization" + "interpretability" + "deep learning"].

Data

The dataset for this study will be CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html), a publicly available and widely recognized database in deep learning.

Methodology

We will be training a Simple CNN, ResNet50, and VGG19, each under the following three conditions: no regularization, Dropout regularization, and L2 regularization. All off-the-shelf neural networks will have the standard, open-sourced architectures, in addition to the regularization (if applicable). Since these models are too large to train locally, they will be trained on the Ocean State Center for Advanced Resources (OSCAR) provided by Dr. Thomas Serre's lab at Brown University's Carney Institute for Brain Science. Models will be trained using PyTorch.

This design would result in various different trained models, one for each combination of architecture and regularization. Once the models are trained, a thorough explainability analysis will be performed. Specifically, images will be generated via gradient ascent to detail how to maximally active select neurons in order to explain "what" each model is looking for. Additionally, heat maps will be created that illustrate "where" the model is looking. Combining the "what" and the "where" should give us an ability to conceptualize the "how," bringing us a step towards fully explaining what each model is "thinking." Once this process is done for all trained models, comparative analysis will ensue, in which we compare the explainability of each model with respect to regularization. Our hypothesis is that L2 regularization will increase explainability because it forces the model to avoid extreme weight values. However, dropout could either increase or decrease explainability because it is hypothesized to force redundancy. This means that we believe that there may be an adversarial relationship between explainability and accuracy.

Alternatively, if we are not able to find significant differences in explainability between the models with respect to regularization, we will compare them with respect to architecture. It may be possible that regularization only has a strong impact on one of the architectures being studied! On the other hand, it could be the case that one architecture is significantly more interpretable (making it more desirable for real-world applications).

This type of research is at the heart of a rapidly-growing subfield of artificial intelligence: explainable deep learning. We hope to further our understanding of how modifications that claim to boost accuracy (namely, regularization) can impact interpretability, which is extremely important.

Metrics

The main experiments that will be run are the "what" and "where" analysis detailed in the "Methodology" section. Critically, accuracy is not the most important metric for this research. We are comparing the explainability of each model, which is somewhat subjective. We hope to be able to make a strong case for each type of regularization either increasing or decreasing model explainability. Furthermore, we would like to extrapolate the objective deliverables (the images created with gradient ascent and the heat maps that detail where the model is looking) into a theory for why each type of regularization has the observed effect. Put another way, as an example, if L2 regularization negatively impacts explainability, but Dropout does not, we would like to formulate educated theories for why this is the case.

The base goal is to train and interpret 4 total models: ResNet and VGG, each with 2 different regularization conditions (no regularization and L2). They will at least be interpreted with the generation of heat maps (the "where") for select neurons, as well as comparative analysis (written).

The target goal is to train and interpret 6 total models: ResNet and VGG, each with 3 different regularization conditions (no regularization, Dropout, and L2). They will be interpreted with heat maps (the "where") and maximum activation images (the "what"), along with comparative analysis (written).

The stretch goal is to train and interpret 9 or more total models: ResNet, VGG, and a Custom Simple CNN, each with 3 or more different regularization conditions (no regularization, Dropout, and L2). They may be interpreted using heat maps (the "where"), maximum activation images (the "what"), and some other explainability metrics. Finally, a comparative analysis will be written.

Ethics

What broader societal issues are relevant to your chosen problem space?

One of the largest criticisms of deep learning nets in real world applications is the fact that many models are a black box; it’s very difficult to look at a model and its weights to understand what patterns are being recognized and used to make predictions. While in certain applications this isn’t a problem, in others it is imperative to know on what basis a model has made its decision, such as deciding who may be released on bond in a court of law.

It is important to understand what makes a model explainable. In this case, we use explainability in the sense that the model makes its predictions upon characteristics that make sense to humans. For example, an explainable model would determine a picture of a cat is indeed a cat by looking at key features in the cat’s face and body, not on peripheral heuristic patterns. This would be explainable because if the model repeatedly makes its predictions on sensible grounds (according to humans), it can better be applied to human centered applications, even if it takes a hit on accuracy to do so.

We are interested in helping shed light on the effect of different regularizations on the explainability of image recognition nets. Regularization helps keep parameter values under control, and thus may help the architecture learn a more balanced model. In doing so, this may encourage the architecture to learn differently, and its effects on explainability are unclear. In shedding light on this matter, we hope to determine what regularization method is best for image recognition in cases where model explainability is valuable, and how best to optimize the tradeoff between explainability and accuracy through regularization.

What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?

Note: this answer is from when we planned to use ImageNet. We believe that similar biases exist in all datasets, including CIFAR10, so we have elected to keep this in our Devpost submission Our dataset is ImageNet, which contains over 14 million images to train and analyze. ImageNet is organized intro synsets, which are sets of images that are related to each other, and there are over 21000 synsets to analyze. Further, there are 12 subtrees to choose from, which are higher level categories such as “mammal” or “vehicle”.

ImageNet (https://image-net.org/static_files/papers/imagenet_cvpr09.pdf) was collected by using WordNet synonyms to find images that go into a synset, and then appending words that are related to find additional images through internet search engines. The authors note that several search engines were used, so the selection ideally will not be susceptible to the bias of a particular search engine, although it is likely that Google images was used most and thus the selection could be subject to Google search’s biases.

The images are labelled using the Amazon Mechanical Turk service which tasks human users for a small payment. The authors of the ImageNet paper say they give the labellers the image and a Wikipedia page corresponding to the label, and then ask the labellers to ensure it matches. To avoid the bias of one labeller, several labellers are assigned the same image and the majority vote on the image is taken. As such, individual label bias is also curtailed. We also will be using the box labeled dataset to help determine explainability (https://image-net.org/download-bboxes.php).

In further efforts against bias, a 2020 paper (https://dl.acm.org/doi/abs/10.1145/3351095.3375709) describes what has been done to reduce bias in the people subtree by mitigating the limitations of WordNet, the lack of unbiased images for certain concepts, and the inequality of representation. Synsets with offensive vocab was filtered and labeled as “offensive” or “sensitive”. As for non-imageable concepts, such synsets were also labeled. To improve representation, the authors developed a system where one may request more balanced imagers (balancing for gender, race, age, etc) to help reduce any such bias and thus yielding a more representative image sample.

Division of Labor

Responsibilities:

--Training ResNet with each regularization condition [ Jay ].

--Training VGG with each regularization condition [ Yousef ].

--Generating maximum activation images ("what") for each ResNet model [ Yousef ].

--Generating maximum activation images ("what") for each VGG model [ Yousef ].

--Generating heat maps ("where") for each ResNet model [ Jay ].

--Generating heat maps ("where") for each VGG model [ Jay ].

--Writing a detailed comparative analysis with respect to regularization [ Jay & Yousef, Collaborative Discussion ].

Built With

Share this project:

Updates

posted an update

UPDATE FOR DATASET ETHICS --- CIFAR-10

Our dataset is CIFAR-10 (https://www.cs.toronto.edu/~kriz/cifar.html), a dataset of 60,000 32x32 images from 10 classes made up of airplane, automobiles, bird, cat, deer, dog, frog, horse, ship, and truck. The classes are noted to be mutually exclusive, meaning there is no overlap of the recognized classes. CIFAR-10 is organized by the Canadian Institute for Advanced Research, and the images were collected for labelling from the 80 million tiny images dataset, where each image is of size 32x32.

There are definite concerns regarding the 80 million tiny images dataset, which were collected using nouns from WordNet. The organizers of this image dataset even removed it from access in June, 2020 (https://groups.csail.mit.edu/vision/TinyImages/) due to the presence of “derogatory terms as categories and offensive images”, which are a direct result of using WordNet for the generation of the images. WordNet is a database of English words, and thus contains both typical and sensitive words which influenced the collection of the 80 million tiny images.

While there certainly is concern of the propagation of biases and sensitive graphics into the CIFAR-10 dataset given it is based on the 80 million tiny images dataset, our concerns are softened by the fact that CIFAR-10 has already been prescreened and labelled for the generally acceptable classes and thus we know that overtly offensive images are not a primary concern. However, we are concerned about how our model may perform and use/perpetuate biases to make classifications. For example, if CIFAR-10 has many images unfairly over representing a certain race with a certain classification, the model may learn an unfair association where it classifies the image based on race rather than on the main image itself.

Otherwise, given the main focus of CIFAR-10 is inanimate objects that are relatively benign, and each category has a variety of different images for each classification (old planes, new planes, jets, etc), we believe that CIFAR-10 is representative for our purposes.

Log in or sign up for Devpost to join the conversation.

posted an update

REFLECTION

Introduction:

Deep learning has made tremendous progress ever since the inception of AlexNet, and the field is constantly making improvements to solve problems. Regularization is a technique often utilized by researchers for a variety of reasons, such as to prevent extreme weight values and/or to encourage sparsity in weights. Although it is traditionally acknowledged that regularization has the potential to increase accuracy and other metrics, the effects of regularization on explainability writ large have not been explored extensively. This research undertakes a detailed analysis of the effects of different types of regularization on the explainability of several models for image classification.

Challenges:

The largest limitation we have encountered is the sheer amount of computational resources required to properly train models. At first, we had planned on using ImageNet, and training two open-sourced architectures (Resnet50 and VGG19) with multiple regularizers. However, simply training one model in this fashion took over a day! This left very little time for debugging, fine-tuning, or a thorough explainability analysis. Therefore, we decided to modify our methodology to overcome the computational resource barrier. Specifically, we elected to use CIFAR10, as opposed to a larger dataset such as ImageNet or CIFAR100. We also decided to write our own convolutional neural network using PyTorch so that we could have at least one set of models (our CNN with multiple regularizers) that could be trained relatively quickly and tested extensively.

Insights:

We have completed a significant portion of the project already, and are excited to share our results! They can be found at this link: https://github.com/JayRGopal/reg-explain.

The models we have trained, as well as the Colab Notebooks we have written, can be found at this link: https://drive.google.com/drive/u/0/folders/1JDJhca-zm6o2pFi3PdZ1psn39_kc0gc3.

In summary, we have completed the simple CNN, and trained it (with various regularizers) along with multiple off-the-shelf models on CIFAR10. We have made extensive progress on the explainability analysis, and have uploaded the most important qualitative findings to GitHub in the form of images. For example, we selected a representative image of each class and computed the overlaid gradients of each model on those images. This allowed us to show whether regularization helps models “focus” on what humans consider the most important features.

Our expectations were that certain regularizers would increase explainability, while others would not. Notably, we thought that L2, by decreasing the magnitude of the weights, would spread out the model’s gradients, broadening its visual field. We believed that dropout would serve to force the model to learn redundantly, meaning it would focus on multiple important features. We thought that L1, by encouraging sparsity of weights, would decrease overall explainability, as some important areas could have their impact reduced to zero.

Our findings are very interesting, and are not fully in line with our expectations. The Simple CNN’s response to regularization is quite similar to our expectations. For example, L2 regularization seems to dampen the model’s focus on the important areas of images. However, larger models are extremely difficult to interpret to begin with. Even with various regularizers, they seem to remain mysterious. We look forward to continuing this work to be able to make conclusions about regularization’s impact on models such as Resnet50.

Plan:

At present, we are ahead of schedule with our project. We aim to hit our stretch goal, which is to train and interpret a total of 9 models (Simple CNN, Resnet50, and VGG19, each with 3 different regularization conditions). Additionally, due to the ease of training and interpreting the Simple CNN, we have added multiple new interpretability metrics! For this model, we are incorporating Integrated Gradients, as well as DeepLIFT, for an even more extensive analysis of regularization’s effect. Due to insufficient memory on Google Colab, we are unable to run these novel metrics on the larger models at this time.

Thus far, we have had to wait for models to train on OSCAR. Even on the CIFAR10 dataset, large models such as Resnet50 do not train quickly! We would now like to dedicate more time to generating key images for the explainability endeavor. The main changes from our original plan were detailed in the Challenges section. To reiterate, we chose a more manageable dataset (CIFAR10), and added a custom model (Simple CNN) to facilitate fast training.

Citations: ##

Here are the important public repositories we have utilized thus far:

https://github.com/rwightman/pytorch-image-models

https://github.com/greentfrapp/lucent

https://github.com/pytorch/captum

Log in or sign up for Devpost to join the conversation.