Title
Analysis of the effects of different regularization methods on model explainability on images.
Who
Jay Gopal (jgopal), Yousef Elgodamy (yelgodam)
Introduction
Deep learning has made tremendous progress ever since the inception of AlexNet, and the field is constantly making improvements to solve problems. Regularization is a technique often utilized by researchers for a variety of reasons, such as to prevent extreme weight values and/or to encourage sparsity in weights. Although it is traditionally acknowledged that regularization has the potential to increase accuracy and other metrics, the effects of regularization on explainability writ large have not been explored extensively. This research undertakes a detailed analysis of the effects of different types of regularization on the explainability of several open-sourced models, as well as a custom-built CNN.
Related Work
Research has been conducted on regularization's ability to increase accuracy. For example, this paper (https://proceedings.neurips.cc/paper/2016/hash/41bfd20a38bb1b0bec75acf0845530a7-Abstract.html) by Wen et al. explores the effects of inducing sparsity (similar to L1 regularization) on models' accuracy. The authors show that regularization can improve performance. As another example, this paper (https://arxiv.org/abs/1712.09936) by Varga et al. argues that regularization of the input gradient can greatly aid in image classification (using accuracy as a metric). However, this area of research often fails to include one critical component: model explainability. The effects of various well-known types of regularization, such as L1 and L2, on interpretability remain under-researched. To date, we are unable to find a paper that focusses on this specific question. Search phrases used by the authors include ["Regularization" + "explainability" + "deep learning"], [Does regularization affect deep learning model explainability?], and ["Regularization" + "interpretability" + "deep learning"].
Data
The dataset for this study will be CIFAR10 (https://www.cs.toronto.edu/~kriz/cifar.html), a publicly available and widely recognized database in deep learning.
Methodology
We will be training a Simple CNN, ResNet50, and VGG19, each under the following three conditions: no regularization, Dropout regularization, and L2 regularization. All off-the-shelf neural networks will have the standard, open-sourced architectures, in addition to the regularization (if applicable). Since these models are too large to train locally, they will be trained on the Ocean State Center for Advanced Resources (OSCAR) provided by Dr. Thomas Serre's lab at Brown University's Carney Institute for Brain Science. Models will be trained using PyTorch.
This design would result in various different trained models, one for each combination of architecture and regularization. Once the models are trained, a thorough explainability analysis will be performed. Specifically, images will be generated via gradient ascent to detail how to maximally active select neurons in order to explain "what" each model is looking for. Additionally, heat maps will be created that illustrate "where" the model is looking. Combining the "what" and the "where" should give us an ability to conceptualize the "how," bringing us a step towards fully explaining what each model is "thinking." Once this process is done for all trained models, comparative analysis will ensue, in which we compare the explainability of each model with respect to regularization. Our hypothesis is that L2 regularization will increase explainability because it forces the model to avoid extreme weight values. However, dropout could either increase or decrease explainability because it is hypothesized to force redundancy. This means that we believe that there may be an adversarial relationship between explainability and accuracy.
Alternatively, if we are not able to find significant differences in explainability between the models with respect to regularization, we will compare them with respect to architecture. It may be possible that regularization only has a strong impact on one of the architectures being studied! On the other hand, it could be the case that one architecture is significantly more interpretable (making it more desirable for real-world applications).
This type of research is at the heart of a rapidly-growing subfield of artificial intelligence: explainable deep learning. We hope to further our understanding of how modifications that claim to boost accuracy (namely, regularization) can impact interpretability, which is extremely important.
Metrics
The main experiments that will be run are the "what" and "where" analysis detailed in the "Methodology" section. Critically, accuracy is not the most important metric for this research. We are comparing the explainability of each model, which is somewhat subjective. We hope to be able to make a strong case for each type of regularization either increasing or decreasing model explainability. Furthermore, we would like to extrapolate the objective deliverables (the images created with gradient ascent and the heat maps that detail where the model is looking) into a theory for why each type of regularization has the observed effect. Put another way, as an example, if L2 regularization negatively impacts explainability, but Dropout does not, we would like to formulate educated theories for why this is the case.
The base goal is to train and interpret 4 total models: ResNet and VGG, each with 2 different regularization conditions (no regularization and L2). They will at least be interpreted with the generation of heat maps (the "where") for select neurons, as well as comparative analysis (written).
The target goal is to train and interpret 6 total models: ResNet and VGG, each with 3 different regularization conditions (no regularization, Dropout, and L2). They will be interpreted with heat maps (the "where") and maximum activation images (the "what"), along with comparative analysis (written).
The stretch goal is to train and interpret 9 or more total models: ResNet, VGG, and a Custom Simple CNN, each with 3 or more different regularization conditions (no regularization, Dropout, and L2). They may be interpreted using heat maps (the "where"), maximum activation images (the "what"), and some other explainability metrics. Finally, a comparative analysis will be written.
Ethics
What broader societal issues are relevant to your chosen problem space?
One of the largest criticisms of deep learning nets in real world applications is the fact that many models are a black box; it’s very difficult to look at a model and its weights to understand what patterns are being recognized and used to make predictions. While in certain applications this isn’t a problem, in others it is imperative to know on what basis a model has made its decision, such as deciding who may be released on bond in a court of law.
It is important to understand what makes a model explainable. In this case, we use explainability in the sense that the model makes its predictions upon characteristics that make sense to humans. For example, an explainable model would determine a picture of a cat is indeed a cat by looking at key features in the cat’s face and body, not on peripheral heuristic patterns. This would be explainable because if the model repeatedly makes its predictions on sensible grounds (according to humans), it can better be applied to human centered applications, even if it takes a hit on accuracy to do so.
We are interested in helping shed light on the effect of different regularizations on the explainability of image recognition nets. Regularization helps keep parameter values under control, and thus may help the architecture learn a more balanced model. In doing so, this may encourage the architecture to learn differently, and its effects on explainability are unclear. In shedding light on this matter, we hope to determine what regularization method is best for image recognition in cases where model explainability is valuable, and how best to optimize the tradeoff between explainability and accuracy through regularization.
What is your dataset? Are there any concerns about how it was collected, or labeled? Is it representative? What kind of underlying historical or societal biases might it contain?
Note: this answer is from when we planned to use ImageNet. We believe that similar biases exist in all datasets, including CIFAR10, so we have elected to keep this in our Devpost submission Our dataset is ImageNet, which contains over 14 million images to train and analyze. ImageNet is organized intro synsets, which are sets of images that are related to each other, and there are over 21000 synsets to analyze. Further, there are 12 subtrees to choose from, which are higher level categories such as “mammal” or “vehicle”.
ImageNet (https://image-net.org/static_files/papers/imagenet_cvpr09.pdf) was collected by using WordNet synonyms to find images that go into a synset, and then appending words that are related to find additional images through internet search engines. The authors note that several search engines were used, so the selection ideally will not be susceptible to the bias of a particular search engine, although it is likely that Google images was used most and thus the selection could be subject to Google search’s biases.
The images are labelled using the Amazon Mechanical Turk service which tasks human users for a small payment. The authors of the ImageNet paper say they give the labellers the image and a Wikipedia page corresponding to the label, and then ask the labellers to ensure it matches. To avoid the bias of one labeller, several labellers are assigned the same image and the majority vote on the image is taken. As such, individual label bias is also curtailed. We also will be using the box labeled dataset to help determine explainability (https://image-net.org/download-bboxes.php).
In further efforts against bias, a 2020 paper (https://dl.acm.org/doi/abs/10.1145/3351095.3375709) describes what has been done to reduce bias in the people subtree by mitigating the limitations of WordNet, the lack of unbiased images for certain concepts, and the inequality of representation. Synsets with offensive vocab was filtered and labeled as “offensive” or “sensitive”. As for non-imageable concepts, such synsets were also labeled. To improve representation, the authors developed a system where one may request more balanced imagers (balancing for gender, race, age, etc) to help reduce any such bias and thus yielding a more representative image sample.
Division of Labor
Responsibilities:
--Training ResNet with each regularization condition [ Jay ].
--Training VGG with each regularization condition [ Yousef ].
--Generating maximum activation images ("what") for each ResNet model [ Yousef ].
--Generating maximum activation images ("what") for each VGG model [ Yousef ].
--Generating heat maps ("where") for each ResNet model [ Jay ].
--Generating heat maps ("where") for each VGG model [ Jay ].
--Writing a detailed comparative analysis with respect to regularization [ Jay & Yousef, Collaborative Discussion ].
Log in or sign up for Devpost to join the conversation.