HoneyPot: Model Agnostic Spuriosity Detection Framework

Title: project honeyPot

Who: bli116, szhu46, vsharm44, vojewale

Introduction: Our project looks to address the idea of spuriosity and bias in large models. More specifically, our project looks to find a way to both amplify and find hidden biases within models, as well as potentially find a way for models to identify and rank images based on spuriosity, and thus allow for models to group images based on spuriosity (and thus bias) outside of our specific Salient ImageNet distribution.

Related Work: Salient ImageNet: How to discover spurious features in Deep Learning? - Singla and Feizi, https://arxiv.org/abs/2110.04301

This paper details the creation of Salient ImageNet, a database of about 52,000 images from 232 classes of ImageNet. Each image is assigned core and spurious features to their classification (with the idea that certain features of an image should not be relevant to how they are classified, hence spurious compared to the core elements). These features are applied to masks of the image, which then make it somewhat simple to analyze which features classifiers are using to determine class, and whether these features happen to be more or less spurious.

Core Risk Minimization using Salient ImageNet - Singla, Moayeri, Feizi, https://arxiv.org/abs/2203.15566

This paper first expands Salient ImageNet into Sailent ImageNet-1M, with over 1 million images from all ImageNet classes, each with its own core and spurious masks. Furthermore, the researchers evaluate how different transformer models potentially use spurious elements by analyzing the heatmaps of penultimate layers and comparing these heatmaps to the masks of core and spurious features for each image. They then introduce a learning paradigm called “Core Risk Minimization”, which essentially fine-tunes classifiers on low-spuriosity images in order to improve core accuracy (i.e. less spurious features used) without dropping clean accuracy at all.

Data:We will be using salient ImageNet-1M in https://arxiv.org/abs/2203.15566. Salient ImageNet consists of over a million images from all 1000 classes of ImageNet, each given core and spurious features. Furthermore, we will be applying our results from Salient ImageNet to a larger subset of ImageNet, should our results in Salient suggest that the model is scalable past just one dataset.

Methodology: What is the architecture of your model? We will be using an adversarially trained ResNet architecture in order to train and then fine-tune multiple classifiers within Salient ImageNet. We will be fine-tuning 3 different models, with the fine-tuning occurring on images with different levels of spuriosity (low, medium, high) for each classifier. For our stretch goal, we will be using Mistral AI’s LLM model to further compare our results from Salient ImageNet on a model that was trained on a much larger dataset.

Metrics: For our project, we have a few different goals with different success metrics.

The first success metric (base goal) is how well we are able to amplify bias within our model, specifically by fine-tuning the model on high spuriosity images. This first metric will be somewhat qualitative, involving comparing heatmaps of the penultimate dense layer to determine whether the model is using more spurious and less core features to classify, and thus is more biased.

Our next success metric (target goal) is based on how close the model’s generate spuriosity rankings are to the spuriosity rankings created by humans in Salient ImageNet. If the model is able to rank images within Salient ImageNet similarly to how humans ranked them, we can conclude that the model can be applied to datasets outside of just salient ImageNet.

Our final success metric (stretch goal) involves comparing our results with the use of MistralAI, or some other public source LLM trained on larger datasets than Salient ImageNet.

Ethics:

What broader societal issues are relevant to your chosen problem space?

The broader societal issues that our project tackles focuses on the idea of locating and finding hidden biases in larger models. With the use of larger and more pivotal models, the potential concern of algorithmic bias is one that plagues most conversations around the topic. The ability for a model to evaluate both how biased its results are, as well as to fine-tune itself to amplify and find hidden biases, is a novel tool that help users try and find different ways in which the models they use have flaws, and which factors to take into account when using them.

How are you planning to quantify or measure error or success? What implications does your quantification have?

We plan on measuring success based on how transferrable the spuriosity rankings are within our own dataset, as well as into outside datasets. Essentially, if our model is able to create spuriosity rankings that are similar, or generally match the rankings created by humans (see Mazda’s paper), then we can conclude that the model is able to be applied outside of the specific ImageNet dataset.

Division of labor: Briefly outline who will be responsible for which part(s) of the project. Detailed division of labor is included in the below plan: https://docs.google.com/spreadsheets/d/12UwdebvupZZO6ob27YT_RStOUQKi3Vdo9VcZel_-WMI/edit#gid=0

Project Material: Final Reflection: Attached as a Google Doc link. https://docs.google.com/document/d/1AZUoxPGb29GVcf5Y5gY1EFdkJ0RobwANmMgLnRmJtD4/edit?usp=sharing
Slides for Presentation: Attached as a Google Slides link. https://docs.google.com/presentation/d/1SJl7j6-L4LD4gY3m67YQKTQM8rMvrJEsuBLky-mSWrc/edit#slide=id.g2cd0934d7b0_0_10
GitHub Link for the Project: Attached in the links below. https://github.com/victorojewale/projectHoneyPot/tree/main