Visual Transformers in Tensorflow

Who: Ethan Asis (easis), Isabelle Towle (itowle), Riki Fameli (rfameli1), Obi Chikezie (cchikezie)

Introduction

If you are implementing an existing paper, describe the paper’s objectives and why you chose this paper.

For our project, we would like to implement a Cornell University paper that created a token-based image recognition model that performs faster than CNNs. We thought that this would be an interesting model to implement because CNNs are already so widely used and considered to be a current industry standard. Being able to implement a model that would be faster and more effective was very exciting to all of us.

What kind of problem is this? Classification? Regression? Structured prediction? Reinforcement Learning? Unsupervised Learning? Etc.

Classification, supervised learning.

Related Work: Are you aware of any, or is there any prior work that you drew on to do your project?

Please read and briefly summarize (no more than one paragraph) at least one paper/article/blog relevant to your topic beyond the paper you are re-implementing/novel idea you are researching.

Source: Medium article

CNNs have a number of weaknesses for image processing and scene recognition: they lack global understanding of the images, they are computationally expensive, they’re specifically designed for images, and they are domain-specific because they use pixel arrays where each pixel represents varying importance. On the other hand, there are many advantages to using Visual Transformers (ViTs). They outperform state of the art CNNs in terms of accuracy and computational efficiency, and they contain a self-attention layer which allows them to integrate information globally across the entire image. They work by first dividing the image into visual tokens of fixed size (since the cost of self-attention is quadratic, it would not be feasible computationally to compute self-attention for every pixel, hence the visual tokens). It then flattens and linearly embeds each of these tokens, and adds a positional embedding to retain positional info. as an input to the encoder. The encoder consists of a Multi-Head Self Attention layer, Multi-Layer Perceptrons (which contains two-layer with GELU), and Layer Norm is applied before every block. Residual connections are applied after every block to allow gradients to flow through without passing through non-linear activation functions. The higher layers of ViT learn global features, while the lower layers learn both global and local features, thus the ViT can learn generic patterns.

In this section, also include URLs to any public implementations you find of the paper you’re trying to implement. Please keep this as a “living list”--if you stumble across a new implementation later down the line, add it to this list.

Data: What data are you using (if any)?

If you’re using a standard dataset (e.g. MNIST), you can just mention that briefly. Otherwise, say something more about where your data come from (especially if there’s anything interesting about how you will gather it). Datasets: ImageNet, CIFAR-10/100

How big is it? Will you need to do significant preprocessing?

ImageNet has over 14 million images and is organized according to the WordNet hierarchy. Each meaningful concept in WordNet is described by a word or phrase, called a “synonym set” or “synset.” There are more than 100,000 synsets in WordNet. ImageNet provides roughly 1,000 images to illustrate each synset.

CIFAR-10 has 60,000 32x32 RGB images in 10 different classes, with 6,000 images per class. It is divided into 5 training batches and 1 test batch, with each batch having 10,000 images. The test batch contains 1,000 randomly selected images from each class.

CIFAR-100 contains 10,000 test images and 50,000 train images of 20 object classes, along with 100 object subclasses (we will train on the 100 subclasses). Each image is RGB of size 32x32.

For each of our datasets, I do not believe we will need to do significant preprocessing, other than loading the data into our model and batching it. All three of our datasets are already labeled. However, we will have to ensure that we take a correctly sized sample of the larger datasets in order to ensure that it is small enough to run on our computer but large enough to train an accurate model.

Methodology: What is the architecture of your model?

How are you training the model?

The visual transformer needs to be trained in a specific way to avoid overfitting. To this end we have to incorporate more training epochs, greater data augmentation, stronger regularization, and distillation. To be more specific, we train the VT-ResNet model for 400 epochs. During that time we start with a learning rate of 0.01, and then over 5 warmup epochs, raise that learning rate to 0.16. After the warmup, we decrease the learning rate by a factor of 0.9875 per epoch. Our batch size is 2048. We also set a stochastic depth survival probability of 0.9 and a dropout ratio of 0.2

If you are implementing an existing paper, detail what you think will be the hardest part about implementing the model here.

Metrics: What constitutes “success?”

What experiments do you plan to run?

We plan to run our image classification model using visual transformers and compare it to image classification using conventional CNNs. Our main hypothesis is that visual transformers will perform statistically significantly better in terms of accuracy.

For most of our assignments, we have looked at the accuracy of the model. Does the notion of “accuracy” apply for your project, or is some other metric more appropriate?

Accuracy is a great metric because we are just doing an image identification task, our model is more successful when it correctly labels the images it sees at a higher rate.

What are your base, target, and stretch goals?

Base: Our base goal is to create a model using visual transformers that is able to classify images with an accuracy roughly equivalent to a model that relies on conventional convolutional neural networks when using the CIFAR dataset.

Target: Our target goal is to create a visual transformers model that consistently has a larger accuracy than CNNs on the CIFAR10 and ImageNet datasets.

Stretch: Our stretch goal is to include any of the following features / hit the following goals:

  • Generate metrics and graphics to compare the performance of the VT-based model and the CNN-based model. This can include a compilation of images that were misclassified by each as well as graphs comparing loss over time. Make the model extensible so as to work with most supervised learning image classification datasets.
  • Create a model that is more accurate than the model used in the paper. Ethics: Choose 2 of the following bullet points to discuss; not all questions will be relevant to all projects so try to pick questions where there’s interesting engagement with your project. (Remember that there’s not necessarily an ethical/unethical binary; rather, we want to encourage you to think critically about your problem setup.)

Why is Deep Learning a good approach to this problem?

This model aims to optimize what has already been proven to be an efficient and effective use of deep learning models (CNNs). Image classification is a classic example of neural networks. Because we want our model to extrapolate results onto a test dataset, we have to learn some function that will be able to find noticeable patterns via transformers for the model to discern one class of image from another.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

Some of the major stakeholders are the entities that are using image classification process for their services and products, such as security companies, governments, and developers of self-driving cars. Additionally, people on which these image classification algorithms are used are affected. If we improve facial recognition, for instance, doors are opened for tech companies to improve security for applications within smartphones, but governments are also able to more accurately and easily surveil individuals. However, improved facial recognition may allow for fewer mistakes when incriminating individuals, as seen in the case of Robert Julian-Borchak Williams.

Other stakeholders are other computer scientists and researchers. If a new image classification technique is shown to be more than CNN, it may shift the way that image classification is implemented and taught on a broader scale.

Division of labor: Briefly outline who will be responsible for which part(s) of the project.

  • Riki: preprocessing, visualizer and testing infrastructure (e.g. visualization of misclassified images, loss, accuracy, other metrics, etc.)
  • Isabelle: Preprocessing and training/testing
  • Ethan: training/testing and building model (e.g. call)
  • Obi: training/testing and building model (e.g. call)

Built With

Share this project:

Updates

posted an update

Project Check in 3

Introduction: This can be copied from the proposal.

For our project, we are implementing a Cornell University paper that created a token-based image recognition model that performs better and faster than CNNs. (The paper can be found here). We thought that this would be an interesting model to implement because CNNs are already so widely used and considered to be a current industry standard. Being able to implement a model that would be faster and more effective was very exciting to all of us.

Challenges What has been the hardest part of the project you’ve encountered so far?

So far the hardest part of the project has been understanding some new concepts that we did not encounter in class like creating a distillation token from a pre-trained CNN to extract the most important patterns that the CNN has learned so that our VT is able to stay small in size while still having great accuracy. The coding aspect of our project does not look overly complicated, so most of our struggles have come from understanding the machinery underneath the model. Mainly, we're having difficulty converting the Pytorch implementation into a Tensorflow one.

Insights: Are there any concrete results you can show at this point? How is your model performing compared with expectations?

Plan: Are you on track with your project?
What do you need to dedicate more time to?
What are you thinking of changing, if anything?

So far we have created a CNN model with 70% accuracy that will be the benchmark that we will be trying to beat with the Visual Transformer. We also started to convert the existing pytorch implementation into Tensorflow and have translated most of the classes so far. Since we are not yet done though, we are unsure of its effectiveness. Right now we feel like we are on track to finish the project on time - we got a somewhat late start but we have made good progress everytime we meet to work on project. As long as we keep committing the amount of time that we are right now until the project’s completion, we should be fine. If we had to pick something specific to focus on it would finishing up the VT model so we can start debugging it in case we did not properly translate everything from pytorch to tensorflow. As of right now, we do not have anything that requires changing.

Log in or sign up for Devpost to join the conversation.