Attention Really Is All You Need 0_0

Final Poster

Title

Attention Really Is All You Need

Who

Heon Lee hlee184 Ilana Nguyen inguyen4 Kelvin Jiang kjiang32

Introduction

We are implementing the paper “An Image is Worth 16X16 Words: Transformers for Image Recognition at Scale” by Dosovitskiy, et al. Traditionally, image classification models rely on convolution layers; however, the authors propose a model architecture called ViT using a novel approach with a transformer encoder that uses attention, an architecture commonly used in natural language processing tasks. The model divides the image into a grid of patches and applies attention with respect to the flattened patches, which are fed into the transformer encoder. Because each patch is treated as a token, which means the transformer encoder can learn to attend to different patches to extract relevant features for the image recognition task. We follow the architecture of the ViT model to perform image classification on the tiny Imagenet dataset, albeit with reduced model size.

We chose this paper because it provides a relatively novel approach to vision models, particularly to the task of image classification. We find it especially interesting how to attention mechanism in the transformer model originally developed for text input can be applied to image "patches" instead.

Related Work

Some related/prior relevant works include the seminal "Attention is All You Need" paper (Vaswani et al., 2017), which was the original paper to introduce the concept of attention, "image GPT (iGPT)" (Chen et al., 2020a), which applies Transformers to image pixels after downsizing resolution, and "On the Relationship between Self-Attention and Convolutional Layers" (Cordonnier et al., 2019), which is the approach that the ViT model follows, with increasing the possible image size and showing the superiority of vanilla Transformers.

Data

The data we used comes from Tiny ImageNet and CIFAR-10. The Tiny ImageNet training dataset has 100,000 64x64 images sorted into 200 classes, so there are 500 images per class. The testing dataset has 10,000 images. CIFAR-10 has 60,000 32x32 images sorted into 10 classes, so there are 6000 images per class. We only used 2 classes, cats and dogs. All datasets are available publicly and have been widely used as benchmarks for vision models.

Methodology

The ViT architecture can be segmented into three primary parts: patch creation, transformer encoder, and classification head. The patch creation process divides the input image into a grid of patches, which are flattened into sequences and fed into the transformer encoder. Since each patch is treated as a token, the transformer encoder can learn to attend to different patches to extract relevant features for the image recognition task. The transformer encoder consists of multiple layers of self-attention and feedforward networks. In the self-attention layers, the model learns to attend to different patches and compute a weighted average of them, capturing the relationships between patches. In the feedforward layers, the model applies a set of fully connected layers to further process the learned features. The classification head takes in the output of the transformer and outputs the logits using a feed-forward layer. The output is a tensor of probabilities corresponding to each category.

Metrics

Experimental goals: quantitative accuracy of 30% on Tiny ImageNet and 70% on CIFAR-10 (state-of-the-art model achieves around 88.55% on the full ImageNet; we plan on using a much smaller dataset and a significantly reduced parameter size) using the same benchmarks as the paper and qualitative evaluation on new datasets as well as fun images we find!

Base: A working model with a quantitative accuracy of 30% on Tiny ImageNet and 70% on CIFAR-10

Target: A working model with a quantitative accuracy of 50% on Tiny ImageNet and 80% on CIFAR-10

Stretch: A working model with a quantitative accuracy of 70% on Tiny ImageNet and 90% on CIFAR-10

Ethics

What broader societal issues are relevant to your chosen problem space?

In general, big vision models have significant societal implications related to computation power and energy consumption. The training and deployment of these models require extensive computational resources, which can lead to increased energy consumption and carbon emissions. This issue is especially relevant as the demand for more advanced language models grows, as it could lead to significant environmental impacts. The paper's focus on reducing the number of parameters in deep learning models is crucial as a mid-term solution in addressing this issue. By using fewer parameters, the computational power required to train and deploy the models is reduced, resulting in lower energy consumption and carbon emissions. This could have a positive impact on the environment and support the development of more sustainable AI practices. Moreover, the broader societal implications relate to the potential for AGI (Artificial General Intelligence) and deep learning models that can take in more types of information about the world. These models have the potential to revolutionize various industries, from healthcare to transportation. However, the development and deployment of these models must consider their ethical implications, including potential biases and the impact on the job market. Therefore, it is essential to explore ways to create sustainable and ethical AI practices in this problem space.

Who are the major “stakeholders” in this problem, and what are the consequences of mistakes made by your algorithm?

The major "stakeholders" in the problem space of classifying images using deep learning models are individuals or organizations that use this technology. This can include content creators who rely on image classification algorithms to describe images, search engines that utilize image recognition algorithms to generate search results, and consumers who rely on these algorithms to obtain information. The consequences of mistakes made by these algorithms can be significant, as they can lead to incorrect image classifications. These mistakes could have a severe impact on the accuracy and reliability of the information provided, potentially leading to misunderstandings, confusion, and even harm. If people rely on this technology to provide accurate information, they may trust incorrect information, leading to potential legal, financial, or social consequences. Moreover, the consequences of mistakes made by these algorithms can extend beyond individual users to society as a whole. For example, if search engines use image recognition algorithms to generate search results, incorrect or biased results could have a significant impact on the perception of certain groups, ideas, or events. Therefore, it is essential to develop and deploy these algorithms responsibly, considering their potential impact on all stakeholders involved.

Division of labor

We’re planning on working together, utilizing pair programming practices. We will also divide the writing portion, and most likely work on the poster together as it will probably mainly require reformatting information in our write up in the form of a poster.