Final Deliverables:

Write-up link

Poster link (Also in media)

Oral presentation link (Also in media)

Github link

Introduction

Problem:

  • Supervised training focuses only on predicting p(label|input). However, the input itself can contain rich information, and understanding p(input) can be helpful to the model in later tasks.

Our Idea:

  • Introduce a regularization method in supervised training for transformers: feed patches from multiple images to the transformer.
  • In this way, the transformer has to think positiveness between tokens to give correct predictions
  • Significant improvement on small datasets: CIFAR-10 and CIFAR-100

Who

Tiancheng Shi(tshi19)

Haowei Gao(hgao13)

Related Work

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale link

mixup: Beyond Empirical Risk Minimization link

A Simple Framework for Contrastive Learning of Visual Representationslink

Data

CIFAR 10 and CIFAR 100.

Methodology

  • Backbone neural network: vision transformer
  • Pass different images token from a batch with the same positional encoding to the transformer
  • Each output token is classified with respect to the corresponding image label.
  • The training task is more difficult: the transformer has to think positiveness between tokens to give correct predictions.

Metrics

Classification accuracy on the test dataset.

Result

  • Joint training with multiple images can improve test accuracy on CIFAR-10 by about 4% and can improve test accuracy on CIFAR-100 by about 10% for plain ViT.
  • Compared to jointly training with 4 images does not have significant performance improvement despite the number of images increasing.

Limitations

  • Longer training time: Increased sequence length requires extra computation resources in training
  • We don’t know if this method can work well on large datasets

Future work

  • Can this method work on the pre-training stage of ViT on large datasets to get a better pre-trained model?
  • Can this method extend to other neural network structures like CNN?

Built With

Share this project:

Updates