Joint Image Training for Transformers in Supervised Learning

Final Deliverables:

Write-up link

Poster link (Also in media)

Oral presentation link (Also in media)

Github link

Problem:

Supervised training focuses only on predicting p(label|input). However, the input itself can contain rich information, and understanding p(input) can be helpful to the model in later tasks.

Our Idea:

Introduce a regularization method in supervised training for transformers: feed patches from multiple images to the transformer.
In this way, the transformer has to think positiveness between tokens to give correct predictions
Significant improvement on small datasets: CIFAR-10 and CIFAR-100

Tiancheng Shi(tshi19)

Haowei Gao(hgao13)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale link

mixup: Beyond Empirical Risk Minimization link

A Simple Framework for Contrastive Learning of Visual Representationslink

CIFAR 10 and CIFAR 100.

Backbone neural network: vision transformer
Pass different images token from a batch with the same positional encoding to the transformer
Each output token is classified with respect to the corresponding image label.
The training task is more difficult: the transformer has to think positiveness between tokens to give correct predictions.

Classification accuracy on the test dataset.

Joint training with multiple images can improve test accuracy on CIFAR-10 by about 4% and can improve test accuracy on CIFAR-100 by about 10% for plain ViT.
Compared to jointly training with 4 images does not have significant performance improvement despite the number of images increasing.

Longer training time: Increased sequence length requires extra computation resources in training
We don’t know if this method can work well on large datasets

Can this method work on the pre-training stage of ViT on large datasets to get a better pre-trained model?
Can this method extend to other neural network structures like CNN?