Final Deliverables:
Write-up link
Poster link (Also in media)
Oral presentation link (Also in media)
Github link
Introduction
Problem:
- Supervised training focuses only on predicting p(label|input). However, the input itself can contain rich information, and understanding p(input) can be helpful to the model in later tasks.
Our Idea:
- Introduce a regularization method in supervised training for transformers: feed patches from multiple images to the transformer.
- In this way, the transformer has to think positiveness between tokens to give correct predictions
- Significant improvement on small datasets: CIFAR-10 and CIFAR-100
Who
Tiancheng Shi(tshi19)
Haowei Gao(hgao13)
Related Work
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale link
mixup: Beyond Empirical Risk Minimization link
A Simple Framework for Contrastive Learning of Visual Representationslink
Data
CIFAR 10 and CIFAR 100.
Methodology
- Backbone neural network: vision transformer
- Pass different images token from a batch with the same positional encoding to the transformer
- Each output token is classified with respect to the corresponding image label.
- The training task is more difficult: the transformer has to think positiveness between tokens to give correct predictions.
Metrics
Classification accuracy on the test dataset.
Result
- Joint training with multiple images can improve test accuracy on CIFAR-10 by about 4% and can improve test accuracy on CIFAR-100 by about 10% for plain ViT.
- Compared to jointly training with 4 images does not have significant performance improvement despite the number of images increasing.
Limitations
- Longer training time: Increased sequence length requires extra computation resources in training
- We don’t know if this method can work well on large datasets
Future work
- Can this method work on the pre-training stage of ViT on large datasets to get a better pre-trained model?
- Can this method extend to other neural network structures like CNN?
Log in or sign up for Devpost to join the conversation.