Inspiration
Vision transformers are slowly taking over conventional CNN models. These vision transformers are either equally well or better than the CNNs. fast.ai is a library that is built upon Pytorch, and is famous for its easy-to-use feature, wherein one can easily create an image classifier model in just 5 lines of code. timm is a library that comprises state-of-the-art computer vision models. I wanted to make the vision transformers, in timm, to be compatible with fast.ai. I also wanted to use self-supervised learning to pre-train custom vision transformers. This is the main inspiration for me to develop transcv.
What it does
transcv does the following :
- It can create a custom, pre-trained (if required) ViTs (Vision transformers) and, SwinT (Swin transformer) that are compatible with the fast.ai library, in just 3-4 lines of code.
- The documentation of transcv consists of tutorial notebooks for self-supervised pre-training of custom ViTs and SwinT with fast.ai (if one wants to train these models from scratch).
How I built it
I used the ViT and SwinT models from timm. The modules of these models were separated and then, joined, along with a custom head, as a sequential model. For self-supervised learning tutorials, I used this library; it has fast.ai-friendly functions for performing self-supervised pre-training. The transcv PyPI library was developed using nbdev.
Challenges I ran into
There were quite a few of them. The two most challenging were :
- Training time for self-supervised pre-training of the vision transformers was significantly high. Therefore, I decided to pre-train the models for just 1 epoch, in the tutorial notebooks.
- I had to create a custom embedding module for ViT. It was because the embedding module of timm was never there, whenever I tried to split the modules of the model.
Accomplishments that I am proud of
The transcv library can successfully create (in just 3-4 lines of code) custom, pre-trained ViTs, and SwinT models, as well as, these models can be pre-trained in a self-supervised fashion.
What I learned
I learned a lot of fast.ai and Pytorch intricacies, specially, the callback system of fast.ai. Along with this, I also learned about the architectures of ViT, SwinT and their variants.
What's next for transcv
I plan to add the following :
- Vision transformers for object detection and Image segmentation
- Vision transformers for object tracking and motion prediction.
- Video vision transformers (that is, vision transformers that are capable of working with video data)
- Multi-modal models


Log in or sign up for Devpost to join the conversation.