Inspiration

Vision transformers are slowly taking over conventional CNN models. These vision transformers are either equally well or better than the CNNs. fast.ai is a library that is built upon Pytorch, and is famous for its easy-to-use feature, wherein one can easily create an image classifier model in just 5 lines of code. timm is a library that comprises state-of-the-art computer vision models. I wanted to make the vision transformers, in timm, to be compatible with fast.ai. I also wanted to use self-supervised learning to pre-train custom vision transformers. This is the main inspiration for me to develop transcv.

What it does

transcv does the following :

  1. It can create a custom, pre-trained (if required) ViTs (Vision transformers) and, SwinT (Swin transformer) that are compatible with the fast.ai library, in just 3-4 lines of code.
  2. The documentation of transcv consists of tutorial notebooks for self-supervised pre-training of custom ViTs and SwinT with fast.ai (if one wants to train these models from scratch).

How I built it

I used the ViT and SwinT models from timm. The modules of these models were separated and then, joined, along with a custom head, as a sequential model. For self-supervised learning tutorials, I used this library; it has fast.ai-friendly functions for performing self-supervised pre-training. The transcv PyPI library was developed using nbdev.

Challenges I ran into

There were quite a few of them. The two most challenging were :

  1. Training time for self-supervised pre-training of the vision transformers was significantly high. Therefore, I decided to pre-train the models for just 1 epoch, in the tutorial notebooks.
  2. I had to create a custom embedding module for ViT. It was because the embedding module of timm was never there, whenever I tried to split the modules of the model.

Accomplishments that I am proud of

The transcv library can successfully create (in just 3-4 lines of code) custom, pre-trained ViTs, and SwinT models, as well as, these models can be pre-trained in a self-supervised fashion.

What I learned

I learned a lot of fast.ai and Pytorch intricacies, specially, the callback system of fast.ai. Along with this, I also learned about the architectures of ViT, SwinT and their variants.

What's next for transcv

I plan to add the following :

  1. Vision transformers for object detection and Image segmentation
  2. Vision transformers for object tracking and motion prediction.
  3. Video vision transformers (that is, vision transformers that are capable of working with video data)
  4. Multi-modal models

Built With

Share this project:

Updates