Fine Tuned Heads (Are All You Need?)

Introduction

Hey! I'm William. I care about doing efficient Computer Vision and I think that this project highlights some great potential ideas for ensuring we can stay efficient.

Inspiration

I've recently been very interested in making ML (and specifically Computer Vision) more efficient. There exist many recent analyses of making ML more efficient. These papers showcase a variety of different approaches to improve model efficiency and reduce global impact. If you're not sold on why model efficiency matters, I encourage you to skim over some of those papers.

Some key takeaways are as follows:

A majority of energy is spent serving models than training (90%-10%) (Patterson et al.)
There are three major focuses we can use to improve model implementation efficiency
More Efficient Architectures --Model inference cost (in FLOPs), as well as model size (number of parameters), is growing exponentially. It is necessary to design and utilize algorithms that learn and infer more efficiently. The authors highlight two promising approaches, Mixture of Expert models and approaches which perform more efficient attention (BigBird, LinFormer, Nystromformer).
More Efficient Hardware --The only reason we are able to train these larger models is because of more complex hardware. Specialized hardware, such as TPUs or Cerebras' WSE shows promising results in performance per watt. Authors find that specialized hardware can improve efficiency 2-5x.
More Efficient Energy (Datacenter/Energy Generation) --The location that these models are run on also has a significant impact on the efficiency of inference time. By computing locally, as well as in a place where we can efficiently generate energy, we can reduce the impact of our models.

These goals, however, conflict with many current approaches in Machine Learning implementation.

In the NLP space, we are quickly moving towards models that are being trained for longer on larger, multi-lingual datasets. Recent SOTA works (Megatron-LM) are nearly 10 Trillion parameters. Training GPT-3 (once!) takes the CO2 equivalent of three round trip flights from SF-NY (1,287 MWh).

In Computer Vision, a push towards using attention-based architectures as well as focusing on higher-resolution or videos (rather than images) has led to a sharp increase in the cost to train and perform inference in a model.

We propose a new workflow to assist with the implementation of efficient Computer Vision.

Rather than training a single model from scratch, we separate the encoding and fine-tuning aspects of a task.

Specifically, we train a single, large, self-supervised model on an unlabelled, high-resolution dataset. After building a strong encoding, we can fine-tune small, efficient prediction heads on those encodings to solve a specific task. We utilize Scale.ai's platform to automate labeling.

What it does

A diagram of the flow of data across the system. After raw data is collected, it is sent to the server and to scale.ai. The server generates embeddings while scale.ai can generate ground truth labels.

We use UIUC's Almacam as a proof of concept for this topic. We want to show that you can take some arbitrary data source and build out an extremely efficient, data-drift resistant, (mostly) task-agnostic inference pipeline.

Our contribution has x major components:

Automatic data collection pipeline -- We design a small server that automatically collects data from the UIUC AlmaCam and uploads it to the Scale.ai server as well as uploading to some data centers.
Self Supervision for Image Encoding -- Instead of training a model end-to-end on local/inefficient hardware, we train in a (theoretical) data center. This allows us to still leverage extremely large models on extremely large data while maintaining long-term stability and efficiency. -- We train a self-supervised ViT/Nystromformer hybrid model on high resolution* images. We require the use of efficient (Nystromformer) attention and we have large (48x48) image patches to accommodate for the high resolution images.
Fine Tuned Heads -- We showcase that this approach maintains accuracy by fine-tuning very small prediction heads on the image encoding alone (NONE of the original images is passed to the heads). -- We show that these prediction heads can be quickly and easily trained on a new task. This means we do not have to retrain the larger encoding model when we want to apply it to a new task!

*We train on 576x960, normally people train on 224x224 so this is ~11x larger. We can train on the raw (1920x1080, almost 4x larger) resolution without adverse effects, but not over a weekend!

How we built it

Everything is written in Python. I use PyTorch for ML things and Flask for server things.

Automatic data collection pipeline:

This was relatively straightforward. We have Flask automatically execute some bash scripts to first grab the .m3u8 playlist URL from YouTube. Then, we have ffmpeg scrape frames from that m3u8 file into a separate directory. After a certain number of images are downloaded, we hit the Scale.ai API to upload these images and associated metadata (time of day) to their database.

Originally, I had configured this server to automatically create a batch (a new set of images to be labeled) but I found it easier to manually do it (since creating batches can be expensive and I only got $250 of credit).

Self Supervision for Image Encoding

So, the data collection pipeline is feeding new data into some files and we want to train on that in a self-supervised way (since this allows the encoder model to remain task agnostic). If you're interested, here's the wandb logs.

Model Architecture

We train a Vision Transformer-based architecture on 576x960 resolution images. We replace normal attention (which scales poorly with many tokens) with Nystrom-based attention. This allows us to approximate self-attention with O(n) memory complexity (rather than O(n^2) with normal attention). This is necessary since we do not know our downstream task and so we must maintain high-resolution images. Our best model has a token dimension of 1024, a depth of 10, 10 heads, and 256 landmarks. If you'd like to compare it to existing models, it's a bit in between ViT-S and ViT-B

Training Scheme

We use the Masked Autoencoders Are Scalable Vision Learners paper to establish our approach for self-supervision. As seen in the graphic below: Masked Autoencoder training scheme for self-supervision. Image patches are masked out and a model is made to predict the masked regions

This approach masks regions of the image and provides the image encoder with the non-masked regions. The image encoder builds a strong encoding then passes those tokens to a decoder, which predicts the masked regions. We mask 70% of our patches. The best model was around 80M parameters, as well as an extra 30M for the decoder part (which is only needed when training the self-supervised model).

Logistics

We train on 4xA100 for 1.5 hours. Our model was still improving but you gotta move fast during a hackathon so I spent less time on hyperparameter tuning.

Fine Tuned Heads

The fine-tuned head can be adjusted depending on the task. We only train on predicting the number of people in an image as well as predicting the time of day. These heads were about 2M parameters. We did not have time to ablate the head size but I strongly suspect it could be reduced drastically (5% of current size or less) for simple tasks like those above.

Challenges we ran into

In no particular order:

I did not know bash or ffmpeg very well, so I was a bit of a struggle to get that to download files from a YouTube live stream.
My original uploads to Scale.ai did not include the original image path in the metadata, so when I downloaded the labeled results, I was unable to link them back to my local images :(. I created a new dataset and included the image path in the metadata. -There is an extremely fine balance between the size of patches and the number of patches. Generally, more (and thus smaller) patches are better (since you can represent more complexity). However, this still makes our memory and training footprint prohibitively larger. On the other hand, if the patches are too big we also run the risk of masking out an entire person, which would be impossible for the model to reconstruct and lead to poor encoding representation. It took a lot of fine-tuning. -My lack of desire to tune hyperparameters early meant I wasted a very long time with a very bad learning rate -I planned to feed these encodings into a DETR architecture to perform object/bounding box detection but I didn't have time. It isn't hard, but the DETR paper is more time-consuming to implement than I originally expected.

Accomplishments that we're proud of

I truly believe this idea holds significant merit for a future of performing efficient inference. It has its faults- relying on some data center to serve model encodings introduces latency that may not be possible in some applications (see: self-driving vehicles). However, the benefit of executing upwards of 99% of your FLOPs in a location where it can be 40x more efficient should not be understated.

This model shows extremely high generalization capacity.

The two tasks we do fine-tune on are time (of day) prediction and a number of people (a stand-in for bound box prediction). For time of day, we train on 300 samples and are able to generally predict the time of day within 15 minutes (though, all of the data was collected over the weekend, so there's probably some overfitting occurring since the test and train set are from the same day.). For a number of people, we train on only 30 samples and are able to get an average error of about .7 people! This is absolutely amazing considering how well the embedding of only 512 numbers was able to describe the 1.6M pixel values in an original image.

The automated data pipeline is extremely smooth. It felt very nice to be working on other tasks while collecting data, creating batches, and getting those batches labeled at the same time.

What we learned

A ton about Bash and Flask. I learned what Scale.ai actually does, which was really fun. I learned a significant amount about the efficiency of models and, maybe more importantly, the inefficiencies. It was quite shocking for me to read that usually only 10% of the energy spent on a model was during training.

What's next for Fine Tuned Heads

I really want to see how well the object detection works with only encodings. Theoretically, it should be fine, since these embeddings are strong enough to (almost perfectly) reconstruct the original image, so the information on where people are in the image is clearly present.

Automating more of the process would be the next major step. Currently, the 'datacenter' is a DGX A100, which honestly is a pretty killer data center but it could be on a TPU, further increasing efficiency.

Looking extremely long-term, a major question becomes how to keep the self-supervised model up to date. Consistent 'fine-tuning' with new data should prevent too much data drift, but every time you fine-tune the self-supervised model you need to retrain all of the prediction heads (which, granted, is quite easy as fast). I believe this can be solved with some codebook solutions like that seen in VQ-VAE2, but I'd have to think about it more.