Transformers are getting bigger and better. The SoTA baseline seems like a never-ending chase. However, this progress raises concerns on two cardinal aspects:
How much compute do these models require? The largest transformer (GPT-3) requires 10K GPUs for performing few-shot learning. This is 10x bigger than the previous largest transformer. These models require extensive compute (training time), huge amounts of data to perform well.
What are they learning and how it is carried out? These large pre-trained LMs are brittle when distribution shifts during inference. They are biased (for instance, generative models such as GPT-2 will associate nurses with women often). They are treated as black-box models in some aspect, we don't know what impact does finetune has and broadly, what do they look at and how do they perform this reasoning.
What it does
The main goal is to provide standardized modules for compute-efficient and robust algorithms. Compute efficient methods such as ZSL, Meta-Learning, Adaptive methods, Importance sampling can reduce the amount of data our models need and at the same time provide us competitive performance with standard fine-tuning methods. Another line of research has dived into how we can make these models robust. Debiasing methods aim to provide models with more generalization capabilities. Interpretability methods such as probing classifiers help us study their internal dynamics.
Fluence provides standardized API (similar to HF Transformers) to integrate these methods with existing workflows. Almost all the modules require similar arguments as any transformers code would require minimal changes. Thus reducing overhead from the user's endpoint. More details can be found in the video.
How I built it
This library uses Pytorch for all its functionalities. This library is part of a research project which will be published in the next few months. Being an active user and contributor of Pytorch and Transformers, I realized that there is some form of the gap that can be filled in regards to computing efficiency and robust methods. I looked at many different implementations to understand the issues (different ways of loading data, models expecting different inputs, custom training loops, no standard way to report results) and want to address these issues in this domain similar to Transformers. You can simply feed in any
nn.Module model, wrap it inside
Fluence provided methods and let the rest be taken care of by it. The current functionalities come from what I feel were the essential starting point.
Challenges I ran into
It took me a lot of time to make some methods work (such as HEX, due to its instability with matrix inversion, MAML for transformers, which now uses
higher, integrating these methods with HF Pytorch
Trainer). I had to read many different papers to understand the problems and how they can be better implemented. Some of the methods were implemented in TF which were ported to PT (required me to read TF docs).
Accomplishments that I'm proud of
I am proud of implementing modules that didn't have a proper implementation. This library going forth will include some of the best practices in research. In the process, I submitted several PRs to the
Transformers repo. For instance, adding Pytorch Native AMP in the
Trainer. I think
Fluence's direction will be determined by the community response. It has always been one of my research goals to create an ML library that makes it easier for researchers to try out their ideas and prototype it with minimal overhead.
What I learned
I learned a ton about NLI research since this is the task on which I tested out these methods. I learned a lot about the
Transformers library such as their standard APIs to instantiate modules and training workflow. I liked it and this is one of the reasons why
Fluence integrates with their workflow. I learned about code coverage in general and added it to this library.
What's next for Fluence
There are a lot of things which will come to Fluence in the coming months. The meta-learning pipeline needs improvement in terms of providing flexibility to users. I hope to add improved pruning methods (inspired by LTH). There are a few sampling methods currently. I hope to make the
data order aspect easy to manage. I want to add
sparse methods also possibly which integrates well with
autograd. Improvements in documentation and the addition of examples will be an ongoing effort.