PyTorch Abstractions for Deep Learning

PADL

For data scientists, developing neural network models is often hard to coordinate and manage, due to the need to juggle diverse tasks such as pre-processing, PyTorch layers, loss functions and post-processing, as well as maintenance of config files, code bases and communicating results between teams. PADL is a tool to alleviate several aspects of this work.

Problem Statement

While developing and deploying our deep learning models in PyTorch, we found that important design decisions and even data-dependent hyper-parameters took place not just in the forward passes/ modules but also in the pre-processing and post-processing. For example:

in NLP the exact steps and objects necessary to convert a sentence to a tensor
in neural translation the details of beam search post-processing and filtering based on business logic
in vision applications, the normalization constants applied to image tensors
in classification the label lookup dictionaries, formatting the tensor to human readable output

In terms of the functional mental model for deep learning we typically enjoy working with, these steps constitute key initial and end nodes on the computation graph which is executed for each model forward or backward pass.

Standard Approach

The standard approach to deal with these steps is to maintain a library of routines for these software components and log with the model or in code which functions are necessary to deploy and use the model. This approach has several drawbacks.

A complex versioning problem is created in which each model may require a different version of this library. This means that models using different versions cannot be served side-by-side.
To import and use the correct pre- and post-processing is a laborious process when working interactively (as data scientists are accustomed to doing)
It is difficult to create exciting variants of a model based on slightly different pre and post-processing without first going through the steps to modify the library in a git branch or similar
There is no easy way to robustly save and inspect the results of "quick and dirty" experimentation in, for example, jupyter notebooks. This way of operating is a major workhorse of a data-scientists' daily routine.

PADL Solutions

In creating PADL we aimed to create:

A beautiful functional API including all mission critical computational steps in a single formalism -- pre-processing, post-processing, forward pass, batching and inference modes.
An intuitive serialization/ saving routine, yielding nicely formatted output, saved weights and necessary data blobs which allows for easily comprehensible and reproducible results even after creating a model in a highly experimental, "notebook" fashion.
An "interactive" or "notebook-friendly" philosophy, with print statements and model inspection designed with a view to applying and viewing the models, and inspecting model outputs.

With PADL it's easy to maintain a single pipeline object for each experiment which includes pre-processing, forward pass and post-processing, based on the central Transform abstraction. When the time comes to inspect previous results, simply load that object and inspect the model topology and outputs interactively in a Jupyter or IPython session. When moving to production, simply load the entire pipeline into the serving environment or app, without needing to maintain disparate libraries for the various model components. If the experiment needs to be reproduced down the line, then simply re-execute the experiment by pointing the training function to the saved model output.

What it does

Defining atomic transforms

Imports:

from padl import this, transform, batch, unbatch, value
import padl
import torch

Transform definition using transform decorator. Any callable class implementing __call__ can also become a transform:

@transform
def split_string(x):
    return x.split()

@transform
class ToInteger:
    def __init__(self, words):
        self.words = words + ['<unk>']
        self.dictionary = dict(zip(self.words, range(len(self.words))))

    def __call__(self, word):
        if not word in self.dictionary:
            word = '<unk>'
        return self.dictionary[word]

to_integer = ToInteger(WORDS)

EOS_VALUE = to_integer.dictionary['</s>']

@transform
def to_tensor(x):
    x = x[:10][:]
    for _ in range(10 - len(x)):
        x.append(EOS_VALUE)
    return torch.tensor(x)

transform also supports inline lambda functions as transforms:

split_string = transform(lambda x: x.split())

this yields inline transforms which reflexively reference object methods:

left_shift = this[:, :-1]
lower_case = this.lower()

PyTorch layers are first class citizens via padl.transforms.TorchModuleTransform:

@transform
class LM(torch.nn.Module):
    def __init__(self, n_words):
        super().__init__()
        self.rnn = torch.nn.GRU(64, 512, 2, batch_first=True)
        self.embed = torch.nn.Embedding(n_words, 64)
        self.project = torch.nn.Linear(512, n_words)

    def forward(self, x):
        output = self.rnn(self.embed(x))[0]
        return self.project(output)

model = LM(N_WORDS)

print(isinstance(model, torch.nn.Module))                   # prints "True"
print(isinstance(model, padl.transforms.Transform))         # prints "True"

Finally, it's possibly to invoke all callables from an imported module as Transforms directly. This saves writing the transforms explicitly:

import numpy
import torchvision

normalize = transform(torchvision).transforms.Normalize(*args, **kwargs)
cosine = transform(numpy).cos

print(isinstance(normalize, padl.transforms.Transform))         # prints "True"
print(isinstance(cosine, padl.transforms.Transform))            # prints "True"

Defining compound transforms

Atomic transforms may be combined using 3 functional primitives:

Transform composition: compose

s = transform_1 >> transform_2

Applying a single transform over multiple inputs: map

s = ~ transform

Applying transforms in parallel to multiple inputs: parallel

s = transform_1 / transform_2

Applying multiple transforms to a single input: rollout

s = transform_1 + transform_2

Large transforms may be built in terms of combinations of these operations. For example the branching example above would be implemented by:

preprocess = (
    lower_case
    >> clean
    >> tokenize
    >> ~ to_integer
    >> to_tensor
    >> batch
)

forward_pass = (
    left_shift
    >> IfTrain(word_dropout)
    >> model
)

train_model = (
    (preprocess >> model >> left_shift)
    + (preprocess >> right_shift)
) >> loss

Passing inputs between transform stages

In a compose model, if transform_1 has 2 outputs and transform_2 has 2 outputs, then in applying the composition: transform_1 >> transform_2 to data, the outputs of transform_1 are passed to transform_2 positionally. So output-1 of transform_1 is passed to input-1 of transform_2. If transform_2 has only one input, then the outputs of transform_1 are passed as a tuple to transform_2.

In an upcoming release, we plan to allow for passing inputs from one stage to the next using input/ output names.

Decomposing models

Often it is instructive to look at slices of a model -- this helps with e.g. checking intermediate computations:

preprocess[:3]

Individual components may be obtained using indexing:

step_1 = model[1]

Naming transforms inside models

Component Transform instances may be named inline:

s = (transform_1 - 'a') / (transform_2 - 'b')

These components may then be referenced using __getitem__:

print(s['a'] == s[0])    # prints "True"

Applying transforms to data

To pass single data points may be passed through the transform:

prediction = t.infer_apply('the cat sat on the mat .')

To pass data points in batches but no gradients:

for x in t.eval_apply(
    ['the cat sat on the mat', 'the dog sh...', 'the man stepped in th...', 'the man kic...'],
    batch_size=2,
    num_workers=2,
):
    ...

To pass data points in batches but with gradients:

for x in t.train_apply(
    ['the cat sat on the mat', 'the dog sh...', 'the man stepped in th...', 'the man kic...'],
    batch_size=2,
    num_workers=2,
):
    ...

"batch" and "unbatch" key transforms

The batch transform denotes where to split a transform between preprocessing and forward pass. The unbatch transform denotes where to split between forward pass and postprocessing. Everything before batch is performed in the data loader. This means that multiprocessing may be leveraged without extra boilerplate, to prepare data quickly for the forward pass. Every between batch and unbatch is performed on the GPU (is CUDA is being used) and in batches. Everything after unbatch downstream is applied in a for loop over the rows of output of the forward pass.

When using Transform.infer_apply to apply a transform to a single data point, the transforms batch adds the additional dimension which is otherwise created by batching in the data loader implicit in Transform.train_apply and Transform.eval_apply. Analogously, in Transform.infer_apply the unbatch transform serves to remove this additional dimension, so that the output going to the postprocessing step has the same number of dimensions as the rows which come out of the forward pass in Transform.eval_apply and Transform.train_apply.

As a very simple example:

m = transform(torch.nn.Linear)(10, 20)
t = (
  transform(lambda x: torch.tensor(x))
    >> batch
    >> m
    >> unbatch
    >> this.tolist()
)

t.infer_apply(x) is approximately equivalent to:

m(torch.tensor(x).unsqueeze(0))[0, :, :].tolist()

Whereas t.eval_apply(x) and t.train_apply(x) are approximately equivalent to:

[y.tolist() for y in m(torch.stack([torch.tensor(y) for y in x]))]

Model training

Important methods such as all model parameters are accessible via Transform.pd_*.:

o = torch.optim.Adam(model.pd_parameters(), lr=LR)

For a model which emits a tensor scalar, training is super straightforward using standard torch functionality:

for loss in model.train_apply(TRAIN_DATA, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS):
    o.zero_grad()
    loss.backward()
    o.step()

Saving/ Loading

Saving:

from padl import load
model.save('test.padl')

Loading:

from padl import load
model = load('test.padl')

How we built it

We built PADL at LF1 with a team of 6, in the process of building highly branching and multi-modal models using NLP, information retrieval and vision deep learning. We originally started with a more complex version of the software which was statically typed, with a pattern matcher, and a large focus on jit compilation. In the process we realized that the clear value proposition of the concept lay in model building, saving and working interactively with the models. We would like to remain agnostic with respect to whether users prefer to use JIT models or standard modules.

Our philosophy in the released version is to maintain a minimal set of software requirements and to keep the transform concept and its associated builder and serializer central. Once these are available, users can build their transforms using a combination of pytorch and whichever favourite data-science packages they wish to use for their data processing. Key work-horses are inspect and the ast parser, as well as, of course, Pytorch functionality.

Challenges we ran into

Operator precedence in python

PADL overloads certain python operators to enhance the usablity of it's functional API. However the python operators have a certain built in precendence which must be respected. A challenge was to find a suitable collection of primitives with corresponding python operators, whose precedence also reflected the intuitions of a data scientists building a transform.

Introspecting python objects

We use ast and inspect to access the code which created a Transform object. Often we needed to extract imports and global variables which were key in creating the object and isolate them so that we can robustly save our models. This presented several technical challenges, especially for objects with multiple dependencies and nested definitions.

Automatic batching

Using the batchify operator, we are able to handle batching and extraction from batches automatically within our Transform objects. This presented a challenging task to identify how a PyTorch data-loader object is built by recursively navigating the computation graph and splitting it between pre-processing, forward pass and post-processing.

Accomplishments that we're proud of

Completely self contained model saving

With PADL we are able to save and load models which require multiple imports, data blobs, weights, layers etc.. without specification of additional paths, packages, files etc.. Everything occurs by introspection and code navigation of the created Transform object. This gives the security that the training results may be easily loaded and recovered at any point down the line.

Model reproducibility

Due to the human readable format of saved output, it is super easy to inspect a previously trained model and even modify it in a new experiment. This alleviates a key pain point in the data-science life cycle, namely reproducibility.

An enjoyable developer experience

We found that the functional philosophy applied at the high level of model structure and the object oriented Pytorch approach applied at the individual layers and forward pass level provide an optimal marriage and get very close to a commonly used mental model which data-scientists enjoy working in. The operator overloading means that the way that Transform objects are written is visually close to the way data-scientists think about their models.

What we learned

New features of python3 such as advanced introspection have enabled new ways of working for python developers -- for example as used in pytest. We learned that these features can be used to great advantage for working with Pytorch code and models. By building on top of the great Pytorch data API, we learned that PADL's Transform formalism can provide intuitive and easy to use abstractions for deep learning development and deployment.

What's next for PADL

In the next steps we plan to support:

Model conversion/ interchangeability with MAR files so that for example a simple interface with torchserve may be built.
A simple interface to pytorch-lightning.
Support for arbitrary serialization.
Model import from torch model repository.
Support for skip-connections between diverse points on the Transform computation graph.