Torchtext dataset classes provide a convenient, high-level interface to reading natural language data. However, they do not provide an interface for including data transformations at read time in the way that torchvision does. Further, they return data in a vectorized format, which precludes the application of data augmentation policies after reading. Current SOTA results across all domains include some form of augmentation for accuracy, so it's important to be able to do this with text data.

What it does

niacin is a python library with a collection of common text data augmentation functions, like backtranslation, word order swapping, and synonym replacement. Previously, it had not been usable with PyTorch dataloader classes, because torchtext dataset classes did not support transformations. niacin now includes torchtext-like dataset classes that can apply an arbitrarily large number of input transformations before vectorizing the data. Additionally, niacin now also includes an implementation of RandAugment, a tunable policy for applying augmentation functions that has produced results comparable with more involved policies, like AutoAugment.

What's next for Easy text data augmentation in PyTorch

Currently, the augmentation functions inside niacin are restricted to English, which limits their utility in most parts of the world. Future work will include adding support for a broader variety of languages.

Share this project: