Turing Natural Language Generation (T-NLG) is a 17 billion parameter language model by Microsoft that outperforms the state of the art on many downstream NLP tasks. We present a demo of the model, including its freeform generation, question answering, and summarization capabilities, to academics for feedback and research purposes.  – This summary was generated by the Turing-NLG language model itself.

On February 10th 2020, Microsoft announced T-NLG(Turing Natural Language Generation), the largest natural language processing model (NLP) ever published at 17 billion parameters which outperforms state of the art NLP models on a variety of language modelling benchmarks and also excels when applied to numerous practical tasks, including summarization and question answering.

Massive deep learning language models, such as BERT(by Google) and GPT-2(by OpenAI), with billions of parameters have improved the state of the art on nearly every natural language processing NLP task. However, T-NLG outperforms them all. Training such a big model was possible with ZeRO(Zero Redundancy Optimizer) and DeepSpeed, a new open source library announced by Microsoft on the same day which makes model training more effective by improve scale, speed, cost and usability, which unlocks the ability to train upto 100 billion parameter models. DeepSpeed is built on top of PyTorch and provides a simple API that allows engineers to leverage training parallelization techniques with just a few lines of code.

Training any model which has more than 1 billion parameters runs out of memory even on GPUs with 32 GB of memory. However, even techniques like data parallelism cannot overcome the above problem. This is where ZeRO comes into play – it tries to address the limitations of data parallelism and model parallelism while achieving the merits of both.

DeepSpeed excels at 4 key aspects:

Scale: DeepSpeed provides system support to run models up to 100 billion parameters which represents a 10x improvement compared to other training optimization frameworks.

Speed: In the initial tests, DeepSpeed showed 4x-5x higher throughput than other libraries.

Cost:  Improved throughput can be translated to significantly reduced training cost. For example, to train a model with 20 billion parameters, DeepSpeed requires three times fewer resources.

Usability: DeepSpeed does not require refactoring PyTorch models and could be used with just a few lines of code.

Share this project:

Updates