Two of my past projects involved training large number of embeddings, and the GPU I was using was not able to hold these embeddings at the desired embedding dimension I wanted. I noticed that during each training step, only a small number of the total embeddings were updated; the rest are idle and could be kept off the GPU RAM when not in use.
For this I needed fast CPU <-> GPU transfer. Using a Pytorch pinned CPU tensor to a Pytorch Cuda Variable was decent enough, but the reverse was too slow for training in a reasonable amount of time, so I started looking at other options. In response to a StackOverflow question I asked [ https://stackoverflow.com/questions/57752516/how-to-use-cuda-pinned-zero-copy-memory-for-a-memory-mapped-file ], StackOverFlow user Robert Crovella developed a new type of Tensor (a Cupy Tensor with pinned CPU memory), which turned out to be very fast at transferring data to/from the CPU.
What it does
Fast CPU <-> GPU data transfer from/to Pytorch Cuda Variables. Applications are:
Incorporate SpeedTorch into your data pipelines for fast data transfer to/from CPU <-> GPU
Augment training parameters via CPU storage.
Use Adadelta, Adamax, RMSprop, Rprop, ASGD, AdamW, and Adam optimizers for sparse embeddings training. Previously, only SpraseAdam, Adagrad, and SGD were suitable since only these directly support sprase gradients.
Here are some quick getting started videos:
Using the dataGadget: https://www.youtube.com/watch?v=ka4uoIkv0tA
Use in embeddings training: https://www.youtube.com/watch?v=jKu7vCxFhUc
How I built it
With Pytorch, Numpy, and Cupy.
Challenges I ran into
I initially was looking for a way to store the embeddings onto the disk, and use memory mapping for fast storage and retrieval. I discovered Cupy memory maps and when I bench marked the speeds for word2vc training, they were amazing. I thought I discovered something huge, so I starting making a library around this.
A few weeks in, the library was nearly complete, and I wanted to test this on my larger embeddings datasets. But something was wrong, my system would keep crashing, due to out of memory errors. To my horror, I found out that even though memory maps were being used, it would still take up as much memory as using live tensors, which defeated the purpose of the library I created. If you look at the Github repo, it still contains the original memory map code [ https://github.com/Santosh-Gupta/SpeedTorch/blob/master/SpeedTorch/CPUCupyPinned.py ], I still haven't had the heart to delete it haha.
I was pretty bummed out, and thought I had wasted a few weeks, and started shifting focus to a different project I could submit to the Pytorch hackathon; I still had 2 weeks. But I did remember that in some cases, for the Cupy memmaps, data transfer was actually fast than Pytorch tensors for transferring data to/from a Pytorch Cuda Variable.
I wondered if the same could be done with a Cupy tensor on pinned CPU memory, so I asked on StackOverFlow [ https://stackoverflow.com/questions/57752516/how-to-use-cuda-pinned-zero-copy-memory-for-a-memory-mapped-file ]. While such a method didn't officially exist, StackOverFlow user Robert Crovella created his own method.
When I did the speed benchmarks of this new tensor datatype, I couldn't believe some of the results. Robert Crovella created something amazing.
Accomplishments that I'm proud of
In addition to the library itself, I was able to use SpeedTorch to finally created a rare book recommender that I've been wanting to make for over a year now.
The original version of this project [ https://github.com/Santosh-Gupta/lit2vec ] only contained 10,000 popular books, but I didn't really need a recommender for popular books; I wanted one for rare books. Thanks to SpeedTorch, I was able to create a 400 dimensional embedding for 2,829,853 books. I actually could have created larger embeddings, but I didn't have the Google Drive space (though much easier problem to solve).
There's also a similar case an updated for a research paper recommender I created last year
For the original [ https://github.com/Santosh-Gupta/research2vec ] I could only train 2-4 million research paper embeddings, at embedding sizes 64-80. I can now train 14,886,544 paper embeddings at an embedding size of 188.
Though these were partially due to the Google Colab upgrades, SpeedTorch allows to nearly double the parameters in Colab by allowing half to be hosted on the CPU using their 25 gb CPU instances.
As long as you have enough CPU RAM, you can host any number of embeddings without having to worry about the GPU RAM.
What I learned
I learned A LOT of Pytorch. A month and a half ago, I actually never used Pytorch. But I received an invite to an official Pytorch hackathon located at the Facebook headquarters in Menlo Park. It was an opportunity I couldn't pass up, so I spent the week before the hackathon learning about Pytorch. During the hackathon, I got to meet with other Pytorch users, and even some of the Pytorch developers, and I learned a lot about it.
My biggest takeaway was how pliable and Pythonic it was, and how easy it was to alter the weights of the model variables and optimizer, which led to my idea about a library for sparse training.
What's next for SpeedTorch
I'm sure that there are a lot of other applications that can be applied to SpeedTorch other than fast data transfer and embeddings training. If you have an idea, please post an issue on the project Github, or post in the Gitter [ https://gitter.im/SpeedTorch ], I would love to hear it!