Arvind Sridhar posted an update — Dec 09, 2021 12:11 PM EST

Final Writeup

Project Name: Binarized Neural Networks

Team: Binary Bros

Arvind Sridhar (asridh13)
Nicholas Masi (nmasi)

Introduction

Original Paper
This paper from 2016 introduces the original idea behind BNNs (“neural networks with binary weights and activations at run-time”) to decrease space usage and runtime of models while also increasing the power efficiency. Our group reimplemented this paper, applying BNN’s to Fashion MNIST - a dataset composed of tens of thousands of clothing images that fall into 10 categories. Our project tackled a number of different kinds of problems following what we’ve covered in class, with the main goal of improving the runtime and memory usage of models in both training and testing while maintaining relatively high accuracy. Additionally, we used Larq, a collection of open-source Python packages for building, training, and deploying Binarized Neural Networks, and used TensorFlow features such as lite files to gain actionable metrics that could be used to quantify our results.

Methodology

We replicated the model architecture from Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1, the aforementioned original paper on BNNs, in building our models for classification. Our BNN is a recreation of their binary model using Larq, and the vanilla model has near-identical architecture aside from full-precision (float32) activations and weights, thus making it a control. The models use the ADAM optimizer with an exponentially decaying learning rate. Weights are initialized according to Glorot Uniform. Batches of 100 are used, and Batch Normalization is performed after each layer; Batch Normalization is considered an essential property for BNNs as described by Simons and Lee (2019). Since the derivative of the Sign/Signum function, which is used as the deterministic binary quantizer in our BNN, is zero almost everywhere (ruining backpropagation), we used the Saturated Straight Through Estimator (Saturated STE) to backpropagate through the BNN; these are implemented together as the “ste_sign” quantizer in the Larq library. The authors use L2-Hinge Loss for BNN and claim it outperforms Softmax at certain classification tasks following the work of Tang (2013), but in our experimentation, we found that our Larq-implemented BNN was unable to learn when using it and thus we utilized Softmax activation on the last layers of both our models along with Sparse Categorical Cross-entropy as the loss function. Finally, the paper’s MLP architecture consisted of 3 hidden layers of size 4,096 and 1000 epochs (we trained with 10 for the sake of time). Larq has not yet optimized binary dense layers, and so we implement dense layers in our BNN with 2D convolution layers that have 1x1 kernels, strides of 1, and no padding. This yields the same architecture while using an optimized binary function. We follow all of these discussed architecture cues in our BNNs and implement them in our vanilla model as appropriate (e.g., no quantization and STE in the vanilla model). With this, we ensure that variations in accuracy and runtime/memory between the vanilla and binarized models are as attributable to the binarization as possible and recreate conditions as close to the original paper as was possible given the resources available to us. We train and test our models using the fashion_mnist dataset, which is a harder version of the MNIST benchmark. All our code is run in a Colab notebook, including data preprocessing, model building, training, testing, and analyzing. We use the TFLite Analyzer to see the size of the models when serialized as TFLite files meant for small hardware. We ran the models on the test data of the dataset after training for 10 epochs to assess accuracy.

Results

Our results were mostly successful in replicating those of the original paper. We conducted multiple rounds of training and testing to judge the accuracy and found that our accuracy difference between the BNN and vanilla model was negligible. This was both surprising and impressive, as typically one would expect that BNN accuracy would be slightly lower than vanilla accuracy due to the loss of technical nuance in binarizing weights at run time, which did not end up being the case for our BNN. Though we were unable to track power efficiency due to hardware limitations (detailed in the Challenges section), we were able to record an 88% model file size decrease from the vanilla model to the BNN. This metric of efficiency, especially when combined with negligible accuracy differences, goes to show the potential of BNNs in optimized deep learning practice. Finally, we were not able to accurately gauge runtime, once again due to hardware limitations. Our BNN took longer to train on Colab than the vanilla model, but that is because the Larq library is not at all optimized for the Colab interface, and if we were to run the models on 64-bit ARM architecture as they are designed for, we would expect a significant decrease in runtime for the Binarized model as compared to the vanilla model.

Challenges

One of the initial problems that we ran into was converting previous class assignments into code that was optimized for binarization in the form that we wanted it to take. For example, we initially took the MNIST assignment, for which we later changed the dataset, and retrofitted our implementation to use Keras. In terms of additional difficulties, our project used Google CoLab, since the TensorFlow profiler which we were originally planning to use to collect statistics needed machine specifications that were unavailable to the members of the team. However, Larq is only optimized for 64-bit ARM architecture and Android devices, so it was not able to show runtime/memory improvements on CoLab, which uses GPUs. We eventually were able to use the TFLite Analyzer to evaluate memory usage, but couldn't analyze optimized runtime (where we expect the BNN would show significant improvements over the vanilla model) because we didn't have any 64-bit ARM computers accessible to us.

Reflection

Our project turned out very successful, not just in the results that we garnered, but more importantly in the conceptual understanding of binarization and the power that it could have as a future industry and research practice for deep learning. At the beginning of our project, we established our stretch goal as creating BNNs that perform with only minor drops in accuracy and have significantly better performance in runtime and memory usage. With the exception of runtime due to hardware limitations, we were able excitingly to meet this goal, with even better accuracy than we expected.
Our model definitely had significant tweaks in how we expected to implement it at the beginning of the project process. For example, we planned on using the tensor flow profiler extensively to document the statistics that we wanted to compare between the vanilla model and the BNN model. However, we ran into significant hardware difficulties and had to adapt our model accordingly. Our approach changed significantly over time. We changed the interface on which we decided to create our implementation, and after identifying our inability to track some of the main statistics we were hoping to see due to the platform we were on, we came up with the idea of using TFLite files to record additional statistics for our model.
If we could do our project over again, and more generally for future projects, we would absolutely want to make sure that our plans for implementation took hardware into account, and more specifically hardware limitations. If we were to have more time to build on what we’ve done so far, we would implement binarization on other types of DL models that have been explored in the class, such as RNNs and transformers. Additionally, there is a lot of work currently being done in the field regarding the importance of weight randomization in binarized models. Future studies could mathematically delve into weight randomization equations that would optimize the statistics explored in this project.
Our biggest takeaways from the project were mainly optimism for the future of binarization in deep learning practice, and the benefits it could have up to and including environmental accountability. We are very glad to have gotten practice and education in understanding how this novel concept works, how to implement it, and seeing firsthand how it stacks up against the conventional models that we have been using thus far in our experience with deep learning.

Log in or sign up for Devpost to join the conversation.