Binary neural networks are networks with binary weights and activations at run time. At training time these weights and activations are used for computing gradients; however, the gradients and true weights are stored in full precision. This procedure allows us to effectively train a network on systems with fewer resources.
In this article, we delve into the theory behind binary neural networks (BNNs), their training procedure, and their performance.
For forward propagation, we need two binary matrices; we thus binarize the weight matrix and the incoming activation from the previous layer.
The binarization procedure is illustrated below.
Gradient Propagation Through Discretization
The derivative of the sign function is zero almost everywhere, making it incompatible with backpropagation. Thus, a straight-through estimator is used. This preserves the gradient's information and cancels large gradients.
The gradient propagation procedure is illustrated below.
Forward Pass for Binary Matrices
One can see below how a matrix composed of only +/- 1s can be converted to a binary matrix, and replace fused multiply–add (FMA) operations with bit manipulations to give the same results.
We shall be using this observation to generate binary matrices, which allow us to gain significant speedup and decrease the memory overhead notably.
It is important to note that the popcount function increments for every ON bit and decrements for every OFF bit. This is not how __builtin_popcount function of GCC works. However, the __builtin_popcount function can be used in the following fashion (if using unsigned int to bit-pack matrices).
value = 2*(__builtin_popcount(~(A^B))) - 32
In-depth information about bit-packing and matrix multiplication using this strategy can be found in the blog Art’Em – Artistic Style Transfer to Virtual Reality Week 4 Update.
Implementing Binary Neural Networks
A bare-bones version of a BNN of the structure below was implemented in the Wolfram Language. The network was trained on the Intel® CPUs on the MNIST* dataset.
This network has the following layers:
- Fully connected (128)
- Ramp - rectified linear unit (ReLU) activation function
- Binarize activations
- Fully connected (128)
- Ramp - ReLU activation function
- Binarize activations
- Fully connected (10)
- Sigmoid activation function
Please note that the input layer is left unbinarized. This is due to the high-dimensional geometry of neural networks. After numerical experiments on CIFAR10* were conducted, it was seen that the dot products of activations with the pre-binarization and post-binarization weights are highly correlated. However, this correlation is weaker in the first layer. This is a strong reason to never binarize the first layer of a BNN.
Dot Product Preservation
Each figure above shows a 2D histogram of the dot products between the binarized weights and the activations (x-axis), and the dot products between the continuous weights and the activations (y-axis).
Angle Preservation Property
It has been demonstrated that binarization approximately preserves the direction of high-dimensional vectors. The angle between a random vector and its binarized version converges to approximately 37 degrees as the vector dimension approaches infinity.
This network uses Adam* optimizer. For the implementation of the BNN, you can find the file as "3LayBNN Easy.nb" in the following GitHub* repository.
The BNN of the architecture previously provided was trained on the MNIST dataset. We can clearly see that the BNN has slower convergence than the full precision counterpart. However, we also observe that a fairly low parameter network (the BNN) is able to closely mimic the performance of the full precision network.
These metrics further validate the legitimacy of the dot product preservation and the angle preservation property discussed above.
Speedup Analysis For General Matrix to Matrix Multiplication
After studying the classic matrix multiplication algorithm (CMMA) implemented in Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System, I implemented my own, similarly optimized version of a binarized CMMA algorithm. Since the memory accesses required for the binarized CMMA algorithm are significantly lesser, we saw a great increase in performance.
The topic of tuning has a wide scope; however, it is evident that the binary classical matrix multiplication algorithm (binCMMA) is faster than the full precision classical matrix multiplication algorithm (FP CMMA). We see a significant increase in performance with minimal optimization.
This algorithm was tested on the Intel® Xeon® Gold processor from the Intel® AI DevCloud.
In the next article, Code Sample: Optimizing Binarized Neural Networks on Intel® Xeon® Scalable Processors we will conduct a performance analysis for the binCMMA algorithm.
- Alexander G. Anderson and Cory P. Berg, May 2017, The High-Dimensional Geometry of Binary Neural Networks. CoRR, abs/1705.07199.
- Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System.