Code Sample: Optimizing Binarized Neural Networks on Intel® Xeon® Scalable Processors

File(s):

Download
License:Intel Sample Source Code License Agreement
Optimized for... 
OS:Linux*
Hardware:Intel® Xeon® Gold 6128 processor
Software:
(Programming Language, tool, IDE, Framework)
OpenMP*, C++ (for kernel development), Wolfram Language (for model development)
Prerequisites:Access to Intel® AI DevCloud

In a previous article, we discussed the performance and accuracy of Binarized Neural Networks (BNN). We also introduced a BNN coded from scratch in the Wolfram Language. The key component of this neural network is Matrix Multiplication. General Matrix-to-Matrix Multiplication (GEMM) is at the heart of deep learning. We can clearly see in the image below that almost 89 percent of the CPU forward time is spent on convolutional and fully connected layers in a typical neural network.

As a result, it is exceedingly important to accelerate GEMM. Convolutions can be reduced to GEMM operations as well, using the im2col algorithm. Therefore, developing good approximations to convolution could allow a 10x speed up in inference with respect to a full precision convolution implementation.

Binarizing Convolutions

As mentioned above, it is possible to reduce convolutions to GEMM operations by utilizing the im2col algorithm, which is visualized below.

By virtue of the im2col algorithm, it is possible to develop approximations to convolution. im2col arranges data in a way that makes memory accesses regular. Regular indicating that the matrices can be accessed in a row-major format. This leads to significantly lesser cache-misses. Cache miss is a state where the data requested for processing by a component or application is not found in the cache memory.1 However, it is important to note that there is a large memory overhead generated by using im2col. This can be dealt with effectively during the binarization process. Row-wise binarization is one of the dominant strategies to reduce model size but at the same time preserve the representational capacity of the network. More in-depth study of this field will be required to truly parallelize convolutions.

Implementing Binary Classic Matrix Multiplication on Intel® Xeon® Gold Processors

Execution

To run the code, use this download link and upload the file (xbin.c) to the Intel® AI DevCloud and use the following command:

icpc -O3 -xCORE-AVX512 -qopenmp -mkl -fp-model fast=2 -fma -unroll=4 xbin.c -o xcmma.out && echo ~/xcmma.out | qsub

Note: This will work if the xbin.c file is saved in the home directory. Make sure the file addresses are changed accordingly.

We will begin the process of developing this algorithm by first implementing the Classic Matrix Multiplication Algorithm (CMMA).

Optimized CMMA Loop Processing Scheme

for(int i = 0; i < m; i++){
	for(int j = 0; j < n; j++){
		for(int k = 0; k < p; k++){
			pC[i][j] += pA[i][k]*pB[k][j];
			}
		}
	}

There are a few issues with the unoptimized implementation of CMMA. We encounter k memory accesses to pC, which can be avoided. Moreover, we have not utilized any OpenMP* clauses to help parallelize this operation.

Most of the constructs in OpenMP are compiler directives.

    #pragma omp construct [clause [clause]...]

We will be using #pragma omp parallel to define a parallel region.

#pragma omp parallel [clauses]  
{  
   code_block  
}


Defines a parallel region, which is code that will be executed by multiple threads in parallel.

We will also define private variables.

private(var)


Each thread should have its own instance of the variable specified

Optimized CMMA Loop Processing Scheme

float sum = 0;
#pragma omp parallel for private(i, j, k, sum) num_threads(NUM_OF_THREADS)
for(int i = 0; i < m; i++){
	for(int j = 0; j < n; j++){
		sum = 0.0;
		for(int k = 0; k < p; k++){
			sum += pA[i][k]*pB[k][j];
		}
		pC[i][j] = sum;
	}
}

Note: The NUM_OF_THREADS has been defined in the main code.

If pB denotes the weights in a neural network, the memory accesses are made more efficient by transposing them. (Row major memory access pattern). We will however, omit that optimization for now as this will remain common to both the binary multiplication and full precision multiplication.

Binarizing Activations

In a neural network, there are two elements: the activations and the weights. The weights for a trained neural network are fixed; however, the activations are dependent upon the input.

Therefore, we will introduce separate kernels for multiplication and binarization. This is for better understanding, although combining both the processes gives marginal performance improvements.

unsigned int tbA; int sign;
#pragma omp parallel for private(i, seg, j, sign, tbA) num_threads(NUM_OF_THREADS)
for (i = 0; i < MX_SIZE; i++)
{
	for(seg = 0; seg < MX_SIZE/16; seg++)
	{
		tbA = 0;
		for(j = 0; j < 16; j++)
		{
			sign = (int) (pA[i][seg*16 + j] >= 0);
			tbA = tbA|(sign<<j);
		}
	bA[i][seg] = tbA; 
	}
}

Notice that the second matrix being multiplied is the Weight Matrix and has already been binarized.

binarization times vs. matrix size

Binary Multiplication

FP CMMA and binCMMA

We will attempt to introduce C code for binary multiplication with the performance metrics, as demonstrated in the figure above. This benchmark was discussed in a previous article.

Keeping in mind the constructs used in the ”Binarizing Activations” section, inspect the code below for binary multiplication. We divide the MX_SIZE (Matrix Size) by 16 since that is the bit width that data types bA and bB are initialized with.

float temp; 
#pragma omp parallel for private(i, j, k, temp) num_threads(NUM_OF_THREADS)
for(i = 0; i < m; i++){
	for(j = 0; j < n; j++){
			temp = 0;
			for(k = 0; k < MX_SIZE/16; k++){
			temp +=      
                      2*(__builtin_popcount(~(bA[i][k]^bB[j][k]))) - 48;
			}
		pC[i][j] = temp;
	}
}

After #pragma unoptimized

As shown in the graph above, by simply adding the #pragma clause, we achieved a consistent 10-fold speed-up. Further optimizations could involve parallelizing loops with pointers to row arrays.

Intel® Development Tools Used

The project was developed using the Intel® AI DevCloud (using Intel® Xeon® Scalable processors). Information from the Intel® AI Developer Program forum was also used for optimization purposes with Intel Xeon processors.

More details about implementation:

Processor Intel® Xeon® Gold 6128 processor
Multiprocessing APIOpenMP*
Programming language

C++ (for kernel development)

Wolfram Language (for model development)

Compiler

icpc version 18.0.3 (gcc version 4.8.5 compatibility)

References

  1. Cache Miss
For more complete information about compiler optimizations, see our Optimization Notice.