Introducing DNN primitives in Intel® Math Kernel Library

    Deep Neural Networks (DNNs) are on the cutting edge of the Machine Learning domain. These algorithms received wide industry adoption in the late 1990s and were initially applied to tasks such as handwriting recognition on bank checks. Deep Neural Networks have been widely successful in this task, matching and even exceeding human capabilities. Today DNNs have been used for image recognition and video and natural language processing, as well as in solving complex visual understanding problems such as autonomous driving. DNNs are very demanding in terms of compute resources and the volume of data they must process. To put this into perspective, the modern image recognition topology AlexNet takes a few days to train on modern compute systems and uses slightly over 14 million images. Tackling this complexity requires well optimized building blocks to decrease the training time in order to meet the needs of the industrial application.

    Intel® Math Kernel Library (Intel® MKL) 2017 introduces the DNN domain, which includes functions necessary to accelerate the most popular image recognition topologies, including AlexNet, VGG, GoogleNet and ResNet.

    These DNN topologies rely on a number of standard building blocks, or primitives, that operate on data in the form of multidimensional sets called tensors. These primitives include convolution, normalization, activation and inner product functions along with functions necessary to manipulate tensors. Performing computations effectively on Intel architectures requires taking advantage of SIMD instructions via vectorization and of multiple compute cores via threading. Vectorization is extremely important as modern processors operate on vectors of data up to 512 bits long (16 single-precision numbers) and can perform up to two multiply and add (Fused Multiply Add, or FMA) operations per cycle. Taking advantage of vectorization requires data to be located consecutively in memory. As typical dimensions of a tensor are relatively small, changing the data layout introduces significant overhead; we strive to perform all the operations in a topology without changing the data layout from primitive to primitive.

Intel MKL provides primitives for most widely used operations implemented for vectorization-friendly data layout:

  • Direct batched convolution
  • Inner product
  • Pooling: maximum, minimum, average
  • Normalization: local response normalization across channels (LRN), batch normalization
  • Activation: rectified linear unit (ReLU)
  • Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale.

Programming model

    Execution flow for the neural network topology includes two phases: setup and execution. During the setup phase the application creates descriptions of all DNN operations necessary to implement scoring, training, or other application-specific computations. To pass data from one DNN operation to the next one, some applications create intermediate conversions and allocate temporary arrays if the appropriate output and input data layouts do not match. This phase is performed once in a typical application and followed by multiple execution phases where actual computations happen.

    During the execution step the data is fed to the network in a plain layout like BCWH (batch, channel, width, height) and is converted to a SIMD-friendly layout. As data propagates between layers the data layout is preserved and conversions are made when it is necessary to perform operations that are not supported by the existing implementation.

 

    Intel MKL DNN primitives implement a plain C application programming interface (API) that can be used in the existing C/C++ DNN framework. An application that calls Intel MKL DNN functions should involve the following stages:

    Setup stage: for given a DNN topology, the application creates all DNN operations necessary to implement scoring, training, or other application-specific computations. To pass data from one DNN operation to the next one, some applications create intermediate conversions and allocate temporary arrays if the appropriate output and input data layouts do not match.

    Execution stage: at this stage, the application calls to the DNN primitives that apply the DNN operations, including necessary conversions, to the input, output, and temporary arrays.

    The appropriated examples for training and scoring computations may be find out into MKL package directory: <mklroot>\examples\dnnc\source 

Performance

Caffe, a deep learning framework developed by Berkeley Vision and Learning Center (BVLC), is one of the most popular community frameworks for image recognition. Together with AlexNet, a neural network topology for image recognition, and ImageNet, a database of labeled images, Caffe is often used as a benchmark. The chart below shows performance comparison of original Caffe implementation and Intel optimized version, that takes advantage of optimized matrix-matrix multiplication and new Intel MKL 2017 DNN primitives on Intel® Xeon® processor E5-2699 v4 (codename Broadwell) and Intel® Xeon Phi™ processor 7250 (codename Knights Landing).

Summary

DNN primitives available in Intel MKL 2017 can be used to accelerate Deep Learning workloads on Intel Architecture. Please refer to Intel MKL Developer Reference Manual and examples for detailed information.

 

 

For more complete information about compiler optimizations, see our Optimization Notice.

2 comments

Top
Alexander G. (Intel)'s picture

Is it possible to add some "custom" kernel/primitive and expose to MKL-DNN for execution as a part of its graph/job. 

Simon Li's picture

The github respository. 

https://github.com/01org/mkl-dnn

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.