By Jun Ye, Tao Lv, Wei Li, Ashok Emani, Hao Li, Shufan Wu, Andres Ignacio Rodriguez Paniagua, and Xinyu Chen
Published:08/01/2018 Last Updated:08/01/2018
Apache* MXNet community announced the v1.2.0 release of the Apache MXNet deep learning framework. One of the most important features in this release is the Intel optimized CPU backend: MXNet now integrates with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) to accelerate neural network operators: Convolution, Deconvolution, FullyConnected, Pooling, Batch Normalization, Activation, LRN, Softmax, as well as some common operators: sum and concat. More details are available in the release note and release blog. This article will give more details on how to play it and how much faster v1.2.0 is on CPU platform.
In the deployment environment, the latency always is sensitive so the more specific optimizations are applied to reduce the latency for the better real-time results, especially for the batchsize one.
As the following chart shows, the latency of single picture inference (batchsize one) is significantly decreased.
Figure 1. NOTE: the latency can be calculated by (1000 * batchsize / throughput) and the unit is ms.
For the big batchsize, such as BS=32, the throughput has been improved a lot with Intel optimized backend.
As the following chart shows, the throughput of batchsize=32 is about 23.4-56.9X faster than the original CPU backend.
Figure 2.
The new backend shows the good scalability for the batchsize. In below chart, the throughput keeps constant at approximately eight images/second for the original CPU backend.
The new implementation shows very good batch scalability where the throughput is boosted from 83.7 images/second (BS=1) to 199.3 images/second (BS=32) for the resnet-50.
Figure 3.
Benchmark script: https://github.com/apache/incubator-mxnet/blob/master/example/image-classification/benchmark_score.py
CMD to reproduce the results:
export KMP_AFFINITY=granularity=fine,compact,1,0
export vCPUs=`cat /proc/cpuinfo | grep processor | wc -l`
export OMP_NUM_THREADS=$((vCPUs / 2))
Batch Size | AlexNet | VGG-16 | inception-bn | resnet-50 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
mxnet | mxnet-mkl | speedup | mxnet | mxnet-mkl | speedup | mxnet | mxnet-mkl | speedup | mxnet | mxnet-mkl | speedup | |
1 | 11 | 415.7 | 37.8 | 2.2 | 94.7 | 43 | 13.4 | 113.4 | 8.5 | 8.5 | 83.7 | 9.8 |
2 | 15.4 | 692.3 | 45 | 2.5 | 132.2 | 52.9 | 13.9 | 187.6 | 13.5 | 8.5 | 117.5 | 13.8 |
4 | 13.4 | 808.7 | 60.4 | 2.7 | 145.2 | 53.8 | 13.9 | 283.7 | 20.4 | 8.7 | 152.9 | 17.6 |
8 | 23.5 | 981 | 41.7 | 2.9 | 156.4 | 53.9 | 14 | 380.1 | 27.2 | 8.7 | 186.3 | 21.4 |
16 | 24.5 | 1119.4 | 45.7 | 2.9 | 148.7 | 51.3 | 13.8 | 449.6 | 32.6 | 8.7 | 190.3 | 21.9 |
32 | 24.8 | 1411.7 | 56.9 | 2.9 | 134.6 | 46.4 | 13.8 | 500.5 | 36.3 | 8.5 | 199.3 | 23.4 |
$ sudo apt-get update
$ sudo apt-get install -y wget python gcc
$ wget https://bootstrap.pypa.io/get-pip.py && sudo python get-pip.py
MXNet with Intel MKL-DNN backend has been released in 1.2.0.
$ pip install mxnet-mkl==1.2.0 [–user]
Please note that the mxnet-mkl package is built with USE_BLAS=openblas. If you want to leverage the performance boost from MKL blas, please try to install mxnet from source.
$ pip install mxnet==1.2.0 [–user]
$ git clone --recursive https://github.com/apache/incubator-mxnet
$ cd incubator-mxnet
$ git checkout 1.2.0
$ git submodule update --init --recursive
$ make -j USE_OPENCV=1 USE_MKLDNN=1 USE_BLAS=mkl
Note 1: When calling this command, Intel MKL-DNN will be downloaded and built automatically.
Note 2: MKL2017 backend has been removed from MXNet master branch. So users cannot build MXNet with MKL2017 backend from source code anymore.
Note 3: To use MKL as BLAS library, users may need to install Intel® Parallel Studio for best performance.
Note 4: If MXNet cannot find MKLML libraries, please add the MKLML library path to LD_LIBRARY_PATH and LIBRARY_PTH at first.
Machine | Neon City |
CPU/GPU Model, Core, Socket# | Intel® Xeon® Platinum 8180, 56, 2S |
CPU/GPU TFLOPS(FP32) | 8.24T = 2.3G*56*64(AVX512) |
CPU Config | Turbo on, HT on, NUMA on |
RAM Bandwidth | 255GB/s = 2.66*12*8(2666MHz DDR4) |
RAM Capacity | 192G = 16G*12*1 |
Platform |
Linux* 3.10.0-862.6.3.el7.x86_64-x86_64-with-centos-7.4.1708-Core |
Kernel | 3.10.0-862.6.3.el7.x86_64 |
BIOS Vendor |
Intel Corporation |
BIOS Version |
SE5C620.86B.0X.01.0117.021220182317 |
Notices and Disclaimers
Performance results are based on testing as of July 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications, and roadmaps.
The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utilized in the testing, and may not be applicable to any particular user’s components, computer system or workloads. The results are not necessarily representative of other benchmarks and other benchmark results may show greater or lesser impact from mitigations.
Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
Other names and brands may be claimed as the property of others.
© Intel Corporation.
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.