Intel has a long-term collaboration with Apache* MXNet* (incubating) community to accelerate neural network operators in CPU backend. Since MXNet v1.2.0, Intel and MXNet community announces MXNet is optimized with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) formally. End users can take advantage of Intel hardware advances including latest Intel® Xeon® Scalable processors (codename Cascade Lake) and Intel® Deep Learning Boost (Intel® DL Boost) technology directly.
See the article Amazing Inference Performance with Intel® Xeon® Scalable Processors and Apache* MXNet* v1.5.0 Gets a Lift with Intel® DL Boost for more details on recent performance accelerations.
Developers can easily install MXNet according to its Installation Guide.
As shown in the following screenshot, the current version and preview versions all are provided. Developers can also choose to install the binary from Anaconda, PIP or build from source for CPU. Python, Java and C++ are supported based your OS: Linux*, MacOS* and Windows*.
Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) is designed to accelerate the neural network computation. It was optimized for Intel processors with Intel® AVX-512, Intel® AVX-2, and Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2) instructions and will deliver the maximum performance of deep learning application in CPU. As Intel and MXNet community works closely to make MXNet have Intel® MKL-DNN acceleration enabled on CPU1 by default, users can get the performance benefit on Intel platform without additional installation steps.
MXNet provides PyPI packages with Intel® MKL-DNN optimization in multiple options. The version with suffix mkl will be much faster when running on Intel CPU. Check the chart below for details.
Note: all versions of MXNet* with suffix mkl (with or without CUDA* support) have Intel® MKL-DNN acceleration enabled.
User can install mxnet-mkl in the python environment on Intel® CPU using the command below:
> pip install mxnet-mkl
The MXNet package will be installed to your python path. For example, the directory under Windows* is:
and the directory under Linux* with Anaconda3 environment is:
If the user wants to try the new features in advance, user can install a nightly build from master by the command:
> pip install mxnet-mkl --pre
For other options including PIP, Docker and Build from source, please check Ubuntu installation guide, CentOS installation guide, other MXNet pip packages, and please validate your MXNet installation.
Once Intel optimized MXNet has been installed, to make sure the Intel® MKL-DNN optimizations are enabled, running the below command and it shall print "True" out.
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import mxnet.runtime >>> fs=MXNet.runtime.Features() >>> fs.is_enabled('MKLDNN') True
A better training and inference performance is expected to be achieved on Intel® CPUs with MXNet built with Intel® MKL-DNN on multiple operating system, including Linux*, Windows* and MacOS*. Once user has installed mxnet-mkl, user can start a simple MXNet python code with a single convolution layer and verify if the MKL-DNN backend works.
import mxnet as mx import numpy as np num_filter = 32 kernel = (3, 3) pad = (1, 1) shape = (32, 32, 256, 256) x = mx.sym.Variable('x') w = mx.sym.Variable('w') y = mx.sym.Convolution(data=x, weight=w, num_filter=num_filter, kernel=kernel, no_bias=True, pad=pad) exe = y.simple_bind(mx.cpu(), x=shape) exe.arg_arrays[:] = np.random.normal(size=exe.arg_arrays.shape) exe.arg_arrays[:] = np.random.normal(size=exe.arg_arrays.shape) exe.forward(is_train=False) o = exe.outputs t = o.asnumpy()
More details about debugging and profiling can be logged by setting the environment variable MKLDNN_VERBOSE:
For example, by running the code snippet above, the following debugging logs are printed out and it provides more insights on Intel® MKL-DNN primitives convolution and reorder. That includes: memory layout, infer shape and the time cost of primitive execution.
mkldnn_verbose,exec,reorder,jit:uni,undef,<a></a>in:f32_nchw out:f32_nChw16c,num:1,32x32x256x256,6.47681 mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,32x32x3x3,0.0429688 mkldnn_verbose,exec,convolution,jit:avx512_common,forward_inference,fsrc:nChw16c fwei:OIhw16i16o fbia:undef fdst:nChw16c,alg:convolution_direct,mb32_g1ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,9.98193 mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,32x32x3x3,0.0510254 mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nChw16c out:f32_nchw,num:1,32x32x256x256,20.4819
MXNet provide bunch of samples to help users use MXNet to do CNN for Image Classification, Text Classification, Semantic Segmentation, R-CNN, SSD, RNN, Recommender Systems, Reinforcement Learning etc. Please visit the GitHub* Examples.
Please also check this website for excellent MXNet tutorials.
User can run commonly used benchmarks with Intel optimized MXNet. For example to run
The detailed performance data collected on Intel® Xeon® processor with MXNet built with Intel® MKL-DNN can be found here. Where the figures show the performance of MXNet-1.5.0. It measures the inference performance on different AWS EC2 machines. We will update it constantly.
For performance consideration of MXNet running on Intel® CPU, please refer Some Tips for Improving MXNet Performance, or the discussions about Data Layout, Non-Uniform Memory Access (NUMA) Controls Affecting Performance and Intel® MKL-DNN Technical Performance Considerations sections of this article.
The Intel® MKL-DNN is well integrated to MXNet on CPU and MXNet provides good multi-threading control by default. To get better performance on Intel® CPU, please use below environment variables to set CPU affinity.
export KMP_AFFINITY=granularity=fine, noduplicates,compact,1,0 export OMP_NUM_THREADS=56 [physical core number] export MXNET_SUBGRAPH_BACKEND=MKLDNN $ cd example/image_classification $ python benchmark_score.py
We also find setting the following environment variables can help boost performance:
Suggested value: vCPUs / 2 in which vCPUs is the number of virtual CPUs. For more information, please see the guide for setting the number of threads using an OpenMP* environment variable
Suggested value: granularity=fine,compact,1,0. For more information, please see the guide for Thread Affinity Interface (Linux and Windows).
Note: MXNet treats all CPUs on a single machine as a single device. So whether you specify cpu(0) or cpu(), MXNet will use all CPU cores on the machine.
For better performance consideration, MXNet also allows users to enable Intel MKL by building MXNet from source. Please check how to Build/Install MXNet with Intel® MKL-DNN.
Graph optimization by sub-graph feature is available in the master branch. You can build from source and then use below command to enable this feature for better performance:
Please see details from
Even after fixing the training or deployment environment and parallelization scheme, a number of configuration settings and data-handling choices can impact the MXNet performance. You may want to check if Intel® MKL-DNN is used or not.
Use export MKLDNN_VERBOSE=1 to check Intel® MKL-DNN and Use export MKL_verbose=1 to check MKL.
Besides, MXNet has a built-in profiler that gives detailed information about the execution time at the symbol level. The profiler can then be turned on with an environment variable for an entire program run, or programmatically for just part of a run. See example/profiler for complete examples of how to use the profiler in code.
After the execution finishes, navigate to your browser’s tracing (Example - chrome://tracing in a Chrome* browser) and load the profile_output.json file output by the profiler to inspect the results.
Note: The output file can grow extremely large, so this approach is not recommended for general use.
Q: "pip install mxnet-mkl" get the follow error:
Collecting mxnet Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x00000146D096ACC0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/MXNet/
A: Set proxy of PIP or check internet connect, for example:
> pip install --proxy proxyserver:port MXNet<o:p></o:p>
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804