Getting Started with Intel® Optimization for MXNet*

Intel has a long-term collaboration with Apache* MXNet* (incubating) community to accelerate neural network operators in CPU backend.  Since MXNet v1.2.0,  Intel and MXNet community announces MXNet is optimized with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) formally.  End users can take advantage of Intel hardware advances including latest Intel® Xeon® Scalable processors (codename Cascade Lake) and Intel® Deep Learning Boost (Intel® DL Boost) technology directly.

See the article amazing Inference Performance with Intel® Xeon® Scalable Processors for more details on recent performance accelerations.


Developers can easily install MXNet from its official documentation website.

As shown in the following screenshot, the current version and preview versions all are provided. Developers can also choose to install the binary from anaconda, pip or build from source for CPU. Python* 2.7, Python 3.5 to 3.7 and C++ are supported based your OS:  Linux*, MacOS* and Windows*.

Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) is designed to accelerate the neural network computation. It was optimized for Intel processors with Intel® AVX-512, Intel® AVX-2, and Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2) instructions and will deliver maximum performance of Deep learning application in CPU.  As Intel and MXNet community works closely to make MXNet have MKL-DNN acceleration support enabled on CPU1 by default, users can get performance benefit on Intel platform without additional installation steps.

MXNet provides PyPI packages with Intel MKL-DNN optimization in multiple options. The version with suffix mkl will be much faster when running on Intel hardware. Check the chart below for details. 

Note: all versions of MXNet* with suffix mkl (with or without CUDA* support) have Intel® MKL-DNN acceleration support enabled.

User can install mxnet-mkl on CPU python environment by the below command: 

 > pip install mxnet-mkl

The MXNet package will be installed to your python path, i.e. under Windows, the directory is:


and under Linux with Anaconda3 environment, the directory is:


If the user wants to try the new features in advance, user can install a nightly build from master by command:

> pip install MXNet-mkl --pre

For other options including PIP, Docker and Build from source, please check MXNet install Guideother MXNet pip packages, and please validate your MXNet installation.

Sanity Check

Once Intel optimized MXNet is installed, to make sure the Intel MKL-DNN optimizations are present, running the below command and it shall print "True" out.

Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import MXNet.runtime
>>> fs=MXNet.runtime.Features()
>>> fs.is_enabled('MKLDNN')

Getting started

A better training and inference performance is expected to be achieved on Intel® Architecture CPUs with MXNet built with Intel MKL-DNN on multiple operating system, including Linux, Windows and MacOS. Once user has installed mxnet-mkl, user can start a simple MXNET python code with a single convolution layer and verify if the MKL-DNN backend works.

import MXNet as mx
import numpy as np

num_filter = 32
kernel = (3, 3)
pad = (1, 1)
shape = (32, 32, 256, 256)

x = mx.sym.Variable('x')
w = mx.sym.Variable('w')
y = mx.sym.Convolution(data=x, weight=w, num_filter=num_filter, kernel=kernel, no_bias=True, pad=pad)
exe = y.simple_bind(mx.cpu(), x=shape)

exe.arg_arrays[0][:] = np.random.normal(size=exe.arg_arrays[0].shape)
exe.arg_arrays[1][:] = np.random.normal(size=exe.arg_arrays[1].shape)

o = exe.outputs[0]
t = o.asnumpy()

More detailed debugging and profiling information can be logged by setting the environment variable MKLDNN_VERBOSE:


For example, by running the code snippet above, the following debugging logs is printed out and it provides more insights on Intel MKL-DNN primitives convolution and reorder. That includes: memory layout, infer shape and the time cost of primitive execution.

mkldnn_verbose,exec,reorder,jit:uni,undef,<a></a>in:f32_nchw out:f32_nChw16c,num:1,32x32x256x256,6.47681 mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,32x32x3x3,0.0429688 mkldnn_verbose,exec,convolution,jit:avx512_common,forward_inference,fsrc:nChw16c fwei:OIhw16i16o fbia:undef fdst:nChw16c,alg:convolution_direct,mb32_g1ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,9.98193 mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,32x32x3x3,0.0510254 mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nChw16c out:f32_nchw,num:1,32x32x256x256,20.4819

MXNet provide bunch of samples to help users use MXNet to do CNN for Image Classification, Text Classification, Semantic Segmentation, R-CNN, SSD, RNN, Recommender Systems, Reinforcement Learning etc. Please visit the GitHub* Examples.

Please visit for excellent MXNet Tutorials.

Run Performance Benchmark

User can run commonly used benchmark with Intel optimized MXNet. For example to run 


The detailed performance data collected on Intel® Xeon® processor with MXNet built with Intel MKL-DNN can be found here. Where the table shows performance of MXNet-1.2.0.rc1, namely number of images that can be predicted per second.  It measures the inference performance on different AWS EC2 machines. We will update it constantly.

Performance Consideration

For performance consideration of MXNet running on Intel CPU, please refer to Some Tips for Improving MXNet Performance or the discussion about Data Layout, Non-Uniform Memory Access (NUMA) Controls Affecting Performance and Intel MKL-DNN Technical Performance Considerations sections of this Intel article

The Intel MKL-DNN are well integrated to MXNet on CPU and MXNet provides good multi-threading control by default. If for better performance, please use below environment variables to set CPU affinity.

export KMP_AFFINITY=granularity=fine, noduplicates,compact,1,0
export OMP_NUM_THREADS=56 [physical core number]
$ cd example/image_classification
$ python

We also find that setting the following environment variables can help:




Suggested value: vCPUs / 2 in which vCPUs is the number of virtual CPUs. For more information, please see the guide for setting the number of threads using an OpenMP* environment variable


Suggested value: granularity=fine,compact,1,0. For more information, please see the guide for Thread Affinity Interface (Linux and Windows).


Set to MKLDNN to enable the subgraph feature for better performance. For more information please see Build/Install MXNet with MKL-DNN

Note: MXNet treats all CPUs on a single machine as a single device. So whether you specify cpu(0) or cpu(), MXNet will use all CPU cores on the machine.

Enable Intel® MKL

For better performance consideration, MXNet also allows users to enable Intel MKL by building MXNet from source.  Please see the session of Build/Install MXNet with Intel MKL-DNN.

Enable graph optimization

Graph optimization by sub-graph feature is available in the master branch. You can build from source and then use below command to enable this feature for better performance:


Please see details from 

Build/Install MXNet with Intel MKL-DNN

Blog -Model Quantization for Production-level Neural Network inference

Performance analysis and profiling

Even after fixing the training or deployment environment and parallelization scheme, a number of configuration settings and data-handling choices can impact the MXNet performance. You may want to check if Intel MKL-DNN is used or not.

Use export MKLDNN_VERBOSE=1 as above to check Intel MKL-DNN and Use export MKL_verbose=1 to check MKL.

Besides, MXNet has a built-in profiler that gives detailed information about execution time at the symbol level. The profiler can then be turned on with an environment variable for an entire program run, or programmatically for just part of a run. See example/profiler for complete examples of how to use the profiler in code.

After the program finishes, navigate to your browser’s tracing (Example - chrome://tracing in a Chrome* browser) and load the profile_output.json file output by the profiler to inspect the results.

Note: The output file can grow extremely large, so this approach is not recommended for general use.


  1. MXNet Documentation
  2. MXNet GitHub
  3. Intel MXNet Official Website
  4. Build/Install MXNet with Intel MKL-DNN
  5. Apache MXNet v1.2.0 optimized with Intel® Math Kernel Library for Deep Neural Networks (Intel MKL-DNN)
  6. Amazing Inference Performance with Intel® Xeon® Scalable Processors
  7. Model Quantization for Pruduction-level Neural Network inference


Q: "pip install MXNet"  get the follow error:

Collecting MXNet

  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x00000146D096ACC0>, 'Connection to timed out. (connect timeout=15)')': /simple/MXNet/

A:  Set proxy of PIP or check internet connect, for example:

  > pip install --proxy proxyserver:port MXNet<o:p></o:p>