Getting Started with Intel® Optimization for MXNet*

By Ying Hu, Zhuowei Si

Published:07/17/2019   Last Updated:07/16/2019

Intel has a long-term collaboration with Apache* MXNet* (incubating) community to accelerate neural network operators in CPU backend. Since MXNet v1.2.0,  Intel and MXNet community announces MXNet is optimized with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) formally.  End users can take advantage of Intel hardware advances including latest Intel® Xeon® Scalable processors (codename Cascade Lake) and Intel® Deep Learning Boost (Intel® DL Boost) technology directly.

See the article Amazing Inference Performance with Intel® Xeon® Scalable Processors  and Apache* MXNet* v1.5.0 Gets a Lift with Intel® DL Boost for more details on recent performance accelerations.


Developers can easily install MXNet according to its Installation Guide.

As shown in the following screenshot, the current version and preview versions all are provided. Developers can also choose to install the binary from Anaconda, PIP or build from source for CPU. Python, Java and C++ are supported based your OS:  Linux*, MacOS* and Windows*.

Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) is designed to accelerate the neural network computation. It was optimized for Intel processors with Intel® AVX-512, Intel® AVX-2, and Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2) instructions and will deliver the maximum performance of deep learning application in CPU.  As Intel and MXNet community works closely to make MXNet have Intel® MKL-DNN acceleration enabled on CPU1 by default, users can get the performance benefit on Intel platform without additional installation steps.

MXNet provides PyPI packages with Intel® MKL-DNN optimization in multiple options. The version with suffix mkl will be much faster when running on Intel CPU. Check the chart below for details. 

Note: all versions of MXNet* with suffix mkl (with or without CUDA* support) have Intel® MKL-DNN acceleration enabled.

User can install mxnet-mkl in the python environment on Intel® CPU using the command below: 

 > pip install mxnet-mkl

The MXNet package will be installed to your python path. For example, the directory under Windows* is:


and the directory under Linux* with Anaconda3 environment is:


If the user wants to try the new features in advance, user can install a nightly build from master by the command:

> pip install mxnet-mkl --pre

For other options including PIP, Docker and Build from source, please check Ubuntu installation guideCentOS installation guideother MXNet pip packages, and please validate your MXNet installation.

Sanity Check

Once Intel optimized MXNet has been installed, to make sure the Intel® MKL-DNN optimizations are enabled, running the below command and it shall print "True" out.

Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet.runtime
>>> fs=MXNet.runtime.Features()
>>> fs.is_enabled('MKLDNN')

Getting started

A better training and inference performance is expected to be achieved on Intel® CPUs with MXNet built with Intel® MKL-DNN on multiple operating system, including Linux*, Windows* and MacOS*. Once user has installed mxnet-mkl, user can start a simple MXNet python code with a single convolution layer and verify if the MKL-DNN backend works.

import mxnet as mx
import numpy as np

num_filter = 32
kernel = (3, 3)
pad = (1, 1)
shape = (32, 32, 256, 256)

x = mx.sym.Variable('x')
w = mx.sym.Variable('w')
y = mx.sym.Convolution(data=x, weight=w, num_filter=num_filter, kernel=kernel, no_bias=True, pad=pad)
exe = y.simple_bind(mx.cpu(), x=shape)

exe.arg_arrays[0][:] = np.random.normal(size=exe.arg_arrays[0].shape)
exe.arg_arrays[1][:] = np.random.normal(size=exe.arg_arrays[1].shape)

o = exe.outputs[0]
t = o.asnumpy()

More details about debugging and profiling can be logged by setting the environment variable MKLDNN_VERBOSE:


For example, by running the code snippet above, the following debugging logs are printed out and it provides more insights on Intel® MKL-DNN primitives convolution and reorder. That includes: memory layout, infer shape and the time cost of primitive execution.

mkldnn_verbose,exec,reorder,jit:uni,undef,<a></a>in:f32_nchw out:f32_nChw16c,num:1,32x32x256x256,6.47681 mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,32x32x3x3,0.0429688 mkldnn_verbose,exec,convolution,jit:avx512_common,forward_inference,fsrc:nChw16c fwei:OIhw16i16o fbia:undef fdst:nChw16c,alg:convolution_direct,mb32_g1ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,9.98193 mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,32x32x3x3,0.0510254 mkldnn_verbose,exec,reorder,jit:uni,undef,in:f32_nChw16c out:f32_nchw,num:1,32x32x256x256,20.4819

MXNet provide bunch of samples to help users use MXNet to do CNN for Image Classification, Text Classification, Semantic Segmentation, R-CNN, SSD, RNN, Recommender Systems, Reinforcement Learning etc. Please visit the GitHub* Examples.

Please also check this website for excellent MXNet tutorials.

Run Performance Benchmark

User can run commonly used benchmarks with Intel optimized MXNet. For example to run 


The detailed performance data collected on Intel® Xeon® processor with MXNet built with Intel® MKL-DNN can be found here. Where the figures show the performance of MXNet-1.5.0.  It measures the inference performance on different AWS EC2 machines. We will update it constantly.

Performance Consideration

For performance consideration of MXNet running on Intel® CPU, please refer Some Tips for Improving MXNet Performance, or the discussions about Data Layout, Non-Uniform Memory Access (NUMA) Controls Affecting Performance and Intel® MKL-DNN Technical Performance Considerations sections of this article

The Intel® MKL-DNN is well integrated to MXNet on CPU and MXNet provides good multi-threading control by default. To get better performance on Intel® CPU, please use below environment variables to set CPU affinity.

export KMP_AFFINITY=granularity=fine, noduplicates,compact,1,0
export OMP_NUM_THREADS=56 [physical core number]
$ cd example/image_classification
$ python

We also find setting the following environment variables can help boost performance:




Suggested value: vCPUs / 2 in which vCPUs is the number of virtual CPUs. For more information, please see the guide for setting the number of threads using an OpenMP* environment variable


Suggested value: granularity=fine,compact,1,0. For more information, please see the guide for Thread Affinity Interface (Linux and Windows).


Set to MKLDNN to enable the subgraph feature for better performance. For more information please see Build/Install MXNet with MKL-DNN

Note: MXNet treats all CPUs on a single machine as a single device. So whether you specify cpu(0) or cpu(), MXNet will use all CPU cores on the machine.

Enable Intel® MKL

For better performance consideration, MXNet also allows users to enable Intel MKL by building MXNet from source.  Please check how to Build/Install MXNet with Intel® MKL-DNN.

Enable graph optimization

Graph optimization by sub-graph feature is available in the master branch. You can build from source and then use below command to enable this feature for better performance:


Please see details from 

Build/Install MXNet with Intel MKL-DNN

Blog -Model Quantization for Production-level Neural Network inference

Performance analysis and profiling

Even after fixing the training or deployment environment and parallelization scheme, a number of configuration settings and data-handling choices can impact the MXNet performance. You may want to check if Intel® MKL-DNN is used or not.

Use export MKLDNN_VERBOSE=1 to check Intel® MKL-DNN and Use export MKL_verbose=1 to check MKL.

Besides, MXNet has a built-in profiler that gives detailed information about the execution time at the symbol level. The profiler can then be turned on with an environment variable for an entire program run, or programmatically for just part of a run. See example/profiler for complete examples of how to use the profiler in code.

After the execution finishes, navigate to your browser’s tracing (Example - chrome://tracing in a Chrome* browser) and load the profile_output.json file output by the profiler to inspect the results.

Note: The output file can grow extremely large, so this approach is not recommended for general use.


  1. MXNet Documentation
  2. MXNet GitHub
  3. Intel MXNet Official Website
  4. Build/Install MXNet with Intel MKL-DNN
  5. Apache MXNet v1.2.0 optimized with Intel® Math Kernel Library for Deep Neural Networks (Intel MKL-DNN)
  6. Amazing Inference Performance with Intel® Xeon® Scalable Processors
  7. Model Quantization for Pruduction-level Neural Network inference


Q: "pip install mxnet-mkl"  get the follow error:

Collecting mxnet

  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x00000146D096ACC0>, 'Connection to timed out. (connect timeout=15)')': /simple/MXNet/

A:  Set proxy of PIP or check internet connect, for example:

  > pip install --proxy proxyserver:port MXNet<o:p></o:p>

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at