Getting Started with Intel® Optimization for PyTorch* on Second Generation Intel® Xeon® Scalable Processors

Intel recently launched the second generation Intel® Xeon® Scalable processors (codename Cascade Lake) adding Intel® Deep Learning Boost (Intel DL Boost) technology. Fast math to take advantage of these hardware advances has been added to Intel® Math Kernel Library for Deep Neural Network (Intel® MKL-DNN). The Intel MKL-DNN optimizations are abstracted and integrated directly into the PyTorch*  framework. End users can take advantage of this technology with minimum changes to their code.

See the article Intel and Facebook* collaborate to Boost PyTorch* CPU Performance for more details on recent performance accelerations.


Intel MKL-DNN has been integrated into official release of PyTorch by default, thus users can get performance benefit on Intel platform without additional installation steps.

Users can easily get PyTorch from its official website ( As shown in the following screenshot, a stable version and a preview version are provided for Linux*, mac OS* and Windows*. Users can also choose to install the binary from anaconda*, pip, LibTorch or build from source. Python* 2.7, Python 3.5 to 3.7 and C++ are supported. To run PyTorch on Intel platforms, the CUDA* option must be set to None.

Note: all versions of PyTorch (with or without CUDA support) have Intel® MKL-DNN acceleration support enabled by default.


Getting Started

Let’s take a simple example to get started with Intel optimization for PyTorch on Intel platform.

We will run a simple PyTorch example on a Intel® Xeon® Platinum 8180M processor.

1. Install PyTorch following the matrix. In this example, we will install the stable version (v 1.0) on Linux via Pip for Python 3.6. There is no CUDA support.

$ pip3 install
$ pip3 install torchvision

2. Write a simple example (, modified from official example

3. Run the example with MKLDNN verbose message enabled. Appearance of the following mkldnn_verbose messages indicates that the PyTorch binary has Intel MKL-DNN acceleration enabled.

Output to console:

mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw16c,num:1,64x1000x128x128,274.935
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,100x1000x3x3,0.404053
mkldnn_verbose,exec,convolution,jit:avx512_common,forward_training,fsrc:nChw16c fwei:OIhw16i16o fbia:x fdst:nChw16c,alg:convolution_direct,mb64_g1ic1000oc100_ih128oh128kh3sh1dh0ph1_iw128ow128kw3sw1dw0pw1,497.934
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nChw16c out:f32_nchw,num:1,64x100x128x128,32.45
0 123862424.0
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw16c,num:1,64x1000x128x128,270.903
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw16c,num:1,64x100x128x128,26.2151
mkldnn_verbose,exec,convolution,jit:avx512_common,backward_weights,fsrc:nChw16c fwei:OIhw16i16o fbia:x fdst:nChw16c,alg:convolution_direct,mb64_g1ic1000oc100_ih128oh128kh3sh1dh0ph1_iw128ow128kw3sw1dw0pw1,709.844
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_OIhw16i16o out:f32_oihw,num:1,100x1000x3x3,0.39917
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw16c,num:1,64x1000x128x128,198.667
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,100x1000x3x3,0.26709
mkldnn_verbose,exec,convolution,jit:avx512_common,forward_training,fsrc:nChw16c fwei:OIhw16i16o fbia:x fdst:nChw16c,alg:convolution_direct,mb64_g1ic1000oc100_ih128oh128kh3sh1dh0ph1_iw128ow128kw3sw1dw0pw1,892.816
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nChw16c out:f32_nchw,num:1,64x100x128x128,28.2871
1 47307755520.0

Performance Considerations

For performance consideration of PyTorch running on Intel® Architecture processors, please refer to Data Layout, Non-Uniform Memory Access (NUMA) Controls Affecting Performance and Intel® MKL-DNN Technical Performance Considerations sections of: Maximize TensorFlow* Performance on CPU: Considerations and Recommendations for Inference Workloads.


contents of

# -*- coding: utf-8 -*-
import torch

N, D_in, D_out, H, W = 64, 1000, 100, 128, 128

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, H, W)
y = torch.randn(N, D_out, H, W)

# Use the nn package to define our model as a sequence of layers. nn.Sequential is a Module which contains other Modules, and applies them in sequence to produce its output.
model = torch.nn.Sequential(
    torch.nn.Conv2d(D_in, D_out, 3, padding=1),

# The nn package also contains definitions of popular loss functions; in this case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(5):
    # Forward pass: compute predicted y by passing x to the model. Module objects override the __call__ operator so you can call them like functions. When doing so you pass a Tensor of input data to the Module and it producesa Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true values of y, and the loss function returns a Tensor containing the loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.

    # Backward pass: compute gradient of the loss with respect to all the learnable parameters of the model. Internally, the parameters of each Module are stored in Tensors with requires_grad=True, so this call will compute gradients for all learnable parameters in the model.

    # Update the weights using gradient descent. Each parameter is a Tensor, so we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad



For more complete information about compiler optimizations, see our Optimization Notice.