Maximize Performance of Intel® Software Optimization for PyTorch* on CPU

By Jing Xu, Zhuowei Si, Nathan G Greeneltch, and Shailendrsingh Kishore Sobhee

Published: 04/18/2019   Last Updated: 04/18/2019

Previous: Vol 1: Getting Started In the previous Volume 1 of this series, we introduced how to install PyTorch* and Caffe2* with Intel optimizations, and how to get started.

Volumes in Introduction Series for Intel® Software Optimization for PyTorch*

Vol 1: Getting Started - Installation instructions of of Intel® Software Optimization for PyTorch* and getting started guide.

Vol 2: Performance considerations - Introduces hardware and software configuration to fully utilize CPU computation resources with Intel Software Optimization for PyTorch.

Special: Performance number - Introduces performance number of Intel Software Optimization for PyTorch.


To fully utilize the power of Intel® architecture and thus yield high performance, PyTorch and Caffe2 can be powered by Intel’s highly optimized math routines for deep learning tasks. This primitives library is Intel® Deep Neural Network Library (Intel® DNNL). Intel DNNL includes convolution, normalization, activation, inner product, and other primitives.

Several optional switches/tools are provided to make Intel DNNL more flexible to use. By default, you can enjoy the high performance of Intel DNNL on Intel CPUs. In this volume, we will introduce how to make use of these optional switches/tools to manually fit PyTorch and Caffe2, as well as the underlying accelerator Intel DNNL, to your machine, and thus results in even better performance.

The following techniques will be covered in this article.

  • Numactl
  • OpenMP*
    • OMP_NUM_THREADS
    • MKL_NUM_THREADS (For Caffe2 only)
    • GOMP_CPU_AFFINITY/KMP_AFFINITY

Capability of CPU

Making use of these optional switches/tools depends on machine topology of your computer. You need to know the following knowledges.

  • How many sockets onboard
  • How many cores on each socket
  • Is hyper-threading enabled or not
  • Which cores are physical cores

You can use command lscpu to get these information on Linux* machines. The following is an example on Intel® Xeon® Platinum 8180M processor.

$ lscpu
...
CPU(s):              112
On-line CPU(s) list: 0-111
Thread(s) per core:  2
Core(s) per socket:  28
Socket(s):           2
NUMA node(s):        2
...
Model name:          Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz
...
NUMA node0 CPU(s):   0-27,56-83
NUMA node1 CPU(s):   28-55,84-111
...

From this output, we can decode the structure of cores on this machine. CPU is an Intel Xeon Platinum 8180M processor (line 10). There are 2 sockets onboard (line 7), and each socket has 28 physical cores (line 6). Hyper-threading functionality is enabled (Line 5) on this machine, thus there are 112 logical cores in total (line 3). These logical cores are numbered from 0 to 111 (line 4). Numa is also enabled on this machine, and there are 2 numa nodes (line 8). Numa node 0 controls CPU 0-27 and 56-83 (line 12). Numa node 1 controls CPU 28-55 and 84-111 (line 13). CPU 0-27 and 28-55 are physical cores on numa node 0 and 1 respectively. I.e., there are 56 physical cores in total. CPU 56-83 and 84-111 are logical cores on numa node 0 and 1 respectively.

Note If there is only 1 numa node on your machine, please jump to OpenMP section.

Non-Uniform Memory Access (NUMA) Controls

Recommended settings: --cpunodebind=0 --membind=0

Usage (shell)

numactl --cpunodebind=0 --membind=0 python <pytorch_script>

Running on a NUMA-enabled machine brings with it special considerations. NUMA or non-uniform memory access is a memory layout design used in data center machines meant to take advantage of locality of memory in multi-socket machines with multiple memory controllers and blocks. In most cases, inference runs best when confining both the execution and memory usage to a single NUMA node.

Concurrent Execution

You can optimize performance by breaking up your workload into multiple data shards and then running concurrently on more than one NUMA node. On each node (N), run the following command:

Usage (shell)

numactl --cpunodebind=N --membind=N python <pytorch_script>

For example, you can use the & command to launch simultaneous processes on multiple NUMA nodes:

numactl --cpunodebind=0 --membind=0 python <pytorch_script> & numactl --cpunodebind=1 --membind=1 python <pytorch_script>

OpenMP*

OpenMP is utilized to bring better performance for parallel computation tasks. OMP_NUM_THREADS is the easiest switch that you would like to use to accelerate computation. Furthermore, GOMP_CPU_AFFINITY/KMP_AFFINITY is used for scheduling OpenMP threads and binding them to desired cores.

OMP_NUM_THREADS

Recommended setting: OMP_NUM_THREADS = <num_physical_cores>

Usage (shell)

export OMP_NUM_THREADS=<num_physical_cores>

This environment variable sets the maximum number of threads to use for OpenMP parallel regions if no other value is specified in the application. You can take advantage of this setting to fully squeeze computation capability of your CPU.

The default value is the number of logical processors visible to the operating system on which the program is executed. This value is recommended to be set to the number of physical cores.

On Caffe2, for legacy reasons, several operations may fall to Intel® Math Kernel Library (Intel® MKL) as backend. In this cases, please also set MKL_NUM_THREADS to number of physical cores.

export OMP_NUM_THREADS=<num_physical_cores>
export MKL_NUM_THREADS=<num_physical_cores>

GOMP_AFFINITY/KMP_AFFINITY

To maximize PyTorch performance, behavior of OpenMP threads scheduling can be controlled precisely with GOMP_CPU_AFFINITY/KMP_AFFINITY environment variables. The former one works on GNU OpenMP, while the later one works on Intel's OpenMP Runtime Library.

By default, PyTorch is shipped with GNU OpenMP. But you can easily use LD_PRELOAD environment variable to switch it to Intel's OpenMP Runtime Library.

$ export LD_PRELOAD=<path>/libiomp5.so

GOMP_CPU_AFFINITY

Recommended setting for general cases:

export GOMP_CPU_AFFINITY="0-<physical_cores_num-1>"

GOMP_CPU_AFFINITY binds threads to specific CPUs. Setting its value to "0-<physical_cores_num-1>" will bind OpenMP threads to physical cores only. On Intel® Xeon® Platinum 8180M processor, <physical_cores_num> is 56.

To work with GOMP_CPU_AFFINITY, other GNU OpenMP environment variables are suggested to be used at the same time.

  • OMP_PROC_BIND: Specifies whether threads may be moved between processors. Setting it to CLOSE keeps OpenMP threads close to the primary thread in contiguous place partitions.
  • OMP_SCHEDULE: Determine how OpenMP threads are scheduled.
  • GOMP_SPINCOUNT: Determines how long a threads waits actively with consuming CPU power before waiting passively without consuming CPU power. You can change this value and test to see which value is the best for your program.

Following is a recommended combination of these environment variables:

export OMP_SCHEDULE=STATIC
export OMP_PROC_BIND=CLOSE
export GOMP_CPU_AFFINITY="0-55"

Recommended setting for special cases:

In some cases you may wish to run instances on specific cores. For instance, you may wish to run 4 instances on Intel Xeon Platinum 8180M processor. Since there are 56 physical cores (28 on each socket) on the CPU, each instance can be running on 14 physical cores. I.e., CPU resources can be scheduled as the following.

  • Instance 1: Core 0-13 (on Socket 0)
  • Instance 2: Core 14-27 (on Socket 0)
  • Instance 3: Core 28-41 (on Socket 1)
  • Instance 4: Core 42-55 (on Socket 1)

The following commands meet this requirement. Each instance will be running in a separate Linux terminal.

# On Terminal 1
$ export OMP_SCHEDULE=STATIC
$ export OMP_PROC_BIND=CLOSE
$ export GOMP_CPU_AFFINITY="0-13"
$ python <pytorch_script>

# On Terminal 2
$ export OMP_SCHEDULE=STATIC
$ export OMP_PROC_BIND=CLOSE
$ export GOMP_CPU_AFFINITY="14-27"
$ python <pytorch_script>

# On Terminal 3
$ export OMP_SCHEDULE=STATIC
$ export OMP_PROC_BIND=CLOSE
$ export GOMP_CPU_AFFINITY="28-41"
$ python <pytorch_script>

# On Terminal 4
$ export OMP_SCHEDULE=STATIC
$ export OMP_PROC_BIND=CLOSE
$ export GOMP_CPU_AFFINITY="42-55"
$ python <pytorch_script>

KMP_AFFINITY

Recommended setting for general cases:

export KMP_AFFINITY=granularity=fine,compact,1,0

Intel DNNL has the ability to bind OpenMP threads to physical processing units. KMP_AFFINITY is used to take advantage of this functionality. It restricts execution of certain threads to a subset of the physical processing units in a multiprocessor computer.

Note For a more detailed description of KMP_AFFINITY, please refer to Intel® C++ developer guide.

Figure 1 shows the machine topology map when KMP_AFFINITY is set to the recommended value. The OpenMP thread <n>+1 is bound to a thread context as close as possible to OpenMP thread <n>, but on a different core. Once each core has been assigned one OpenMP thread, the subsequent OpenMP threads are assigned to the available cores in the same order, but they are assigned on different thread contexts.

Figure 1. Machine topology map with setting KMP_AFFINITY=granularity=fine,compact,1,0

The advantage of this setting is that consecutive threads are bound close together, so that communication overhead, cache line invalidation overhead, and page thrashing are minimized. Suppose the application also had a number of parallel regions which did not utilize all of the available OpenMP threads, it is desirable to avoid binding multiple threads to the same core and leaving other cores not utilized.

Similarly, to bind OpenMP threads to physical cores can be done by the following setting.

export KMP_AFFINITY=granularity=fine,proclist=[0-<physical_cores_num-1>],explicit

Recommended setting for special cases:

In cases you wish to run instances on specific cores, as the same example for GOMP_CPU_AFFINITY:

  • Instance 1: Core 0-13 (on Socket 0)
  • Instance 2: Core 14-27 (on Socket 0)
  • Instance 3: Core 28-41 (on Socket 1)
  • Instance 4: Core 42-55 (on Socket 1)

Still four different Linux terminals are required. Each instance is running in a separate terminal.

# On Terminal 1
$ export LD_PRELOAD=<path>/libiomp5.so
$ export KMP_AFFINITY=granularity=fine,proclist=[0-13],explicit
$ python <pytorch_script>

# On Terminal 2
$ export LD_PRELOAD=<path>/libiomp5.so
$ export KMP_AFFINITY=granularity=fine,proclist=[14-27],explicit
$ python <pytorch_script>

# On Terminal 3
$ export LD_PRELOAD=<path>/libiomp5.so
$ export KMP_AFFINITY=granularity=fine,proclist=[28-41],explicit
$ python <pytorch_script>

# On Terminal 4
$ export LD_PRELOAD=<path>/libiomp5.so
$ export KMP_AFFINITY=granularity=fine,proclist=[42-55],explicit
$ python <pytorch_script>

How to verify GOMP_CPU_AFFINITY/KMP_AFFINITY setting is done correctly

We can use a Linux command line tool, htop, to verify whether GOMP_CPU_AFFINITY/KMP_AFFINITY is working as expected or not. For Ubuntu* users, you can simply run apt command to get this tool installed. Running it is also simply run htop in command line.

$ sudo apt install htop
$ htop
Examples:

GOMP_CPU_AFFINITY:

# On Terminal 1
$ export OMP_SCHEDULE=STATIC
$ export OMP_PROC_BIND=CLOSE
$ export GOMP_CPU_AFFINITY="0-13"
$ export OMP_NUM_THREADS=14
$ numactl --cpunodebind=0 --membind=0 python <pytorch_script>

# On Terminal 2
$ export OMP_SCHEDULE=STATIC
$ export OMP_PROC_BIND=CLOSE
$ export GOMP_CPU_AFFINITY="42-55"
$ export OMP_NUM_THREADS=14
$ numactl --cpunodebind=1 --membind=1 python <pytorch_script>

KMP_AFFINITY:

# On Terminal 1
$ export LD_PRELOAD=<path>/libiomp5.so
$ export OMP_NUM_THREADS=14
$ export KMP_AFFINITY=granularity=fine,proclist=[0-13],explicit
$ numactl --cpunodebind=0 --membind=0 python <pytorch_script>

# On Terminal 2
$ export LD_PRELOAD=<path>/libiomp5.so
$ export OMP_NUM_THREADS=14
$ export KMP_AFFINITY=granularity=fine,proclist=[42-55],explicit
$ numactl --cpunodebind=1 --membind=1 python <pytorch_script>

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.