To fully utilize the power of Intel® architecture (IA) and thus yield high performance, TensorFlow* can be powered by Intel’s highly optimized math routines for deep learning tasks. This primitives library is called Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). Intel MKL-DNN includes convolution, normalization, activation, inner product, and other primitives.
To setup Intel® optimization for TensorFlow* on your system, see the installation guide.
The Bottom Line: The user gets accelerated CPU TensorFlow execution with no code changes.
Figure 1: Inference and training performance on Intel® processors with Intel® MKL-DNN
Maximum Throughput vs. Real-time Inference
Deep learning inference can be done with two different strategies, each with different performance measurements and recommendations. The first is Max Throughput (MxT) and aims to process as many images per second, passing in batches of size > 1. For Max Throughput, best performance is achieved by exercising all the physical cores on a socket. This solution is intuitive in that we simply load up the CPU with as much work as we can and process as many images as we can in a parallel and vectorized fashion. Real-time Inference (RTI) is an altogether different regime where we typically want to process a single image as fast as possible. Here we aim to avoid penalties from excessive thread launching and orchestration between concurrent processes. The strategy is to confine and execute quickly. The following best known methods (BKMs) differ where noted with MxT RTI.
TensorFlow Runtime Options Affecting Performance
Runtime options heavily effect TensorFlow performance. Understanding them will help get the best performance out of the Intel Optimization of TensorFlow.
- Data layout
Recommended settings (RTI)→ intra_op_parallelism = #physical cores
Recommended settings → inter_op_parallelism = 2
tf_cnn_benchmarks usage (shell)
python tf_cnn_benchmarks.py --num_intra_threads=cores --num_inter_threads=2
intra_op_parallelism_threads and inter_op_parallelism_threads are runtime variables defined in TensorFlow. ConfigProto. The ConfigProto is used for configuration when creating a session. These two variables control number of cores to use.
This runtime setting controls parallelism inside an operation. For instance, if matrix multiplication or reduction is intended to be executed in several threads, this variable should be set. TensorFlow will schedule tasks in a thread pool which contains intra_op_parallelism_threads threads. As illustrated later in figure 3, OpenMP* threads are bound to thread context as close as possible on different core, setting this environment variable to the number of available physical cores is recommended.
NOTE: This setting is highly dependent on hardware and topologies, so it’s best to empirically confirm the best setting on your workload.
This runtime setting controls parallelism among independent operations. Since these operations are not relevant to each other, TensorFlow will try to run them concurrently in the thread pool which contains inter_op_parallelism_threads threads. This variable is recommended to be set to the number of parallel paths where you want the code to run. For Intel® Optimization for TensorFlow we recommend recommended starting with the setting 2, and adjusting after empirical testing.
Recommended settings → data_format = NCHW
tf_cnn_benchmarks usage (shell)
python tf_cnn_benchmarks.py --num_intra_threads=cores --num_inter_threads=2 data_format=NCHW
In modern IA, efficiency of cache and memory usage brings remarkable influences to overall performance. A good memory access pattern minimizes extra cost for accessing data in memory, thus the overall processing will not be slowed down. To achieve this, how data is stored and accessed plays an important role. This is usually referred to as data layout. It describes how multidimensional arrays are stored linearly in memory address space.
In most cases, data layout is represented by four letters for a two-dimensional image.
- N: Batch size, indicating number of images in a batch.
- C: Channel, indicating number of channels in an image.
- W: Width, indicating number of pixels in horizontal dimension of an image.
- H: Height, indicating number of pixels in vertical dimension of an image.
Order of these four letters indicates how pixel data are stored in 1-d memory space. For instance, NCHW indicates pixel data are stored in width first, then height, then channel, and finally batch (Illustrated in figure 2). The data is then accessed from left-to-right with channels-first indexing. NCHW is the recommended data layout of using Intel MKL-DNN, since this format is an efficient data layout for CPU. TensorFlow uses NHWC as default data layout, but it also supports NCHW.
Non-Uniform Memory Access (NUMA) Controls Affecting Performance
Recommended settings → --cpunodebind=0 --membind=0
numactl --cpunodebind=0 --membind=0 python
Running on a NUMA-enabled machine brings with it special considerations. NUMA or non-uniform memory access is a memory layout design used in data center machines meant to take advantage of locality of memory in multi-socket machines with multiple memory controllers and blocks. Intel Optimization for TensorFlow runs best when confining both the execution and memory usage to a single NUMA node. So when running on a NUMA-enabled system, intra_op_parallelism_threads should be chosen as the numbers of local cores to each single NUMA-node.
You can optimize performance by breaking up your workload into multiple data shards and then running concurrently on more than one NUMA node. On each node (N), run the following command:
numactl --cpunodebind=N --membind=N python
For example, you can use the “&” command to launch simultaneous processes on multiple NUMA nodes:
numactl --cpunodebind=0 --membind=0 python & numactl --cpunodebind=1 --membind=1 python
Intel® MKL-DNN Technical Performance Considerations
Intel MKL-DNN Technical Summary: The library takes advantage of SIMD instructions through vectorization, as well as multiple cores through multi-threading. The technique of vectorization effectively utilizes cache and computation ability of modern CPUs, and the effectiveness of instruction sets. A single calculation could process up to 16 single-precision (512-bit long) numbers. Meanwhile, up to two multiply and add (Fused Multiply Add, or FMA) operations can be finished in a single cycle. Moreover, the technique of multi-threading helps in performing multiple independent operations simultaneously. Since computation of deep learning tasks is often best served by avoiding sequential execution, getting available cores working in parallel is the obvious choice to speed up deep learning tasks. Intel MKL-DNN utilizes OpenMP to leverage Intel architecture.
To ensure robustness, Intel developed a number of optimized deep learning primitives in Intel MKL-DNN. In addition to matrix multiplication and convolution, the following building blocks are implemented for vectorization-friendly data layout:
- Direct batched convolution
- Inner product
- Pooling: maximum, minimum, average
- Normalization: local response normalization across channels (LRN), batch normalization
- Activation: rectified linear unit (ReLU)
- Data manipulation: multi-dimensional transposition (conversion), concat, sum and scale
Intel MKL-DNN utilizes the following environment variables for vectorization and multi-threading. Thus, changing values of these environment variables affects performance of the framework. These environment variables will be described in detail in the following sections. We highly recommend users tuning these values for their specific neural network model and platform.
Recommended settings → KMP_AFFINITY=granularity=fine,verbose,compact,1,0
tf_cnn_benchmarks usage (shell)
python tf_cnn_benchmarks.py --num_intra_threads=cores --num_inter_threads=2 data_format=NCHW --kmp_affinity=granularity=fine,compact,1,0
Intel MKL-DNN has the ability to bind OpenMP threads to physical processing units. KMP_AFFINITY is used to take advantage of this functionality. It restricts execution of certain threads to a subset of the physical processing units in a multiprocessor computer.
Usage of this environment variable is as below.
Modifier is a string consisting of keyword and specifier. type is a string indicating the thread affinity to use. permute is a positive integer value, controls which levels are most significant when sorting the machine topology map. The value forces the mappings to make the specified number of most significant levels of the sort the least significant, and it inverts the order of significance. The root node of the tree is not considered a separate level for the sort operations. offset is a positive integer value, indicates the starting position for thread assignment. We will use the recommended setting of KMP_AFFINITY as an example to explain basic content of this environment variable.
The modifier is granularity=fine,verbose. Fine causes each OpenMP thread to be bound to a single thread context. Verbose prints messages in runtime concerning the supported affinity, and this is optional. These messages include information about the number of packages, number of cores in each package, number of thread contexts for each core, and OpenMP thread bindings to physical thread contexts. Compact is value of type, assigning the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed.
NOTE The recommendation changes if Hyperthreading is disabled on your machine. In that case, the recommendation is:
KMP_AFFINITY=granularity=fine,verbose,compact if hyperthreading is disabled.
Fig. 3 shows the machine topology map when KMP_AFFINITY is set to these values. The OpenMP thread <n>+1 is bound to a thread context as close as possible to OpenMP thread <n>, but on a different core. Once each core has been assigned one OpenMP thread, the subsequent OpenMP threads are assigned to the available cores in the same order, but they are assigned on different thread contexts.
Figure 3. Machine topology map with setting KMP_AFFINITY=granularity=fine,compact,1,0
The advantage of this setting is that consecutive threads are bound close together, so that communication overhead, cache line invalidation overhead, and page thrashing are minimized. Suppose the application also had a number of parallel regions which did not utilize all of the available OpenMP threads, it is desirable to avoid binding multiple threads to the same core and leaving other cores not utilized.
For a more detailed description of KMP_AFFINITY, please refer to Intel® C++ developer guide.
Recommended settings for CNN→ KMP_BLOCKTIME=0
Recommended settings for non-CNN→ KMP_BLOCKTIME=1 (user should verify empirically)
export KMP_BLOCKTIME=0 (or 1)
tf_cnn_benchmarks usage (shell)
python tf_cnn_benchmarks.py --num_intra_threads=cores --num_inter_threads=2 data_format=NCHW --kmp_affinity=granularity=fine,compact,1,0 --kmp_blocktime=0( or 1)
This environment variable sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping. The default value is 200ms.
After completing the execution of a parallel region, threads wait for new parallel work to become available. After a certain period of time has elapsed, they stop waiting and sleep. Sleeping allows the threads to be used, until more parallel work becomes available, by non-OpenMP threaded code that may execute between parallel regions, or by other applications. A small KMP_BLOCKTIME value may offer better overall performance if application contains non-OpenMP threaded code that executes between parallel regions. A larger KMP_BLOCKTIME value may be more appropriate if threads are to be reserved solely for use for OpenMP execution, but may penalize other concurrently-running OpenMP or threaded applications. It is suggested to be set to 0 for convolutional neural network (CNN) based models.
Recommended settings for CNN→ OMP_NUM_THREADS = num physical cores
export OMP_NUM_THREADS=num physical cores
This environment variable sets the maximum number of threads to use for OpenMP parallel regions if no other value is specified in the application.
The value can be a single integer, in which case it specifies the number of threads for all parallel regions. The value can also be a comma-separated list of integers, in which case each integer specifies the number of threads for a parallel region at a nesting level.
The first position in the list represents the outer-most parallel nesting level, the second position represents the next-inner parallel nesting level, and so on. At any level, the integer can be left out of the list. If the first integer in a list is left out, it implies the normal default value for threads is used at the outer-most level. If the integer is left out of any other level, the number of threads for that level is inherited from the previous level.
The default value is the number of logical processors visible to the operating system on which the program is executed. This value is recommended to be set to the number of physical cores.
This environment variable enables (TRUE) or disables (FALSE) the printing of OpenMP run-time library environment variables during program execution.
INFERENCE using FP32 Batch Size Caffe* GoogleNet v1 128 AlexNet 256.
Configurations for Inference throughput
Tested by Intel as of 6/7/2018: Platform: 2 socket Intel® Xeon® Platinum 8180 processor with 2.50GHz / 28 cores, HT[WE1] : on , Turbo: on, total memory 376.28GB (12slots / 32 GB / 2666 MHz),4 instances of the framework, CentOS*, Linux* 7.3, 1611-Core, SSD: sda RS3WC080, HDD: 744.1GB, sdb RS3WC080, HDD: 1.5TB,sdc RS3WC080, HDD: 5.5TB, Deep Learning Framework: Caffe* version a3d5b022fe026e9092fc7abc7654b1162ab9940d, Topology: GoogLeNet v1, BIOS:SE5C620.86B.00.01.0004.071220170215, Intel MKL-DNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396, No Data Layer. Measured: 1449 imgs/sec vs Tested by Intel as of 06/15/2018 Platform: 2S Intel® Xeon® processor E5-2699 v3 with 2.30GHz / 18 cores, HT: enabled, Turbo: disabled, scaling governor set to “performance” via intel_pstate driver, 64GB DDR4-2133 ECC RAM. BIOS: SE5C610.86B.01.01.0024.021320181901, CentOS Linux-7.5.1804(Core) kernel 3.10.0-862.3.2.el7.x86_64, SSD sdb INTEL SSDSC2BW24 SSD 223.6GB. Framework Berkeley Vision and Learning Center (BVLC) Caffe: https://github.com/BVLC/caffe, inference & training measured with “caffe time” command. For “ConvNet” topologies, a dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. BVLC Caffe (http://github.com/BVLC/caffe), revision 2a1c552b66f026c7508d390b526f2495ed3be594.
Configuration for training throughput:
Tested by Intel as of 05/29/2018 Platform :2 socket Intel® Xeon® Platinum 8180 processor with 2.50GHz / 28 cores, HT: on , Turbo: on, Total Memory 376.28GB (12slots / 32 GB / 2666 MHz),4 instances of the framework, CentOS Linux-7.3.1611-Core , SSD sda RS3WC080 HDD 744.1GB,sdb RS3WC080 HDD 1.5TB,sdc RS3WC080 HDD 5.5TB , Deep Learning Framework Caffe* version: a3d5b022fe026e9092fc7abc765b1162ab9940d, Topology: AlexNet, BIOS:SE5C620.86B.00.01.0004.071220170215, Intel MKL-DNN: version: 464c268e544bae26f9b85a2acb9122c766a4c396, No Data Layer. Measured: 1257 imgs/sec vs Tested by Intel as of 06/15/2018 Platform: 2S Intel® Xeon® processor E5-2699 v3 @ 2.30GHz (18 cores), HT: enabled, turbo: disabled, scaling governor set to “performance” via intel_pstate driver, 64GB DDR4-2133 ECC RAM. BIOS: SE5C610.86B.01.01.0024.021320181901, CentOS Linux-7.5.1804(Core) kernel 3.10.0-862.3.2.el7.x86_64, SSD sdb INTEL SSDSC2BW24 SSD 223.6GB. Framework BLVC Caffe: https://github.com/BVLC/caffe, inference & training measured with “caffe time” command. For “ConvNet” topologies, a dummy dataset was used. For other topologies, data was stored on local storage and cached in memory before training. BVLC Caffe (http://github.com/BVLC/caffe), revision 2a1c552b66f026c7508d390b526f2495ed3be594