Understanding NUMA using Intel® Optimization for Caffe*

Introduction

In this article we demonstrate how Intel® VTune™ Amplifier can be used to identify and improve performance bottlenecks while running a neural network training workload (for example, training a Canadian Institute for Advanced Research's - CIFAR-10 model) with deep learning framework Caffe*.

For this tutorial we start with the fork of BVLC/Caffe, which is dedicated to improving performance of Caffe running on a CPU, in particular Intel® Xeon® processors. This version of Caffe integrates the latest version of Intel® Math Kernel Library (Intel® MKL) and Intel® Machine Learning Scaling Library (Intel® MLSL) and is optimized for Intel® Advanced Vector Extensions 2 and Intel® Advanced Vector Extensions 512 instructions.

table of data
Figure 1. Caffe "time" command—performance difference between BVLC/Caffe and Caffe optimized for Intel® architecture through different layers running on the same test system.

Benchmark results were obtained prior to the implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information, see Performance Benchmark Test Disclosure.

Configuration: 2S Intel® Xeon® Platinum 8180 processor @ 2.50 GHz, 28 physical cores/socket – HT ON, 192 GB DDR4 2666 MHz, Red Hat Enterprise Linux Server release 7.3 Linux* version 3.10.0-514.16.1.el7.x86_64, Intel VTune Amplifier 2018 Update 1, Caffe version – 1.1.0, Testing model – CIFAR-10 with CIFAR-10 dataset.
Benchmark Source: Intel Corporation. See below for further notes and disclaimers.

As demonstrated in the following article, Caffe Optimized for Intel® Architecture: Applying Modern Code Techniques and also from Figure 1, we can see that there is a significant difference in performance between BVLC/Caffe and Caffe optimized for Intel architecture both running on an Intel CPU. As a result, we have set the goal of this article to investigate whether we can further improve the performance of Caffe optimized for Intel architecture without modifying the framework (or its sources), and by simply using the currently available Intel software developer tools like Intel VTune Amplifier and Intel® MLSL.

Compiling Caffe*

It is assumed that the Caffe dependencies are successfully installed as per your platform. More information about installing prerequisites can be found here. We have also installed OpenCV 3.1.0 (optional) on the test system. Instructions for installing OpenCV can be found here.

Check out the Caffe source from the intel/caffe Git* repository:

$ git clone https://github.com/intel/caffe.git intelcaffe

Modify the Makefile.config to enable use of OpenCV 3 and Intel MLSL for the Linux* operating system. More about Intel MLSL in a later part of the tutorial.

OPENCV_VERSION := 3
USE_MLSL := 1

Code block 1. Modifications to the Makefile.config.

Steps to compile Caffe:

$ make clean
$ make -j16 -k

CIFAR-10 Training: Profiling and Recommendations

Once Caffe is successfully compiled we prepare and use one of the example models, CIFAR-10 with CIFAR-10 image-classification dataset to investigate a performance bottleneck of Caffe CIFAR-10 training on a CPU.

The CIFAR-10 dataset consists of 60,000 color images, each with dimensions of 32 × 32, equally divided and labeled into the following 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The classes are mutually exclusive.

Steps to get and prepare the CIFAR-10 dataset:

$ cd $CAFFE_ROOT
$ data/cifar10/get_cifar10.sh
$ examples/cifar10/create_cifar10.sh
$ time examples/cifar10/train_full.sh
.
.
.
I0109 06:09:27.285831 428533 solver.cpp:737]     Test net output #0: accuracy = 0.8224
I0109 06:09:27.285871 428533 solver.cpp:737]     Test net output #1: loss = 0.512778 (* 1 = 0.512778 loss)
I0109 06:09:27.285876 428533 solver.cpp:617] Optimization Done.
I0109 06:09:27.285881 428533 caffe.cpp:345] Optimization Done.

real    14m41.296s
user    805m57.464s
sys     14m20.309s

Code block 2. Baseline elapsed time (training CIFAR-10 model with default settings) – Note: Accuracy of 82.24% and Elapsed time of approximately 14 mins.

With the default hyperparameters, it currently takes about 14 minutes to complete the full training using the script provided in the CIFAR-10 example directory which runs 70k iterations to achieve 82.34 percent accuracy with the default batch size of 100.

We now run the same example with Intel VTune Amplifier memory access analysis to look for potential bottlenecks. In order to limit the amount of data to be collected with memory access analysis, we run the training for 12k iterations instead of 70k.

With hardware event-based sampling, Intel VTune Amplifier's memory access analysis can be used to identify memory-related issues, like non-uniform memory architecture (NUMA) problems and bandwidth-limited accesses, and attribute performance events to memory objects (data structures), which is provided due to instrumentation of memory allocations/de-allocations and getting static/global variables from symbol information.

Additional information about Intel VTune Amplifier training and documentation can be found on the product webpage.

$ source /opt/intel/vtune_amplifier/amplxe-vars.sh
$ amplxe-cl -c memory-access -data-limit=0 -r CIFAR_100b_12kiter_MA examples/cifar10/train_full.sh

Screenshot of memory access data
Figure 2. Summary (memory access—memory usage viewpoint).

screenshot of of UPI bandwidth
Figure 3. Intel® UPI bandwidth (bottom-up view)—moderately high Intel UPI bandwidth.

Issue 1: High remote and local memory ratio

In NUMA machines, memory requests missing last level cache (LLC) may be serviced either by local or remote DRAM. Memory requests to remote DRAM incur much greater latencies than those to local DRAM. Intel VTune Amplifier defines this metric as a ratio of remote DRAM loads to local DRAM loads. Referring to the above Figures 2 and 3, it can be seen that the remote to local memory ratio is about 40 percent, which is considered to be high, and accounts for higher Intel® Ultra Path Interconnect (Intel® UPI) traffic.

Screenshot of memory access hotspots viewpoint
Figure 4. Summary (memory access—hotspots viewpoint).

Issue 2: High imbalance or serial spin time

Imbalance or serial spinning time is CPU time when working threads are spinning on a synchronization barrier consuming CPU resources. This can be caused by load imbalance, insufficient concurrency for all working threads, or waits on a barrier in the case of serialized execution.

Recommendation 1: Reducing remote to local memory access

In order to improve the ratio of remote to local memory access, it is recommended that cores must keep frequently accessed data local. One way to achieve this is to identify and modify each of the contributing source code regions to affinitize its data; however, this would require major code restructuring. An alternative approach is to use an out-of-the-box Intel MLSL to perform parallel distributed training across the two NUMA nodes so that each process allocates and accesses its data from the same NUMA domain.

In general, there are two approaches of achieving parallelism in distributed training, data parallelism and model parallelism. The approach used in Caffe optimized for Intel architecture is data parallelism, where different batches of data are trained on different nodes (that is, Message Passing interface (MPI) ranks/processes). The data is split among all the nodes, but the same model is used on all of the nodes. This means that the total batch size in a single iteration is equal to the sum of individual batch sizes of all nodes. The primary purpose of Intel MLSL is to allow multinode training and scale training across hundreds of nodes. However, it does not hurt to use the same library to perform training on two NUMA nodes of a single 2-socket server. Additional information on how the distributed training works can be found here.

Using Intel MLSL with Caffe optimized for Intel architecture is as simple as recompiling Caffe with an additional flag—Use Intel MLSL: = 1, in the makefile.config (this flag was already enabled in the steps shown above to compile Caffe in Section 2, so we can skip recompiling Caffe).

In order to compare performance of the baseline with the distributed training, we use the same total batch size of 100 per iteration of the stochastic gradient descent algorithm. As a result, we set the batch size to be 50 for each node.

For distributed training, train_full.sh script can be modified as given below. Here, environment variable OMP_NUM_THREADS and KMP_AFFINITY is set to assign one OpenMP* thread to each physical core with compact affinity (see Thread Affinity Interface). Also, environment variable I_MPI_PIN_DOMAIN="numa" is used to facilitate execution of only one MPI process per NUMA node (see Interoperability with OpenMP API).

--- ../ref_intelcaffe/examples/cifar10/train_full.sh    2018-03-12 17:09:17.706664443 -0400
+++ examples/cifar10/train_full.sh      2018-03-13 01:58:50.789193819 -0400
@@ -39,15 +39,21 @@

 TOOLS=./build/tools

+export OMP_NUM_THREADS=28
+export KMP_AFFINITY="granularity=fine,compact,1,0"
+
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver.prototxt $@

 # reduce learning rate by factor of 10
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \

$TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver_lr1.prototxt \
     --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate.h5 $@

 # reduce learning rate by factor of 10
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver_lr2.prototxt \
     --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate.h5 $@

Code block 3. Modifications to the run-script—train_full.sh.

Now we profile the distributed training for the CIFAR-10 model for 12k iterations using Intel MLSL:

$ source external/mlsl/l_mlsl_2017.2.018/intel64/bin/mlslvars.sh intel64
$ amplxe-cl -c memory-access -data-limit=0 -r CIFAR_50b_12kiter_MA examples/cifar10/train_full.sh

Screenshot of memory access data
Figure 5. Memory access summary—low remote/local memory ratio.

Screenshot of memory usage viewpoint low bandwidth
Figure 6. Memory usage viewpoint—low Intel® UPI bandwidth.

From the above figures 2 and 5, it can be seen that the overall remote to Local Memory Ratio has reduced significantly; that is, from 40 percent down to 5 percent, thereby reducing the observed total Intel UPI bandwidth. Also, the total elapsed time has now reduced from 150 seconds to 110 seconds.

Screenshot of Hotspots viewpoint moderately high
Figure 7. Hotspots viewpoint (moderately high spin time).

Recommendation 2: Reducing spin time

We see some improvements in the overall imbalance or serial spinning time but it is still significantly high. As a result, the execution performance of this workload does not scale linearly with the increased thread count (in accordance with Amdahl's law). The accounted spin time can be caused by either load imbalance, insufficient concurrency for all working threads, or waits on a barrier in the case of serialized execution.

Achieving any performance gain by improving load balance and/or parallelizing serial execution parts would require significant code refactoring and is beyond the scope of this article. However, to check if some amount of that spin time is caused due to insufficient concurrency and threads waiting for work, we can try reducing the number of worker (OpenMP) threads and see if it reduces the overall thread-spin time.

To conduct this experiment, we rerun the CIFAR-10 training for 12k iteration with VTune Amplifier, but now with OMP_NUM_THREADS environment variable set to 16. This forces a maximum limit of 16 threads per MPI process running on each NUMA node.

$ amplxe-cl -c memory-access -data-limit=0 -r CIFAR_50b_12k_16t_MA examples/cifar10/train_full.sh
data comparison
comparison data

Figure 8. Spin time comparison between 28 OpenMP threads and 16 OpenMP threads, per MPI process

The above comparison shows an improvement in the elapsed time, from 110 seconds down to 91 seconds, but at the same time it also signifies that the running workload has concurrency issues, which is affecting the simultaneous execution of a high number of parallel threads. Fixing such issues would require significant code restructuring in Caffe to implement further optimizations such as collapsing inner and outer loops in parallel regions of code, changing scheduling and/or granularity of work distribution, and identification of other serial regions of code and parallelizing them, and is beyond the scope of this article.

Optimized Performance

Finally, we would like to verify whether the above changes improve our total elapsed time to train the full model for 70k iterations with the same default hyperparameters. In order to do this, we modify the train_full.sh script as given below:

--- ../ref_intelcaffe/examples/cifar10/train_full.sh    2018-03-12 17:09:17.706664443 -0400
+++ examples/cifar10/train_full.sh      2018-03-13 01:58:50.789193819 -0400
@@ -39,15 +39,21 @@

 TOOLS=./build/tools

+export OMP_NUM_THREADS=16
+export KMP_AFFINITY="granularity=fine,compact,1,0"
+
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver.prototxt $@

 # reduce learning rate by factor of 10
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver_lr1.prototxt \
     --snapshot=examples/cifar10/cifar10_full_iter_60000.solverstate.h5 $@

 # reduce learning rate by factor of 10
+mpirun -l -n 2 -ppn 2 -genv I_MPI_PIN_DOMAIN="numa" \
 $TOOLS/caffe train \
     --solver=examples/cifar10/cifar10_full_solver_lr2.prototxt \
     --snapshot=examples/cifar10/cifar10_full_iter_65000.solverstate.h5 $@

Code block 4. Modifications to the run-script—train_full.sh.

$ time examples/cifar10/train_full.sh
.
.
.
[0] I0312 20:25:35.617869 323789 solver.cpp:563]    Test net output #0: accuracy = 0.8215
[0] I0312 20:25:35.617897 323789 solver.cpp:563]     Test net output #1: loss = 0.514667 (* 1 = 0.514667 loss)
[0] I0312 20:25:35.617908 323789 solver.cpp:443] Optimization Done.
[0] I0312 20:25:35.617914 323789 caffe.cpp:345] Optimization Done.
[1] I0312 20:25:35.617866 323790 solver.cpp:443] Optimization Done.
[1] I0312 20:25:35.617892 323790 caffe.cpp:345] Optimization Done.

real    8m5.952s
user    256m16.212s
sys     2m3.563s

Code block 5. Optimized elapsed time (distributed training for CIFAR-10 model with Intel® MLSL) – Accuracy of 82.15% and Elapsed time of approximately 8 mins.

From code block 5, we see that we can achieve accuracy of 82.15 percent (approximately similar to what we achieved at the beginning of the article) in up to 45 percent less time with distributed training and tuned number of OpenMP threads.

System Configuration

Performance testing for the results provided in this paper were achieved on the following test system. For more information go to Product Performance.

Component

Specification

System

2-socket server

Host Processor

Intel® Xeon® Platinum 8180 processor @ 2.50 GHz

Physical Cores

28 cores/socket

Host Memory

96 GB/socket

Profiler

Intel VTune Amplifier 2018 Update 1

Host Operating System

Linux* version 3.10.0-514.16.1.el7.x86_64

Caffe version

1.1.0 (GitHub)

Conclusion

In this article we demonstrated how Intel VTune Amplifier can be used to identify NUMA and thread-spinning issues on a neural network training workload running with Caffe. With the help of Intel MLSL and tuning for number of threads, we were able to achieve out-of-the-box performance improvements without actually modifying the Caffe framework or any of the model hyperparameters.

Using distributed training with Intel MLSL, we were able to reduce the ratio of remote to local memory access, and thus improve the elapsed time by approximately 27 percent. Also, by reducing the number of OpenMP threads for the current workload, we were able to achieve even lower elapsed times by reducing inefficient spin time. However, it is important to note that this might not be true for every other training model trained with Caffe. Ideally, even with this workload, we should be able to use all the cores in parallel efficiently without a lot of imbalance or serial spinning time, but that would require some code refactoring.

References

For more complete information about compiler optimizations, see our Optimization Notice.