Benefits of Intel® Optimized Caffe* in comparison with BVLC Caffe*

Overview

 This article introduces Berkeley Vision and Learning Center (BVLC) Caffe* and  a custom version of Caffe*, Intel® Optimized Caffe*. We explain why and how Intel® Optimized Caffe* performs efficiently on Intel® architecture via Intel® VTune™ Amplifier and the time profiling option of Caffe* itself.

 

Introduction to BVLC Caffe* and Intel® Optimized Caffe*

Caffe* is a well-known and widely used machine vision based Deep Learning framework developed by the Berkeley Vision and Learning Center (BVLC). It is an open-source framework and is evolving currently. It allows users to control a variety options such as libraries for BLAS, CPU or GPU focused computation, CUDA, OpenCV*, MATLAB and Python* before you build Caffe* through 'Makefile.config'. You can easily change the options in the configuration file and BVLC provides intuitive instructions on their project web page for developers. 

Intel® Optimized Caffe* is an Intel-distributed customized Caffe* version for Intel architecture. Intel® Optimized Caffe* offers all the goodness of main Caffe* with the addition of Intel architecture-optimized functionality and multi-node distributor training and scoring. Intel® Optimized Caffe* makes it possible to more efficiently utilize CPU resources.

To see in detail how Intel® Optimized Caffe* has changed in order to optimize itself to Intel Architectures, please refer this page : https://software.intel.com/en-us/articles/caffe-optimized-for-intel-architecture-applying-modern-code-techniques

In this article, we will first profile the performance of BVLC Caffe* with Cifar 10 example and then will profile the performance of Intel® Optimized Caffe* with the same example. Performance profile will be conducted through two different methods.

Tested platform : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

1. Caffe* provides its own timing option for example : 

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

2. Intel® VTune™ Amplifier :  Intel® VTune™ Amplifier is a powerful profiling tool that provides advanced CPU profiling features with a modern analysis interface.  https://software.intel.com/en-us/intel-vtune-amplifier-xe

 

 

How to Install BVLC Caffe*

Please refer the BVLC Caffe project web page for installation : http://caffe.berkeleyvision.org/installation.html

If you have Intel® MKL installed on your system, it is better using MKL as BLAS library. 

In your Makefile.config , choose BLAS := mkl and specify MKL address. ( The default set is BLAS := atlas )

In our test, we kept all configurations as they are specified as default except the CPU only option. 

 

Test example

In this article, we will use 'Cifar 10' example included in Caffe* package as default. 

You can refer BVLC Caffe project page for detail information about this exmaple : http://caffe.berkeleyvision.org/gathered/examples/cifar10.html

You can simply run the training example of Cifar 10 as the following : 

cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./examples/cifar10/train_full_sigmoid_bn.sh

First, we will try the Caffe's own benchmark method to obtain its performance results as the following:

./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

as results, we got the layer-by-layer forward and backward propagation time. The command above measure the time each forward and backward pass over a batch f images. At the end it shows the average execution time per iteration for 1,000 iterations per layer and for the entire calculation. 

This test was run on Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM of DDR4 installed with CentOS 7.2.

The numbers in the above results will be compared later with the results of Intel® Optimized Caffe*. 

Before that, let's take a look at the VTune™ results also to observe the behave of Caffe* in detail. 

 

VTune Profiling

Intel® VTune™ Amplifier is a modern processor performance profiler that is capable of analyzing top hotspots quickly and helping tuning your target application. You can find the details of Intel® VTune™ Amplifier from the following link :

Intel® VTune™ Amplifier : https://software.intel.com/en-us/intel-vtune-amplifier-xe

We used Intel® VTune™ Amplifier in this article to find the function with the highest total CPU utilization time. Also, how OpenMP threads are working. 

 

VTune result analysis

 

What we can see here is some functions listed on the left side of the screen which are taking the most of the CPU time. They are called 'hotspots' and can be the target functions for performance optimization. 

In this case, we will focus on 'caffe::im2col_cpu<float>' function as a optimization candidate. 

'im2col_cpu<float>' is one of the steps in performing direct convolution as a GEMM operation for using highly optimized BLAS libraries. This function took the largest CPU resource in our test of training Cifar 10 model using BVLC Caffe*. 

Let's take a look at the threads behaviors of this function. In VTune™, you can choose a function and filter other workloads out to observe only the workloads of the specified function. 

On the above result, we can see the CPI ( Cycles Per Instruction ) of the function is 0.907 and the function utilizes only one single thread for the entire calculation.

One more intuitive data provided by Intel VTune Amplifier is here. 

This 'CPU Usage Histogram' provides the data of the numbers of CPUs that were running simultaneously. The number of CPUs the training process utilized appears to be about 25. The platform has 64 physical core with Intel® Hyper-Threading Technology so it has 256 CPUs. The CPU usage histogram here might imply that the process is not efficiently threaded. 

However, we cannot just determine that these results are 'bad' because we did not set any performance standard or desired performance to classify. We will compare these results with the results of Intel® Optimized Caffe* later.

 

Let's move on to Intel® Optimized Caffe* now.

 

How to Install Intel® Optimized Caffe*

 The basic procedure of installation of  Intel® Optimized Caffe* is the same as BVLC Caffe*. 

When clone  Intel® Optimized Caffe* from Git, you can use this alternative : 

git clone https://github.com/intel/caffe

 

Additionally, it is required to install  Intel® MKL to bring out the best performance of  Intel® Optimized Caffe*. 

Please download and install  Intel® MKL. Intel offers MKL for free without technical support or for a license fee to get one-on-one private support. The default BLAS library of  Intel® Optimized Caffe* is set to MKL.

 Intel® MKL : https://software.intel.com/en-us/intel-mkl

After downloading Intel® Optimized Caffe* and installing MKL, in your Makefile.config, make sure you choose MKL as your BLAS library and point MKL include and lib folder for BLAS_INCLUDE and BLAS_LIB

BLAS :=mkl

BLAS_INCLUDE := /opt/intel/mkl/include
BLAS_LIB := /opt/intel/mkl/lib/intel64

 

If you encounter 'libstdc++' related error during the compilation of  Intel® Optimized Caffe*, please install 'libstdc++-static'. For example :

sudo yum install libstdc++-static

 

 

 

Optimization factors and tunes

Before we run and test the performance of examples, there are some options we need to change or adjust to optimize performance.

  • Use 'mkl' as BLAS library : Specify 'BLAS := mkl' in Makefile.config and configure the location of your MKL's include and lib location also.
  • Set CPU utilization limit : 
    echo "100" | sudo tee /sys/devices/system/cpu/intel_pstate/min_perf_pct 
    echo "0" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
  • Put 'engine:"MKL2017" ' at the top of your train_val.prototxt or solver.prototxt file or use this option with caffe tool : -engine "MKL2017"
  • Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For the example run below , 'OMP_NUM_THREADS = 64' has been used.
  • Intel® Optimized Caffe* has edited many parts of original BVLC Caffe* code to achieve better code parallelization with OpenMP*. Depending on other processes running on the background, it is often useful to adjust the number of threads getting utilized by OpenMP*. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_cores-2.
  • Please also refer here : Intel Recommendation to Achieve the best performance 

If you observe too much overhead because of too frequent movement of thread by OS, you can try to adjust OpenMP* affinity environment variable : 

KMP_AFFINITY=compact,granularity=fine

 

Test example

 For Intel® Optimized Caffe* we run the same example to compare the results with the previous results. 

cd $CAFFE_ROOT
./data/cifar10/get_cifar10.sh
./examples/cifar10/create_cifar10.sh
./build/tools/caffe time \
    --model=examples/cifar10/cifar10_full_sigmoid_train_test_bn.prototxt \
    -iterations 1000

 

Comparison

 The results with the above example is the following :

Again , the platform used for the test is : Xeon Phi™ 7210 ( 1.3Ghz, 64 Cores ) with 96GB RAM, CentOS 7.2

first, let's look at the BVLC Caffe*'s and Intel® Optimized Caffe* together, 

  --> 

to make it easy to compare, please see the table below. The duration each layer took in milliseconds has been listed, and on the 5th column we stated how many times Intel® Optimized Caffe* is faster than BVLC Caffe* at each layer. You can observe significant performance improvements except for bn layers relatively. Bn stands for "Batch Normalization" which requires fairly simple calculations with small optimization potential. Bn forward layers show better results and Bn backward layers show 2~3% slower results than the original. Worse performance can occur here in result of threading overhead. Overall in total, Intel® Optimized Caffe* achieved about 28 times faster performance in this case. 

 DirectionBVLC (ms)Intel (ms)Performance Benefit (x)
conv1Forward40.29661.6506324.413
conv1Backward54.59112.2478724.286
pool1Forward162.2881.9714682.319
pool1Backward21.71330.45976747.227
bn1Forward1.607170.8124871.978
bn1Backward1.222361.244490.982
Sigmoid1Forward132.5152.2476458.957
Sigmoid1Backward17.90850.26279768.146
conv2Forward125.8113.891532.330
conv2Backward239.4598.4569528.315
bn2Forward1.585820.8549361.855
bn2Backward1.22531.258950.973
Sigmoid2Forward132.4432.224759.533
Sigmoid2Backward17.91860.23470176.347
pool2Forward17.28680.3845644.952
pool2Backward27.01680.66175540.826
conv3Forward40.64051.7472223.260
conv3Backward79.01864.9582215.937
bn3Forward0.9188530.7799271.178
bn3Backward1.180061.181850.998
Sigmoid3Forward66.29181.154357.430
Sigmoid3Backward8.980230.12176673.750
pool3Forward12.55980.22036956.994
pool3Backward17.35570.33383751.989
iplForward0.3018470.1864661.619
iplBackward0.3018370.1842091.639
lossForward0.8022420.6412211.251
lossBackward0.0137220.0138250.993
Ave.Forward735.53421.679933.927
Ave.Backward488.04921.721422.469
Ave.Forward-Backward1223.8643.63628.047
Total 12238604363628.047

 

Some of many reasons this optimization was possible are :

  • Code vectorization for SIMD 
  • Finding hotspot functions and reducing function complexity and the amount of calculations
  • CPU / system specific optimizations
  • Reducing thread movements
  • Efficient OpenMP* utilization

 

Additionally, let's compare the VTune results of this example between BVLC Caffe and Intel® Optimized Caffe*. 

Simply we will looking at how efficiently im2col_cpu function has been utilized. 

BVLC Caffe*'s im2col_cpu function had CPI at 0.907 and was single threaded. 

In case of Intel® Optimized Caffe* , im2col_cpu has its CPI at 2.747 and is multi threaded by OMP Workers. 

The reason why CPI rate increased here is vectorization which brings higher CPI rate because of longer latency for each instruction and multi-threading which can introduce spinning while waitning for other threads to finish their jobs. However, in this example, benefits from vectorization and multi-threading exceed the latency and overhead and bring performance improvements after all.

VTune suggests that CPI rate close to 2.0 is theoretically ideal and for our case, we achieved about the right CPI for the function. The training workload for the Cifar 10 example is to handle 32 x 32 pixel images for each iteration so when those workloads split down to many threads, each of them can be a very small task which may cause transition overhead for multi-threading. With larger images we would see lower spining time and smaller CPI rate.

CPU Usage Histogram for the whole process also shows better threading results in this case. 

 

 

 

Useful links

BVLC Caffe* Project : http://caffe.berkeleyvision.org/ 
 
Intel® Optimized Caffe* Git : https://github.com/intel/caffe
Intel® Optimized Caffe* Recommendations for the best performance : https://github.com/intel/caffe/wiki/Recommendations-to-achieve-best-performance 
 

 

Summary

Intel® Optimized Caffe* is a customized Caffe* version for Intel Architectures with modern code techniques.

In Intel® Optimized Caffe*, Intel leverages optimization tools and Intel® performance libraries, perform scalar and serial optimizations, implements vectorization and parallelization. 

 

 
For more complete information about compiler optimizations, see our Optimization Notice.