Recipe: Optimized Caffe* for Deep Learning on Intel® Xeon Phi™ processor x200

By Vamsi K Sripathi, ElMoustapha Ould-Ahmed-Vall, Published: 06/29/2016, Last Updated: 06/29/2016


The computer learning code Caffe* has been optimized for Intel® Xeon Phi™ processors. This article provides detailed instructions on how to compile and run this Caffe* optimized for Intel® architecture to obtain the best performance on Intel Xeon Phi processors.


Caffe is a popular open source deep learning framework developed by the Berkeley Vision and Learning Center (BVLC) and community contributors. Together with AlexNet, a neural network topology for image recognition, and ImageNet, a database of labeled images, Caffe is often used as a benchmark in the domain of image classification. An Intel version of BVLC Caffe, referred to as Caffe optimized for Intel architecture in the rest of this article, has been created to optimize the framework performance for Intel architectures. These optimizations are available on Github for the broader deep learning user community.

Intel Xeon Phi processors x200 are the latest generation of Intel® Many Integrated Core Architecture (Intel® MIC Architecture) family of architecture. Continuing the performance leadership demonstrated by previous generations of Intel® Xeon® and Intel® Xeon Phi™ product family, Intel Xeon Phi processors x200 are targeting high performance computing applications and the emerging machine learning and deep learning applications. Intel Xeon Phi processors x200 introduce several state-of-the-art features – a compute core with two 512-bit Vector Processing Units (VPU) capable of doing a total of 2 Fused-Multiply-Add (FMA) per clock cycle per core and an on chip Multi-Channel DRAM (MCDRAM) memory which provides significantly higher bandwidth than DDR4 memory.


Download the latest version of Caffe optimized for Intel architecture by cloning the repository:

git clone

Caffe depends on several external libraries that can be installed from your Linux* distribution repositories. The required pre-requisites are well documented and are posted here for user convenience.

  1. On RHEL*/CentOS*/Fedora* systems:

    sudo yum install protobuf-devel leveldb-devel snappy-devel opencv-devel boost-devel hdf5-devel gflags-devel glog-devel lmdb-devel

  2. On Ubuntu* systems:
    1. sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler libgflags-dev libgoogle-glog-dev liblmdb-dev
    2. sudo apt-get install --no-install-recommends libboost-all-dev

Apart from the above listed dependencies, Caffe optimized for Intel architecture requires Intel® Math Kernel Library (Intel® MKL) 2017 Beta update 1 or later releases and OpenMP* run-time library to obtain optimal performance on Intel Xeon Phi processor x200. These libraries are provided in Intel® Parallel Studio XE 2017 Beta software suite and can be downloaded by filling the registration form.

After the registration and download is complete, follow the instructions provided with the package to install Intel® C++ Compiler 17.0 Pre-Release (Beta) Update 1 and Intel Math Kernel Library 2017 Pre-Release (Beta) Update 1 for C/C++.

Build Caffe optimized for Intel architecture for Intel Xeon Phi processor

Setup the shell environment to use Intel C/C++ Compilers and Intel MKL by sourcing the corresponding shell script (assuming the installation directory is /opt/intel/), for example:

For sh/bash: source /opt/intel/bin/ intel64

For c/tcsh: source /opt/intel/bin/compilervars.csh intel64

Change directory to the location where the Caffe optimized for Intel architecture repository is cloned and build the framework to use Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) API’s which provide optimized implementations of convolution, pooling, normalization and other key DNN operations on Intel Xeon Phi processor x200. The below commands will generate a executable binary, named caffe, at /opt/caffe/build/tools/ (assuming the repository is cloned to /opt/caffe):

cd /opt/caffe; mkdir build; cd build;
cmake -DCPU_ONLY=on -DBLAS=mkl -DUSE_MKL2017_AS_DEFAULT_ENGINE=on  /opt/caffe/
make –j 68 all

We will use the AlexNet network topology for image classification to benchmark the performance of Caffe optimized for Intel architecture on Intel Xeon Phi processor x200. Caffe optimized for Intel architecture provides AlexNet topology files located at /opt/caffe/models/mkl2017_alexnet/ which sets the “engine” parameter for the different layers of the neural network (direct batched convolution, maximum pooling, local response normalization across channels (LRN), rectified linear unit (ReLU)) to “MKL2017” corresponding to the usage of Intel MKL-DNN API’s at run-time. AlexNet input file uses image data stored in Lightning Memory-mapped Database (lmdb) format files (data.mdb, lock.mdb) and are required for benchmarking. The ImageNet dataset files can be obtained from here.

Run Caffe optimized for Intel architecture on the Intel Xeon Phi processor

The Intel Xeon Phi processor x200 supports different memory modes, to obtain the best performance on Caffe optimized for Intel architecture it is recommended to run out of MCDRAM memory in “Flat” mode. The standard Linux utility, “numactl” is used to allocate memory buffers in MCDRAM. In MCDRAM Flat mode, DDR and MCDRAM memory are exposed as distinct, addressable NUMA nodes (numactl -H shows this info). More information about MCDRAM and Flat, Cache and Hybrid modes can be found here.

Before running the executable, set the OpenMP environment variables for numbers of threads and thread pinning to physical processor cores:

export OMP_NUM_THREADS=<number_of_cores which implies 64 or 68  depending on Intel Xeon Phi x200 SKU>
export KMP_AFFINITY=granularity=fine,compact,1,0

Since the goal of this benchmark is to measure performance and not to train an end-to-end image classification model, we will use the Caffe “time” mode with the default of 50 iterations comprised of forward and backward passes:

numactl –m 1 /opt/caffe/build/tools/caffe time --model=/opt/caffe/models/mkl2017_alexnet/train_val.prototxt

The above step produces timing statistics (in milliseconds) for average forward (FW) and backward (BW) passes across 50 iterations for processing a batch of images. Currently, the input files provided in models/mkl2017_alexnet/ directory are set to use 256 images, which is the recommended batch size to obtain ideal performance (refer to the /opt/caffe/models/mkl2017_alexnet/train_val.prototxt file for future changes in the number of images). The time spent in FW, BW passes is used in calculating the training rate as:

More Details

For more details on various configuration and run parameters of Caffe framework, please refer to this in-depth article.

About the Author

Vamsi Sripathi is a software engineer at Intel since 2010. He has a Masters' degree in Computer Science from North Carolina State University, USA. During his tenure at Intel, he worked on the performance optimization of Basic Linear Algebra Subroutines (BLAS) in Intel Math Kernel Library (MKL) spanning multiple generations of Intel Xeon and Intel Xeon Phi architectures. Recently, he has been working on the optimization of deep learning algorithms and frameworks for Intel architectures

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804