Intel and Facebook* Collaborate to Boost Caffe2 Performance on Intel CPU’s


By Andres Rodriguez and Niveditha Sundaram

Every day, the world generates more and more information — text, pictures, videos and more. In recent years, artificial intelligence and deep learning have improved several applications that help people better understand this information with state-of-the-art voice/speech recognition, image/video recognition, and recommendation engines.

Most deep learning workloads consists of both training and inference. Training usually requires many hours or days to complete. Inference usually requires milliseconds or seconds and is often a step of a larger process. While the computing intensity of inference is much lower than that of training, inference is often done on a much larger dataset. Therefore, the total computing resources spent on inference are likely to dwarf those spent on training. The overwhelming majority of all inference workloads run on Intel® Xeon® CPUs.

Over the past year, Intel rapidly added CPU support across several deep learning frameworks to optimize for a variety of training and inference applications. At the center of these optimizations is Intel® Math Kernel Library (Intel® MKL) which makes use of Intel® Advanced Vector Extension CPU instructions (e.g., Intel® AVX-512) that provide enhanced support for deep learning applications.

Caffe2* is an open source deep learning framework created by Facebook and built with expression, speed, and modularity in mind. Caffe2 is deployed at Facebook to help researchers train large machine learning models and deliver AI on mobile devices. Now, developers will have access to many of the same tools, allowing them to run large-scale distributed training scenarios and build machine learning applications for mobile.

Intel and Facebook are collaborating to integrate Intel® MKL functions into Caffe2 for optimal inference performance on CPU’s. Table 1 shows inference performance numbers on AlexNet* using the Intel® MKL library and the Eigen* BLAS library for comparison. In this table, OMP_NUM_THREADS indicates the number of physical cores used in these workloads (details in the table caption). These results show that Caffe2 is highly optimized on CPUs and offers competitive performance. For small batch inference workloads it is recommended to run each workload in each CPU core and run multiple workloads in parallel with one workload per core.





batch size

Intel® MKL


Eigen BLAS


Intel® MKL


Eigen BLAS






















Table 1: Performance results on Caffe2 using the AlexNet topology with Intel® MKL and Eigen BLAS. Experiments were performed on Intel® Xeon® processor E5-2699 v4 (codename Broadwell) @ 2.20GHz with dual sockets, 22 physical cores per socket (total of 44 physical cores in both sockets), 122GB RAM DDR4, 2133 MHz, HT Disabled, on Linux 3.10.0-514.2.2.el7.x86_64 CentOS 7.3.1611, Intel® MKL version 20170209, Eigen BLAS version 3.3.2, based on Caffe2 as of April 18, 2017.

Instructions to install and use Caffe2 can be found at this link

Later this year, the new generation of Intel® Xeon® processors (codename Skylake) will become available to the general market. Skylake introduces the 512-bit wide Fused Multiply Add (FMA) instructions as part of the larger 512-bit wide vector engine, i.e., Intel® AVX-512, providing a significant performance boost over the previous 256-bit wide AVX2 instructions in the Haswell/Broadwell processor for both training and inference workloads. The 512-bit wide FMA’s essential doubles the FLOPS that Skylake can deliver and significantly speeds up single precision matrix arithmetic used in convolutional and recurrent neural networks. Inference workloads are massively parallel and will benefit from the larger core count offered by Skylake. In addition, the Skylake CPUs have re-architected memory subsystem supporting faster system memory and larger Mid-Level-Cache (MLC) per core, which also helps with the performance improvements over current generation CPUs and significant enhancement over the common installed base of four year old systems.


About the authors

Andres Rodriguez, PhD, is a Sr. Principal Engineer with Intel’s AI Products Group (AIPG) where he designs deep learning solutions for Intel’s customers and provides technical leadership across Intel for deep learning products. He has 13 years of experience working in artificial intelligence. Andres received his PhD from Carnegie Mellon University for his research in machine learning. He holds over 20 peer reviewed publications in journals and conferences, and a book chapter on machine learning.

Niv Sundaram, PhD, is Director of Engineering with Intel’s Datacenter Engineering Group (DEG), focusing on performance and power optimizations of current and emerging workloads. In this role, she leads a team that works with Intel’s customers to characterize deep learning/machine learning and augmented/virtual/mixed reality workloads for the datacenter. Niv has a PhD in Electrical Engineering from the University of Wisconsin-Madison and has one issued patent and several peer-reviewed publications.


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.  Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.  Any change to any of those factors may cause the results to vary.  You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit:

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

For more complete information about compiler optimizations, see our Optimization Notice.