Intel and Facebook* collaborate to boost PyTorch* CPU performance

Intel Logo

Pytorch Logo

Every day, the world generates more and more information — text, pictures, videos and more. In recent years, advances in deep learning have improved several applications that help people better understand this information with state-of-the-art speech recognition and synthesis, image/video recognition, and personalization.

Applying deep learning to develop new models consists of a model training phase and, once convergence to a target performance is achieved, deploying these models for inference to make new predictions within an application. Training usually requires hours, days, or even weeks to complete. Inference usually requires milliseconds and is often a step in a larger process within a larger application workflow. While the computing intensity of inference is much lower than that of training, inference is often done on a significant larger dataset. Therefore, the total computing resources spent on inference dwarf those spent on training. Concretely and as an example, inference workloads for a number of areas continue to grow at Facebook* where every day Facebook makes over 200 trillion predictions and over 6 billion language translations.

Today, Intel is launching the 2nd generation Intel® Xeon® Scalable processors (codename Cascade Lake) adding Intel® Deep Learning Boost (Intel® DL Boost) technology. End users can take advantage of this technology with minimum changes to their code. The optimizations are abstracted and integrated into the main deep learning frameworks such as PyTorch*.

In this article we detail the hardware advancements in Cascade Lake, the software optimizations required to take advantage of these optimizations that Intel and Facebook are collaborating to bring to the PyTorch community, and the results of these advancements on deep learning workloads.

Hardware Advancements

In July 2017, Intel launched the Intel Xeon Scalable processor (formerly codename Skylake) with new features including

Today, Intel is launching the 2nd generation Intel Xeon Scalable processors (codename Cascade Lake) which, along with all of the existing Intel Xeon Scalable processor features, introduces the AVX-512 Vector Neural Network Instruction (VNNI), see Fig. 1, as part of Intel DL Boost technologies. VNNI enable 8-bit FMAs with 32-bit accumulates in a single instruction, and 'quadruples' the FMA throughput over 32-bit FMAs.

Lower precision increases the performance in two ways:

  1. The additional FMA throughput boosts compute bound operations
  2. The reduced footprint (from using 8-bits rather than 32-bits) boosts memory bandwidth bound operations by enabling faster data movement through the memory hierarchy.

V P D P B U S D instruction
Figure 1. The AVX-512 VNNI VPDPBUSD instruction multiplies 64 signed 8-bit with 64 unsigned 8-bit values and accumulates to 16 signed 32-bit values per clock cycle per FMA (Intel® Xeon® processors 6000 series and above have two FMAs per core). Credit: Israel Hirsh.

Software Advancements

Intel and Facebook are partnering to accelerate PyTorch’s CPU performance. These optimizations generally do not require the data scientist end user to modify their PyTorch scripts.

A deep learning network is a computational graph comprised of various layers or nodes. Optimizations happen at the node level and at the graph level. At the node level, Intel optimizes various layers such as convolution, matrix multiplication, ReLU, Pooling, etc. for high CPU performance, and includes those optimizations in Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN). These optimizations minimize data transfers and ensure effective usage of SIMD instructions, execution units, registers, and memory cache hierarchy. At the graph level, Intel optimizes groups of nodes with various data order strategies and layer fusion, for example, fusing ReLU into convolution so that ReLU operations are executed at the last convolution cycle while the data is still in the registers.

Intel MKL-DNN was integrated into both PyTorch and Caffe2* backends by implementing the most performance critical DNN layers using Intel MKL-DNN APIs. As PyTorch and Caffe2 merged, the Intel MKL-DNN integration was also consolidated, and Intel MKL-DNN library was built into PyTorch 1.0 binary as default on CPU. The Intel MKL-DNN tensor representation was redesigned so that it can work on both PyTorch and Caffe2 (also known as C2) backends. We also aligned with FBGEMM* on int8 operation semantics and quantization method so that Caffe2 int8 models can be run with both FBGEMM and Intel MKL-DNN.   

Results

Tioga Pass is an Open Compute Project (OCP) platform used at Facebook to support a variety of compute services. It is a dual-socket motherboard with Intel® Xeon® Gold 6139 (Skylake) processors. Table 1 summarizes the results of using PyTorch (C2 backend) integrated with the Intel MKL-DNN library. We observed 5.4x and 8.0x performance gains over the fp32 baseline with fp32, and 9.3x and 15.6x over the fp32 baseline with int8 when running ResNet50 inference with batch size 1 and 32 per socket, respectively.

Table 1. PyTorch integrated with Intel MKL-DNN at fp32 and int8 performance gains over baseline (fp32 without Intel MKL-DNN) using batch size 1 and 32 on ResNet50 on a single socket Intel® Xeon® Gold 6139 (Skylake) processor.

ResNet50 inference images/second per socket
Batch sizeNo Intel MKL-DNN FP32Intel MKL-DNN FP32GainsIntel MKL-DNN INT8Gains
118.90101.365.4x175.169.3x
3221.18169.498.0x331.1215.6x

On the 2nd generation of Intel Xeon Scalable processors (codename Cascade Lake) launching today we observe further gains. We summarized these gains in Table 2. We observe that using the Intel® Xeon® Platinum 8280 (Cascade Lake) processor and PyTorch (C2 backend) integrated with the Intel MKL-DNN library the performance gains across ResNet50, Faster R-CNN (ResNext101-64x4d backbone, 800x1333 resolution input), and RetinaNet (ResNet101 backbone, 800x1333 resolution input) are 7.7x, 47x, and 23.6x, respectively, over fp32 baseline with fp32, and 19.5x, 105.1x, and 58.9x, respectively, over fp32 baseline with int8.

Table 2. PyTorch integrated with Intel MKL-DNN at fp32 and int8 performance gains over baseline (fp32 without Intel MKL-DNN) for ResNet50, Faster R-CNN, and RetinaNet using batch size 1 on a single socket Intel Xeon Platinum 8280 (Cascade Lake) processor.

Inference images/second per socket
Batch Size = 1No Intel MKL-DNN FP32Intel MKL-DNN FP32GainsIntel MKL-DNN INT8Gains
ResNet5021.88167.517.7x427.1519.5x
Faster R-CNN0.021.0847.0x2.42105.1x
RetinaNet0.184.2723.6x10.6958.9x

Conclusion

Intel and Facebook continue to accelerate PyTorch 1.0+ for CPUs, benefiting the overall PyTorch ecosystem. The Intel MKL-DNN is included in PyTorch as default math kernel library for deep learning at pytorch.org. Additional information on lower numerical precision deep learning inference and training can be found here.

About the Authors

Andres Rodriguez, PhD, is a Sr. Principal Engineer and the lead Cloud AI Architect at Intel Data Center Group (DCG) where he designs deep learning solutions for cloud customers and provides technical leadership across Intel for deep learning products. He has 15 years of experience working in artificial intelligence. Andres received his PhD from Carnegie Mellon University for his research in machine learning. He holds over 20 peer reviewed publications in journals and conferences, and a book chapter on machine learning.

Jianhui Li, PhD, is a principal engineer at Intel® architecture, Graphics and Software group and leads deep learning framework integration and workload optimization. He was software developer for binary translation and JIT compiler and led the development of Houdini which runs Android* ARM applications transparently with comparable user experience on AI-based platform. Jianhui received PhD from Fudan University on computer science. He holds 21 US patents in binary translation and real-life application optimization.

Shivani Sud is a system architect working on cloud technologies and ML system architecture. She has been a leading contributor to telco network transformation to software defined infrastructure with NFV, SDN and cloud technologies. Prior to that her research contributions have been in next-gen mobile devices, multi device usages and platform security.

Configuration Details

Intel Xeon Platinum 8280:
Tested by Intel as of 3/25/2019. 2 socket Intel Xeon Platinum 8280 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0271.120720180605 (ucode: 0x4000013), Ubuntu 18.04.1 LTS, kernel 4.15.0-45-generic Deep Learning Framework: Pytorch with ONNX/Caffe2 backend to PyTorch: https://github.com/pytorch/pytorch.git (commit: 4ac91b2d64eeea5ca21083831db5950dc08441d6)and Pull Request link: https://github.com/pytorch/pytorch/pull/17464 (submitted for upstreaming), gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0, MKL DNN version: v0.17.3 (commit hash: 0c3cb94999919d33e4875177fdef662bd9413dd4), Mkl 2019.1.144, ResNet50: https://github.com/intel/optimized-models/tree/master/pytorch, BS=1, synthetic data, 2 instance/2 socket, Datatype: INT8 & FP32

Faster R-CNN:
https://github.com/intel/Detectron/blob/master/configs/12_2017_baselines/e2e_faster_rcnn_X-101-64x4d-FPN_1x.yaml BS=1, synthetic data, 2 instance/2 socket, Datatype: INT8 and FP32

RetinaNet:
https://github.com/intel/Detectron/blob/master/configs/12_2017_baselines/retinanet_R-101-FPN_1x.yaml BS=1, synthetic data, 2 instance/2 socket, Datatype: INT8 and FP32

Intel Xeon Gold 6139:
Tested by Intel as of 3/01/2019. 2S Intel Xeon Gold 6139 (18 cores), HT ON, turbo ON, Total Memory 128 GB (4 slots/ 32 GB/ 2.30 GHz), BIOS: F08_3A13, Centos 7 Kernel 3.10.0-957.e17.x86_64, Deep Learning Framework: PyTorch w/C2 backend PR link: https://github.com/pytorch/pytorch/pull/17464, gcc (Red Hat 5.3.1-6) 5.3.1 20160406, MKLDNN version: v0.17.3 (commit hash: 0c3cb94999919d33e4875177fdef662bd9413dd4), Mkl 2019.1.144,

ResNet50:
https://github.com/intel/optimized-models/tree/master/pytorch, BS=1/32, No datalayer; 1 socket, Datatype: INT8 and FP32.

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit: http://www.intel.com/performance.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

For more complete information about compiler optimizations, see our Optimization Notice.