Optimize HEVC Decoding Efficiency on High-end NUMA Systems

Co-authored with: Sergey A. Bufalov Manager, Software Engineering, MainConcept. Frank Schoenberger Senior Product Manager, MainConcept

Intel® Xeon® Scalable Processors have advanced scalability features. To gain workload performance increases developers need to put a strong focus on scalability within their software architecture, revise and adopt multi-processing strategies. MainConcept, together with the Intel Architecture Graphics and Software (IAGS) division at Intel, could achieve up to 1.7x1 performance and efficiency increase for the HEVC decoder (4320p@43Mbit) from the MainConcept® HEVC/H.265 SDK optimizing version 8.2 to 10.0.12 on Intel® Xeon® Platinum 8180 Processor3.

Introduction

The Latest Intel® Xeon® Scalable processors have advanced scalability including optimized CPU core design, memory bandwidth, and communication inter-connections. To gain workload performance  increases,  maxing out the hardware capabilities, software developers need to put a strong focus on scalability within their software architecture, revise and adopt
multi-processing strategies. The MainConcept* Codec experts have a profound experience in designing high-quality professional products that scale and perform to available computational resources. This paper explains some concepts for effective utilization of Intel products using the MainConcept HEVC/H.265 Decoder.

MainConcept
Intel Architecture Graphics and Software (IAGS) division

Testing Environment

Reference System

The reference performance and efficiency are based on an Intel® Core™ i7-8700K processor with 16 gigabytes (GB) of RAM measured for 10-bit 4:2:2 720p to 4320p elementary streams in decoded frames per second (fps).

Efficiency Definition

Two decoders with the same performance may have different efficiencies if they consume different amounts of CPU time. This may happen, for example, if one decoder has less spinning and idle CPU cycles or better data and instruction caches usage.

Table 1 describes the performance of MainConcept’s HEVC/H.265 SDK 8.2 decoder on the reference desktop system

Table 1. MainConcept HEVC SDK 8.2 decoder performance on Intel® Core™ i7-8700K processor4.

10-BIT 4:2:2 HEVC STREAMS
PARAMETER720P at 11 Megabits1080P at 21 Megabits2160P at 32 Megabits4320P at 43 MegabitsAVERAGE
Performance, fps622.4291.8114.739.5 
CPU utilization, percentage96.898.598.598.5 
Efficiency6.43.01.20.42.7

To achieve comparability for decoder efficiency between Intel® Core™ i7-8700K processor to Intel® Xeon® server processors in a linear manner, the effects of increased number of CPUs and NUMA (Non-Uniform Memory Access) were minimized. To evaluate the decoder scalability on platform P with total hardware threads THREADSP and base frequency FREQUENCYP its efficiency is compared to the reference value EFFICIENCYP computed by the following formula:

Below is the list of Intel® Xeon® processors used in the article and their reference efficiencies, calculated by the formula.


Figure 1. MainConcept HEVC decoder reference efficiency on Intel® processors.

The next table describes configurations of test systems for Intel® processors referenced in this article.

Table 2. Test systems configuration5.

PROCESSORNODESTHREADSFREQUENCYRAMLINUX*KERNEL
Intel® Xeon® Platinum 8180 x442242.5 GHz768 GBOracle* 7.63.10.0-862.14.4
Intel® Xeon® Platinum 8168 x22962.7 GHz192 GBOracle 7.63.10.0-862.14.4
Intel® Xeon® E5-2699 processor, version 4 x22882.2 GHz128 GBOracle 7.63.10.0-862.14.4
Intel® Xeon® E5-2640 processor, version x22322.6 GHz64 GBGentoo4.10.17
Intel® Core™ i7-8700K1123.7 GHz32 GBCentOS* 7.63.10.0-957.5.1

Problem Statement

The MainConcept HEVC Decoder version 8.2 is a well- designed, efficient and performant companion for Intel® Core™ processors. The decoder’s efficiency on UMA (Uniform Memory Access) CPUs is based on SIMD vectorization, memory footprint reduction, cache optimization, lightweight synchronization primitives, and fibers. However, these optimization technics don’t deliver the best performance on Intel® Xeon® server processors.


The gap between real and reference efficiencies in HEVC/H.265 SDK 8.2 is illustrated in the figure below.

Metrics
Figure 2. MainConcept HEVC decoder reference and SDK 8.2 efficiency on Intel® processors6.

On the Intel® Xeon® E5-2640 v3 processor the HEVC Decoder 8.2 achieves the efficiency of 4.3, which is less than the Reference value of 5.1 by 15.6% (refer to the 4th column in the table of Figure 2 above). On the Intel® Xeon® E5-2699 v4 the difference is 23.5% and it gets even worse growing up  to 29.4% and 73% on Intel® Xeon® Scalable 8168 and 8180 processors, respectively7.

The growing number of hardware threads made it possible to decode more pictures in parallel, increasing the memory footprint of the workload. As a result, it caused better synchronization, more scheduling operations, further contention between threads, increased switching between tasks (fibers), improved cache invalidation, and extensive data traffic between NUMA nodes.

Optimizations

According to Intel’s brief introduction of Scalable Platform, the expected speedup of technical computations comparing Intel Xeon E5-2699 processor version 4 versus the new Intel Xeon Platinum 8180 processor is approximately 2.2 times8. From Figure 2, it is evident that the claimed boost is not achieved in HEVC/H.265 SDK 8.2, and that the Intel Xeon Scalable processor platform is underutilized.

The reason for the observed scalability deficiency was identified inside the MainConcept multiprocessing runtime.

Bottleneck identifications and improvement validations were accomplished with Intel® VTune™ Amplifier Task API invoked from inside the MainConcept multi-processing runtime. The growing number of hardware threads made it possible to decode more pictures in parallel increasing the memory footprint of the workload. As a result, it caused better synchronization, more scheduling operations, further contention between threads, increased switching between tasks (fibers), improved cache invalidation, and extensive data traffic between NUMA nodes.

In HEVC/H.265 SDK version 10.0.1, the MainConcept multi- processing runtime was redesigned to eliminate identified deficiencies.

The old features of the multi-processing runtime that were revised and the new features that were added are listed below with the short description of the respective improvement they deliver.

Table 3. Base features of MainConcept multiprocessing runtime.

FeatureHEVC SDK 8.2HEVC SDK 10.0.1Improvement
Cooperative multi-taskingFibers with non-shared stacksFibers with shared stacksLess cache invalidation
Task queue implementationSingle FIFO circular bufferMultiple lock-free linked listsLess scheduling contention
Task queue synchronizationCritical sectionAtomics, CASLess scheduling delays
Workload prioritizationN/AHierarchical ordered task listsLess task contention
Workload affinityN/ABinding to NUMA, CPU, etc.Less memory access overhead
Workload balancingN/AProximity-based task stealingLess thread contention
Task scheduling strategiesN/AConfigurable at compile-timeBetter hardware compatibility
Load balancing strategiesN/AConfigurable at run-timeBetter workload compatibility
Learning (under investigation)N/ANeural network offline trainingUnlimited flexibility in tuning

The results of improved task scheduling in HEVC/H.265 SDK 10.0.1 are summarized below.

Metrics
Figure 3. MainConcept HEVC decoder reference, SDK 8.2, and SDK 10.0.1 efficiency on Intel® processors9.

Looking at the table in Figure 3, the scalability improved three times in the best case (SKL 8180) between HEVC SDK 8.2 (9.3) vs. 10.0.1 (32.0), and the deviation from the Reference values didn’t exceed 12% in the worst case (E5-2699 version 4). On SKL 8180 the HEVC Decoder 10.0.1 got more than three times faster (9.3 vs. 32.0) compared to E5-2699 processor, version 4 (9.1 vs. 10.5).

Summary and Discussion

MainConcept and Intel Architecture Graphics and Software (IAGS) division could increase performance and efficiency for latest Intel® Xeon® Scalable Processors up to 3 times, getting very close to reference efficiency and reaching Intel’s advertised 2.2 times acceleration in between the architectures.

The multi-processing runtime from HEVC/H.265 SDK 8.2 still performs and scales well on Intel® Core™ processors and older Intel® Xeon® processor generations. Even for older Intel® Xeon® processor generations and Intel® Core™ processors the MainConcept HEVC/H.265 SDK 10.0.1 shows slightly improved efficiency and performance. But the most promising conclusion is that for the current generation of Intel® Xeon® Scalable Processors, the MainConcept HEVC Decoder library version 10 shows the most significant speed- up for 10-bit 4:2:2 decoding.

Footnotes

  1. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
    Performance results are based on testing as of 11/30/2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.
    Testing by MainConcept as of November 30, 2018
    Configuration: 4 x Intel® Xeon® Platinum 8180 @ 2.5 GHz, total threads 224, RAM 768 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0x2000043
  2. MainConcept HEVC/H.265
  3. Intel® Xeon® Platinum 8180 Processor
  4. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
    Performance results are based on testing as of 11/30/2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.
    Testing by MainConcept as of November 30, 2018
    Configuration: Intel® Core™ i7-8700K processor at 3.7 GHz, total threads 12, RAM 32 GB, Centos 7.6, kernel 3.10.0-957.5.1, ucode: 0x96
  5. All test systems have been patched to protect from Meltdown and Spectre vulnerabilities prior to running benchmarks late November 2018 according to. Project:Security/Vulnerabilities/Meltdown and Spectre 
    Spectre and meltdown patches
  6. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit Performance Benchmark Test Disclosure. Performance results are based on testing as of 11/30/2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Testing by MainConcept as of November 30, 2018 Configurations: 4 x Intel® Xeon® Platinum 8180 processor at 2.5 GHz, total threads 224, RAM 768 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0x2000043 2 x Intel® Xeon® Platinum 8168 processor at 2.7 GHz, total threads 96, RAM 192 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0x200004d 2 x Intel® Xeon® E5-2699 processor v4 at 2.2 GHz, total threads 88, RAM 128 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0xB00002E 2 x Intel® Xeon® E5-2640 processor v3 at 2.6 GHz, total threads 32, RAM 64 GB, Gentoo, kernel 4.10.17, ucode: 0x2B Intel® Core™ i7-8700K processor at 3.7 GHz, total threads 12, RAM 32 GB, Centos 7.6, kernel 3.10.0-957.5.1, ucode: 0x96 7 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Performance results are based on testing as of 11/30/2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Testing by MainConcept as of November 30, 2018 Configurations: 4 x Intel® Xeon® Platinum 8180 processor at 2.5 GHz, total threads 224, RAM 768 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0x2000043 2 x Intel® Xeon® Platinum 8168 processor at 2.7 GHz, total threads 96, RAM 192 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0x200004d 2 x Intel® Xeon® E5-2699 processor version 4 at 2.2 GHz, total threads 88, RAM 128 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0xB00002E 2 x Intel® Xeon® E5-2640 processor version 3 at 2.6 GHz, total threads 32, RAM 64 GB, Gentoo, kernel 4.10.17, ucode: 0x2B
  7. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit Performance Benchmark Test Disclosure.
    Performance results are based on testing as of 11/30/2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.
    Testing by MainConcept as of November 30, 2018 Configurations:
    4 x Intel® Xeon® Platinum 8180 @ 2.5 GHz, total threads 224, RAM 768 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0x2000043
    2 x Intel® Xeon® Platinum 8168 @ 2.7 GHz, total threads 96, RAM 192 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0x200004d
    2 x Intel® Xeon® E5-2699 version 4 @ 2.2 GHz, total threads 88, RAM 128 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0xB00002E
    2 x Intel® Xeon® E5-2640 version 3 @ 2.6 GHz, total threads 32, RAM 64 GB, Gentoo, kernel 4.10.17, ucode: 0x2B
  8. Intel® Xeon® Scalable Platform
  9. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit Performance Benchmark Test Disclosure.
    Performance results are based on testing as of 11/30/2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure.
    Testing by MainConcept as of November 30, 2018 Configurations:
    4 x Intel® Xeon® Platinum 8180 @ 2.5 GHz, total threads 224, RAM 768 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0x2000043
    2 x Intel® Xeon® Platinum 8168 @ 2.7 GHz, total threads 96, RAM 192 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0x200004d
    2 x Intel® Xeon® E5-2699 v4 @ 2.2 GHz, total threads 88, RAM 128 GB, Oracle Linux 7.6, kernel 3.10.0-862.14.4, ucode: 0xB00002E
    2 x Intel® Xeon® E5-2640 v3 @ 2.6 GHz, total threads 32, RAM 64 GB, Gentoo, kernel 4.10.17, ucode: 0x2B Intel® Core™ i7-8700K @ 3.7 GHz, total threads 12, RAM 32 GB, Centos 7.6, kernel 3.10.0-957.5.1, ucode: 0x96

Notices

Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer.Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

For more complete information about compiler optimizations, see our Optimization Notice.