Co-authored with: Sergey A. Bufalov Manager, Software Engineering, MainConcept. Frank Schoenberger Senior Product Manager, MainConcept
Intel® Xeon® Scalable Processors have advanced scalability features. To gain workload performance increases developers need to put a strong focus on scalability within their software architecture, revise and adopt multi-processing strategies. MainConcept, together with the Intel Architecture Graphics and Software (IAGS) division at Intel, could achieve up to 1.7x1 performance and efficiency increase for the HEVC decoder (4320p@43Mbit) from the MainConcept® HEVC/H.265 SDK optimizing version 8.2 to 10.0.12 on Intel® Xeon® Platinum 8180 Processor3.
The Latest Intel® Xeon® Scalable processors have advanced scalability including optimized CPU core design, memory bandwidth, and communication inter-connections. To gain workload performance increases, maxing out the hardware capabilities, software developers need to put a strong focus on scalability within their software architecture, revise and adopt
multi-processing strategies. The MainConcept* Codec experts have a profound experience in designing high-quality professional products that scale and perform to available computational resources. This paper explains some concepts for effective utilization of Intel products using the MainConcept HEVC/H.265 Decoder.
Intel Architecture Graphics and Software (IAGS) division
The reference performance and efficiency are based on an Intel® Core™ i7-8700K processor with 16 gigabytes (GB) of RAM measured for 10-bit 4:2:2 720p to 4320p elementary streams in decoded frames per second (fps).
Two decoders with the same performance may have different efficiencies if they consume different amounts of CPU time. This may happen, for example, if one decoder has less spinning and idle CPU cycles or better data and instruction caches usage.
Table 1 describes the performance of MainConcept’s HEVC/H.265 SDK 8.2 decoder on the reference desktop system
Table 1. MainConcept HEVC SDK 8.2 decoder performance on Intel® Core™ i7-8700K processor4.
|10-BIT 4:2:2 HEVC STREAMS|
|PARAMETER||720P at 11 Megabits||1080P at 21 Megabits||2160P at 32 Megabits||4320P at 43 Megabits||AVERAGE|
|CPU utilization, percentage||96.8||98.5||98.5||98.5|
To achieve comparability for decoder efficiency between Intel® Core™ i7-8700K processor to Intel® Xeon® server processors in a linear manner, the effects of increased number of CPUs and NUMA (Non-Uniform Memory Access) were minimized. To evaluate the decoder scalability on platform P with total hardware threads THREADSP and base frequency FREQUENCYP its efficiency is compared to the reference value EFFICIENCYP computed by the following formula:
Below is the list of Intel® Xeon® processors used in the article and their reference efficiencies, calculated by the formula.
The next table describes configurations of test systems for Intel® processors referenced in this article.
Table 2. Test systems configuration5.
|Intel® Xeon® Platinum 8180 x4||4||224||2.5 GHz||768 GB||Oracle* 7.6||3.10.0-862.14.4|
|Intel® Xeon® Platinum 8168 x2||2||96||2.7 GHz||192 GB||Oracle 7.6||3.10.0-862.14.4|
|Intel® Xeon® E5-2699 processor, version 4 x2||2||88||2.2 GHz||128 GB||Oracle 7.6||3.10.0-862.14.4|
|Intel® Xeon® E5-2640 processor, version x2||2||32||2.6 GHz||64 GB||Gentoo||4.10.17|
|Intel® Core™ i7-8700K||1||12||3.7 GHz||32 GB||CentOS* 7.6||3.10.0-957.5.1|
The MainConcept HEVC Decoder version 8.2 is a well- designed, efficient and performant companion for Intel® Core™ processors. The decoder’s efficiency on UMA (Uniform Memory Access) CPUs is based on SIMD vectorization, memory footprint reduction, cache optimization, lightweight synchronization primitives, and fibers. However, these optimization technics don’t deliver the best performance on Intel® Xeon® server processors.
The gap between real and reference efficiencies in HEVC/H.265 SDK 8.2 is illustrated in the figure below.
On the Intel® Xeon® E5-2640 v3 processor the HEVC Decoder 8.2 achieves the efficiency of 4.3, which is less than the Reference value of 5.1 by 15.6% (refer to the 4th column in the table of Figure 2 above). On the Intel® Xeon® E5-2699 v4 the difference is 23.5% and it gets even worse growing up to 29.4% and 73% on Intel® Xeon® Scalable 8168 and 8180 processors, respectively7.
According to Intel’s brief introduction of Scalable Platform, the expected speedup of technical computations comparing Intel Xeon E5-2699 processor version 4 versus the new Intel Xeon Platinum 8180 processor is approximately 2.2 times8. From Figure 2, it is evident that the claimed boost is not achieved in HEVC/H.265 SDK 8.2, and that the Intel Xeon Scalable processor platform is underutilized.
The reason for the observed scalability deficiency was identified inside the MainConcept multiprocessing runtime.
Bottleneck identifications and improvement validations were accomplished with Intel® VTune™ Amplifier Task API invoked from inside the MainConcept multi-processing runtime. The growing number of hardware threads made it possible to decode more pictures in parallel increasing the memory footprint of the workload. As a result, it caused better synchronization, more scheduling operations, further contention between threads, increased switching between tasks (fibers), improved cache invalidation, and extensive data traffic between NUMA nodes.
In HEVC/H.265 SDK version 10.0.1, the MainConcept multi- processing runtime was redesigned to eliminate identified deficiencies.
The old features of the multi-processing runtime that were revised and the new features that were added are listed below with the short description of the respective improvement they deliver.
Table 3. Base features of MainConcept multiprocessing runtime.
|Feature||HEVC SDK 8.2||HEVC SDK 10.0.1||Improvement|
|Cooperative multi-tasking||Fibers with non-shared stacks||Fibers with shared stacks||Less cache invalidation|
|Task queue implementation||Single FIFO circular buffer||Multiple lock-free linked lists||Less scheduling contention|
|Task queue synchronization||Critical section||Atomics, CAS||Less scheduling delays|
|Workload prioritization||N/A||Hierarchical ordered task lists||Less task contention|
|Workload affinity||N/A||Binding to NUMA, CPU, etc.||Less memory access overhead|
|Workload balancing||N/A||Proximity-based task stealing||Less thread contention|
|Task scheduling strategies||N/A||Configurable at compile-time||Better hardware compatibility|
|Load balancing strategies||N/A||Configurable at run-time||Better workload compatibility|
|Learning (under investigation)||N/A||Neural network offline training||Unlimited flexibility in tuning|
The results of improved task scheduling in HEVC/H.265 SDK 10.0.1 are summarized below.
Looking at the table in Figure 3, the scalability improved three times in the best case (SKL 8180) between HEVC SDK 8.2 (9.3) vs. 10.0.1 (32.0), and the deviation from the Reference values didn’t exceed 12% in the worst case (E5-2699 version 4). On SKL 8180 the HEVC Decoder 10.0.1 got more than three times faster (9.3 vs. 32.0) compared to E5-2699 processor, version 4 (9.1 vs. 10.5).
MainConcept and Intel Architecture Graphics and Software (IAGS) division could increase performance and efficiency for latest Intel® Xeon® Scalable Processors up to 3 times, getting very close to reference efficiency and reaching Intel’s advertised 2.2 times acceleration in between the architectures.
The multi-processing runtime from HEVC/H.265 SDK 8.2 still performs and scales well on Intel® Core™ processors and older Intel® Xeon® processor generations. Even for older Intel® Xeon® processor generations and Intel® Core™ processors the MainConcept HEVC/H.265 SDK 10.0.1 shows slightly improved efficiency and performance. But the most promising conclusion is that for the current generation of Intel® Xeon® Scalable Processors, the MainConcept HEVC Decoder library version 10 shows the most significant speed- up for 10-bit 4:2:2 decoding.
Intel technologies may require enabled hardware, specific software, or services activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer.Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.
Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804