Accelerating x265 with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

Introduction

Motivation

Vector units in CPUs have become the de facto standard for acceleration of media, and other kernels that exhibit parallelism according to the single instruction, multiple data (SIMD) paradigm.1 These units enable a single register file to be treated as a combination of multiple registers, whose cumulative width equals that of the vector register file. A single instruction can therefore operate in parallel on all data in this vector register, resulting in significant speedups to applications that exhibit data access trends that fit this pattern. Starting from a 64-bit vector register file that may be treated as an 8-bit register in the architecture extended with MMX™ technology, SIMD on Intel® architecture processors has evolved to enable 256-bit register files that allow for 32 parallel 8-bit operations in Intel® Advanced Vector Extensions (Intel® AVX) and Intel® Advanced Vector Extensions 2 (Intel® AVX2) generations.

Kernels in media workloads fit this pattern of execution naturally, because the same operation (filtering for example) is uniformly applied across several pixels of a frame. Consequently, several popular open source projects leverage SIMD instructions for code acceleration. The x264 project for Advanced Video Coding (AVC) encoding2 and the x265 project for High Efficiency Video Coding (HEVC) encoding3 are the two widely used media libraries that extensively use multiple generations of SIMD instructions on Intel architecture processors, from MMX technology all the way up to Intel AVX2. As shown in Figure 1, x264 and x265 achieve two times and five times speedup respectively over their corresponding baselines that do not use any SIMD code. The x265 encoder gains more performance from Intel AVX2 when compared to x264, because the quantum of work done per frame is substantially larger for HEVC than for AVC.4

graph showing peformance benefits comparisons
Figure 1. Performance benefit for x264 and x265 from Intel® Advanced Vector Extensions 2 for 1080p encoding with main profile using an Intel® Core™ i7-4500U Processor.

Focus of this whitepaper

The recently released Intel® Xeon® Scalable processors, part of the platform formerly code-named Purley, have introduced the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set.5 Intel AVX-512 instructions are capable of performing two times the number of operations in the same number of cycles as the previous generation Intel AVX2 instruction set. To accommodate this increased throughput, a larger fraction of the die is utilized, resulting in increased power being consumed, when compared to the previous-generation SIMD units. Therefore, certain Intel AVX-512 instructions are expected to cause a higher degradation to CPU clock frequency than others.6 While this reduction in frequency is offset by the increased throughput for the Intel AVX-512 instructions, media kernels continue to rely significantly on SIMD instructions in older generations (because not all kernels benefit from the increased width) and on straight-line C code that is not amenable to SIMD conversion, which may see reduced performance.

This whitepaper presents a case study based on our experience using the Intel AVX-512 SIMD instructions to accelerate the compute intensive kernels of x265. We describe how we offset the reduction in CPU frequency to ensure that the overall encoder achieves positive performance benefits. Through this process, we present recommendations of when we think Intel AVX-512 should be enabled with x265 for HEVC encoding. We also share our experience on when to choose Intel AVX-512 as a vehicle for accelerating media kernels.

Key takeaways

Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:

  • When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
  • For desktop and workstation SKUs (like the Intel® Core™ i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations, because the reduction in CPU clock frequency is rather low.
  • For server SKUs (like the Intel® Xeon® Platinum 8180 processor on which we tested), the frequency dip is higher and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock-cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.

Specifically, we recommend enabling Intel AVX-512 only when encoding 4K content using a slower or veryslow preset in the main10 profile. We do not recommend enabling Intel AVX-512 kernels for other settings (resolutions/profiles/presets), because of possible performance impact on the encoder.

While the results and recommendations presented in this paper are not without limitations to the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.

The rest of the paper is organized as follows: The "Background" section presents the background relevant to the technical material presented in the paper. "Acceleration of x265 Kernels with Intel Advanced Vector Extensions 512" discusses the choices we made to accelerate specific kernels of x265 and discusses results for the main and main10 profiles. "Accelerating x265 Encoding with Intel Advanced Vector Extensions 512" presents the results for the overall encoder for the main and main10 profiles. Finally, Section 5 provides detailed recommendations for when Intel AVX-512 should be enabled when using x265 and generic recommendations for when Intel AVX-512 should be chosen when accelerating specific kernels. This section also describes future work.

Background

This section presents the relevant background of the concepts presented in this paper. Specifically, section "HEVC Video Encoding" provides the background on HEVC. "x265, an Open Source HEVC Encoder" discusses x265 with specific focus on the existing methods of performance optimizations that it employs. Section "Introduction to the Intel® Xeon® Scalable Processor Platform" presents the relevant background on Intel Xeon Scalable processors, and Section "SIMD Vectorization Using Intel Advanced Vector Extensions 512" discusses in more detail the Intel AVX-512 architecture.

HEVC video encoding

HEVC was ratified as an encoding standard by the JCT- VC (Joint Collaborative Team on Video Coding) in 2013 as a successor to the vastly popular AVC standard.4 The video encoding and decoding processes in HEVC resolves around identifying three units: a coding unit (CU) that represents each block in the picture, a prediction unit (PU) that represents the mode decision, including motion compensated prediction of the CU, and a transform unit (TU) that represents the way in which the generated residual error between the predicted and the actual block is coded.

Initially, a frame is divided into a sequence of its largest non- overlapping coding units, called a coding tree unit (CTU). A CTU can then be split into multiple CUs with variable sizes of 64x64, 32x32, 16x16, and 8x8 to form a quad-tree. Each CU is then predicted from a set of candidate-blocks, which may be in either the same frame or different frames. If the block used for the prediction is in the same frame, the block is said to intra-predicted, while if it is in a different frame, it is said to be inter-predicted.

Intra-predicted blocks are represented by a combination of the prediction block and a mode that denotes the angle of the prediction. The allowed modes for intra-prediction are labeled DC, planar, and angular modes representing various angles from the predicted block. Inter-predicted blocks are represented by a combination of the block used for prediction (the reference block) and the motion vector (MV) that represents the delta between the current and the reference block. Blocks that have zero MV are said to use the merge mode, while others use the AMP (Advanced Motion Prediction) mode. The skip mode is a special case of the merge mode when the predicted block is identical to the source, that is, no residual. The AMP modes may use PUs that are the same size of the CU (denoted as 2Nx2N PUs) or may further partition them (denoted as rectangular and asymmetric PUs) to compute the MVs. The residual generated as a difference from the original and the predicted picture is then quantized and coded using TUs that may vary from 32x32 up to 4x4 blocks, depending on the prediction mode.

The entire process of inter, intra, CU, PU, and TU selection benefits across a broad variety of usage models including big data, artificial intelligence, high-performance computing, enterprise-class IT, cloud, storage, communication, and Internet of Things. Top enhancements include performance for a wide range of workloads with one and a half of memory bandwidth, integrated network/fabric, and optional integrated accelerators. Our results in x265 indicate a significant gen- over-gen speedup of 50 – 67 percent for offline encodes when compared to the previous-generation Intel® Xeon® processor 10 is called Rate-Distortion Optimization (RDO). The goal of Intel® Xeon® processor E5-2600. This boost comes primarily from RDO is to ensure that distortion is minimized at the target bitrate or the bitrate is minimized at the target quality level as represented by distortion. Throughout the process of RDO, various combinations of CUs, PUs, and TUs are attempted by an encoder, for which it employs several kernels. In this paper, we chose to vectorize these specific kernels by converting them to use Intel AVX-512 instructions.

HEVC encoding also supports multiple profiles for encoding a video, with each profile representing a different number of bits used to represent each pixel. The main and main10 profile are popular profiles of HEVC (their AVC counterparts are called main and high profiles respectively). Each component of a pixel is represented with a minimum of 8 bits in the main profile resulting in the values ranging from 0 –255. The main10 profile uses 10 bits per pixel, allowing for a higher range of 0 –1023 for each pixel, enabling the representation of more details in the encoded video. 2.2 x265, an Open Source HEVC Encoder The x265 encoder is an open-source HEVC that compresses video in compliance to the HEVC standard.7 This encoder has been integrated into several open-source frameworks including VLC* , HandBrake*,8 and FFMpeg9 and is the de facto open-source video encoder for HEVC. The x265 encoder has assembly optimizations for several platforms, including Intel architecture, ARM*, and PowerPC*.

The x265 encoder employs techniques for inter-frame and intra-frame parallelism to deal with the increased complexity of HEVC encoding.10 For inter-frame parallelism, x265 encodes multiple frames in parallel by using system-level software threads. For intra-frame parallelism, x265 relies on the Wavefront Parallel Processing (WPP) tool exposed by the HEVC standard. This feature enables encoding rows of CTUs of a given frame in parallel, while ensuring that the blocks required for intra-prediction from the previous row are completed before the given block starts to encode; as per the standard, this translates to ensuring that the next CTU on the previous row completes before starting the encode of a CTU on the current row. The combination of these features gives a tremendous boost in speed with no loss in efficiency compared to the publicly available reference encoder, HM.

Introduction to the Intel® Xeon® processor Scalable family platform

The Intel® Xeon® processor Scalable family, part of the Intel® platform formerly code-named Purley, are designed to deliver new levels of consistent and breakthrough performance. The platform is based on cutting-edge technology and provides compelling the improved microarchitecture features available on Intel Xeon Scalable processors.

SIMD vectorization using Intel® AVX-512

The Intel AVX-512 vector blocks present a 512-bit register file, allowing 2X parallel data operations per cycle compared to that of Intel AVX2. Though the benefits of vectorizing kernels to use the Intel AVX-512 architecture seem obvious, several key questions must be answered specifically for media workloads before embarking on this task. First, is there sufficient parallelism inherently preset in media kernels that they can leverage this increased parallelism? Second, is the fraction of the execution that exploits this parallelism sufficiently large such that we can expect average speedups as per Amdhal’s law? Third, by enabling such vectorization, is there some effect on the execution on the serial- and non-vector codes?

Acceleration of x265 Kernels with Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

As a first step in acceleration, we used handwritten Intel AVX-512 instructions to select the kernels from x265 to be accelerated. While automated tools that generate vectorized SIMD code are available, we found that handwritten assembly outperforms auto-vectorizing tools, which convinced us to use this technique. This section details how this technique was performed and the gains in cycle count we observed from these kernels for sample runs in main and main10 profiles.

Selecting the kernels to accelerate

We selected over 1,000 kernels from the core compute We selected over 1,000 kernels from the core compute of x265 to optimize with Intel AVX-512 instructions for the main and main10 profiles. These kernels were chosen based on their resource requirements. Some kernels may require frequent memory access like different block-copy and block-fill kernels, while others may involve intense computation like DCT, iDCT, and quantization kernels. There is also a third class of kernels that involve a combination of both in varying proportions. We found that ensuring that the buffers that the assembly routines accessed were 64-byte aligned reduces cache misses and in general helps Intel AVX-512 kernels. A complete list of the kernels optimized with Intel AVX-512 instructions for main and main10 kernels are listed in Appendix A1 and A2 respectively.

Framework to evaluate cycle-count improvements

The x265 encoder implements a sample test bench as a correctness and performance measurement tool for assembly kernels. It accepts valid arguments for a given kernel and invokes the C primitive and corresponding assembly kernel and compares both output buffers. It verifies all possible corner cases for the given input type by using a randomly distributed set of values. Each assembly kernel is called 100 times and checked against its C primitive output for ensuring the correctness. To measure performance improvement, the test bench measures the difference in the clock ticks (as reported by the rdtsc instruction) between the assembly kernel and the C kernel for 1,000 runs and reports the average between them.

Cycle-Count improvement for kernels in the main and main10 profiles

Figure 2 shows the cycle-count improvements for each of the 500 kernels in the main profile and the 600+ kernels in the main10 profile that were accelerated with Intel AVX-512. In each curve, the kernels are sorted in increasing order of their cycle count gains over the corresponding Intel AVX-512 implementation. Appendix A details the per-kernel gains over Intel AVX2 in cycle counts.

On average, we saw a 33 percent and 40 percent gain in the cycle count over the Intel AVX2 kernels for kernels in the main and main10 profile respectively. The reason for the higher gains is as follows. In the main10 profile, x265 uses 16 bits to represent each pixel, as opposed to the main profile, which uses 8 bits; although main10 technically only needs 10 bits, using 16 bits simplifies all data structures in the software. Therefore, the amount of work that has to be done for the same number of pixels is doubled. Due the higher quantum of compute, kernels in the main10 profile gain more from Intel AVX-512 over Intel AVX2, than what the kernels in the main profile gain. These results from cycle counts indicate that at the kernel level, there is much benefit in using Intel AVX-512 to accelerate x265. However, this does not account for the reduction in clock frequency incurred when using Intel AVX-512 instructions compared to using Intel AVX2 instructions. In the next section, we look at the effect on overall encoding time, which also accounts for this effect.

Accelerating x265 Encoding with Intel Advanced Vector Extensions 512

In this section, we look at the impact of using Intel AVX-512 kernels for real encoding use cases with x265. Section "Test Setup" describes our test setup including the videos chosen, the x265 presets used, and the system configurations of the test machines. Section "Encoding on Intel® Core™ Processors" presents results on a workstation machine with an Intel Core i9-7900X processor, while section "Encoding on Intel Xeon Scalable Processors" presents results on a typical high-end server CPU that has two Intel Xeon Platinum 8180 processors.

Test setup

Our tests mainly focused on encoding 1080p videos with the main profile and 4K videos with the main10 profile. We used four typical 1080p clips (crowdrun, ducks_take_off, park_ joy, and old_town_cross), and three 4k clips (Netflix_Boat, Netflix_FoodMarket, and Netflix_Tango) for our tests 10. Appendix B gives a little more detail, along with screenshots of the videos used. We encode the 1080p to the main profile at the following bitrates (in Kbps): 1000, 3000, 5000, and 7000. For the 4K clips, the main10 profiles target the following bitrates (in Kbps): 8000, 10000, 12000, and 14000.

We encode these videos with a version of x265 that has all the kernels described in Section 3; these kernels were contributed as part of the default branch of x265. The kernels are disabled by default and may be enabled with the –asm avx512 option in the x265 command-line interface.

A graph
Figure 2. Cycle-count gains of the main and main10 profile Intel® Advanced Vector Extensions 512 kernels over the corresponding Intel® Advanced Vector Extensions 2 kernels.

We focused our experiments on four presets of x265 to represent the wide set of use cases that x265 presents: ultrafast, veryfast, medium, and veryslow. These presets represent a wide variety of trade-offs between encode efficiency and frames per second (FPS). The veryslow preset generates the most efficient encode but is the slowest; this preset is also the preferred choice for any offline encoding use cases such as OTT. The ultrafast preset is the quickest setting of x265 but generates the encode with the lowest efficiency. The veryfast and medium presets represent intermediate points in the trade-off between performance and encoder efficiency. Typically, the more efficient presets employ more tools of HEVC, resulting in more compute-per- pixel than the less efficient presets. This is important to call out as Intel AVX-512 kernels tend to give better speedup when the compute-per-pixel is higher, as shown from the results in the previous section.

Encoding on Intel® Core™ Processors

Figure 3 shows the performance of encoding 1080p and 4K video in main and main10 profile with Intel AVX-512 kernels relative to using Intel AVX2 kernels on a workstation-like configuration with an Intel Core i9-7900X processor using a single instance of x265. The full details of the system configuration are described in Appendix C. The single instance results in high utilization of the CPU across all configurations, representing a typical use case for this system when performing HEVC encoding.

Intel® Core™ i9-7900X Processor
Graph with performance metrics
Figure 3. Encoder performance from using Intel® Advanced Vector Extensions 512 kernels on a single instance of x265, as measured on a workstation-like system with an Intel® Core™ i9-7900X processor.

From the results, we see that for all profiles and presets, enabling Intel AVX-512 kernels results in a positive performance gains. On the Intel Core i9-7900X processor system, our measurements did not indicate any significant reduction in clock frequency. The cycle-count improvements from the kernels therefore directly reflect an increased encoder performance. When we observed the relative encoder performance per encode, we observed that there were no command lines that demonstrated lower performance with Intel AVX-512 than with Intel AVX2.

We therefore recommend that for the Intel Core i9-7900X processor, and similar systems where the frequency reduction is minimal, Intel AVX-512 kernels be enabled for all encoding profiles and resolutions when using x265.

Encoding on Intel Xeon Scalable Processors

In this section, we present results from using x265 accelerated by Intel AVX-512 on a high-end server configuration with two Intel Xeon Platinum 8180 processors arranged in a dual-socket configuration with 28 hyperthreaded cores per CPU. For full details of the system configuration, refer to Appendix C.

x265 single instance performance using 8 threads and 16 threads

Figure 4 shows the performance of a single instance of x265 with kernels that use Intel AVX-512 for encoding 1080p videos in the main profile and 4K videos in the main10 profile relative to using kernels that only use Intel AVX2 instructions. Two configurations, one with 8 threads per instance and another with 16 threads per instance, are shown in the graph to understand the impact of increasing the number of active cores on the CPU; limiting the number of threads for each instance is done using the --pools option of the x265 library.

The figure shows that for a given thread configuration, the gains when encoding 4K content in the main10 profile are higher than for the 1080p content in the main profile. Also, for a given resolution and profile, the gains that we see from the presets that have more work-per-pixel (the higher efficient presets like the veryslow preset) are higher than the faster presets; in fact, for 1080p content in the main profile, we see an average performance loss. These gains are consistent with previously observed results that demonstrate that the more the work per pixel of a specific configuration, the better it is to use Intel AVX-512. Additionally, when we investigated the S-curves of these profiles (not shown here for brevity), we saw that several encoder command lines outside the 4K main10 veryslow setting lost performance over Intel AVX2.

We therefore recommend using Intel AVX-512-enabled kernels only when doing 4K encodes in the main10 profile with the versylow preset. For other presets and encoder settings, the amount of work per pixel is insufficient to offset the reduction in clock frequency to the gains in cycle-count achieved.

One additional observation we can make from Figure 4 is that the performance gains are in general higher across the board when using 8 threads for the single instance of x265, compared to the 16 threads. Upon further analysis, we observe that when more cores are activated with Intel AVX- 512 instructions in the Intel Xeon Platinum 8180 processor, the frequency reduces further, resulting in lower gains from using Intel AVX-512 instructions. In a typical server, however, encoder vendors attempt to maximize all available CPU cores to get the maximum throughput out of the given server.
This use case is explored in Section 4.3.2 where we attempt to saturate the server with 4K main10 encodes to see if the lower frequency when more cores are activated may result in muting the gains.

Intel® Xeon® Platinum 8180 Processor
graph showing peformance benefits comparisons
Figure 4. Relative performance of a single instance of x265 when using Intel® Advanced Vector Extensions 512 kernels with 8 or 16 threads over Intel® Advanced Vector Extensions 2 kernels on a server configuration with two Intel® Xeon® Platinum 8180 processors.

Saturating Intel® Xeon® Platinum 8180 processors using multiple instances of x265

To study whether activating more cores results in performance loss for 4K encodes in the main10 profile, we saturated one and both CPUs of a dual-socket Intel Xeon Platinum 8180 processor-based server with four and eight instances of x265, respectively, with each instance using 16 threads. We measured the total FPS achieved by all x265 instances to encode the same clip at different bitrates when using kernels that use Intel AVX-512 and reported the number relative to when the Intel AVX2-enabled kernels were used. Figure 5 shows these results.

Intel® Xeon® Platinum 8180 processor - Single and Dual Socket Saturation
graph showing performance benefits comparisons
Figure 5. Single-socket and dual-socket saturation of theIntel® Xeon® Platinum 8180 processor with x265 instances.

Figure 5. Shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.

These results reiterate our recommendation that Intel AVX-512 kernels should only be enabled when encoding 4K content for the main10 profile for the veryslow preset. For other presets that have lower compute per pixel, enabling Intel AVX-512 kernels may result in a performance loss over using Intel AVX2 kernels.

Figure 5 shows that even when saturating one or both CPUs, encoding 4K videos with main10 shows positive performance gains over using the Intel AVX2 counterparts. However, the gains are lower than the corresponding gains achieved when a single instance of x265 that uses fewer cores. Additionally, we observe that for lower efficiency presets such as veryfast and medium, the gains are muted due to the higher frequency drop with more active cores.

These results reiterate our recommendation that Intel AVX-512 kernels should only be enabled when encoding 4K content for the main10 profile for the veryslow preset. For other presets that have lower compute per pixel, enabling Intel AVX-512 kernels may result in a performance loss over using Intel AVX2 kernels.

Conclusions and Future Work

In this paper, we presented our experience with using the Intel AVX-512 instructions available in the newly introduced Intel Xeon Scalable processors to accelerate the open-source HEVC encoder x265. The specific challenges that we had to overcome included selecting the right kernels to accelerate with Intel AVX-512 such that the reduction in CPU frequency were offset from the benefits in cycle count, and choosing the right encoder configuration that enabled the right balance of compute per pixel to achieve positive gains in encoder performance.

Recommendations

Our experience shows that enabling Intel AVX-512 specifically for media kernels requires achieving a balance that should be delicately handled. From our results, we recommend the following:

  • When choosing specific kernels that can be accelerated with Intel AVX-512, the same compute-to-memory ratio should be considered. If this ratio is high, using Intel AVX-512 is recommended. Also, when using Intel AVX-512, try to align the buffers to 64B in order to avoid loads that cross cache- line boundaries.
  • For desktop and workstation SKUs (like the Intel Core i9-7900X processor that we tested), Intel AVX-512 kernels can be enabled for all encoder configurations because the reduction in CPU clock frequency is rather low.
  • For server SKUs (like the Intel Xeon Platinum 8180 processor on which we tested), the frequency dip is higher, and increases, with more cores being active. Therefore, Intel AVX-512 should only be enabled when the amount of computation per pixel is high, because only then is the clock- cycle benefit able to balance out the frequency penalty and result in performance gains for the encoder.

Specifically, we recommend enabling Intel AVX-512 only when encoding 4K content using a slower or veryslow preset in the main10 profile. We do not recommend enabling Intel AVX-512 kernels for other settings (resolutions/profiles/presets), because of possible performance impact on the encoder.

While the results and recommendations presented in this paper are not without the limitations of the evaluations and our experimental approximations, we believe that they will help the community at large to understand the benefits of using Intel AVX-512 for accelerating media workloads.

Future work

The task of accelerating x265 with Intel AVX-512 has opened several avenues for future work. The accelerated kernels are available through the public mailing list. Future extensions of this work to enable further acceleration from Intel AVX-512 include (1) performing a thorough analysis of the use of Intel AVX-512 for videos at other resolutions and presets available in x265, (2) enabling schemes to dynamically enable and disable Intel AVX-512 kernels by monitoring the CPU frequency, and (3) a fundamental re-architecting of the encoder to segregate the worker threads into different types of threads, only some of which may run Intel AVX-512 limiting the number of cores where the CPU frequency drop is observed. We will continue to develop and contribute these solutions to open source, and encourage the reader to also contribute the project at http://x265.org.

Acknowledgements

This work was funded in part by a non-recurring engineering grant from Intel to MulticoreWare. We would like to thank the various developers and engineers at MulticoreWare for their extensive support throughout this work. In particular, we would like to thank Thomas A. Vaughan for his guidance and Min Chen for his expert comments on the assembly patches.

Appendix A

A1 – Main profile instructions per cycle (IPC) gains

PrimitiveIPC GainPrimitiveIPC GainPrimitiveIPC GainPrimitiveIPC Gain
sad0.16%i422 chroma_vss32.70%i420 chroma_vpp23.19%luma_vss43.18%
pixelavg _pp0.87%luma_vss32.89%addAvg23.37%luma_vss43.35%
i444 chroma_vps1.14%sad_x333.01%addAvg23.38%i444 chroma_hpp43.43%
i444 chroma_vps1.18%luma_vps33.05%i444 chroma_hps23.53%ssd_s43.57%
pixelavg _pp1.41%i420 chroma_hpp33.08%i420 chroma_hps23.77%luma_hps43.68%
convert_p2s1.95%i444 chroma_hpp33.14%var23.95%luma_vss43.75%
i420 chroma_vps2.45%sad_x433.14%i420 chroma_hpp24.03%luma_hps43.84%
i420 chroma_vps2.72%i444 chroma_vss33.16%i422 chroma_vpp24.11%luma_hps43.94%
i422 chroma_hps2.83%i420 chroma_vss33.16%i444 chroma_vss24.15%luma_vsp44.06%
i420 p2s3.21%copy _ps33.33%i422 chroma_vss24.15%luma_vsp44.11%
i444 p2s3.21%i420 copy _ps33.33%i420 chroma_vss24.15%sub_ps44.11%
sad_x33.29%i444 chroma_vss33.34%i420 chroma_vps24.20%i444 chroma_hpp44.15%
i420 chroma_vps3.62%i422 chroma_vss33.34%i444 chroma_vpp24.20%convert_p2s44.33%
sad_x44.50%i420 chroma_vss33.34%i420 chroma_vpp24.20%i444 chroma_hpp44.35%
sad4.62%i422 copy _ps33.43%sad24.21%luma_vss44.42%
i420 chroma_hps4.90%i444 chroma_vss33.43%i444 chroma_vps24.22%luma_hps44.43%
i420 chroma_hps5.19%i422 chroma_vss33.43%i420 chroma_vps24.22%luma_hpp44.48%
pixel_satd5.42%i420 chroma_hpp33.55%i444 chroma_hps24.25%luma_vpp44.54%
i444 chroma_vps5.43%i422 chroma_hpp33.57%i420 chroma_hpp24.42%luma_vss44.61%
i422 chroma_hps5.82%dequant_normal33.60%sad_x424.53%cpy1Dto2D_shl44.61%
i444 chroma_vps6.78%sad_x433.62%i444 chroma_hps24.57%luma_vsp44.62%
dct7.06%i444 chroma_vss33.89%i422 chroma_hps24.65%luma_vsp44.66%
i444 chroma_hps7.08%i420 chroma_vss33.89%psyCost_pp24.89%luma_vss44.70%
i444 chroma_hps7.26%sad_x333.92%i422 chroma_vps25.00%luma_vpp44.74%
i422 chroma_vss8.85%i420 pixel_satd34.01%i444 chroma_vss25.17%luma_vsp44.85%
luma_vss9.76%i444 chroma_hps34.02%i422 chroma_vss25.17%i422 copy _sp45.20%
i422 chroma_hps10.27%luma_vps34.04%i420 chroma_vss25.17%getResidual3245.24%
i444 chroma_hps11.00%i444 chroma_hpp34.20%i422 chroma_vps25.66%luma_vpp45.30%
i444 chroma_hps11.14%i420 pixel_satd34.20%luma_vps25.82%luma_hps45.35%
sad11.26%i420 chroma_hpp34.23%i444 chroma_vps25.89%i444 chroma_hpp45.41%
i420 chroma_hps11.38%i444 chroma_vss34.43%i444 chroma_vps25.92%luma_hpp45.49%
pixel_sa8d11.55%i422 chroma_vss34.43%i420 chroma_hps25.95%convert_p2s45.52%
i444 chroma_hps11.91%i420 chroma_vss34.43%i420 chroma_vps26.07%luma_hps45.58%
luma_vpp11.96%i422 chroma_vsp34.59%convert_p2s26.25%luma_vpp45.62%
i422 chroma_hps12.10%i444 chroma_vss34.71%i422 chroma_vps26.42%convert_p2s45.62%
copy _pp12.54%i444 chroma_vss34.76%i444 chroma_vps26.56%luma_vpp45.69%
ssd_s12.58%addAvg34.88%i444 chroma_vss26.71%cpy2Dto1D_shl45.75%
i420 chroma_vps12.58%addAvg35.14%i422 chroma_vss26.71%i422 addAvg45.76%
i444 chroma_hps12.79%sad35.43%i420 chroma_vss26.71%convert_p2s46.00%
idct13.32%ssd_ss35.45%sad_x426.80%i420 add_ps46.09%
luma_vps13.78%i444 chroma_vss35.51%i422 chroma_hpp27.06%add_ps46.10%
i444 chroma_hps13.87%i420 pixel_satd35.55%i422 chroma_hps27.13%luma_vsp46.14%
sad13.88%pixelavg _pp35.56%luma_hpp27.15%luma_hps46.29%
copy _cnt14.25%luma_vpp35.62%i420 pixel_satd27.23%luma_vss46.31%
luma_vpp14.28%luma_vpp36.21%i444 chroma_vss27.24%i444 chroma_vsp46.52%
pixel_satd14.45%i420 chroma_hpp36.45%i422 chroma_vss27.24%i422 chroma_vsp46.52%
idct14.49%i422 chroma_hpp36.65%luma_hpp27.29%i420 chroma_vsp46.52%
pixel_satd14.92%i422 chroma_hpp36.76%luma_vps27.45%luma_hps46.65%
pixel_satd14.99%sad36.76%psyCost_pp27.62%pixelavg _pp46.67%
sad15.21%i422 chroma_hpp36.81%luma_vsp27.72%luma_vss46.88%
idct15.23%copy _pp36.82%i422 chroma_hps28.00%i422 addAvg46.88%
sad_x315.32%pixelavg _pp36.84%pixel_satd28.50%luma_hps46.90%
i444 chroma_vpp15.47%convert_p2s36.87%cpy2Dto1D_shl28.69%luma_vsp46.97%
i422 chroma_vpp15.47%i420 p2s36.87%luma_vps28.71%i422 p2s47.10%
i420 chroma_vpp15.47%i444 p2s36.87%i444 chroma_hpp28.78%copy _pp47.11%
pixel_satd15.52%i444 chroma_hpp37.07%i420 pixel_satd28.80%luma_vss47.64%
pixel_satd15.62%luma_vpp37.11%i422 pixel_satd28.81%i444 chroma_hpp47.83%
pixel_satd15.66%luma_vss37.49%i422 pixel_satd28.95%i422 addAvg47.85%
sad_x315.70%addAvg37.76%luma_vss29.26%luma_hps48.46%
pixel_satd15.75%i444 chroma_vps37.90%i444 chroma_vss29.29%copy _ps48.57%
i420 chroma_hps15.83%i444 chroma_vss38.04%i420 chroma_hps29.42%sub_ps48.83%
copy _pp15.93%i444 chroma_vps38.05%luma_vpp29.43%luma_hpp48.97%
luma_vpp16.10%i444 chroma_vps38.23%scale1D_128to6429.50%i422 add_ps49.02%
nquant16.33%sad38.42%luma_vss29.59%i444 chroma_vsp49.43%
sad16.35%i444 chroma_hpp38.45%i444 chroma_vpp29.69%i420 sub_ps49.46%
i444 chroma_vpp16.39%Weight_sp38.48%i422 chroma_vpp29.69%add_ps49.50%
i420 chroma_hps16.60%i444 chroma_hpp38.55%i420 chroma_vpp29.69%i422 sub_ps49.52%
i444 chroma_vpp17.02%sad38.56%i422 chroma_hps29.71%i420 addAvg49.74%
i422 chroma_vpp17.02%luma_hpp38.79%i422 pixel_satd29.75%convert_p2s49.75%
i420 chroma_vpp17.02%pixel_satd39.15%i444 chroma_vpp29.82%i422 p2s49.75%
pixel_satd17.08%luma_hpp39.21%i422 chroma_vpp29.82%i444 p2s49.75%
luma_vps17.10%i444 chroma_hpp39.30%luma_vss29.91%luma_vss49.84%
luma_vps17.36%i444 chroma_vps39.39%i444 chroma_vss29.92%luma_hpp50.00%
i444 chroma_vss17.55%addAvg39.51%i422 chroma_vss29.92%copy _sp50.11%
i420 chroma_vss17.55%i420 chroma_hpp39.55%i420 chroma_vss29.92%luma_vss50.22%
pixel_satd17.59%i422 pixel_satd39.57%luma_vps30.19%luma_hpp50.61%
pixel_satd17.66%i422 chroma_hpp39.61%sad_x430.24%luma_hpp51.19%
i444 chroma_vss18.42%convert_p2s39.78%sad30.30%i444 chroma_vsp51.23%
i422 chroma_vss18.42%i420 p2s39.78%luma_vps30.37%luma_hpp51.70%
i420 chroma_vss18.42%i422 p2s39.78%luma_vps30.39%nonPsyRdoQuant51.74%
i444 chroma_vpp18.49%i444 p2s39.78%i444 chroma_vpp30.39%i444 chroma_vsp52.08%
i420 chroma_vpp18.49%copy _sp39.93%i422 chroma_vpp30.39%copy _pp52.17%
luma_vps18.50%i420 addAvg40.02%i420 chroma_vpp30.39%i444 chroma_vsp52.22%
luma_vpp18.51%luma_hps40.04%ssd_ss30.44%i444 chroma_vsp52.28%
sad_x318.99%i444 chroma_hpp40.07%i422 chroma_hpp30.45%nonPsyRdoQuant52.32%
copy _pp19.76%addAvg40.64%i420 pixel_satd30.53%i422 copy _ss52.45%
luma_vss19.80%luma_vsp40.87%i422 chroma_vpp30.54%nonPsyRdoQuant52.56%
pixel_satd19.89%i444 chroma_vsp40.96%i444 chroma_hpp30.54%i444 chroma_vsp52.77%
sad20.09%i420 chroma_vsp40.96%i422 chroma_hpp30.56%i422 chroma_vsp52.77%
sad_x320.26%luma_vss41.01%i444 chroma_hpp30.63%blockfill_s52.93%
i444 chroma_hps20.52%i420 copy _sp41.12%i420 chroma_hpp30.85%i444 chroma_vsp53.30%
i420 chroma_hps20.80%copy _cnt41.14%luma_vsp30.95%i422 chroma_vsp53.30%
psyCost_pp21.15%luma_vsp41.16%sad_x430.95%i420 chroma_vsp53.30%
i444 chroma_hps21.17%Weight_pp41.23%i422 chroma_vss30.99%i422 chroma_vsp53.36%
pixel_satd21.19%luma_hps41.42%i444 chroma_hps31.12%i444 chroma_vsp54.34%
pixel_satd21.21%addAvg41.84%i444 chroma_vpp31.17%i422 chroma_vsp54.34%
quant21.23%i420 addAvg41.87%i444 chroma_vpp31.20%i420 chroma_vsp54.34%
sad_x321.29%luma_vsp41.99%sad31.29%psyRdoQuant54.44%
i444 chroma_vpp21.42%luma_hps42.05%luma_vsp31.33%luma_hpp54.62%
i422 chroma_vpp21.42%convert_p2s42.13%sad_x331.34%i444 chroma_vsp54.64%
i420 chroma_vpp21.42%i420 p2s42.13%i422 pixel_satd31.46%i420 chroma_vsp54.64%
i420 chroma_vps21.60%i422 p2s42.13%luma_hps31.52%luma_hpp54.78%
pixel_satd21.61%i444 p2s42.13%i444 chroma_vpp31.57%luma_hpp55.06%
i444 chroma_vps21.69%i444 chroma_vsp42.31%pixelavg _pp31.62%luma_hpp55.40%
i422 chroma_hps21.99%i422 chroma_vsp42.31%luma_vps31.76%copy _pp55.41%
i420 addAvg22.01%i420 chroma_vsp42.31%i444 chroma_hps31.78%psyRdoQuant55.70%
luma_vsp22.09%luma_vsp42.35%sad_x331.95%psyRdoQuant55.72%
i444 chroma_vps22.27%i420 chroma_hpp42.43%i444 chroma_vss31.96%var55.75%
i422 chroma_vps22.41%nonPsyRdoQuant42.51%i420 chroma_vss31.96%copy _ss56.00%
sad_x422.44%luma_hps42.54%i422 chroma_vss32.01%i444 chroma_vsp56.36%
var22.51%addAvg42.56%i444 chroma_hpp32.12%i422 chroma_vsp56.36%
i444 chroma_vpp22.64%luma_hps42.58%var32.17%i420 chroma_vsp56.36%
i420 chroma_vpp22.64%luma_vss42.82%i420 chroma_hpp32.32%i420 copy _ss56.63%
sad_x422.84%i422 addAvg42.93%i444 chroma_hps32.44%i444 chroma_vsp57.60%
i444 chroma_vpp22.87%luma_vpp42.97%luma_vsp32.61%i420 chroma_vsp57.60%
i422 chroma_vpp22.87%dequant_scaling42.98%i444 chroma_vss32.67%copy _pp58.33%
i422 chroma_hpp22.92%luma_hpp42.99%i420 chroma_vss32.67%copy _ss60.09%
sad_x423.09%i444 chroma_vsp43.05%i444 chroma_vss32.69%psyRdoQuant62.80%
i444 chroma_vpp23.19%i422 chroma_vsp43.05%i422 chroma_vss32.69%i444 chroma_vsp62.98%

 

 

 

 

i420 chroma_vss32.69%i420 chroma_vsp62.98%

A2 – Main10 profile IPC gains

PrimitiveIPC GainPrimitiveIPC GainPrimitiveIPC GainPrimitiveIPC Gain
convert_p2s1.26%i422 chroma_hps39.92%i422 chroma_vpp29.64%i444 chroma_hpp49.20%
i420 p2s1.26%i422 p2s40.30%i420 chroma_vpp29.64%i444 chroma_hps49.45%
i444 p2s1.26%luma_hpp40.35%i444 chroma_vsp29.82%cpy2Dto1D_shl49.70%
addAvg1.86%i422 chroma_hpp40.52%i422 chroma_vsp29.82%luma_hvpp49.80%
addAvg6.88%copy _cnt40.55%i420 chroma_vsp29.82%luma_vss49.84%
dct7.06%luma_vpp40.58%luma_vss29.91%i420 chroma_hps49.85%
sad_x37.65%luma_vsp40.59%i444 chroma_vss29.92%convert_p2s49.87%
sad7.74%i444 chroma_vps40.60%i422 chroma_vss29.92%i420 p2s49.87%
sad8.29%i422 chroma_vps40.60%i420 chroma_vss29.92%i422 p2s49.87%
i420 addAvg8.36%i420 chroma_vps40.60%i444 chroma_vps29.93%i422 p2s49.87%
sad_x38.77%sad_x340.64%i422 chroma_vps29.93%i444 p2s49.87%
luma_vss9.76%nonPsyRdoQuant40.70%i420 chroma_vps29.93%luma_hps49.94%
intra_pred_ang279.79%add_ps40.71%luma_vsp30.06%i422 chroma_hps50.07%
cpy2Dto1D_shl10.13%sad_x440.73%i444 chroma_vsp30.11%i444 chroma_hpp50.13%
sad_x310.81%luma_vpp40.73%i422 chroma_vsp30.11%luma_vss50.22%
sad_x410.96%copy _pp40.81%i420 chroma_vsp30.11%luma_hpp50.25%
i420 addAvg11.05%i422 chroma_vps40.88%pixel_satd30.30%i420 chroma_vpp50.28%
pixel_satd11.05%luma_vss41.01%i422 pixel_satd30.30%luma_hps50.67%
i420 pixel_satd11.05%i444 chroma_vsp41.02%i422 pixel_satd30.35%addAvg50.67%
i422 pixel_satd11.05%i420 chroma_vsp41.02%add_ps30.69%i422 addAvg50.67%
luma_vsp12.64%i444 chroma_vsp41.05%sad30.94%luma_hpp50.75%
copy _cnt13.29%i420 chroma_vsp41.05%dequant_normal31.10%i420 chroma_hpp50.82%
idct13.32%sad41.06%sad31.37%copy _pp50.95%
i444 chroma_vps14.44%intra_pred_ang3441.06%pixel_satd31.43%i422 addAvg50.99%
i422 chroma_vps14.44%convert_p2s41.09%i420 pixel_satd31.43%luma_hps51.17%
i420 chroma_vps14.44%i444 p2s41.09%i422 pixel_satd31.43%i422 chroma_hpp51.22%
idct14.49%nonPsyRdoQuant41.21%i444 chroma_vpp31.60%i444 chroma_hpp51.37%
i444 chroma_vpp14.84%sad_x441.22%i422 chroma_vss31.76%luma_hpp51.48%
idct15.23%i422 chroma_vpp41.25%i444 chroma_vss31.96%luma_hps51.57%
luma_vsp15.24%i420 chroma_vpp41.25%i420 chroma_vss31.96%copy _ss51.58%
sad_x315.53%i420 chroma_vpp41.36%sad31.99%luma_hpp51.63%
addAvg15.60%i444 chroma_vsp41.40%psyCost_pp32.12%luma_hps51.64%
i422 chroma_vpp15.71%luma_vpp41.43%i420 chroma_hps32.32%luma_hps51.65%
i420 chroma_vpp15.71%luma_hvpp41.46%i422 addAvg32.46%luma_hps51.70%
addAvg15.90%luma_vpp41.48%i422 chroma_vss32.62%luma_hps51.81%
i422 chroma_vpp16.07%i444 chroma_vsp41.51%i444 chroma_vss32.67%i422 chroma_hpp51.86%
intra_pred_ang2516.22%luma_hvpp41.54%i420 chroma_vss32.67%luma_hps51.89%
nquant16.33%intra_pred_ang1141.55%i444 chroma_vss32.69%addAvg51.89%
sad_x416.42%convert_p2s41.58%i422 chroma_vss32.69%i420 addAvg51.89%
luma_vsp16.55%sad_x441.71%i420 chroma_vss32.69%i422 addAvg51.89%
i420 addAvg17.12%sad_x441.71%luma_vss32.89%luma_hps51.93%
sad_x417.33%luma_vsp41.78%i444 chroma_vsp33.14%luma_hps51.99%
i444 chroma_vss17.55%sad_x441.83%i422 chroma_vsp33.14%i444 chroma_hpp52.16%
i420 chroma_vss17.55%i444 chroma_vsp42.01%i444 chroma_vss33.16%i422 copy _sp52.45%
i444 chroma_vps17.88%i444 chroma_vsp42.08%i420 chroma_vss33.16%i422 copy _ps52.45%
i422 chroma_vps17.88%i422 chroma_vsp42.08%convert_p2s33.27%i422 copy _ss52.45%
i420 chroma_vps17.88%nonPsyRdoQuant42.13%i444 chroma_vss33.34%i444 chroma_hps52.94%
pixel_satd18.02%pixelavg _pp42.17%i422 chroma_vss33.34%copy _ss53.20%
i422 addAvg18.13%i422 chroma_vpp42.20%i420 chroma_vss33.34%i420 chroma_hps53.22%
i444 chroma_vss18.42%i420 chroma_vpp42.20%i444 chroma_vss33.43%i422 chroma_hps53.27%
i422 chroma_vss18.42%luma_vps42.30%i422 chroma_vss33.43%i420 chroma_hpp53.48%
i420 chroma_vss18.42%sub_ps42.52%pixelavg _pp33.45%copy _pp53.53%
addAvg19.50%luma_vsp42.55%pixel_satd33.45%i422 chroma_hpp53.81%
i444 chroma_vps19.54%luma_hvpp42.65%i420 pixel_satd33.45%i422 chroma_hpp53.89%
i422 chroma_vps19.54%pixelavg _pp42.65%addAvg33.46%i444 chroma_hpp54.31%
i420 chroma_vps19.54%luma_vps42.72%luma_vsp33.47%ssd_ss54.69%
sad_x319.75%convert_p2s42.77%sad_x433.51%i422 chroma_hpp54.77%
luma_vss19.80%luma_vss42.82%i444 chroma_vsp33.79%i420 chroma_hpp55.18%
i422 pixel_satd19.95%luma_vsp43.05%i422 chroma_vsp33.79%luma_hpp55.53%
pixel_satd20.02%convert_p2s43.11%i420 chroma_vsp33.79%i444 chroma_hpp55.56%
i420 pixel_satd20.02%i444 chroma_hpp43.15%i444 chroma_vss33.89%i444 chroma_hpp55.78%
i422 pixel_satd20.02%luma_vsp43.17%i420 chroma_vss33.89%i444 chroma_hpp55.94%
i444 chroma_vps20.09%luma_vss43.18%luma_vsp34.08%luma_hpp55.96%
i420 chroma_vps20.09%luma_vsp43.22%sub_ps34.13%copy _sp56.00%
i422 chroma_vss20.53%luma_hvpp43.24%i444 chroma_vsp34.18%copy _ps56.00%
sad_x420.69%luma_vss43.35%i420 chroma_vsp34.18%i444 chroma_hpp56.07%
i444 chroma_vps20.86%luma_vsp43.36%i444 chroma_vsp34.22%luma_hpp56.16%
i422 chroma_vps20.86%i420 chroma_hpp43.38%i422 chroma_vsp34.22%i420 copy _sp56.63%
i444 chroma_vpp20.98%cpy1Dto2D_shl43.50%i420 chroma_vsp34.22%i420 copy _ps56.63%
quant21.23%luma_vsp43.50%i444 chroma_vss34.43%i420 copy _ss56.63%
i422 chroma_vpp21.45%luma_vpp43.51%i422 chroma_vss34.43%i422 chroma_hpp57.32%
sad21.61%copy _pp43.54%i420 chroma_vss34.43%i444 chroma_hps57.33%
i444 chroma_vpp21.78%luma_hvpp43.57%pixel_satd34.59%luma_hpp57.40%
i444 chroma_vps22.06%luma_vpp43.58%i444 chroma_vss34.71%i420 chroma_hps57.97%
i420 chroma_vps22.06%luma_hvpp43.60%i444 chroma_vss34.76%luma_hpp58.55%
i444 chroma_vsp22.12%luma_vss43.75%intra_pred_ang1034.76%i444 chroma_hps59.21%
i422 chroma_vsp22.12%luma_vps43.77%i444 chroma_vps34.80%i420 chroma_hps59.46%
i420 chroma_vsp22.12%i444 chroma_vsp43.80%i444 chroma_vps34.98%blockfill_s59.53%
i444 chroma_vsp22.14%i420 chroma_vsp43.80%luma_vps35.07%luma_hpp59.56%
i422 chroma_vsp22.14%pixelavg _pp43.94%i444 chroma_vps35.34%i422 chroma_hps59.75%
i420 chroma_vsp22.14%psyRdoQuant44.02%Weight_pp35.37%copy _sp60.09%
i422 chroma_vpp22.28%sad_x344.17%i444 chroma_vss35.51%copy _ps60.09%
i420 chroma_vpp22.28%pixelavg _pp44.23%luma_vps35.63%luma_hps60.23%
i444 chroma_vpp22.28%luma_hvpp44.24%i422 chroma_hps35.68%psyRdoQuant60.25%
i422 chroma_vpp22.35%luma_hvpp44.28%i444 chroma_vps36.38%luma_hpp60.26%
ssd_ss22.60%luma_vsp44.31%i422 chroma_vss36.56%i444 chroma_hps60.28%
i444 chroma_vpp23.06%dequant_scaling44.37%sad36.66%i420 chroma_hps60.48%
sad_x423.09%convert_p2s44.40%luma_vpp36.68%luma_hps60.76%
luma_vpp23.67%luma_vpp44.41%i444 chroma_vpp36.70%copy _pp60.87%
luma_vpp23.82%luma_vss44.42%luma_vsp36.71%i444 chroma_hps60.92%
i444 chroma_vpp23.84%sad_x444.42%sad_x336.75%i422 chroma_hps61.09%
i444 chroma_vss24.15%luma_vpp44.60%sad_x436.78%luma_hpp61.28%
i422 chroma_vss24.15%luma_vss44.61%pixel_satd36.88%i444 chroma_hpp61.38%
i420 chroma_vss24.15%luma_hvpp44.61%i422 chroma_vpp36.91%luma_hpp61.43%
intra_pred_ang924.37%getResidual3244.64%copy _pp36.96%luma_hpp61.44%
i444 chroma_vpp24.41%luma_hpp44.68%addAvg37.08%i422 chroma_hps61.55%
luma_vpp24.48%luma_vss44.70%sad_x437.09%luma_hpp61.58%
i422 addAvg24.62%luma_hvpp44.73%i420 chroma_vpp37.29%luma_hpp62.26%
psyCost_pp24.88%i444 chroma_vsp44.76%i422 chroma_vpp37.36%i422 chroma_hps62.31%
i420 chroma_vpp24.90%i422 chroma_vsp44.76%i420 chroma_vpp37.36%luma_hpp62.35%
i422 chroma_vpp25.11%i420 chroma_vsp44.76%luma_vss37.49%i420 chroma_hpp62.39%
i420 chroma_vpp25.11%sad_x444.85%luma_vpp37.53%i420 chroma_hps62.39%
i444 chroma_vps25.17%luma_hvpp45.15%i444 chroma_vps37.54%i444 chroma_hpp62.46%
i422 chroma_vps25.17%luma_vps45.19%i422 chroma_vps37.54%luma_hpp62.63%
i420 chroma_vps25.17%i422 chroma_hpp45.23%i420 chroma_vps37.54%i444 chroma_hps62.88%
i444 chroma_vss25.17%intra_pred_dc45.26%i444 chroma_vpp37.59%i420 chroma_hps62.95%
i422 chroma_vss25.17%sad45.31%i420 chroma_vpp37.59%luma_hpp63.07%
i420 chroma_vss25.17%luma_vps45.36%i444 chroma_vps37.59%i444 chroma_hps63.15%
i422 chroma_vps25.28%psyRdoQuant45.40%i422 chroma_vps37.59%luma_hps63.16%
i444 chroma_vps25.97%i420 add_ps45.40%pixel_satd37.60%i420 chroma_hpp63.34%
i422 chroma_vps25.97%pixelavg _pp45.52%i444 chroma_vps37.60%luma_hpp63.61%
i420 chroma_vps25.97%addAvg45.54%i420 chroma_vps37.60%i420 chroma_hps63.85%
luma_vpp26.22%i420 addAvg45.54%i444 chroma_vsp37.66%luma_hpp63.91%
sad26.25%i422 addAvg45.54%i422 chroma_vps37.68%i420 chroma_hpp64.12%
psyCost_pp26.30%i444 chroma_vsp45.57%i444 chroma_vpp37.69%i444 chroma_hps64.15%
i444 chroma_vsp26.38%i422 chroma_vsp45.57%i444 chroma_vps37.71%i444 chroma_hpp64.23%
i420 chroma_vsp26.38%i420 chroma_vsp45.57%i420 chroma_vps37.71%i422 chroma_hpp64.39%
i420 addAvg26.39%luma_vps45.58%convert_p2s37.73%i422 chroma_hpp64.56%
i422 addAvg26.39%pixelavg _pp45.61%i420 p2s37.73%i444 chroma_hps64.84%
pixel_satd26.62%luma_vps45.62%i422 p2s37.73%i422 chroma_hps64.87%
i444 chroma_vss26.71%luma_vps45.64%i444 p2s37.73%i444 chroma_hpp64.92%
i422 chroma_vss26.71%sad_x345.65%i444 chroma_vpp37.74%i420 chroma_hps64.93%
i420 chroma_vss26.71%i422 add_ps45.68%i444 chroma_vpp37.76%i422 chroma_hpp65.05%
luma_vsp26.77%addAvg45.72%addAvg37.80%i444 chroma_hps65.06%
luma_vps27.04%i420 addAvg45.72%i422 chroma_vpp37.99%i420 chroma_hpp65.14%
luma_vpp27.10%pixelavg _pp45.80%i444 chroma_vss38.04%i422 chroma_hps65.35%
i444 chroma_vss27.24%i444 chroma_hpp45.95%i420 chroma_hpp38.04%i422 chroma_hps65.63%
i422 chroma_vss27.24%psyRdoQuant45.96%luma_vps38.08%i444 chroma_hps65.72%
i422 chroma_vps27.26%luma_vsp45.97%i444 chroma_vpp38.09%i422 chroma_hpp65.80%
i420 addAvg27.28%sad46.04%i444 chroma_vpp38.27%i444 chroma_hpp65.88%
i422 addAvg27.28%luma_hvpp46.17%i422 chroma_vpp38.27%i420 chroma_hpp65.92%
addAvg27.55%luma_vss46.31%i444 chroma_hps38.30%i420 chroma_hpp65.94%
i422 chroma_vpp27.71%sad_x346.36%intra_pred_ang238.34%i444 chroma_hps66.03%
i420 chroma_vpp27.71%sad_x346.42%i444 chroma_hps38.37%i422 chroma_hps66.03%
pixel_satd27.93%luma_vps46.44%i444 chroma_vpp38.48%i420 chroma_hps66.15%
ssd_s28.04%luma_hpp46.46%copy _pp38.51%i422 chroma_hpp66.20%
pixel_satd28.10%i444 chroma_vsp46.66%addAvg38.54%i422 chroma_hps66.20%
pixelavg _pp28.47%sad_x346.71%nonPsyRdoQuant38.57%i420 chroma_hps66.29%
i420 pixel_satd28.54%luma_hpp46.82%sad_x338.74%i422 chroma_hpp66.32%
i422 pixel_satd28.54%luma_vss46.88%sad_x338.80%i444 chroma_hpp66.38%
pixel_satd28.56%i422 chroma_hps46.99%sad38.84%i444 chroma_vpp66.41%
i420 pixel_satd28.56%intra_pred_ang2647.26%Weight_sp38.86%i444 chroma_hps66.50%
i422 pixel_satd28.56%luma_vps47.31%pixel_satd38.88%i444 chroma_vpp66.61%
i444 chroma_vps28.75%luma_hvpp47.44%i420 pixel_satd38.88%i444 chroma_vpp66.63%
luma_vps28.78%pixelavg _pp47.50%copy _pp38.96%i444 chroma_hps66.64%
luma_vps28.82%luma_vss47.64%i422 sub_ps39.19%i444 chroma_hpp66.64%
i422 chroma_hps28.86%luma_vps47.69%i420 sub_ps39.34%i420 chroma_hpp66.64%
i420 chroma_hps29.02%i420 chroma_hpp47.78%i420 chroma_hps39.47%i420 chroma_hpp66.65%
sad_x329.04%i422 chroma_hps47.82%luma_vpp39.54%i444 chroma_hps66.71%
i444 chroma_hps29.11%luma_vsp47.93%luma_hvpp39.63%i422 chroma_hpp66.71%
luma_vsp29.13%luma_hvpp48.30%i444 chroma_vps39.68%i444 chroma_hps66.75%
luma_vss29.26%addAvg48.40%i420 chroma_vps39.68%i444 chroma_hps66.91%
i444 chroma_vss29.29%i420 addAvg48.40%luma_hpp39.72%i422 chroma_hpp66.92%
luma_vpp29.39%luma_hps48.96%addAvg39.77%i444 chroma_hpp67.59%
luma_vss29.59%luma_hps49.05%convert_p2s39.79%i444 chroma_hpp67.78%

 

 

 

 

i420 p2s39.79%i420 chroma_hpp69.14%

 

 

 

 

i444 p2s39.79%i444 chroma_hpp69.23%

Appendix B

1080p Test Clips and Bitrates Used

The following 1080p clips were used for generating test results.

passerby in a verdant sunny park
park_ joy _1080p.y4m

large crowd of joggers in a park
crowd_run_1080p50.y4m

ducks  loligagging in a blue pond
ducks_take_off_1080p50.y4m

Urban landscape of old European city
old_town_cross_1080p50.y4m

4k Test Clips and Bitrates Used

The following 4k clips were used for generating test results.

vacation panaroma
Netflix_Boat_4096x2160_60fps_10bit_420.y4m

Tango afficionados
Netflix_Tango_4096x2160_60fps_10bit_420.y4m

a rural open market
Netflix_FoodMarket_4096x2160_60fps_10bit_420.y4m

 

Appendix C

Configurations for Testing on Intel® Core™ i7-4500U Processor
System AttributeValue
OS NameWindows 10 professional
Version10.0.16299 Build 16299
System ModelMS-7A93
System Typex64-based PC
ProcessorIntel® Core™ i7-
4500U CPU @
3.30GHz, 3312 MHz,
10 Core(s), 20 Logical
Processor(s)
Core(s) per socket:2
Thread(s) per core:2
Socket(s):1
NUMA node(s):1
  
BIOS
BIOS Version/DateAmerican
Megatrends Inc.
1.00, 6/2/2017
SMBIOS Version3
BIOS ModeUEFI
  
Graphic Interface:
VersionPCI-Express
Link Widthx16
Max. Supportedx16
  
Memory:
TypeDDR3
Channel1
Size8 GB
DRAM Frequency800 MHz
command Rate (CR)2T
Configurations for Testing on Intel® Core™ i9-7900X Processor
System AttributeValue
OS NameMicrosoft Windows 10 Enterprise
Version110.0.16299 Build 16299
System ModelMS-7A93
System Typex64-based PC
ProcessorIntel® Core™ i9-7900X CPU at 3.30GHz, 3312Mhz, 10 Core(s), 20 Logical Processor(s)
Core(s) per socket:10
Thread(s) per core:2
Socket(s):1
NUMA node(s):1
  
BIOS
BIOS Version/DateAmerican
Megatrends Inc.
1.00, 6/2/2017
SMBIOS Version3
BIOS ModeUEFI
  
Graphic Interface:
VersionPCI-Express
Link Widthx16
Max. Supportedx16
  
Memory:
TypeDDR4
Channel2
Size32 GB
DRAM Frequency1066.8 MHz
command Rate (CR)2T
Configurations for Testing on Intel® Xeon® Platinum 8180 Processor
System AttributeValue
OS NameCentOS
Version7.2
System ModelIntel S4PR1SY2B
System Typex86_64
ProcessorIntel® Xeon® Platinum 8180 CPU at 2.50 GHz
Core(s) per socket:28
Thread(s) per core:2
Socket(s):2
NUMA node(s):2
  
BIOS
BIOS Version/DateSE5C620.86B.0X. 01.0007.062120172 125 / 06/21/2017
SMBIOS Version2.8
BIOS ModeUEFI
  
Graphic Interface:
VersionPCI-Express
Link Widthx16
Max. Supportedx16
  
Memory:
TypeDDR4
Channel2
Size192 GB
DRAM Frequency1333 MHz
command Rate (CR)2T

References

  1. David A. Patterson and John L. Hennessey, Computer Organization and Design: the Hardware/Software Interface, 2nd Edition, Morgan Kaufmann Publishers, Inc., San Francisco, California, 1998, p.751.
  2. VideoLAN Organization, x264, The best H.264/AVC encoder. https://www.videolan.org/developers/x264.html
  3. MulticoreWare Inc., x265 HEVC Encoder/H.265 Video Codec. http://x265.org/
  4. G. J. Sullivan, J.-R. Ohm, W.-J. Han and T. Wigand, "Overview of the High Efficiency Video Coding (HEVC) Standard," IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12,pp. 1649-1668, 2012.
  5. Intel Corporation, Intel Advanced Vector Instructions 512. https://www.intel.in/content/www/in/en/architecture-and-technology/avx-512-overview.html
  6. Intel Corporation, "Intel® Xeon® Processor Scalable Family Specification Update", February, 2018. https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-scalable-spec-update.pdf
  7. x265.org
  8. HandBrake, An OpenSource Video Transcoder.https://handbrake.fr/
  9. FFMPEG, A complete, cross-platform solution to record, convert and stream audio and video.
  10. MulticoreWare Inc., "x265 Receives Significant Boost from Intel Xeon Scalable Processor Family." http://x265.org/x265-receives-significant-boost-intel-xeon-scalable-processor-family/
For more complete information about compiler optimizations, see our Optimization Notice.