Security Software

Mitigation Strategies for JCC Microcode

In second generation Intel® Core™ Processors and Intel® Xeon E3-1200 Series Processors and later processor families, Intel® introduced a microarchitectural structure called the Decoded ICache (also called the Decoded Streaming Buffer or DSB) to decode instructions coming out of the legacy decode pipeline and speed program execution.

On some Intel processors (see Affected Processors), conditional branch instructions may exhibit unpredictable behavior under complex microarchitectural conditions involving jump instructions that span 64-byte boundaries. More details can be found in the Intel® Xeon® Processor Scalable Family Specification Update under SKX102.

Intel has released a microcode update (MCU) to fix this issue, called the Jump Conditional Code erratum, but the update could cause a performance degradation ranging from 0-4% on certain industry-standard benchmarks.

Many applications will not see a significant performance impact from this MCU. If you suspect that the mitigation is affecting your app’s performance, the instructions below explain how you can determine whether this is the case, and then provide guidance to help you recover some or all of the performance loss.

Performance Monitoring

The JCC erratum MCU workaround will cause a greater number of misses out of the DSB and subsequent switches to the legacy decode pipeline. This occurs since branches that overlay or end on a 32-byte boundary are unable to fill into the Decoded ICache.

Intel has observed performance effects associated with the workaround ranging from 0-4% on many industry-standard benchmarks1. In subcomponents of these benchmarks, Intel has observed outliers higher than the 0-4% range. Other workloads not observed by Intel may behave differently. Intel has in turn developed software-based tools to minimize the impact on potentially affected applications and workloads.

The potential performance impact of the JCC erratum mitigation arises from two different sources:

  1. A switch penalty that occurs when executing in the Decoded ICache and switching over to the legacy decode pipeline.
  2. Inefficiencies that occur when executing from the legacy decode pipeline that are potentially hidden by the Decoded ICache. ​

Performance Monitoring Events to look for

Tools like Intel® VTune™ Profiler can be used to locate performance bottlenecks, and more specifically can find locations in the code where the legacy decode pipeline is being used, rather than the DSB.

The list below describes critical events that can be used to compare performance before and after the MCU to determine whether the MCU causes any performance impact on your workloads.

Collect the following events to detect the performance effects of the MCU:

  • CPU_CLK_UNHALTED.THREAD = Core clock cycles in C0.
  • IDQ.DSB_UOPS = μops coming from the Decoded ICache.
  • DSB2MITE_SWITCHES.PENALTY_CYCLES = Penalty cycles introduced into the pipeline from switching from the Decoded ICache.
  • FRONTEND_RETIRED.DSB_MISS_PS = Precise frontend retired DSB miss will tag within the 64-byte boundary where the DSB miss occurs.
  • IDQ.MS_UOPS = μops coming from the microcode sequencer.
  • IDQ.MITE_UOPS = μops coming from the legacy decode pipeline (also called the Micro Instruction Translation Engine)
  • LSD.UOPS = μops coming from the Loop Stream Detector (LSD)

Note: The LSD is only available on some cores. The LSD.UOPS event can be excluded from calculations if not present as an event.

Figure 1 below is an example from Intel® VTune™ Profiler showing a disassembled function which contains a loop where the macrofused cmp + jnz is crossing a 32-byte boundary, causing a performance loss. This function shows an increase in IDQ.MITE_UOPS and a decrease in IDQ.DSB_UOPS.

To further understand how the legacy decode pipeline and DSB can impact the performance of the front end of the CPU pipeline, refer to Understanding the Instruction Pipeline and other articles in the Get Started with Intel® VTune™ Profiler guide.

Software Guidance and Optimization Methods

Software can compensate for the performance effects of the mitigation for this erratum with optimizations that align the code such that jump instructions (and macro-fused jump instructions) do not cross 32-byte boundaries or end on 32-byte boundaries. Aligning the code in this way can reduce or eliminate the performance penalty caused by execution transitioning from Decoded ICache to the legacy decode pipeline.

In the following code example, the two-byte jump instruction jae starting at offset 1f spans a 32-byte boundary and can cause a transition from the Decoded ICache to the legacy decode pipeline.

Code without JCC mitigation

0000000000000000 <fn1>:
   0:        55                           push   %rbp
   1:        41 54                        push   %r12
   3:        48 89 e5                     mov    %rsp,%rbp
   6:        c5 f8 10 04 0f               vmovups (%rdi,%rcx,1),%xmm0
   b:        c5 f8 11 04 0a               vmovups %xmm0,(%rdx,%rcx,1)
  10:        c5 f8 10 44 0f 10            vmovups 0x10(%rdi,%rcx,1),%xmm0
  16:        c5 f8 11 44 0a 10            vmovups %xmm0,0x10(%rdx,%rcx,1)
  1c:        48 39 fe                     cmp    %rdi,%rsi
  1f:        73 09                        jae    2a <fn1+0x2a>
  21:        e8 00 00 00 00               callq  26 <fn1+0x26>
  26:        41 5c                        pop    %r12
  28:        c9                           leaveq
  29:        c3                           retq
  2a:        e8 00 00 00 00               callq  2f <fn1+0x2f>
  2f:        41 5c                        pop    %r12
  31:        c9                           leaveq
  32:        c3                           retq

Intel’s advice to software developers is to align the jae instruction so that it does not cross a 32-byte boundary. In the example, this is done by adding the benign prefix 0x2e four times before the first push %rbp instruction so that the cmp instruction, which started at offset 1c, will instead start at offset 20. Hence the macro-fused cmp + jae instruction will not cross a 32-byte boundary.

​Code with JCC mitigation

0000000000000000 <fn1>:
   0:        2e 2e 2e 2e 55               cs cs cs cs push %rbp
   5:        41 54                        push   %r12
   7:        48 89 e5                     mov    %rsp,%rbp
   a:        c5 f8 10 04 0f               vmovups (%rdi,%rcx,1),%xmm0
   f:        c5 f8 11 04 0a               vmovups %xmm0,(%rdx,%rcx,1)
  14:        c5 f8 10 44 0f 10            vmovups 0x10(%rdi,%rcx,1),%xmm0
  1a:        c5 f8 11 44 0a 10            vmovups %xmm0,0x10(%rdx,%rcx,1)
  20:        48 39 fe                     cmp    %rdi,%rsi
  23:        73 09                        jae    2e <fn1+0x2e>
  25:        e8 00 00 00 00               callq  2a <fn1+0x2a>
  2a:        41 5c                        pop    %r12
  2c:        c9                           leaveq 
  2d:        c3                           retq   
  2e:        e8 00 00 00 00               callq  33 <fn1+0x33>
  33:        41 5c                        pop    %r12
  35:        c9                           leaveq 
  36:        c3                           retq   
 

​Software tools to improve performance

Intel has worked with the community on tools to help developers align branches, and has observed that recompiling software with the updated tools can help recover most of the performance loss that might be otherwise observed in selected applications.

There are two padding mechanisms for JCC mitigation alignment:

  • Inserting nop instructions
  • Inserting meaningless prefixes before instructions (prefix padding)

Theoretically, prefix padding can provide better performance because it reduces the number of nop instructions, therefore raising the DSB hit rate. In our experiments, we observed that prefix padding is slightly better than nop padding in general, and may provide much better performance in some outliers.

In general, we suggest developers try nop padding first, as it’s easier to start with. Developers can make their own choice whether to use nop padding or prefix padding for their applications. If you still observe significant performance drops after recompiling the application with the JCC mitigation, Intel recommends aligning all branch types with -malign-branch=jcc+fused+jmp+call+return+indirect.

GNU assembler options are available in binutils 2.34. LLVM nop padding was implemented in LLVM 10.0.0, followed by prefix padding in the LLVM 11 main trunk. ICC has supported prefix padding since the 19.1 release.

In the following sections, we summarize some options you can use with the GNU assembler, and then compare the GNU assembler options with options available in the Intel and LLVM compilers.

​Options for GNU assembler

​-mbranches-within-32B-boundaries

This is the recommended option for affected processors2. This option aligns conditional jumps, fused conditional jumps, and unconditional jumps within a 32-byte boundary with up to 5 segment prefixes on an instruction. It is equivalent to the following:

  • -malign-branch-boundary=32
  • -malign-branch=jcc+fused+jmp
  • -malign-branch-prefix-size=5

The default doesn't align branches.

​-malign-branch-boundary=NUM

This option controls how the assembler should align branches with segment prefixes or NOP. NUM must be a power of 2. Branches will be aligned within the NUM byte boundary. The default -malign-branch-boundary=0 doesn't align branches.

​-malign-branch=TYPE[+TYPE...]

This option specifies types of branches to align. TYPE is combination of the following:

  • jcc, which aligns conditional jumps.
  • fused, which aligns fused conditional jumps.
  • jmp, which aligns unconditional jumps.
  • call, which aligns calls.
  • ret, which aligns returns.
  • indirect, which aligns indirect jumps and calls.

The default is -malign-branch-boundary=jcc+fused+jmp.

​ -malign-branch-prefix-size=NUM

This option specifies the maximum number of prefixes on an instruction to align branches. NUM should be between 0 and 5. The default NUM is 5.

​Differences between GNU Assembler, ICC, and Clang options

There are some differences between the options available in GNU Assembler, ICC, and Clang because each compiler uses a different implementation. The table below lists the different options in each compiler. Refer to the descriptions of the GNU assembler options and the comparable options listed here to find the right options to use with your compiler.

Table 1: GNU, ICC and Clang compiler options for JCC mitigation alignment
  Prefix padding NOP padding Fined options Override principle
GNU Assembler -mbranches-within-32B-boundaries -mbranches-within-32B-boundaries
-malign-branch-prefix-size=0
-malign-branch-prefix-size
-malign-branch
-malign-branch-boundarty
The latter option overrides the former option.
ICC -mbranches-within-32B-boundaries  None  None  None
Clang -mbranches-within-32B-boundaries -mbranches-within-32B-boundaries -mpad-max-prefix-size
-malign-branch
The fined option overrides the general option.

Affected Processors

To find the mapping between a processor's CPUID and its Family/Model number, refer to the Intel® Software Developer's Manual, Vol 2A, table 3-8, and the INPUT EAX = 01H: Returns Model, Family, Stepping Information section.

Table 2: Processors potentially affected by JCC erratum
Family_Model Stepping Processor Families/Processor Number series
06_8EH 9 8th Generation Intel® Core™ Processor Family based on microarchitecture code name Amber Lake Y
06_8EH C 8th Generation Intel® Core™ Processor Family based on microarchitecture code name Amber Lake Y
06_55 7 2nd Generation Intel® Xeon® Scalable Processors based on microarchitecture code name Cascade Lake (server)
06_9EH A 8th Generation Intel® Core™ Processor Family based on microarchitecture code name Coffee Lake H
06_9EH A 8th Generation Intel® Core™ Processor Family based on microarchitecture code name Coffee Lake S
06_8EH A 8th Generation Intel® Core™ Processor Family based on microarchitecture code name Coffee Lake U43e
06_9EH B 8th Generation Intel® Core™ Processor Family based on microarchitecture code name Coffee Lake S (4+2)
06_9EH B Intel® Celeron® Processor G Series based on microarchitecture code name Coffee Lake S (4+2)
06_9EH A 8th Generation Intel® Core™ Processor Family based on microarchitecture code name Coffee Lake S (6+2) x/KBP
06_9EH A Intel® Xeon® Processor E Family based on microarchitecture code name Coffee Lake S (6+2)3
06_9EH A Intel® Xeon® Processor E Family based on microarchitecture code name Coffee Lake S (4+2)4
06_9EH D 9th Generation Intel® Core™ Processor Family based on microarchitecture code name Coffee Lake H (8+2)
06_9EH D 9th Generation Intel® Core™ Processor Family based on microarchitecture code name Coffee Lake S (8+2)
06_8EH C 10th Generation Intel® Core™ Processor Family based on microarchitecture code name Comet Lake U42
06_A6H 0 10th Generation Intel® Core™ Processor Family based on microarchitecture code name Comet Lake U62
06_9EH 9 8th Generation Intel® Core™ Processor Family based on microarchitecture code name Kaby Lake G
06_9EH 9 7th Generation Intel® Core™ Processor Family based on microarchitecture code name Kaby Lake H
06_AEH A 8th Generation Intel® Core™  Processor Family based on microarchitecture code nameKaby Lake Refresh U (4+2)
06_9EH 9 7th Generation Intel® Core™ Processor Family based on microarchitecture code name Kaby Lake S
06_8EH 9 7th Generation Intel® Core™ Processor Family based on microarchitecture code name Kaby Lake U
06_8EH 9 7th Generation Intel® Core™ Processor Family based on microarchitecture code name Kaby Lake U23e
06_9EH 9 Intel® Core™ X-series Processors based on microarchitecture code name Kaby Lake X
06_9EH 9 Intel® Xeon® Processor E3 v6 Family Kaby Lake Xeon E3
06_8EH 9 7th Generation Intel® Core™ Processor Family based on microarchitecture code name Kaby Lake Y
06_55H 4 Intel® Xeon® Processor D Family based on microarchitecture code name Skylake D, Bakerville
06_5E 3 6th Generation Intel® Core™ Processor Family based on microarchitecture code name Skylake H
06_5E 3 6th Generation Intel® Core™ Processor Family based on microarchitecture code name Skylake S
06_55H 4 Intel® Xeon® Scalable Processors based on microarchitecture code name Skylake Server
06_4E 3 6th Generation Intel® Core™ Processors based on microarchitecture code name Skylake U
06_4E 3 6th Generation Intel® Core™ Processor Family based on microarchitecture code name Skylake U23e
06_55H 4 Intel® Xeon® Processor W Family based on microarchitecture code name Skylake W
06_55H 4 Intel® Core™ X-series Processors based on microarchitecture code name Skylake X
06_55H 4 Intel® Xeon® Processor E3 v5 Family based on microarchitecture code name Skylake Xeon E3
06_4E 3 6th Generation Intel® Core™ Processors based on microarchitecture code name Skylake Y
06_8EH B 8th Generation Intel® Core™ Processors based on microarchitecture code name Whiskey Lake U
06_8EH C 8th Generation Intel® Core™ Processors based on microarchitecture code name Whiskey Lake U

Footnotes

  1. Data measured on Intel internal reference platform for research/educational purposes.
    Server benchmarks include:
    • SPECrate2017_int_base compiler with Intel Compiler Version 19 update 4
    • SPECrate2017_fp_base compiler with Intel Compiler Version 19 update 4
    • Linpack, Stream Triad, FIO. (rand7030_4K_04_workers_Q32/seq7030_64K_04_workers_Q32)
    • HammerDB-Postgres
    • SPECjbb2015
    • SPECvirt 
      Client benchmarks include:
    • SPECrate2017_int_base compiler with Intel Compiler Version 19 update 4
    • SPECrate2017_fp_base compiler with Intel Compiler Version 19 update 4
    • SYSmark 2018
    • PCmark 10
    • 3Dmark Sky Diver
    • WebXPRT v3
    • Cinebench R20
  2. Note that some processors which are not affected may take longer to decode instructions with more than 3 or 4 prefixes (for example Silvermont and Goldmont processors as noted in the Intel® 64 and IA-32 Architectures Optimization Reference Manual).
  3. Workstation, server, and desktop included
  4. Workstation, server, and desktop included

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at www.intel.com.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available​ updates. No product or component can be absolutely secure.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Intel provides these materials as-is, with no express or implied warranties.

No product or component can be absolutely secure.

Intel, the Intel logo, Intel Core, Intel Atom, Intel Xeon, Intel Xeon Phi, Intel® C Compiler, Intel Software Guard Extensions, and Intel® Trusted Execution Engine are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.