Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Learn more at intel.com, or from the OEM or retailer.

No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses.

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Intel does not guarantee the availability of these interfaces in any future product. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting http://www.intel.com/design/literature.htm.

Intel, the Intel logo, Intel Atom, Intel Core, Intel SpeedStep, MMX, Pentium, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 1997-2018, Intel Corporation. All Rights Reserved.
# Revision History

<table>
<thead>
<tr>
<th>Revision</th>
<th>Description</th>
</tr>
</thead>
</table>
| -025     | • Removed instructions that now reside in the Intel® 64 and IA-32 Architectures Software Developer’s Manual.  
• Minor updates to chapter 1.  
• Updates to Table 2-1, Table 2-2 and Table 2-8 (leaf 07H) to indicate support for AVX512_4VNNIW and AVX512_4FMAPS.  
• Minor update to Table 2-8 (leaf 15H) regarding ECX definition.  
• Minor updates to Section 4.6.2 and Section 4.6.3 to clarify the effects of "suppress all exceptions".  
• Footnote addition to CLWB instruction indicating operand encoding requirement.  
• Removed PCOMMIT. |
| September 2016 |
| -026     | • Removed CLWB instruction; it now resides in the Intel® 64 and IA-32 Architectures Software Developer’s Manual.  
• Added additional 512-bit instruction extensions in chapter 6. |
| October 2016 |
| -027     | • Added TLB CPUID leaf in chapter 2.  
• Added VPOPCNTD/Q instruction in chapter 6, and CPUID details in chapter 2. |
| December 2016 |
| -028     | • Updated intrinsics for VPOPCNTD/Q instruction in chapter 6. |
| December 2016 |
| -029     | • Corrected typo in CPUID leaf 18H.  
• Updated operand encoding table format; extracted tuple information from operand encoding.  
• Added VPERMB back into chapter 5; inadvertently removed.  
• Moved all instructions from chapter 6 to chapter 5.  
• Updated operation section of VPMULTISHIFTQB. |
| April 2017 |
| -030     | • Removed unnecessary information from document (chapters 2, 3 and 4).  
• Added table listing recent instruction set extensions introduction in Intel 64 and IA-32 Processors.  
• Updated CPUID instruction with additional details.  
• Added the following instructions: GF2P8AFFINEINVQB, GF2P8AFFINEQB, GF2P8MULB, VAESDEC, VAESDECLAST, VAESENC, VAESENCLAST, VPCMULQDQ, VPCOMPRESS, VPDPPUSD, VPDPBUSDS, VPDPWSSD, VPDPWSSDS, VPEXPAND, VPOPCNT, VPSHLD, VPSHLDV, VPSHRD, VPSHRDV, VPSPBQMB.  
• Removed the following instructions: VPMADD52HUQ, VPMADD52LUQ, VPERMB, VPERM2B, VPERMT2B, and VPMULTISHIFTQB. They can be found in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volumes 2A, 2B, 2C & 2D.  
• Moved instructions unique to processors based on the Knights Mill microarchitecture to chapter 3.  
• Added chapter 4: EPT-Based Sub-Page Permissions.  
• Added chapter 5: Intel® Processor Trace: VMX Improvements.  
• Moved instructions unique to processors based on the Knights Mill microarchitecture to chapter 3.  
• Added chapter 4: EPT-Based Sub-Page Permissions.  
• Added chapter 5: Intel® Processor Trace: VMX Improvements. |
<p>| October 2017 |</p>
<table>
<thead>
<tr>
<th>Revision</th>
<th>Description</th>
<th>Date</th>
</tr>
</thead>
</table>
| -031     | • Updated change log to correct typo in changes from previous release.  
          • Updated instructions with imm8 operand missing in operand encoding table.  
          • Replaced "VLMAX" with "MAXVL" to align terminology used across documentation.  
          • Added back information on detection of Intel AVX-512 instructions.  
          • Added Intel® Memory Encryption Technologies instructions PCONFIG and WBNOINVD. These instructions are also added to Table 1-1 "Recent Instruction Set Extensions Introduction in Intel 64 and IA-32 Processors". Added Section 1.5 "Detection of Intel® Memory Encryption Technologies (Intel® MKTME) Instructions".  
          • CPUID instruction updated with PCONFIG and WBNOINVD details.  
          • CPUID instruction updated with additional details on leaf 07H: Intel® Xeon Phi™ only features identified and listed.  
          • CPUID instruction updated with new Intel® SGX features in leaf 12H.  
          • CPUID instruction updated with new PCONFIG information sub-leaf 1BH.  
          • Updated short descriptions in the following instructions: VPDPPBUSD, VPDPPBUSD5, VPDPWSSD and VPDPWSSDS.  
          • Corrections and clarifications in Chapter 4 "EPT-Based Sub-Page Permissions".  
          • Corrections and clarifications in Chapter 5 “Intel® Processor Trace: VMX Improvements”. | January 2018 |
| -032     | • Corrected PCONFIG CPUID feature flag on instruction page.  
          • Minor updates to PCONFIG instruction pages: Changed Table 2-2 to use Hex notation; changed "RSVD, MBZ" to "Reserved, must be zero" in two places in Table 2-3.  
          • Minor typo correction in WBNOINVD instruction description. | January 2018 |
| -033     | • Updated Table 1-1 "Recent Instruction Set Extensions / Features Introduction in Intel 64 and IA-32 Processors".  
          • Added Section 1.6, "Detection of Future Instructions".  
          • Added CLDEMOTE, MOVDIRI, MOVDIR64B, TPAUSE, UMONITOR and UMWAIT instructions.  
          • Updated the CPUID instruction with details on new instructions/features added, as well as new power management details and information on hardware feedback interface ISA extensions.  
          • Corrections to PCONFIG instruction.  
          • Moved instructions unique to processors based on the Knights Mill microarchitecture to the Intel® 64 and IA-32 Architectures Software Developer’s Manual.  
          • Added Chapter 5 "Hardware Feedback Interface ISA Extensions".  
          • Added Chapter 6 "AC Split Lock Detection". | March 2018 |
| -034     | • Added clarification to leaf 07H in the CPUID instruction.  
          • Added MSR index for IA32_UMWAIT_CONTROL MSR.  
          • Updated registers in TPAUSE and UMWAIT instructions.  
          • Updated TPAUSE and UMWAIT intrinsics. | May 2018 |
REVISION HISTORY

CHAPTER 1
FUTURE INTEL® ARCHITECTURE INSTRUCTION EXTENSIONS AND FEATURES

1.1 About This Document .............................................................. 1-1
1.2 Instruction Set Extensions and Feature Introduction in Intel 64 and IA-32 Processors ................................................ 1-1
1.3 Detection of AVX-512 Foundation Instructions ......................... 1-4
1.4 Detection of 512-bit Instruction Groups of Intel® AVX-512 Family ................................................................................. 1-4
1.5 Detection of Intel® Memory Encryption Technologies (Intel® MKTME) Instructions ......................................................... 1-6
1.6 Detection of Future Instructions .................................................. 1-6
1.7 CPUID Instruction ...................................................................... 1-7
    CPUID—CPU Identification ....................................................... 1-7
1.8 Compressed Displacement (disp8*N) Support in EVEX ............... 1-43

CHAPTER 2
INSTRUCTION SET REFERENCE, A-Z

2.1 Instruction SET Reference .......................................................... 2-1
    CLDEMOTIE—Cache Line Demote ........................................... 2-2
    GF2P8AFFINEINVQB — Galois Field Affine Transformation Inverse ........................................................................ 2-4
    GF2P8AFFINEQB — Galois Field Affine Transformation .............. 2-7
    GF2P8MLB — Galois Field Multiply Bytes ................................. 2-10
    MOVDIRI—Move Doubleword as Direct Store ......................... 2-12
    MOVDIR64B—Move 64 Bytes as Direct Store ............................ 2-14
    PCONFIG — Platform Configuration ......................................... 2-16
    TPAUSE—Timed PAUSE ............................................................ 2-23
    UMONITOR—User Level Set Up Monitor Address ..................... 2-25
    UMWAIT—User Level Monitor Wait ......................................... 2-27
    VAESEDEC — Perform One Round of an AES Decryption Flow .... 2-29
    VAESEDECLAST — Perform Last Round of an AES Decryption Flow ........................................................................ 2-31
    VAESEN — Perform One Round of an AES Encryption Flow ......... 2-33
    VAESENSOR — Perform Last Round of an AES Encryption Flow ..................................................................... 2-35
    VPCLMULQDQ — Carry-Less Multiplication Quadword ............... 2-37
    VPCOMPRESS — Store Sparse Packed Byte/Word Integer Values into Dense Memory/Register .......................................... 2-40
    VPDPBUSD — Multiply and Add Unsigned and Signed Bytes ......... 2-43
    VPDPBUSDs — Multiply and Add Unsigned and Signed Bytes with Saturation ................................................................. 2-45
    VPDPIWSSD — Multiply and Add Signed Word Integers .............. 2-47
    VPDPIWSSDs — Multiply and Add Word Integers with Saturation ........................................................................... 2-49
    VPEXPAND — Expand Byte/Word Values .................................... 2-51
    VPPOPCNT — Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD ............................................. 2-54
    VPXSLD — Concatenate and Shift Packed Data Left Logical .......... 2-57
    VPXSLDV — Concatenate and Variable Shift Packed Data Left Logical .................................................................................. 2-60
    VPXSHRD — Concatenate and Shift Packed Data Right Logical ...... 2-63
    VPXSHRDV — Concatenate and Variable Shift Packed Data Right Logical ........................................................................ 2-66
    VPXSHFBBTQMB — Shuffle Bits from Quadword Elements Using Byte Indexes into Mask ........................................... 2-69
    WBNoinvd—Write Back and Do Not Invalidate Cache ................. 2-70

CHAPTER 3
EPT-BASED SUB-PAGE PERMISSIONS

3.1 Introduction .............................................................................. 3-1
3.2 VMCS Changes ........................................................................ 3-1
3.3 Changes to EPT Paging-Structure Entries ................................. 3-1
3.4 Changes to Guest-Physical Accesses ......................................... 3-1
3.5 Sub-Page Permission Table ....................................................... 3-2
    3.5.1 SPPT Overview ................................................................... 3-2
    3.5.2 Operation of SPPT-based Write-Permission ....................... 3-2
    3.5.3 SPP-Induced VM Exits ....................................................... 3-4
    3.5.3.1 Sub-Page Permissions and EPT Violations ...................... 3-4
    3.5.4 Invalidating Cached SPP Permissions ................................. 3-5
<table>
<thead>
<tr>
<th>FIGURE</th>
<th>Description</th>
<th>PAGE</th>
</tr>
</thead>
<tbody>
<tr>
<td>Figure 1-1.</td>
<td>Procedural Flow of Application Detection of AVX-512 Foundation Instructions.</td>
<td>1-4</td>
</tr>
<tr>
<td>Figure 1-2.</td>
<td>Procedural Flow of Application Detection of 512-bit Instruction Groups</td>
<td>1-5</td>
</tr>
<tr>
<td>Figure 1-3.</td>
<td>Version Information Returned by CPUID in EAX.</td>
<td>1-24</td>
</tr>
<tr>
<td>Figure 1-4.</td>
<td>Feature Information Returned in the ECX Register</td>
<td>1-26</td>
</tr>
<tr>
<td>Figure 1-5.</td>
<td>Feature Information Returned in the EDX Register</td>
<td>1-28</td>
</tr>
<tr>
<td>Figure 1-6.</td>
<td>Determination of Support for the Processor Brand String</td>
<td>1-36</td>
</tr>
<tr>
<td>Figure 1-7.</td>
<td>Algorithm for Extracting Maximum Processor Frequency</td>
<td>1-37</td>
</tr>
</tbody>
</table>
FUTURE INTEL® ARCHITECTURE INSTRUCTION EXTENSIONS AND FEATURES

CHAPTER 1
FUTURE INTEL® ARCHITECTURE INSTRUCTION EXTENSIONS AND FEATURES

1.1 ABOUT THIS DOCUMENT

This document describes the software programming interfaces of Intel® architecture instruction extensions and features which may be included in future Intel processor generations. Intel does not guarantee the availability of these interfaces and features in any future product.

The instruction set extensions cover a diverse range of application domains and programming usages. The 512-bit SIMD vector SIMD extensions, referred to as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions, deliver comprehensive set of functionality and higher performance than Intel® AVX and Intel® AVX2 instructions. Intel AVX, Intel AVX2 and many Intel AVX-512 instructions are covered in Intel® 64 and IA-32 Architectures Software Developer’s Manual sets. The reader can refer to them for basic and more background information related to various features referenced in this document.

The base of the 512-bit SIMD instruction extensions are referred to as Intel AVX-512 Foundation instructions. They include extensions of the AVX and AVX2 family of SIMD instructions but are encoded using a new encoding scheme with support for 512-bit vector registers, up to 32 vector registers in 64-bit mode, and conditional processing using opmask registers.

Chapter 2 is devoted to additional 512-bit instruction extensions in the Intel AVX-512 family targeting broad application domains, and instruction set extensions encoded using the EVEX prefix encoding scheme to operate at vector lengths smaller than 512-bits.

Chapter 3 describes EPT-Based Sub-Page Permissions.

Chapter 4 describes Intel® Processor Trace: VMX Improvements.

Chapter 5 describes Hardware Feedback Interface ISA Extensions.

Chapter 6 describes Split Lock Detection.

1.2 INSTRUCTION SET EXTENSIONS AND FEATURE INTRODUCTION IN INTEL 64 AND IA-32 PROCESSORS

Recent instruction set extensions and features are listed in Table 1-1. Within these groups, most instructions and features are collected into functional subgroups.

Table 1-1. Recent Instruction Set Extensions / Features Introduction in Intel 64 and IA-32 Processors

<table>
<thead>
<tr>
<th>Instruction Set Architecture</th>
<th>Processor Generation Introduction</th>
<th>Supported in Microarchitecture</th>
</tr>
</thead>
<tbody>
<tr>
<td>SSE4.1 Extensions</td>
<td>Intel® Xeon® processor 3100, 3300, 5200, 5400, 7400, 7500 series, Intel® Core™ 2 Extreme processors X9000 series, Intel® Core™ 2 Quad processor Q9000 series, Intel® Core™ 2 Duo processors 8000 series, T9000 series.</td>
<td>Legacy and later</td>
</tr>
<tr>
<td></td>
<td>Intel® Atom™ processor.</td>
<td>Silvermont and later</td>
</tr>
<tr>
<td>SSE4.2 Extensions, CRC32, POPCNT</td>
<td>Intel® Core™ i7 965 processor, Intel® Xeon® processors X3400, X3500, X5500, X6500, X7500 series.</td>
<td>Legacy and later</td>
</tr>
<tr>
<td></td>
<td>Intel® Atom™ processor.</td>
<td>Silvermont and later</td>
</tr>
</tbody>
</table>
### Table 1-1. Recent Instruction Set Extensions / Features Introduction in Intel 64 and IA-32 Processors (Continued)

<table>
<thead>
<tr>
<th>Instruction Set Architecture</th>
<th>Processor Generation Introduction</th>
<th>Supported in Microarchitecture</th>
</tr>
</thead>
<tbody>
<tr>
<td>AESNI, PCLMULQDQ</td>
<td>Intel® Xeon® processor E7 series, Intel® Xeon® processors X3600, X5600, Intel® Core™ i7 980X processor. Use CPUID to verify presence of AESNI and PCLMULQDQ across Intel® Core™ processor families.</td>
<td>Westmere and later</td>
</tr>
<tr>
<td></td>
<td>Intel® Atom™ processor.</td>
<td>Silvermont and later</td>
</tr>
<tr>
<td>Intel AVX</td>
<td>Intel® Xeon® processor E3 and E5 families. 2nd Generation Intel® Core™ i7, i5, i3 processor 2xxx families.</td>
<td>Sandy Bridge and later</td>
</tr>
<tr>
<td>F16C</td>
<td>3rd Generation Intel® Core™ processors, Intel® Xeon® processor E3-1200 v2 product family, Next Generation Intel® Xeon® processors, Intel® Core™ processor E5 v2 and E7 v2 families.</td>
<td>Ivy Bridge and later</td>
</tr>
<tr>
<td>RDRAND</td>
<td>3rd Generation Intel® Core™ processors, Intel® Xeon® processor E3-1200 v2 product family, Next Generation Intel® Xeon® processors, Intel® Core™ processor E5 v2 and E7 v2 families.</td>
<td>Ivy Bridge and later</td>
</tr>
<tr>
<td>FS/GS base access</td>
<td>3rd Generation Intel® Core™ processors, Intel® Xeon® processor E3-1200 v2 product family, Next Generation Intel® Xeon® processors, Intel® Core™ processor E5 v2 and E7 v2 families.</td>
<td>Ivy Bridge and later</td>
</tr>
<tr>
<td></td>
<td>Intel® Atom™ processor.</td>
<td>Silvermont and later</td>
</tr>
<tr>
<td>FMA, AVX2, BMI1, BMI2, INVPCID, LZCNT, TSX</td>
<td>Intel® Xeon® processor E3/E5/E7 v3 product families. 4th Generation Intel® Core™ processor family.</td>
<td>Haswell and later</td>
</tr>
<tr>
<td>MOVBE</td>
<td>Intel® Xeon® processor E3/E5/E7 v3 product families. 4th Generation Intel® Core™ processor family.</td>
<td>Haswell and later</td>
</tr>
<tr>
<td></td>
<td>Intel® Atom™ processor.</td>
<td>Silvermont and later</td>
</tr>
<tr>
<td>PREFETCHW</td>
<td>Intel® Core™ M processor family; 5th Generation Intel® Core™ processor family.</td>
<td>Broadwell and later</td>
</tr>
<tr>
<td></td>
<td>Intel® Atom™ processor.</td>
<td>Silvermont and later</td>
</tr>
<tr>
<td>ADX</td>
<td>Intel® Core™ M processor family; 5th Generation Intel® Core™ processor family.</td>
<td>Broadwell and later</td>
</tr>
<tr>
<td>CLAC, STAC</td>
<td>Intel® Core™ M processor family; 5th Generation Intel® Core™ processor family.</td>
<td>Broadwell and later</td>
</tr>
<tr>
<td></td>
<td>Intel® Atom™ processor.</td>
<td>Goldmont and later</td>
</tr>
<tr>
<td>RDSEE</td>
<td>Intel® Core™ M processor family; 5th Generation Intel® Core™ processor family.</td>
<td>Broadwell and later</td>
</tr>
<tr>
<td></td>
<td>Intel® Atom™ processor.</td>
<td>Goldmont and later</td>
</tr>
<tr>
<td>AVXS12ER, AVXS12PF, PREFETCHW1</td>
<td>Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series.</td>
<td>Knights Landing</td>
</tr>
<tr>
<td>AVXS12F, AVXS12CD</td>
<td>Intel® Xeon Phi™ Processor 3200, 5200, 7200 Series.</td>
<td>Knights Landing</td>
</tr>
<tr>
<td></td>
<td>Intel® Xeon® Processor Scalable Family.</td>
<td>Skylake Server and later</td>
</tr>
<tr>
<td></td>
<td>TBD</td>
<td>Cannon Lake and later</td>
</tr>
</tbody>
</table>
# Table 1-1. Recent Instruction Set Extensions / Features Introduction in Intel 64 and IA-32 Processors(Continued)

<table>
<thead>
<tr>
<th>Instruction Set Architecture</th>
<th>Processor Generation Introduction</th>
<th>Supported in Microarchitecture</th>
</tr>
</thead>
<tbody>
<tr>
<td>CLFLUSHOPT, XSAVEC, XSAVES, MPX</td>
<td>Intel® Xeon® Processor Scalable Family.</td>
<td>Skylake Server and later</td>
</tr>
<tr>
<td>6th Generation Intel® Core™ processor family.</td>
<td>Skylake and later</td>
<td></td>
</tr>
<tr>
<td>Intel® Atom™ processor.</td>
<td>Goldmont and later</td>
<td></td>
</tr>
<tr>
<td>SGX1</td>
<td>6th Generation Intel® Core™ processor family.</td>
<td>Skylake and later</td>
</tr>
<tr>
<td>Intel® Atom™ processor.</td>
<td>Goldmont Plus and later</td>
<td></td>
</tr>
<tr>
<td>AVX512DQ, AVX512BW, AVX512VL</td>
<td>Intel® Xeon® Processor Scalable Family.</td>
<td>Skylake Server and later</td>
</tr>
<tr>
<td>TBD</td>
<td>Cannon Lake and later</td>
<td></td>
</tr>
<tr>
<td>CLWB</td>
<td>Intel® Xeon® Processor Scalable Family.</td>
<td>Skylake Server and later</td>
</tr>
<tr>
<td>TBD</td>
<td>Ice Lake and later</td>
<td></td>
</tr>
<tr>
<td>TBD</td>
<td>Future Tremont and later</td>
<td></td>
</tr>
<tr>
<td>PKU</td>
<td>Intel® Xeon® Processor Scalable Family.</td>
<td>Skylake Server and later</td>
</tr>
<tr>
<td>TBD</td>
<td>Cannon Lake and later</td>
<td></td>
</tr>
<tr>
<td>AVX512_IFMA, AVX512_VBMI</td>
<td>TBD</td>
<td>Goldmont and later</td>
</tr>
<tr>
<td>SHA-NI</td>
<td>TBD</td>
<td>Goldmont and later</td>
</tr>
<tr>
<td>Intel® Atom™ processor.</td>
<td>Goldmont and later</td>
<td></td>
</tr>
<tr>
<td>UMIP</td>
<td>TBD</td>
<td>Goldmont Plus and later</td>
</tr>
<tr>
<td>PTWRITE</td>
<td>Intel® Atom™ processor.</td>
<td>Goldmont Plus and later</td>
</tr>
<tr>
<td>TBD</td>
<td>Cannon Lake and later</td>
<td></td>
</tr>
<tr>
<td>RDPID</td>
<td>Intel® Atom™ processor.</td>
<td>Goldmont Plus and later</td>
</tr>
<tr>
<td>AVX512_4FMAPS, AVX512_4VNNI</td>
<td>Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series.</td>
<td>Knights Mill</td>
</tr>
<tr>
<td>AVX512_VPOPCNDQ</td>
<td>Intel® Xeon Phi™ Processor 7215, 7285, 7295 Series.</td>
<td>Knights Mill</td>
</tr>
<tr>
<td>Fast Short REP MOV</td>
<td>TBD</td>
<td>Ice Lake and later</td>
</tr>
<tr>
<td>AVX512_VNNI, VAES, GFNI (AVX/AVX512), AVX512_VBM1Z, VPCLMULQDQ, AVX512_BITALG</td>
<td>TBD</td>
<td>Ice Lake and later</td>
</tr>
<tr>
<td>GFNI(SSE)</td>
<td>TBD</td>
<td>Ice Lake and later</td>
</tr>
<tr>
<td>TBD</td>
<td>Future Tremont and later</td>
<td></td>
</tr>
<tr>
<td>PCONFIG, wBNOINVD</td>
<td>TBD</td>
<td>Ice Lake Server and later</td>
</tr>
<tr>
<td>ENCLV</td>
<td>TBD</td>
<td>Ice Lake Server and later</td>
</tr>
<tr>
<td>TBD</td>
<td>Future Tremont and later</td>
<td></td>
</tr>
<tr>
<td>Split Lock Detection</td>
<td>TBD</td>
<td>Ice Lake and later</td>
</tr>
<tr>
<td>TBD</td>
<td>Future Tremont and later</td>
<td></td>
</tr>
<tr>
<td>CLDEMOTE</td>
<td>TBD</td>
<td>Future Tremont and later</td>
</tr>
<tr>
<td>Direct stores: MOVDIRI, MOVDIR64B</td>
<td>TBD</td>
<td>Future Tremont and later</td>
</tr>
</tbody>
</table>
1.3 DETECTION OF AVX-512 FOUNDATION INSTRUCTIONS

The majority of AVX-512 Foundation instructions are encoded using the EVEX encoding scheme. EVEX-encoded instructions can operate on the 512-bit ZMM register state plus 8 opmask registers. The opmask instructions in AVX-512 Foundation instructions operate only on opmask registers or with a general purpose register.

Processor support of AVX-512 Foundation instructions is indicated by CPUID.(EAX=07H, ECX=0):EBX.AVX512F[bit 16] = 1. Detection of AVX-512 Foundation instructions operating on ZMM states and opmask registers need to follow the general procedural flow in Figure 1-1.

Prior to using AVX-512 Foundation instructions, the application must identify that the operating system supports the XGETBV instruction, the ZMM register state, in addition to processor’s support for ZMM state management using XSAVE/XRSTOR and AVX-512 Foundation instructions. The following simplified sequence accomplishes both and is strongly recommended.

1) Detect CPUID.1H:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use¹).
2) Execute XGETBV and verify that XCR0[7:5] = ‘111b’ (OPMASK state, upper 256-bit of ZMM0-ZMM15 and ZMM16-ZMM31 state are enabled by OS) and that XCR0[2:1] = ‘11b’ (XMM state and YMM state are enabled by OS).
3) Detect CPUID.0x7.0:EBX.AVX512F[bit 16] = 1.

¹. If CPUID.01H:ECX.OSXSAVE reports 1, it also indirectly implies the processor supports XSAVE, XRSTOR, XGETBV, processor extended state bit vector XCR0 register. Thus an application may streamline the checking of CPUID feature flags for XSAVE and OSXSAVE. XSETBV is a privileged instruction.

Table 1-1. Recent Instruction Set Extensions / Features Introduction in Intel 64 and IA-32 Processors (Continued)

<table>
<thead>
<tr>
<th>Instruction Set Architecture</th>
<th>Processor Generation Introduction</th>
<th>Supported in Microarchitecture</th>
</tr>
</thead>
<tbody>
<tr>
<td>User wait: TPAUSE, UMONITOR, UMWAIT</td>
<td>TBD</td>
<td>Future Tremont and later</td>
</tr>
</tbody>
</table>

![Figure 1-1. Procedural Flow of Application Detection of AVX-512 Foundation Instructions](image-url)
1.4 DETECTION OF 512-BIT INSTRUCTION GROUPS OF INTEL® AVX-512 FAMILY

In addition to the Intel AVX-512 Foundation instructions, Intel AVX-512 family provides several additional 512-bit extensions in groups of instructions, each group is enumerated by a CPUID leaf 7 feature flag and can be encoded via EVEX.L'L field to support operation at vector lengths smaller than 512 bits. These instruction groups are listed in Table 1-2.

Software must follow the detection procedure for the 512-bit AVX-512 Foundation instructions as described in Section 1.3.

Detection of other 512-bit sibling instruction groups listed in Table 1-2 (excluding AVX512F) follows the procedure described in Figure 1-2.

<table>
<thead>
<tr>
<th>CPUID Leaf 7 Feature Flag Bit</th>
<th>Feature Flag abbreviation of 512-bit Instruction Group</th>
<th>SW Detection Flow</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPUID.(EAX=07H, ECX=0):EBX[bit 16]</td>
<td>AVX512F: AVX-512 Foundation instructions.</td>
<td>Figure 1-1</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 06]</td>
<td>AVX512_VBMI2: Additional byte, word, dword and qword capabilities, an addition to AVX512.</td>
<td>Figure 1-2</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 08]</td>
<td>GFNI: Galois Field New Instructions; this bit is concatenated by software with either AVX512, AVX, or SSE to indicate the different supported instructions.</td>
<td>Figure 1-2</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 09]</td>
<td>VAES: Vector AES instructions; this bit is concatenated by software with AVX512 or AVX to indicate the different supported instructions.</td>
<td>Figure 1-2</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 10]</td>
<td>VPCLMULQDQ: Vector PCLMULQDQ instructions; this bit is concatenated by software with AVX512 or AVX to indicate the different supported instructions.</td>
<td>Figure 1-2</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 11]</td>
<td>AVX512_VNNI: Vector Neural Network Instructions, an addition to AVX512.</td>
<td>Figure 1-2</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 12]</td>
<td>AVX512_BITALG: Support for VPOPCNT[B,W] and VPSHUFBITQMB.</td>
<td>Figure 1-2</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 14]</td>
<td>AVX512_VPOPCNTDQ: Support for VPOPCNT[D,Q].</td>
<td>Figure 1-2</td>
</tr>
</tbody>
</table>

Software must follow the detection procedure for the 512-bit AVX-512 Foundation instructions as described in Section 1.3.

Detection of other 512-bit sibling instruction groups listed in Table 1-2 (excluding AVX512F) follows the procedure described in Figure 1-2.

Figure 1-2. Procedural Flow of Application Detection of 512-bit Instruction Groups
To illustrate the detection procedure for 512-bit instructions enumerated by AVX512CD, the following sequence is strongly recommended.

1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use).

2) Execute XGETBV and verify that XCR0[7:5] = ‘111b’ (OPMASK state, upper 256-bit of ZMM0-ZMM15 and ZMM16-ZMM31 state are enabled by OS) and that XCR0[2:1] = ‘11b’ (XMM state and YMM state are enabled by OS).

3) Verify both CPUID.0x7.0:EBX.AVX512F[bit 16] = 1, CPUID.0x7.0:EBX.AVX512CD[bit 28] = 1.

Similarly, the detection procedure for enumerating 512-bit instructions reported by AVX512DW follows the same flow.

### 1.5 DETECTION OF INTEL® MEMORY ENCRYPTION TECHNOLOGIES (INTEL® MKTME) INSTRUCTIONS

Intel® Memory Encryption Technologies instructions are enumerated by a CPUID feature flag; details are listed in Table 1-3.

#### Table 1-3. Intel® Memory Encryption Technologies Instructions

<table>
<thead>
<tr>
<th>CPUID Leaf Feature Flag Bit</th>
<th>Feature Flag Abbreviation of Intel® MKTME Instructions</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPUID.(EAX=07H, ECX=0):EDX[bit 18]</td>
<td>PCONFIG: Platform configuration</td>
</tr>
<tr>
<td>CPUID.(EAX=80000008H, ECX=0):EBX[bit 9]</td>
<td>WBNOINVD: Write back and do not invalidate cache</td>
</tr>
</tbody>
</table>

### 1.6 DETECTION OF FUTURE INSTRUCTIONS

Future instructions are enumerated by a CPUID feature flag; details are listed in Table 1-4.

#### Table 1-4. Future Instructions

<table>
<thead>
<tr>
<th>CPUID Leaf Feature Flag Bit</th>
<th>Feature Flag Abbreviation</th>
</tr>
</thead>
<tbody>
<tr>
<td>CPUID.(EAX=07H, ECX=0):EDX[bit 4]</td>
<td>Fast Short REP MOV</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 27]</td>
<td>MOVDIRI: Direct Stores</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 28]</td>
<td>MOVDIR64B: Direct Stores</td>
</tr>
<tr>
<td>CPUID.(EAX=07H, ECX=0):ECX[bit 5]</td>
<td>WAITPKG: Wait and Pause Enhancements</td>
</tr>
</tbody>
</table>
1.7 CPUID INSTRUCTION

CPUID—CPU Identification

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>64-Bit Mode</th>
<th>Compat/ Leg Mode</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F A2</td>
<td>CPUID</td>
<td>Valid</td>
<td>Valid</td>
<td>Returns processor identification and feature information to the EAX, EBX, ECX, and EDX registers, as determined by input entered in EAX (in some cases, ECX as well).</td>
</tr>
</tbody>
</table>

Description

The ID flag (bit 21) in the EFLAGS register indicates support for the CPUID instruction. If a software procedure can set and clear this flag, the processor executing the procedure supports the CPUID instruction. This instruction operates the same in non-64-bit modes and 64-bit mode.

CPUID returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers.\(^1\) The instruction’s output is dependent on the contents of the EAX register upon execution (in some cases, ECX as well). For example, the following pseudocode loads EAX with 00H and causes CPUID to return a Maximum Return Value and the Vendor Identification String in the appropriate registers:

```
MOV EAX, 00H
CPUID
```

Table 1-5 shows information returned, depending on the initial value loaded into the EAX register. Table 1-6 shows the maximum CPUID input value recognized for each family of IA-32 processors on which CPUID is implemented.

Two types of information are returned: basic and extended function information. If a value is entered for CPUID.EAX is invalid for a particular processor, the data for the highest basic information leaf is returned. For example, using the Intel Core 2 Duo E6850 processor, the following is true:

```
CPUID.EAX = 05H (* Returns MONITOR/MWAIT leaf. *)
CPUID.EAX = 0AH (* Returns Architectural Performance Monitoring leaf.*)
CPUID.EAX = 0BH (* INVALID: Returns the same information as CPUID.EAX = 0AH. *)
CPUID.EAX = 80000008H (* Returns virtual/physical address size data.*)
CPUID.EAX = 8000000AH (* INVALID: Returns same information as CPUID.EAX = 0AH. *)
```

When CPUID returns the highest basic leaf information as a result of an invalid input EAX value, any dependence on input ECX value in the basic leaf is honored.

CPUID can be executed at any privilege level to serialize instruction execution. Serializing instruction execution guarantees that any modifications to flags, registers, and memory for previous instructions are completed before the next instruction is fetched and executed.

See also:

"Serializing Instructions” in Chapter 8, “Multiple-Processor Management,” in the *Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A*

"Caching Translation Information” in Chapter 4, “Paging,” in the *Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.*

---

1. On Intel 64 processors, CPUID clears the high 32 bits of the RAX/RBX/RCX/RDX registers in all modes.
### Table 1-5. Information Returned by CPUID Instruction

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Basic CPUID Information</strong></td>
<td></td>
</tr>
<tr>
<td>0H EAX, EBX, ECX, EDX</td>
<td>Maximum Input Value for Basic CPUID Information (see Table 1-6)</td>
</tr>
<tr>
<td>0H EAX, EBX, ECX, EDX</td>
<td>“Genu”</td>
</tr>
<tr>
<td>0H EAX, EBX, ECX, EDX</td>
<td>“Intel”</td>
</tr>
<tr>
<td>0H EAX, EBX, ECX, EDX</td>
<td>“inel”</td>
</tr>
<tr>
<td><strong>Version Information: Type, Family, Model, and Stepping ID (see Figure 1-3)</strong></td>
<td></td>
</tr>
<tr>
<td>01H EAX, EBX, ECX, EDX</td>
<td>Bits 7-0: Brand Index</td>
</tr>
<tr>
<td>01H EAX, EBX, ECX, EDX</td>
<td>Bits 15-8: CLFLUSH line size (Value ( \times 8 ) = cache line size in bytes)</td>
</tr>
<tr>
<td>01H EAX, EBX, ECX, EDX</td>
<td>Bits 23-16: Maximum number of addressable IDs for logical processors in this physical package*</td>
</tr>
<tr>
<td>01H EAX, EBX, ECX, EDX</td>
<td>Bits 31-24: Initial APIC ID</td>
</tr>
<tr>
<td>01H EAX, EBX, ECX, EDX</td>
<td>Feature Information (see Figure 1-4 and Table 1-8)</td>
</tr>
<tr>
<td>01H EAX, EBX, ECX, EDX</td>
<td>Feature Information (see Figure 1-5 and Table 1-9)</td>
</tr>
<tr>
<td><strong>NOTES:</strong></td>
<td></td>
</tr>
<tr>
<td>01H EAX, EBX, ECX, EDX</td>
<td>* The nearest power-of-2 integer that is not smaller than EBX[23:16] is the maximum number of unique initial APIC IDs reserved for addressing different logical processors in a physical package.</td>
</tr>
<tr>
<td><strong>Cache and TLB Information (see Table 1-10)</strong></td>
<td></td>
</tr>
<tr>
<td>02H EAX, EBX, ECX, EDX</td>
<td>Cache and TLB Information</td>
</tr>
<tr>
<td>02H EAX, EBX, ECX, EDX</td>
<td>Cache and TLB Information</td>
</tr>
<tr>
<td>02H EAX, EBX, ECX, EDX</td>
<td>Cache and TLB Information</td>
</tr>
<tr>
<td><strong>Reserved</strong></td>
<td></td>
</tr>
<tr>
<td>03H EAX, EBX, ECX, EDX</td>
<td>Reserved</td>
</tr>
<tr>
<td>03H EAX, EBX, ECX, EDX</td>
<td>Reserved</td>
</tr>
<tr>
<td>03H EAX, EBX, ECX, EDX</td>
<td>Reserved</td>
</tr>
<tr>
<td><strong>NOTES:</strong></td>
<td></td>
</tr>
<tr>
<td>03H EAX, EBX, ECX, EDX</td>
<td>Processor serial number (PSN) is not supported in the Pentium 4 processor or later. On all models, use the PSN flag (returned using CPUID) to check for PSN support before accessing the feature.</td>
</tr>
<tr>
<td><strong>Deterministic Cache Parameters Leaf</strong></td>
<td></td>
</tr>
<tr>
<td>04H EAX</td>
<td>Leaf 04H output depends on the initial value in ECX.</td>
</tr>
<tr>
<td>04H EAX</td>
<td>See also: “INPUT EAX = 4: Returns Deterministic Cache Parameters for each level” on page 1-32.</td>
</tr>
<tr>
<td>04H EAX</td>
<td>Bits 4-0: Cache Type Field</td>
</tr>
<tr>
<td>04H EAX</td>
<td>0 = Null - No more caches</td>
</tr>
<tr>
<td>04H EAX</td>
<td>1 = Data Cache</td>
</tr>
<tr>
<td>04H EAX</td>
<td>2 = Instruction Cache</td>
</tr>
<tr>
<td>04H EAX</td>
<td>3 = Unified Cache</td>
</tr>
<tr>
<td>04H EAX</td>
<td>4-31 = Reserved</td>
</tr>
</tbody>
</table>

NOTES:

CPUID leaves > 3 < 80000000 are visible only when IA32_MISC_ENABLES.BOOT_NT4[bit 22] = 0 (default).
## Table 1-5. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Bits 7-5: Cache Level (starts at 1)</td>
</tr>
<tr>
<td></td>
<td>Bits 8: Self Initializing cache level (does not need SW initialization)</td>
</tr>
<tr>
<td></td>
<td>Bits 9: Fully Associative cache</td>
</tr>
<tr>
<td></td>
<td>Bits 13-10: Reserved</td>
</tr>
<tr>
<td></td>
<td>Bits 25-14: Maximum number of addressable IDs for logical processors sharing this cache*,**</td>
</tr>
<tr>
<td></td>
<td>Bits 31-26: Maximum number of addressable IDs for processor cores in the physical package***,****</td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 11-00: L = System Coherency Line Size*</td>
</tr>
<tr>
<td></td>
<td>Bits 21-12: P = Physical Line partitions*</td>
</tr>
<tr>
<td></td>
<td>Bits 31-22: W = Ways of associativity*</td>
</tr>
<tr>
<td>ECX</td>
<td>Bits 31-00: S = Number of Sets*</td>
</tr>
<tr>
<td>EDX</td>
<td>Bit 0: WBINVD/INVD behavior on lower level caches</td>
</tr>
<tr>
<td></td>
<td>Bit 10: Write-Back Invalidate/Invalidate</td>
</tr>
<tr>
<td></td>
<td>0 = WBINVD/INVD from threads sharing this cache acts upon lower level caches for threads sharing this cache</td>
</tr>
<tr>
<td></td>
<td>1 = WBINVD/INVD is not guaranteed to act upon lower level caches of non-originating threads sharing this cache.</td>
</tr>
<tr>
<td></td>
<td>Bit 1: Cache Inclusiveness</td>
</tr>
<tr>
<td></td>
<td>0 = Cache is not inclusive of lower cache levels.</td>
</tr>
<tr>
<td></td>
<td>1 = Cache is inclusive of lower cache levels.</td>
</tr>
<tr>
<td></td>
<td>Bit 2: Complex cache indexing</td>
</tr>
<tr>
<td></td>
<td>0 = Direct mapped cache</td>
</tr>
<tr>
<td></td>
<td>1 = A complex function is used to index the cache, potentially using all address bits.</td>
</tr>
<tr>
<td></td>
<td>Bits 31-03: Reserved = 0</td>
</tr>
</tbody>
</table>

**NOTES:**

* Add one to the return value to get the result.

** The nearest power-of-2 integer that is not smaller than (1 + EAX[25:14]) is the number of unique initial APIC IDs reserved for addressing different logical processors sharing this cache.

*** The nearest power-of-2 integer that is not smaller than (1 + EAX[31:26]) is the number of unique Core IDs reserved for addressing different processor cores in a physical package. Core ID is a subset of bits of the initial APIC ID.

**** The returned value is constant for valid initial values in ECX. Valid ECX values start from 0.

### MONITOR/MWAIT Leaf

<table>
<thead>
<tr>
<th>EAX</th>
<th>Bits 15-00: Smallest monitor-line size in bytes (default is processor’s monitor granularity)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Bits 31-16: Reserved = 0</td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 15-00: Largest monitor-line size in bytes (default is processor’s monitor granularity)</td>
</tr>
<tr>
<td></td>
<td>Bits 31-16: Reserved = 0</td>
</tr>
<tr>
<td>ECX</td>
<td>Bit 00: Enumeration of Monitor-Mwait extensions (beyond EAX and EBX registers) supported</td>
</tr>
<tr>
<td></td>
<td>Bit 01: Supports treating interrupts as break-event for MWAIT, even when interrupts disabled</td>
</tr>
<tr>
<td></td>
<td>Bits 31-02: Reserved</td>
</tr>
<tr>
<td>Initial EAX Value</td>
<td>Information Provided about the Processor</td>
</tr>
<tr>
<td>------------------</td>
<td>----------------------------------------</td>
</tr>
</tbody>
</table>
| EDX              | Bits 03-00: Number of C0* sub C-states supported using MWAIT  
|                  | Bits 07-04: Number of C1* sub C-states supported using MWAIT  
|                  | Bits 11-08: Number of C2* sub C-states supported using MWAIT  
|                  | Bits 15-12: Number of C3* sub C-states supported using MWAIT  
|                  | Bits 19-16: Number of C4* sub C-states supported using MWAIT  
|                  | Bits 23-20: Number of C5* sub C-states supported using MWAIT  
|                  | Bits 27-24: Number of C6* sub C-states supported using MWAIT  
|                  | Bits 31-28: Number of C7* sub C-states supported using MWAIT  
|                  | NOTE:  
|                  | * The definition of C0 through C7 states for MWAIT extension are processor-specific C-states, not ACPI C-states. |

### Thermal and Power Management Leaf

| EAX            | Bits 00: Digital temperature sensor is supported if set  
|----------------| Bit 01: Intel Turbo Boost Technology Available (see description of IA32_MISC_ENABLE[38]).  
|                | Bit 02: ARAT. APIC-Timer-always-running feature is supported if set.  
|                | Bit 03: Reserved  
|                | Bit 04: PLN. Power limit notification controls are supported if set.  
|                | Bit 05: ECMD. Clock modulation duty cycle extension is supported if set.  
|                | Bit 06: PTM. Package thermal management is supported if set.  
|                | Bit 07: HWP. HWP base registers (IA32_PM_ENABLE[bit 0], IA32_Hwp_REQUEST, IA32_Hwp_STATUS) are supported if set.  
|                | Bit 08: HWP_Notification. IA32_HWP_INTERRUPT MSR is supported if set.  
|                | Bit 09: HWP_Activity_Window. IA32_Hwp_REQUEST MSR is supported if set.  
|                | Bit 11: HWP_Package_Level_Request. IA32_Hwp_REQUEST_PKG MSR is supported if set.  
|                | Bit 12: Reserved.  
|                | Bit 13: HDC. HDC base registers IA32_PKG_HDC_CTL, IA32_PM_CTL1, IA32_THREADSTALL MSRs are supported if set.  
|                | Bit 14: Intel Turbo Boost Max Technology 3.0 available.  
|                | Bit 15: HWP Capabilities. Highest Performance change is supported if set.  
|                | Bit 16: HWP PECI override is supported if set.  
|                | Bit 17: Flexible HWP is supported if set.  
|                | Bit 18: Fast access mode for the IA32_HwpgetRequest MSR is supported if set.  
|                | Bit 19: Hwp_FEEDBACK. IA32_Hwp_FEEDBACK_PTR, IA32_Hwp_FEEDBACK_CONFIG, IA32_PACKAGE_THERM_STATUS bit 26 and IA32_PACKAGE_THERM_INTERRUPT bit 25 are supported if set.  
|                | Bit 20: Ignoring Idle Logical Processor Hwp request is supported if set.  
|                | Bits 31 - 21: Reserved.  
| EBX            | Bits 03-00: Number of Interrupt Thresholds in Digital Thermal Sensor  
|                | Bits 31-04: Reserved  
| ECX            | Bit 00: Hardware Coordination Feedback Capability (Presence of IA32_MPERF and IA32_APERF). The capability to provide a measure of delivered processor performance (since last reset of the counters), as a percentage of the expected processor performance when running at the TSC frequency.  
|                | Bits 02-01: Reserved = 0  
|                | Bit 03: The processor supports performance-energy bias preference if CPUID.06H:ECX.SETBH[bit 3] is set and it also implies the presence of a new architectural MSR called IA32_ENERGY_PERF_BIAS (1B0H)  
|                | Bits 31-04: Reserved = 0  
| EDX            | Reserved = 0
Table 1-5. Information Returned by CPUID Instruction(Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>07H</td>
<td><strong>NOTES:</strong></td>
</tr>
<tr>
<td></td>
<td>Leaf 07H main leaf (ECX = 0).</td>
</tr>
<tr>
<td></td>
<td>If ECX contains an invalid sub-leaf index, EAX/EBX/ECX/EDX return 0.</td>
</tr>
<tr>
<td>EAX</td>
<td>Bits 31-0: Reports the maximum number sub-leaves that are supported in leaf 07H.</td>
</tr>
<tr>
<td>EBX</td>
<td>Bit 00: FSGSBASE. Supports RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE if 1.</td>
</tr>
<tr>
<td></td>
<td>Bit 01: IA32_TSC_ADJUST MSR is supported if 1.</td>
</tr>
<tr>
<td></td>
<td>Bit 02: SGX</td>
</tr>
<tr>
<td></td>
<td>Bit 03: BMI1</td>
</tr>
<tr>
<td></td>
<td>Bit 04: HLE</td>
</tr>
<tr>
<td></td>
<td>Bit 05: AVX2</td>
</tr>
<tr>
<td></td>
<td>Bit 07: SMEP. Supports Supervisor Mode Execution Protection if 1.</td>
</tr>
<tr>
<td></td>
<td>Bit 06: Reserved</td>
</tr>
<tr>
<td></td>
<td>Bit 08: BMI2</td>
</tr>
<tr>
<td></td>
<td>Bit 09: Supports Enhanced REP MOVSB/STOSB if 1.</td>
</tr>
<tr>
<td></td>
<td>Bit 10: INVPCID</td>
</tr>
<tr>
<td></td>
<td>Bit 11: RTM</td>
</tr>
<tr>
<td></td>
<td>Bit 12: Supports Platform Quality of Service Monitoring (PQM) capability if 1.</td>
</tr>
<tr>
<td></td>
<td>Bit 13: Deprecates FPU CS and FPU DS values if 1.</td>
</tr>
<tr>
<td></td>
<td>Bit 14: Intel Memory Protection Extensions</td>
</tr>
<tr>
<td></td>
<td>Bit 15: Supports Platform Quality of Service Enforcement (PQE) capability if 1.</td>
</tr>
<tr>
<td></td>
<td>Bit 16: AVX512F</td>
</tr>
<tr>
<td></td>
<td>Bit 17: AVX512DQ</td>
</tr>
<tr>
<td></td>
<td>Bit 18: RDSEED</td>
</tr>
<tr>
<td></td>
<td>Bit 19: ADX</td>
</tr>
<tr>
<td></td>
<td>Bit 20: SMAP</td>
</tr>
<tr>
<td></td>
<td>Bit 21: AVX512_IFMA</td>
</tr>
<tr>
<td></td>
<td>Bit 22: Reserved</td>
</tr>
<tr>
<td></td>
<td>Bit 23: CLFLUSHOPT</td>
</tr>
<tr>
<td></td>
<td>Bit 24: CLWB</td>
</tr>
<tr>
<td></td>
<td>Bit 25: Intel Processor Trace</td>
</tr>
<tr>
<td></td>
<td>Bit 26: AVX512PF (Intel® Xeon Phi™ only.)</td>
</tr>
<tr>
<td></td>
<td>Bit 27: AVX512ER (Intel® Xeon Phi™ only.)</td>
</tr>
<tr>
<td></td>
<td>Bit 28: AVX512CD</td>
</tr>
<tr>
<td></td>
<td>Bit 29: SHA</td>
</tr>
<tr>
<td></td>
<td>Bit 30: AVX512BW</td>
</tr>
<tr>
<td></td>
<td>Bit 31: AVX512VL</td>
</tr>
</tbody>
</table>
### FUTURE INTEL® ARCHITECTURE INSTRUCTION EXTENSIONS AND FEATURES

**Table 1-5. Information Returned by CPUID Instruction (Continued)**

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
</table>
| ECX               | Bit 00: PREFETCHWT1 (Intel® Xeon Phi™ only.)  
|                   | Bit 01: AVX512_VBMI  
|                   | Bit 02: UMIP. Supports user-mode instruction prevention if 1.  
|                   | Bit 03: PKU. Supports protection keys for user-mode pages if 1.  
|                   | Bit 04: OSPKE. If 1, OS has set CR4.PKE to enable protection keys (and the RDPKRU/WRPKRU instructions).  
|                   | Bit 05: WAITPKG  
|                   | Bit 06: AVX512_VBMI2  
|                   | Bit 07: Reserved  
|                   | Bit 08: GFNI  
|                   | Bit 09: VAES  
|                   | Bit 10: VPCLMULQDQ  
|                   | Bit 11: AVX512_VNNI  
|                   | Bit 12: AVX512_BITALG  
|                   | Bit 13: Reserved  
|                   | Bit 14: AVX512_VPOPCNTDQ  
|                   | Bits 16-15: Reserved  
|                   | Bits 21-17: The value of MAWAU used by the BNDLDX and BNDSTX instructions in 64-bit mode.  
|                   | Bit 22: RDPID and IA32_TSC_AUX are available if 1.  
|                   | Bits 24-23: Reserved  
|                   | Bit 25: CLDEMOTE. Supports cache line demote if 1.  
|                   | Bit 26: Reserved  
|                   | Bit 27: MOVDIRI. Supports MOVDIRI if 1.  
|                   | Bit 28: MOVDIR64B. Supports MOVDIR64B if 1.  
|                   | Bit 29: Reserved  
|                   | Bit 30: SGX_LC. Supports SGX Launch Configuration if 1.  
|                   | Bit 31: Reserved  
| EDX               | Bits 01-00: Reserved  
|                   | Bit 02: AVX512_4VNNIW (Intel® Xeon Phi™ only.)  
|                   | Bit 03: AVX512_4FMAPS (Intel® Xeon Phi™ only.)  
|                   | Bit 04: Fast Short REP MOV  
|                   | Bits 17-05: Reserved  
|                   | Bit 18: PCONFIG  
|                   | Bits 31-19: Reserved  

**Structured Extended Feature Enumeration Sub-leaves (EAX = 07H, ECX = n, n ≥ 1)**

| EAX | Notes:  
|-----|----------------------------------|
|     | Leaf 07H output depends on the initial value in ECX.  
|     | If ECX contains an invalid sub leaf index, EAX/EBX/ECX/EDX return 0.  

| EBX | This field reports 0 if the sub-leaf index, n, is invalid*; otherwise it is reserved.  

| ECX | This field reports 0 if the sub-leaf index, n, is invalid*; otherwise it is reserved.  

| EDX | This field reports 0 if the sub-leaf index, n, is invalid*; otherwise it is reserved.  

---

NOTES:

**Structured Extended Feature Enumeration Sub-leaves (EAX = 07H, ECX = n, n ≥ 1)**

- Leaf 07H output depends on the initial value in ECX.
- If ECX contains an invalid sub leaf index, EAX/EBX/ECX/EDX return 0.
- This field reports 0 if the sub-leaf index, n, is invalid*; otherwise it is reserved.
- This field reports 0 if the sub-leaf index, n, is invalid*; otherwise it is reserved.
### Table 1-5. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Direct Cache Access Information Leaf</strong></td>
<td></td>
</tr>
<tr>
<td>09H EAX</td>
<td>Value of bits [31:0] of IA32_PLATFORM_DCA_CAP MSR (address 1F8H)</td>
</tr>
<tr>
<td>EBX</td>
<td>Reserved</td>
</tr>
<tr>
<td>ECX</td>
<td>Reserved</td>
</tr>
<tr>
<td>EDX</td>
<td>Reserved</td>
</tr>
<tr>
<td><strong>Architectural Performance Monitoring Leaf</strong></td>
<td></td>
</tr>
</tbody>
</table>
| 0AH EAX | Bits 07-00: Version ID of architectural performance monitoring  
| | Bits 15-08: Number of general-purpose performance monitoring counter per logical processor  
| | Bits 23-16: Bit width of general-purpose, performance monitoring counter  
| | Bits 31-24: Length of EBX bit vector to enumerate architectural performance monitoring events  
| EBX | Bit 00: Core cycle event not available if 1  
| | Bit 01: Instruction retired event not available if 1  
| | Bit 02: Reference cycles event not available if 1  
| | Bit 03: Last-level cache reference event not available if 1  
| | Bit 04: Last-level cache misses event not available if 1  
| | Bit 05: Branch instruction retired event not available if 1  
| | Bit 06: Branch mispredict retired event not available if 1  
| | Bits 31-07: Reserved = 0  
| ECX | Reserved = 0  
| EDX | Bits 04-00: Number of fixed-function performance counters (if Version ID > 1)  
| | Bits 12-05: Bit width of fixed-function performance counters (if Version ID > 1)  
| | Reserved = 0 |
| **Extended Topology Enumeration Leaf** |
| 0BH EAX | Bits 04-00: Number of bits to shift right on x2APIC ID to get a unique topology ID of the next level type*. All logical processors with the same next level ID share current level.  
| | Bits 31-05: Reserved.  
| EBX | Bits 15-00: Number of logical processors at this level type. The number reflects configuration as shipped by Intel**.  
| | Bits 31-16: Reserved.  
| ECX | Bits 07-00: Level number. Same value in ECX input.  
| | Bits 15-08: Level type***.  
| | Bits 31-16: Reserved.  
| EDX | Bits 31-00: x2APIC ID the current logical processor. |
| **NOTES:** |

* Software should use this field (EAX[4:0]) to enumerate processor topology of the system.
** FUTURE INTEL® ARCHITECTURE INSTRUCTION EXTENSIONS AND FEATURES

** Table 1-5. Information Returned by CPUID Instruction(Continued)**

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Software must not use EBX[15:0] to enumerate processor topology of the system. This value in this field (EBX[15:0]) is only intended for display/diagnostic purposes. The actual number of logical processors available to BIOS/OS/Applications may be different from the value of EBX[15:0], depending on software and platform hardware configurations.</strong></td>
<td></td>
</tr>
<tr>
<td><strong>The value of the “level type” field is not related to level numbers in any way, higher “level type” values do not mean higher levels. Level type field has the following encoding:</strong></td>
<td></td>
</tr>
<tr>
<td>0: invalid</td>
<td></td>
</tr>
<tr>
<td>1: SMT</td>
<td></td>
</tr>
<tr>
<td>2: Core</td>
<td></td>
</tr>
<tr>
<td>3-255: Reserved</td>
<td></td>
</tr>
<tr>
<td><strong>Processor Extended State Enumeration Main Leaf (EAX = ODH, ECX = 0)</strong></td>
<td></td>
</tr>
<tr>
<td>ODH</td>
<td>NOTES:</td>
</tr>
<tr>
<td>EAX</td>
<td>Leaf ODH main leaf (ECX = 0).</td>
</tr>
<tr>
<td>Bits 31-00: Reports the valid bit fields of the lower 32 bits of the XFEATURE_ENABLED_MASK register. If a bit is 0, the corresponding bit field in XCR0 is reserved.</td>
<td></td>
</tr>
<tr>
<td>Bit 00: legacy x87</td>
<td></td>
</tr>
<tr>
<td>Bit 01: 128-bit SSE</td>
<td></td>
</tr>
<tr>
<td>Bit 02: 256-bit AVX</td>
<td></td>
</tr>
<tr>
<td>Bits 04-03: MPX state</td>
<td></td>
</tr>
<tr>
<td>Bit 07-05: AVX-512 state</td>
<td></td>
</tr>
<tr>
<td>Bit 08: Used for IA32_XSS</td>
<td></td>
</tr>
<tr>
<td>Bit 09: PKRU state</td>
<td></td>
</tr>
<tr>
<td>Bits 12-10: Reserved.</td>
<td></td>
</tr>
<tr>
<td>Bit 13: Used for IA32_XSS.</td>
<td></td>
</tr>
<tr>
<td>Bits 31-14: Reserved.</td>
<td></td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) required by enabled features in XCR0. May be different than ECX if some features at the end of the XSAVE save area are not enabled.</td>
</tr>
<tr>
<td>ECX</td>
<td>Bit 31-00: Maximum size (bytes, from the beginning of the XSAVE/XRSTOR save area) of the XSAVE/XRSTOR save area required by all supported features in the processor, i.e all the valid bit fields in XCR0.</td>
</tr>
<tr>
<td>EDX</td>
<td>Bit 31-00: Reports the valid bit fields of the upper 32 bits of the XCR0 register. If a bit is 0, the corresponding bit field in XCR0 is reserved</td>
</tr>
<tr>
<td><strong>Processor Extended State Enumeration Sub-leaf (EAX = ODH, ECX = 1)</strong></td>
<td></td>
</tr>
<tr>
<td>ODH</td>
<td>EAX</td>
</tr>
<tr>
<td>Bit 01: Supports XSAVEC and the compacted form of XRSTOR if set</td>
<td></td>
</tr>
<tr>
<td>Bit 02: Supports XGETBV with ECX = 1 if set</td>
<td></td>
</tr>
<tr>
<td>Bit 03: Supports XSAVES/XRSTORS and IA32_XSS if set</td>
<td></td>
</tr>
<tr>
<td>Bits 31-04: Reserved</td>
<td></td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 31-00: The size in bytes of the XSAVE area containing all states enabled by XCR0</td>
</tr>
<tr>
<td>ECX</td>
<td>Bits 31-00: Reports the supported bits of the lower 32 bits of the IA32_XSS MSR. IA32_XSS[n] can be set to 1 only if ECX[n] is 1.</td>
</tr>
<tr>
<td>Bits 07-00: Used for XCR0</td>
<td></td>
</tr>
<tr>
<td>Bit 08: PT state</td>
<td></td>
</tr>
<tr>
<td>Bit 09: Used for XCR0</td>
<td></td>
</tr>
<tr>
<td>Bits 12-10: Reserved.</td>
<td></td>
</tr>
<tr>
<td>Bit 13: HwP state.</td>
<td></td>
</tr>
<tr>
<td>Bits 31-14: Reserved.</td>
<td></td>
</tr>
</tbody>
</table>
### Table 1-5. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>EDX</td>
<td>Bits 31-00: Reports the supported bits of the upper 32 bits of the IA32_XSS MSR. IA32_XSS[n+32] can be set to 1 only if EDX[n] is 1. Bits 31-00: Reserved</td>
</tr>
<tr>
<td></td>
<td><strong>Processor Extended State Enumeration Sub-leaves (EAX = ODH, ECX = n, n &gt; 1)</strong></td>
</tr>
<tr>
<td>ODH</td>
<td><strong>NOTES:</strong></td>
</tr>
<tr>
<td></td>
<td>Leaf ODH output depends on the initial value in ECX. Each sub-leaf index (starting at position 2) is supported if it corresponds to a supported bit in either the XCR0 register or the IA32_XSS MSR.</td>
</tr>
<tr>
<td></td>
<td>*If ECX contains an invalid sub-leaf index, EAX/EBX/ECX/EDX return 0. Sub-leaf n (0 ≤ n ≤ 31) is invalid if sub-leaf 0 returns 0 in EAX[n] and sub-leaf 1 returns 0 in ECX[n]. Sub-leaf n (32 ≤ n ≤ 63) is invalid if sub-leaf 0 returns 0 in EDX[n-32] and sub-leaf 1 returns 0 in EDX[n-32].</td>
</tr>
<tr>
<td>EAX</td>
<td>Bits 31-00: The size in bytes (from the offset specified in EBX) of the save area for an extended state feature associated with a valid sub-leaf index, n. This field reports 0 if the sub-leaf index, n, is invalid*</td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 31-00: The offset in bytes of this extended state component's save area from the beginning of the XSAVE/XRSTOR area. This field reports 0 if the sub-leaf index, n, does not map to a valid bit in the XCR0 register*.</td>
</tr>
<tr>
<td>ECX</td>
<td>Bit 0 is set if the bit n (corresponding to the sub-leaf index) is supported in the IA32_XSS MSR; it is clear if bit n is instead supported in XCR0. Bit 1 is set if, when the compacted format of an XSAVE area is used, this extended state component located on the next 64-byte boundary following the preceding state component (otherwise, it is located immediately following the preceding state component). Bits 31-02 are reserved. This field reports 0 if the sub-leaf index, n, is invalid*.</td>
</tr>
<tr>
<td>EDX</td>
<td>This field reports 0 if the sub-leaf index, n, is invalid*; otherwise it is reserved.</td>
</tr>
<tr>
<td></td>
<td><strong>Platform QoS Monitoring Enumeration Sub-leaf (EAX = OFH, ECX = 0)</strong></td>
</tr>
<tr>
<td>OFH</td>
<td><strong>NOTES:</strong></td>
</tr>
<tr>
<td></td>
<td>Leaf OFH output depends on the initial value in ECX. Sub-leaf index 0 reports valid resource type starting at bit position 1 of EDX.</td>
</tr>
<tr>
<td>EAX</td>
<td>Reserved.</td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 31-0: Maximum range (zero-based) of RMID within this physical processor of all types.</td>
</tr>
<tr>
<td>ECX</td>
<td>Reserved.</td>
</tr>
<tr>
<td>EDX</td>
<td>Bit 00: Reserved. Bit 01: Supports L3 Cache QoS Monitoring if 1. Bits 31-02: Reserved</td>
</tr>
<tr>
<td></td>
<td><strong>L3 Cache QoS Monitoring Capability Enumeration Sub-leaf (EAX = OFH, ECX = 1)</strong></td>
</tr>
<tr>
<td>OFH</td>
<td><strong>NOTES:</strong></td>
</tr>
<tr>
<td></td>
<td>Leaf OFH output depends on the initial value in ECX.</td>
</tr>
<tr>
<td>EAX</td>
<td>Reserved.</td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 31-00: Conversion factor from reported IA32_QM_CTR value to occupancy metric (bytes).</td>
</tr>
<tr>
<td>ECX</td>
<td>Maximum range (zero-based) of RMID of this resource type.</td>
</tr>
<tr>
<td>EDX</td>
<td>Bit 00: Supports L3 occupancy monitoring if 1. Bits 31-01: Reserved</td>
</tr>
</tbody>
</table>
## Table 1-5. Information Returned by CPUID Instruction(Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Platform QoS Enforcement Enumeration Sub-leaf (EAX = 10H, ECX = 0)</strong></td>
<td></td>
</tr>
</tbody>
</table>
| 10H | **NOTES:**  
Leaf 10H output depends on the initial value in ECX.  
Sub-leaf index 0 reports valid resource identification (ResID) starting at bit position 1 of EBX.  
**EAX** | Reserved.  
**EBX** | Bit 00: Reserved.  
Bit 01: Supports L3 Cache QoS Enforcement if 1.  
Bits 31-02: Reserved.  
**ECX** | Reserved.  
**EDX** | Reserved. |
| **L3 Cache QoS Enforcement Enumeration Sub-leaf (EAX = 10H, ECX = ResID = 1)** | |
| 10H | **NOTES:**  
Leaf 10H output depends on the initial value in ECX.  
**EAX** | Bits 04-00: Length of the capacity bit mask for the corresponding ResID.  
Bits 31-05: Reserved  
**EBX** | Bits 31-00: Bit-granular map of isolation/contention of allocation units.  
**ECX** | Bit 00: Reserved.  
Bit 01: Updates of COS should be infrequent if 1.  
Bit 02: Code and Data Prioritization Technology supported if 1.  
Bits 31-03: Reserved  
**EDX** | Bits 15-00: Highest COS number supported for this ResID.  
Bits 31-16: Reserved |
| **Intel® Software Guard Extensions (Intel® SGX) Capability Enumeration Leaf, sub-leaf 0 (EAX = 12H, ECX = 0)** | |
| 12H | **NOTES:**  
Leaf 12H sub-leaf 0 (ECX = 0) is supported if CPUID.(EAX=07H, ECX=0H):EBX[SGX] = 1.  
**EAX** | Bit 00: SGX1. If 1, Indicates Intel SGX supports the collection of SGX1 leaf functions.  
Bit 01: SGX2. If 1, Indicates Intel SGX supports the collection of SGX2 leaf functions.  
Bits 04-02: Reserved.  
Bit 05: If 1, indicates Intel SGX supports ENCLV instruction leaves EINCVIRTCHILD, EDECVIRTCHILD, and ESETCNTX.  
Bit 06: If 1, indicates Intel SGX supports ENCLS instruction leaves ETRACKC, ERDFINFO, ELDBC, and ELDBC.  
Bits 31-02: Reserved.  
**EBX** | Bits 31-00: MISCSELECT. Bit vector of supported extended Intel SGX features.  
**ECX** | Bits 31-00: Reserved.  
**EDX** | Bits 07-00: MaxEnclaveSize_Not64. The maximum supported enclave size in non-64-bit mode is 2 ^ (EDX[7:0]).  
Bits 15-08: MaxEnclaveSize_64. The maximum supported enclave size in 64-bit mode is 2 ^ (EDX[15:8]).  
Bits 31-16: Reserved. |
### Intel SGX Attributes Enumeration Leaf, sub-leaf 1 (EAX = 12H, ECX = 1)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>NOTES:</strong></td>
<td>Leaf 12H sub-leaf 1 (ECX = 1) is supported if CPUID.(EAX=07H, ECX=0H):EBX[SGX] = 1.</td>
</tr>
<tr>
<td>EAX</td>
<td>Bit 31-00: Reports the valid bits of SECS.ATTRIBUTES[31:0] that software can set with ECREATE.</td>
</tr>
<tr>
<td>EBX</td>
<td>Bit 31-00: Reports the valid bits of SECS.ATTRIBUTES[63:32] that software can set with ECREATE.</td>
</tr>
<tr>
<td>ECX</td>
<td>Bit 31-00: Reports the valid bits of SECS.ATTRIBUTES[95:64] that software can set with ECREATE.</td>
</tr>
<tr>
<td>EDX</td>
<td>Bit 31-00: Reports the valid bits of SECS.ATTRIBUTES[127:96] that software can set with ECREATE.</td>
</tr>
</tbody>
</table>

### Intel SGX EPC Enumeration Leaf, sub-leaves (EAX = 12H, ECX = 2 or higher)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>NOTES:</strong></td>
<td>Leaf 12H sub-leaf 2 or higher (ECX &gt;= 2) is supported if CPUID.(EAX=07H, ECX=0H):EBX[SGX] = 1. For sub-leaves (ECX = 2 or higher), definition of EDX,ECX,EBX,EAX[31:4] depends on the sub-leaf type listed below.</td>
</tr>
<tr>
<td>EAX</td>
<td>Bit 03-00: Sub-leaf Type</td>
</tr>
<tr>
<td></td>
<td>0000b: Indicates this sub-leaf is invalid.</td>
</tr>
<tr>
<td></td>
<td>0001b: This sub-leaf enumerates an EPC section. EBX:EAX and EDX:ECX provide information on the Enclave Page Cache (EPC) section. All other type encodings are reserved.</td>
</tr>
<tr>
<td>Type</td>
<td>0000b. This sub-leaf is invalid. EDX:ECX:EBX:EAX return 0.</td>
</tr>
<tr>
<td></td>
<td>0001b. This sub-leaf enumerates an EPC sections with EDX:ECX, EBX:EAX defined as follows. EAX[11:04]: Reserved (enumerate 0). EAX[31:12]: Bits 31:12 of the physical address of the base of the EPC section. EBX[19:00]: Bits 51:32 of the physical address of the base of the EPC section. EBX[31:20]: Reserved.</td>
</tr>
<tr>
<td></td>
<td>EAX[03:00]: EPC section property encoding defined as follows: If EAX[3:0] 0000b, then all bits of the EDX:ECX pair are enumerated as 0. If EAX[3:0] 0001b, then this section has confidentiality and integrity protection. All other encodings are reserved. ECX[11:04]: Reserved (enumerate 0). ECX[31:12]: Bits 31:12 of the size of the corresponding EPC section within the Processor Reserved Memory. EDX[19:00]: Bits 51:32 of the size of the corresponding EPC section within the Processor Reserved Memory. EDX[31:20]: Reserved.</td>
</tr>
</tbody>
</table>

### Intel Processor Trace Enumeration Main Leaf (EAX = 14H, ECX = 0)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>NOTES:</strong></td>
<td>Leaf 14H main leaf (ECX = 0).</td>
</tr>
<tr>
<td>EAX</td>
<td>Bits 31-00: Reports the maximum sub-leaf supported in leaf 14H.</td>
</tr>
<tr>
<td>Initial EAX Value</td>
<td>Information Provided about the Processor</td>
</tr>
<tr>
<td>------------------</td>
<td>----------------------------------------</td>
</tr>
<tr>
<td>EBX</td>
<td>Bit 00: If 1, Indicates that IA32_RTIT_CTL.CR3Filter can be set to 1, and that IA32_RTIT_CTL.CR3.MATCH MSR can be accessed. Bits 01: If 1, Indicates support of Configurable PSB and Cycle-Accurate Mode. Bits 02: If 1, Indicates support of IP Filtering, TraceStop filtering, and preservation of Intel PT MSRs across warm reset. Bits 03: If 1, Indicates support of MTC timing packet and suppression of COFI-based packets. Bits 31-04: Reserved</td>
</tr>
<tr>
<td>ECX</td>
<td>Bit 00: If 1, Tracing can be enabled with IA32_RTIT_CTL.ToPA = 1, hence utilizing the ToPA output scheme; IA32_RTIT_OUTPUT_BASE and IA32_RTIT_OUTPUT_MASK_PTRS MSRs can be accessed. Bit 01: If 1, ToPA tables can hold any number of output entries, up to the maximum allowed by the MaskOrTableOffset field of IA32_RTIT_OUTPUT_MASK_PTRS. Bits 02: If 1, Indicates support of Single-Range Output scheme. Bits 03: If 1, Indicates support of output to Trace Transport subsystem. Bit 30-04: Reserved Bit 31: If 1, Generated packets which contain IP payloads have LIP values, which include the CS base component.</td>
</tr>
<tr>
<td>EDX</td>
<td>Bits 31-00: Reserved</td>
</tr>
</tbody>
</table>

**Intel Processor Trace Enumeration Sub-leaf (EAX = 14H, ECX = 1)**

<table>
<thead>
<tr>
<th>14H</th>
<th>EAX</th>
<th>Bits 02-00: Number of configurable Address Ranges for filtering. Bits 15-03: Reserved Bit 31-16: Bitmap of supported MTC period encodings</th>
</tr>
</thead>
<tbody>
<tr>
<td>EBX</td>
<td>Bits 15-00: Bitmap of supported Cycle Threshold value encodings Bit 31-16: Bitmap of supported Configurable PSB frequency encodings</td>
<td></td>
</tr>
<tr>
<td>ECX</td>
<td>Bits 31-00: Reserved</td>
<td></td>
</tr>
<tr>
<td>EDX</td>
<td>Bits 31-00: Reserved</td>
<td></td>
</tr>
</tbody>
</table>

**Time Stamp Counter and Core Crystal Clock Information Leaf**

<table>
<thead>
<tr>
<th>15H</th>
<th>NOTES:</th>
</tr>
</thead>
<tbody>
<tr>
<td>EAX</td>
<td>If EBX[31:0] is 0, the TSC and “core crystal clock” ratio is not enumerated. EBX[31:0]/EAX[31:0] indicates the ratio of the TSC frequency and the core crystal clock frequency. If ECX is 0, the core crystal clock frequency is not enumerated. “TSC frequency” = “core crystal clock frequency” * EBX/EAX. The core crystal clock may differ from the reference clock, bus clock, or core clock frequencies.</td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 31-00: An unsigned integer which is the denominator of the TSC/“core crystal clock” ratio.</td>
</tr>
<tr>
<td>ECX</td>
<td>Bits 31-00: An unsigned integer which is the numerator of the TSC/“core crystal clock” ratio.</td>
</tr>
<tr>
<td>EDX</td>
<td>Bits 31-00: An unsigned integer which is the nominal frequency of the core crystal clock in Hz. Bits 31-00: Reserved = 0.</td>
</tr>
</tbody>
</table>
### Processor Frequency Information Leaf

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
</table>
| 16H               | **EAX**  
|                   | Bits 15-00: Processor Base Frequency (in MHz).  
|                   | Bits 31-16: Reserved = 0  
|                   | **EBX**  
|                   | Bits 15-00: Maximum Frequency (in MHz).  
|                   | Bits 31-16: Reserved = 0  
|                   | **ECX**  
|                   | Bits 15-00: Bus (Reference) Frequency (in MHz).  
|                   | Bits 31-16: Reserved = 0  
|                   | **EDX**  
|                   | Reserved  

**NOTES:**

* Data is returned from this interface in accordance with the processor’s specification and does not reflect actual values. Suitable use of this data includes the display of processor information in a like manner to the processor brand string and for determining the appropriate range to use when displaying processor information e.g. frequency history graphs. The returned information should not be used for any other purpose as the returned information does not accurately correlate to information / counters returned by other processor interfaces.

While a processor may support the Processor Frequency Information leaf, fields that return a value of zero are not supported.

### System-On-Chip Vendor Attribute Enumeration Main Leaf (EAX = 17H, ECX = 0)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
</table>
| 17H               | **EAX**  
|                   | Bits 31-00: MaxSOCID_Index. Reports the maximum input value of supported sub-leaf in leaf 17H.  
|                   | **EBX**  
|                   | Bits 15-00: SOC Vendor ID.  
|                   | Bit 16: IsVendorScheme. If 1, the SOC Vendor ID field is assigned via an industry standard enumeration scheme. Otherwise, the SOC Vendor ID field is assigned by Intel.  
|                   | Bits 31-17: Reserved = 0.  
|                   | **ECX**  
|                   | Bits 31-00: Project ID. A unique number an SOC vendor assigns to its SOC projects.  
|                   | **EDX**  
|                   | Bits 31-00: Stepping ID. A unique number within an SOC project that an SOC vendor assigns.  

**NOTES:**

Leaf 17H main leaf (ECX = 0).  
Leaf 17H output depends on the initial value in ECX.  
Leaf 17H sub-leaves 1 through 3 reports SOC Vendor Brand String.  
Leaf 17H is valid if MaxSOCID_Index >= 3.  
Leaf 17H sub-leaves 4 and above are reserved.

### System-On-Chip Vendor Attribute Enumeration Sub-leaf (EAX = 17H, ECX = 1..3)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
</table>
| 17H               | **EAX**  
|                   | Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.  
|                   | **EBX**  
|                   | Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.  
|                   | **ECX**  
|                   | Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.  
|                   | **EDX**  
|                   | Bit 31-00: SOC Vendor Brand String. UTF-8 encoded string.  

**NOTES:**

Leaf 17H output depends on the initial value in ECX.  
SOC Vendor Brand String is a UTF-8 encoded string padded with trailing bytes of 00H.  
The complete SOC Vendor Brand String is constructed by concatenating in ascending order of EAX:EBX:ECX:EDX and from the sub-leaf 1 fragment towards sub-leaf 3.
Table 1-5. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>System-On-Chip Vendor Attribute Enumeration Sub-leaves (EAX = 17H, ECX &gt; MaxSOCID_Index)</strong></td>
<td></td>
</tr>
</tbody>
</table>
| 17H | **NOTES:**  
| | Leaf 17H output depends on the initial value in ECX.  
| EAX | Bits 31-00: Reserved = 0.  
| EBX | Bits 31-00: Reserved = 0.  
| ECX | Bits 31-00: Reserved = 0.  
| EDX | Bits 31-00: Reserved = 0.  
| **Deterministic Address Translation Parameters Main Leaf (EAX = 18H, ECX = 0)** |
| 18H | **NOTES:**  
| | Each sub-leaf enumerates a different address translations structure. Valid sub-leaves do not need to be contiguous or in any particular order. A valid sub-leaf may be in a higher input ECX value than an invalid sub-leaf or than a valid sub-leaf of a higher or lower-level structure.  
| | If ECX contains an invalid sub-leaf index, EAX/EBX/ECX/EDX return 0. Sub-leaf index n is invalid if n exceeds the value that sub-leaf 0 returns in EAX.  
| | * Add one to the return value to get the result.  
| EAX | Bits 31-00: Reports the maximum input value of supported sub-leaf in leaf 18H.  
| EBX | Bit 00: 4K page size entries supported by this structure.  
| | Bit 01: 2MB page size entries supported by this structure.  
| | Bit 02: 4MB page size entries supported by this structure.  
| | Bit 03: 1 GB page size entries supported by this structure.  
| | Bits 07-04: Reserved.  
| | Bits 10-08: Partitioning (0: Soft partitioning between the logical processors sharing this structure).  
| | Bits 15-11: Reserved.  
| | Bits 31-16: W = Ways of associativity.  
| ECX | Bits 31-00: S = Number of Sets.  
| EDX | Bits 04-00: Translation cache type field.  
| | 00000b: Null (indicates this sub-leaf is not valid).  
| | 00001b: Data TLB.  
| | 00010b: Instruction TLB.  
| | 00011b: Unified TLB.  
| | All other encodings are reserved.  
| | Bits 07-05: Translation cache level (starts at 1).  
| | Bit 08: Fully associative structure.  
| | Bits 13-09: Reserved.  
| | Bits 25-14: Maximum number of addressable IDs for logical processors sharing this translation cache*  
| | Bits 31-26: Reserved.  
| **Deterministic Address Translation Parameters Sub-leaf (EAX = 18H, ECX ≥ 1)** |
| 18H | **NOTES:**  
| | If ECX contains an invalid sub-leaf index, EAX/EBX/ECX/EDX return 0. Sub-leaf index n is invalid if n exceeds the value that sub-leaf 0 returns in EAX.  
| | * Add one to the return value to get the result.  
| EAX | Bits 31-00: Reserved.  


### Table 1-5. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
</table>
| EBX               | Bit 00: 4K page size entries supported by this structure.  
                  | Bit 01: 2MB page size entries supported by this structure.  
                  | Bit 02: 4MB page size entries supported by this structure.  
                  | Bit 03: 1 GB page size entries supported by this structure.  
                  | Bits 07-04: Reserved.  
                  | Bits 10-08: Partitioning (0: Soft partitioning between the logical processors sharing this structure).  
                  | Bits 15-11: Reserved.  
                  | Bits 31-16: W = Ways of associativity. |
| ECX               | Bits 31-00: S = Number of Sets. |
| EDX               | Bits 04-00: Translation cache type field.  
                  | 0000b: Null (indicates this sub-leaf is not valid).  
                  | 0001b: Data TLB.  
                  | 0010b: Instruction TLB.  
                  | 0011b: Unified TLB.  
                  | All other encodings are reserved.  
                  | Bits 07-05: Translation cache level (starts at 1).  
                  | Bit 08: Fully associative structure.  
                  | Bits 13-09: Reserved.  
                  | Bits 25-14: Maximum number of addressable IDs for logical processors sharing this translation cache*  
                  | Bits 31-26: Reserved. |

#### PCONFIG Information Sub-leaf (EAX = 1BH, ECX ≥ 0)

1BH

**NOTES:**

Leaf 1BH is supported if CPUID.(EAX=07H, ECX=0H):EDX[18] = 1.

For sub-leaves of 1BH, the definition of EDX, ECX, EBX, EAX depends on the sub-type listed below.

* Currently MKTME is the only defined target and is indicated by identifier 1. An identifier of 0 indicates an invalid target. If MKTME is a supported target, the MKTME_KEY_PROGRAM leaf of PCONFIG is available.

EAX Bits 11-00: Sub-type

0: Invalid sub-leaf. On an invalid sub-leaf type returned, subsequent sub-leaves are also invalid. EBX, ECX and EDX all return 0 for this case.

1: Target Identifier. This sub-leaf enumerates PCONFIG targets supported on the platform. Software must scan until an invalid sub-leaf type is returned. EBX, ECX and EDX are defined below for this case.

Bits 31-12: 0

EBX * Identifier of target 3n+1 (where n is the sub-leaf number, the initial value of ECX).

ECX * Identifier of target 3n+2.

EDX * Identifier of target 3n+3.

#### Unimplemented CPUID Leaf Functions

| 40000000H - 4FFFFFFFH | Invalid. No existing or future CPU will return processor identification or feature information if the initial EAX value is in the range 40000000H to 4FFFFFFFH. |

#### Extended Function CPUID Information

<table>
<thead>
<tr>
<th>80000000H</th>
<th>EAX</th>
<th>Maximum Input Value for Extended Function CPUID Information (see Table 1-6).</th>
</tr>
</thead>
<tbody>
<tr>
<td>EBX</td>
<td></td>
<td>Reserved</td>
</tr>
<tr>
<td>ECX</td>
<td></td>
<td>Reserved</td>
</tr>
<tr>
<td>EDX</td>
<td></td>
<td>Reserved</td>
</tr>
</tbody>
</table>
### Table 1-5. Information Returned by CPUID Instruction(Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>EAX</th>
<th>EBX</th>
<th>ECX</th>
<th>EDX</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>80000001H</td>
<td>EAX</td>
<td>EBX</td>
<td>ECX</td>
<td>EDX</td>
<td>Extended Processor Signature and Feature Bits.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Reserved</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bit 00: LAHF/SAHF available in 64-bit mode</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 04-01: Reserved</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bit 05: LZCNT available</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 07-06: Reserved</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bit 08: PREFETCHW</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 31-09: Reserved</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 10-00: Reserved</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bit 11: SYSCALL/SYSRET available (when in 64-bit mode)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 19-12: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bit 20: Execute Disable Bit available</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 25-21: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bit 26: 1-GByte pages are available if 1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bit 27: RDTSCP and IA32_TSC_AUX are available if 1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 28: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bit 29: Intel® 64 Architecture available if 1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 31-30: Reserved = 0</td>
</tr>
<tr>
<td>80000002H</td>
<td>EAX</td>
<td>EBX</td>
<td>ECX</td>
<td>EDX</td>
<td>Processor Brand String</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Processor Brand String Continued</td>
</tr>
<tr>
<td>80000003H</td>
<td>EAX</td>
<td>EBX</td>
<td>ECX</td>
<td>EDX</td>
<td>Processor Brand String Continued</td>
</tr>
<tr>
<td>80000004H</td>
<td>EAX</td>
<td>EBX</td>
<td>ECX</td>
<td>EDX</td>
<td>Processor Brand String Continued</td>
</tr>
<tr>
<td>80000005H</td>
<td>EAX</td>
<td>EBX</td>
<td>ECX</td>
<td>EDX</td>
<td>Reserved = 0</td>
</tr>
<tr>
<td>80000006H</td>
<td>EAX</td>
<td>EBX</td>
<td>ECX</td>
<td>EDX</td>
<td>Reserved = 0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 07-00: Cache Line size in bytes</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 11-08: Reserved</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 15-12: L2 Associativity field *</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Bits 31-16: Cache size in 1K units</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Reserved = 0</td>
</tr>
</tbody>
</table>

**NOTES:**
- L2 associativity field encodings:
  - 00H - Disabled
  - 01H - Direct mapped
  - 02H - 2-way
  - 04H - 4-way
  - 06H - 8-way
  - 08H - 16-way
  - 0FH - Fully associative
Ref. # 319433-034 1-23

Table 1-5. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>800000007H</td>
<td>EAX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EBX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>ECX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EDX: Bits 07-00: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>Bit 08: Invariant TSC available if 1</td>
</tr>
<tr>
<td></td>
<td>Bits 31-09: Reserved = 0</td>
</tr>
<tr>
<td>800000008H</td>
<td>EAX: Virtual/Physical Address size</td>
</tr>
<tr>
<td></td>
<td>Bits 07-00: #Physical Address Bits*</td>
</tr>
<tr>
<td></td>
<td>Bits 15-08: #Virtual Address Bits</td>
</tr>
<tr>
<td></td>
<td>Bits 31-16: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EBX: Bits 08-00: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>Bit 09: WBNOINVD is available if 1</td>
</tr>
<tr>
<td></td>
<td>Bits 31-10: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>ECX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EDX: Reserved = 0</td>
</tr>
</tbody>
</table>

NOTES:
* If CPUID.80000008H:EAX[7:0] is supported, the maximum physical address number supported should come from this field.

INPUT EAX = 0H: Returns CPUID’s Highest Value for Basic Processor Information and the Vendor Identification String

When CPUID executes with EAX set to 0H, the processor returns the highest value the CPUID recognizes for returning basic processor information. The value is returned in the EAX register (see Table 1-6) and is processor specific.

A vendor identification string is also returned in EBX, EDX, and ECX. For Intel processors, the string is “GenuineIntel” and is expressed:

- EBX ← 756e6547h (*“Genu”, with G in the low 4 bits of BL*)
- EDX ← 49656e69h (*“ineI”, with i in the low 4 bits of DL*)
- ECX ← 6c65746eh (*“ntel”, with n in the low 4 bits of CL*)

INPUT EAX = 80000000H: Returns CPUID’s Highest Value for Extended Processor Information

When CPUID executes with EAX set to 0H, the processor returns the highest value the processor recognizes for returning extended processor information. The value is returned in the EAX register (see Table 1-6) and is processor specific.

Table 1-6. Highest CPUID Source Operand for Intel 64 and IA-32 Processors

<table>
<thead>
<tr>
<th>Intel 64 or IA-32 Processors</th>
<th>Highest Value in EAX</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Basic Information</td>
</tr>
<tr>
<td>Earlier Intel486 Processors</td>
<td>CPUID Not Implemented</td>
</tr>
<tr>
<td>Later Intel486 Processors and Pentium Processors</td>
<td>01H</td>
</tr>
<tr>
<td>Pentium Pro and Pentium II Processors, Intel® Celeron® Processors</td>
<td>02H</td>
</tr>
<tr>
<td>Pentium III Processors</td>
<td>03H</td>
</tr>
<tr>
<td>Pentium 4 Processors</td>
<td>02H</td>
</tr>
</tbody>
</table>

Table 1-5. Information Returned by CPUID Instruction (Continued)
**IA32_BIOS_SIGN_ID Returns Microcode Update Signature**

For processors that support the microcode update facility, the IA32_BIOS_SIGN_ID MSR is loaded with the update signature whenever CPUID executes. The signature is returned in the upper DWORD. For details, see Chapter 10 in the *Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A*.

**INPUT EAX = 01H: Returns Model, Family, Stepping Information**

When CPUID executes with EAX set to 01H, version information is returned in EAX (see Figure 1-3). For example: model, family, and processor type for the Intel Xeon processor 5100 series is as follows:

- **Model** — 1111B
- **Family** — 0101B
- **Processor Type** — 00B

See Table 1-7 for available processor type values. Stepping IDs are provided as needed.
NOTE


The Extended Family ID needs to be examined only when the Family ID is 0FH. Integrate the fields into a display using the following rule:

\[
\begin{align*}
\text{IF } & \text{Family_ID} \neq 0FH \\
\text{THEN } & \text{Displayed_Family} = \text{Family_ID}; \\
\text{ELSE } & \text{Displayed_Family} = \text{Extended_Family_ID} + \text{Family_ID}; \\
& (* \text{Right justify and zero-extend 4-bit field.} *)
\end{align*}
\]

\[\text{FI;}\]

\[(* \text{Show Display_Family as HEX field.} *)\]

The Extended Model ID needs to be examined only when the Family ID is 06H or 0FH. Integrate the field into a display using the following rule:

\[
\begin{align*}
\text{IF } & (\text{Family_ID} = 06H \text{ or } \text{Family_ID} = 0FH) \\
\text{THEN } & \text{Displayed_Model} = (\text{Extended_Model_ID} \ll 4) + \text{Model_ID}; \\
& (* \text{Right justify and zero-extend 4-bit field; display Model_ID as HEX field.} *)
\end{align*}
\]

\[\text{ELSE } \text{Displayed_Model} = \text{Model_ID}; \]

\[\text{FI;}\]

\[(* \text{Show Display_Model as HEX field.} *)\]

**INPUT EAX = 01H: Returns Additional Information in EBX**

When CPUID executes with EAX set to 01H, additional information is returned to the EBX register:

- Brand index (low byte of EBX) — this number provides an entry into a brand string table that contains brand strings for IA-32 processors. More information about this field is provided later in this section.
- CLFLUSH instruction cache line size (second byte of EBX) — this number indicates the size of the cache line flushed with CLFLUSH instruction in 8-byte increments. This field was introduced in the Pentium 4 processor.
- Local APIC ID (high byte of EBX) — this number is the 8-bit ID that is assigned to the local APIC on the processor during power up. This field was introduced in the Pentium 4 processor.

**INPUT EAX = 01H: Returns Feature Information in ECX and EDX**

When CPUID executes with EAX set to 01H, feature information is returned in ECX and EDX.

- Figure 1-4 and Table 1-8 show encodings for ECX.
- Figure 1-5 and Table 1-9 show encodings for EDX.

For all feature flags, a 1 indicates that the feature is supported. Use Intel to properly interpret feature flags.

NOTE

Software must confirm that a processor feature is present using feature flags returned by CPUID prior to using the feature. Software should not depend on future offerings retaining all features.
**Table 1-8. Feature Information Returned in the ECX Register**

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>SSE3</td>
<td>Streaming SIMD Extensions 3 (SSE3). A value of 1 indicates the processor supports this technology.</td>
</tr>
<tr>
<td>1</td>
<td>PCLMULQDQ</td>
<td>A value of 1 indicates the processor supports PCLMULQDQ instruction.</td>
</tr>
<tr>
<td>2</td>
<td>DTES64</td>
<td>64-bit DS Area. A value of 1 indicates the processor supports DS area using 64-bit layout.</td>
</tr>
<tr>
<td>3</td>
<td>MONITOR</td>
<td>MONITOR/MWAIT. A value of 1 indicates the processor supports this feature.</td>
</tr>
<tr>
<td>4</td>
<td>DS-CPL</td>
<td>CPL Qualified Debug Store. A value of 1 indicates the processor supports the extensions to the Debug Store feature to allow for branch message storage qualified by CPL.</td>
</tr>
<tr>
<td>5</td>
<td>VMX</td>
<td>Virtual Machine Extensions. A value of 1 indicates that the processor supports this technology.</td>
</tr>
<tr>
<td>6</td>
<td>SMX</td>
<td>Safer Mode Extensions. A value of 1 indicates that the processor supports this technology. See Chapter 6, &quot;Safer Mode Extensions Reference&quot;.</td>
</tr>
<tr>
<td>7</td>
<td>EST</td>
<td>Enhanced Intel SpeedStep® Technology. A value of 1 indicates that the processor supports this technology.</td>
</tr>
<tr>
<td>8</td>
<td>TM2</td>
<td>Thermal Monitor 2. A value of 1 indicates whether the processor supports this technology.</td>
</tr>
<tr>
<td>9</td>
<td>SSSE3</td>
<td>A value of 1 indicates the presence of the Supplemental Streaming SIMD Extensions 3 (SSSE3). A value of 0 indicates the instruction extensions are not present in the processor.</td>
</tr>
</tbody>
</table>
### Table 1-8. Feature Information Returned in the ECX Register (Continued)

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>10</td>
<td>CNXT-ID</td>
<td><strong>L1 Context ID.</strong> A value of 1 indicates the L1 data cache mode can be set to either adaptive mode or shared mode. A value of 0 indicates this feature is not supported. See definition of the IA32_MISC_ENABLE MSR Bit 24 (L1 Data Cache Context Mode) for details.</td>
</tr>
<tr>
<td>11</td>
<td>SDBG</td>
<td>A value of 1 indicates the processor supports IA32_DEBUG_INTERFACE MSR for silicon debug.</td>
</tr>
<tr>
<td>12</td>
<td>FMA</td>
<td>A value of 1 indicates the processor supports FMA extensions using YMM state.</td>
</tr>
<tr>
<td>13</td>
<td>CMPXCHG16B</td>
<td><strong>CMPXCHG16B Available.</strong> A value of 1 indicates that the feature is available.</td>
</tr>
<tr>
<td>14</td>
<td>xTPR Update Control</td>
<td><strong>xTPR Update Control.</strong> A value of 1 indicates that the processor supports changing IA32_MISC_ENABLES[bit 23].</td>
</tr>
<tr>
<td>15</td>
<td>PDCM</td>
<td><strong>Perfmon and Debug Capability.</strong> A value of 1 indicates the processor supports the performance and debug feature indication MSR IA32_PERF_CAPABILITIES.</td>
</tr>
<tr>
<td>16</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>17</td>
<td>PCID</td>
<td><strong>Process-context identifiers.</strong> A value of 1 indicates that the processor supports PCIDs and that software may set CR4.PCIDE to 1.</td>
</tr>
<tr>
<td>18</td>
<td>DCA</td>
<td>A value of 1 indicates the processor supports the ability to prefetch data from a memory mapped device.</td>
</tr>
<tr>
<td>19</td>
<td>SSE4.1</td>
<td>A value of 1 indicates that the processor supports SSE4.1.</td>
</tr>
<tr>
<td>20</td>
<td>SSE4.2</td>
<td>A value of 1 indicates that the processor supports SSE4.2.</td>
</tr>
<tr>
<td>21</td>
<td>x2APIC</td>
<td>A value of 1 indicates that the processor supports x2APIC feature.</td>
</tr>
<tr>
<td>22</td>
<td>MOVBE</td>
<td>A value of 1 indicates that the processor supports MOVBE instruction.</td>
</tr>
<tr>
<td>23</td>
<td>POPCNT</td>
<td>A value of 1 indicates that the processor supports the POPCNT instruction.</td>
</tr>
<tr>
<td>24</td>
<td>TSC-Deadline</td>
<td>A value of 1 indicates that the processor’s local APIC timer supports one-shot operation using a TSC deadline value.</td>
</tr>
<tr>
<td>25</td>
<td>AES</td>
<td>A value of 1 indicates that the processor supports the AESNI instruction extensions.</td>
</tr>
<tr>
<td>26</td>
<td>XSAVE</td>
<td>A value of 1 indicates that the processor supports the XSAVE/XRSTOR processor extended states feature, the XSETBV/XGETBV instructions, and XCR0.</td>
</tr>
<tr>
<td>27</td>
<td>OSXSAVE</td>
<td>A value of 1 indicates that the OS has set CR4.OSXSAVE[bit 18] to enable XSETBV/XGETBV instructions to access XCR0 and to support processor extended state management using XSAVE/XRSTOR.</td>
</tr>
<tr>
<td>28</td>
<td>AVX</td>
<td>A value of 1 indicates that processor supports AVX instructions operating on 256-bit YMM state, and three-operand encoding of 256-bit and 128-bit SIMD instructions.</td>
</tr>
<tr>
<td>29</td>
<td>F16C</td>
<td>A value of 1 indicates that processor supports 16-bit floating-point conversion instructions.</td>
</tr>
<tr>
<td>30</td>
<td>RDRAND</td>
<td>A value of 1 indicates that processor supports RDRAND instruction.</td>
</tr>
<tr>
<td>31</td>
<td>Not Used</td>
<td>Always return 0.</td>
</tr>
</tbody>
</table>
Figure 1-5. Feature Information Returned in the EDX Register

Table 1-9. More on Feature Information Returned in the EDX Register

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>FPU</td>
<td>Floating-point Unit On-Chip. The processor contains an x87 FPU.</td>
</tr>
<tr>
<td>1</td>
<td>VME</td>
<td>Virtual 8086 Mode Enhancements. Virtual 8086 mode enhancements, including CR4.VME for controlling the feature, CR4.PVI for protected mode virtual interrupts, software interrupt indirection, expansion of the TSS with the software indirection bitmap, and EFLAGS.VIF and EFLAGS.VIP flags.</td>
</tr>
<tr>
<td>2</td>
<td>DE</td>
<td>Debugging Extensions. Support for I/O breakpoints, including CR4.DE for controlling the feature, and optional trapping of accesses to DR4 and DR5.</td>
</tr>
<tr>
<td>3</td>
<td>PSE</td>
<td>Page Size Extension. Large pages of size 4 MByte are supported, including CR4.PSE for controlling the feature, the defined dirty bit in PDE (Page Directory Entries), optional reserved bit trapping in CR3, PDEs, and PTEs.</td>
</tr>
<tr>
<td>4</td>
<td>TSC</td>
<td>Time Stamp Counter. The RDTSC instruction is supported, including CR4.TSD for controlling privilege.</td>
</tr>
<tr>
<td>5</td>
<td>MSR</td>
<td>Model Specific Registers RDMSR and WRMSR Instructions. The RDMSR and WRMSR instructions are supported. Some of the MSRs are implementation dependent.</td>
</tr>
</tbody>
</table>
### Physical Address Extension

Physical addresses greater than 32 bits are supported: extended page table entry formats, an extra level in the page translation tables is defined, 2-MByte pages are supported instead of 4-Mbyte pages if PAE bit is 1. The actual number of address bits beyond 32 is not defined, and is implementation specific.

### Machine Check Exception

Exception 18 is defined for Machine Checks, including CR4.MCE for controlling the feature. This feature does not define the model-specific implementations of machine-check error logging, reporting, and processor shutdowns. Machine Check exception handlers may have to depend on processor version to do model specific processing of the exception, or test for the presence of the Machine Check feature.

### CMPXCHG8B Instruction

The compare-and-exchange 8 bytes (64 bits) instruction is supported (implicitly locked and atomic).

### APIC On-Chip

The processor contains an Advanced Programmable Interrupt Controller (APIC), responding to memory mapped commands in the physical address range FFFE0000H to FFFE0FFFH (by default - some processors permit the APIC to be relocated).

### SYSENTER and SYSEXIT Instructions

The SYSENTER and SYSEXIT and associated MSRs are supported.

### Memory Type Range Registers

MTRRs are supported. The MTRRcap MSR contains feature bits that describe what memory types are supported, how many variable MTRRs are supported, and whether fixed MTRRs are supported.

### Page Global Bit

The global bit is supported in paging-structure entries that map a page, indicating TLB entries that are common to different processes and need not be flushed. The CR4.PGE bit controls this feature.

### Machine Check Architecture

The Machine Check Architecture, which provides a compatible mechanism for error reporting in P6 family, Pentium 4, Intel Xeon processors, and future processors, is supported. The MCG_CAP MSR contains feature bits describing how many banks of error reporting MSRs are supported.

### Conditional Move Instructions

The conditional move instruction CMOV is supported. In addition, if x87 FPU is present as indicated by the CPUID.FPU feature bit, then the FCOMI and FCMOV instructions are supported.

### Page Attribute Table

Page Attribute Table is supported. This feature augments the Memory Type Range Registers (MTRRs), allowing an operating system to specify attributes of memory accessed through a linear address on a 4KB granularity.

### 36-Bit Page Size Extension

4-MByte pages addressing physical memory beyond 4 GBytes are supported with 32-bit paging. This feature indicates that upper bits of the physical address of a 4-MByte page are encoded in bits 20:13 of the page-directory entry. Such physical addresses are limited by MAXPHYADDR and may be up to 40 bits in size.

### Processor Serial Number

The processor supports the 96-bit processor identification number feature and the feature is enabled.

### CLFLUSH Instruction

CLFLUSH Instruction is supported.

### Debug Store

The processor supports the ability to write debug information into a memory resident buffer. This feature is used by the branch trace store (BTS) and precise event-based sampling (PEBS) facilities (see Chapter 23, “Introduction to Virtual-Machine Extensions,” in the **Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C**).

### Thermal Monitor and Software Controlled Clock Facilities

The processor implements internal MSRs that allow processor temperature to be monitored and processor performance to be modulated in predefined duty cycles under software control.

### Intel MMX Technology

The processor supports the Intel MMX technology.

### FXSAVE and FXRSTOR Instructions

The FXSAVE and FXRSTOR instructions are supported for fast save and restore of the floating-point context. Presence of this bit also indicates that CR4.OSFXSR is available for an operating system to indicate that it supports the FXSAVE and FXRSTOR instructions.

---

Table 1-9. More on Feature Information Returned in the EDX Register (Continued)

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>6</td>
<td>PAE</td>
<td><strong>Physical Address Extension.</strong> Physical addresses greater than 32 bits are supported: extended page table entry formats, an extra level in the page translation tables is defined, 2-MByte pages are supported instead of 4-Mbyte pages if PAE bit is 1. The actual number of address bits beyond 32 is not defined, and is implementation specific.</td>
</tr>
<tr>
<td>7</td>
<td>MCE</td>
<td><strong>Machine Check Exception.</strong> Exception 18 is defined for Machine Checks, including CR4.MCE for controlling the feature. This feature does not define the model-specific implementations of machine-check error logging, reporting, and processor shutdowns. Machine Check exception handlers may have to depend on processor version to do model specific processing of the exception, or test for the presence of the Machine Check feature.</td>
</tr>
<tr>
<td>8</td>
<td>CX8</td>
<td><strong>CMPXCHG8B Instruction.</strong> The compare-and-exchange 8 bytes (64 bits) instruction is supported (implicitly locked and atomic).</td>
</tr>
<tr>
<td>9</td>
<td>APIC</td>
<td><strong>APIC On-Chip.</strong> The processor contains an Advanced Programmable Interrupt Controller (APIC), responding to memory mapped commands in the physical address range FFFE0000H to FFFE0FFFH (by default - some processors permit the APIC to be relocated).</td>
</tr>
<tr>
<td>10</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>11</td>
<td>SEP</td>
<td><strong>SYSENTER and SYSEXIT Instructions.</strong> The SYSENTER and SYSEXIT and associated MSRs are supported.</td>
</tr>
<tr>
<td>12</td>
<td>MTRR</td>
<td><strong>Memory Type Range Registers.</strong> MTRRs are supported. The MTRRcap MSR contains feature bits that describe what memory types are supported, how many variable MTRRs are supported, and whether fixed MTRRs are supported.</td>
</tr>
<tr>
<td>13</td>
<td>PGE</td>
<td><strong>Page Global Bit.</strong> The global bit is supported in paging-structure entries that map a page, indicating TLB entries that are common to different processes and need not be flushed. The CR4.PGE bit controls this feature.</td>
</tr>
<tr>
<td>14</td>
<td>MCA</td>
<td><strong>Machine Check Architecture.</strong> The Machine Check Architecture, which provides a compatible mechanism for error reporting in P6 family, Pentium 4, Intel Xeon processors, and future processors, is supported. The MCG_CAP MSR contains feature bits describing how many banks of error reporting MSRs are supported.</td>
</tr>
<tr>
<td>15</td>
<td>CMOV</td>
<td><strong>Conditional Move Instructions.</strong> The conditional move instruction CMOV is supported. In addition, if x87 FPU is present as indicated by the CPUID.FPU feature bit, then the FCOMI and FCMOV instructions are supported.</td>
</tr>
<tr>
<td>16</td>
<td>PAT</td>
<td><strong>Page Attribute Table.</strong> Page Attribute Table is supported. This feature augments the Memory Type Range Registers (MTRRs), allowing an operating system to specify attributes of memory accessed through a linear address on a 4KB granularity.</td>
</tr>
<tr>
<td>17</td>
<td>PSE-36</td>
<td><strong>36-Bit Page Size Extension.</strong> 4-MByte pages addressing physical memory beyond 4 GBytes are supported with 32-bit paging. This feature indicates that upper bits of the physical address of a 4-MByte page are encoded in bits 20:13 of the page-directory entry. Such physical addresses are limited by MAXPHYADDR and may be up to 40 bits in size.</td>
</tr>
<tr>
<td>18</td>
<td>PSN</td>
<td><strong>Processor Serial Number.</strong> The processor supports the 96-bit processor identification number feature and the feature is enabled.</td>
</tr>
<tr>
<td>19</td>
<td>CLFSH</td>
<td><strong>CLFLUSH Instruction.</strong> CLFLUSH Instruction is supported.</td>
</tr>
<tr>
<td>20</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>21</td>
<td>DS</td>
<td><strong>Debug Store.</strong> The processor supports the ability to write debug information into a memory resident buffer. This feature is used by the branch trace store (BTS) and precise event-based sampling (PEBS) facilities (see Chapter 23, “Introduction to Virtual-Machine Extensions,” in the <strong>Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C</strong>).</td>
</tr>
<tr>
<td>22</td>
<td>ACPI</td>
<td><strong>Thermal Monitor and Software Controlled Clock Facilities.</strong> The processor implements internal MSRs that allow processor temperature to be monitored and processor performance to be modulated in predefined duty cycles under software control.</td>
</tr>
<tr>
<td>23</td>
<td>MMX</td>
<td><strong>Intel MMX Technology.</strong> The processor supports the Intel MMX technology.</td>
</tr>
<tr>
<td>24</td>
<td>FXSR</td>
<td><strong>FXSAVE and FXRSTOR Instructions.</strong> The FXSAVE and FXRSTOR instructions are supported for fast save and restore of the floating-point context. Presence of this bit also indicates that CR4.OSFXSR is available for an operating system to indicate that it supports the FXSAVE and FXRSTOR instructions.</td>
</tr>
</tbody>
</table>
When CPUID executes with EAX set to 02H, the processor returns information about the processor’s internal caches and TLBs in the EAX, EBX, ECX, and EDX registers.

The encoding is as follows:

- The least-significant byte in register EAX (register AL) indicates the number of times the CPUID instruction must be executed with an input value of 02H to get a complete description of the processor’s caches and TLBs. The first member of the family of Pentium 4 processors will return a 01H.
- The most significant bit (bit 31) of each register indicates whether the register contains valid information (set to 0) or is reserved (set to 1).
- If a register contains valid information, the information is contained in 1 byte descriptors. Table 1-10 shows the encoding of these descriptors. Note that the order of descriptors in the EAX, EBX, ECX, and EDX registers is not defined; that is, specific bytes are not designated to contain descriptors for specific cache or TLB types. The descriptors may appear in any order.

### Table 1-10. Encoding of Cache and TLB Descriptors

<table>
<thead>
<tr>
<th>Descriptor Value</th>
<th>Cache or TLB Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>00H</td>
<td>Null descriptor</td>
</tr>
<tr>
<td>01H</td>
<td>Instruction TLB: 4 KByte pages, 4-way set associative, 32 entries</td>
</tr>
<tr>
<td>02H</td>
<td>Instruction TLB: 4 MByte pages, 4-way set associative, 2 entries</td>
</tr>
<tr>
<td>03H</td>
<td>Data TLB: 4 KByte pages, 4-way set associative, 64 entries</td>
</tr>
<tr>
<td>04H</td>
<td>Data TLB: 4 MByte pages, 4-way set associative, 8 entries</td>
</tr>
<tr>
<td>05H</td>
<td>Data TLB1: 4 MByte pages, 4-way set associative, 32 entries</td>
</tr>
<tr>
<td>06H</td>
<td>1st-level instruction cache: 8 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>08H</td>
<td>1st-level instruction cache: 16 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>0AH</td>
<td>1st-level data cache: 8 KBytes, 2-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>0BH</td>
<td>Instruction TLB: 4 MByte pages, 4-way set associative, 4 entries</td>
</tr>
<tr>
<td>0CH</td>
<td>1st-level data cache: 16 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>22H</td>
<td>3rd-level cache: 512 KBytes, 4-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>23H</td>
<td>3rd-level cache: 1 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>25H</td>
<td>3rd-level cache: 2 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
</tbody>
</table>

**INPUT EAX = 02H: Cache and TLB Information Returned in EAX, EBX, ECX, EDX**

The processor supports the use of the FERR#/PBE# pin when the processor is in the stop-clock state (STPCLK# is asserted) to signal the processor that an interrupt is pending and that the processor should return to normal operation to handle the interrupt. Bit 10 (PBE enable) in the IA32_MISC_ENABLE MSR enables this capability.
### Table 1-10. Encoding of Cache and TLB Descriptors (Continued)

<table>
<thead>
<tr>
<th>Descriptor Value</th>
<th>Cache or TLB Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>29H</td>
<td>3rd-level cache: 4 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>2CH</td>
<td>1st-level data cache: 32 KBytes, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>30H</td>
<td>1st-level instruction cache: 32 KBytes, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>40H</td>
<td>No 2nd-level cache or, if processor contains a valid 2nd-level cache, no 3rd-level cache</td>
</tr>
<tr>
<td>41H</td>
<td>2nd-level cache: 128 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>42H</td>
<td>2nd-level cache: 256 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>43H</td>
<td>2nd-level cache: 512 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>44H</td>
<td>2nd-level cache: 1 MByte, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>45H</td>
<td>2nd-level cache: 2 MByte, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>46H</td>
<td>3rd-level cache: 4 MByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>47H</td>
<td>3rd-level cache: 8 MByte, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>49H</td>
<td>3rd-level cache: 4 MB, 16-way set associative, 64-byte line size (Intel Xeon processor MP, Family 0FH, Model 06H); 2nd-level cache: 4 MByte, 16-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>4AH</td>
<td>3rd-level cache: 6 MByte, 12-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>4BH</td>
<td>3rd-level cache: 8 MByte, 16-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>4CH</td>
<td>3rd-level cache: 12 MByte, 12-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>4DH</td>
<td>3rd-level cache: 16 MByte, 16-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>4EH</td>
<td>2nd-level cache: 6 MByte, 24-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>50H</td>
<td>Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 64 entries</td>
</tr>
<tr>
<td>51H</td>
<td>Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 128 entries</td>
</tr>
<tr>
<td>52H</td>
<td>Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 256 entries</td>
</tr>
<tr>
<td>56H</td>
<td>Data TLB0: 4 MByte pages, 4-way set associative, 16 entries</td>
</tr>
<tr>
<td>57H</td>
<td>Data TLB0: 4 KByte pages, 4-way associative, 16 entries</td>
</tr>
<tr>
<td>58H</td>
<td>Data TLB: 4 KByte and 4 MByte pages, 64 entries</td>
</tr>
<tr>
<td>5CH</td>
<td>Data TLB: 4 KByte and 4 MByte pages, 128 entries</td>
</tr>
<tr>
<td>5DH</td>
<td>Data TLB: 4 KByte and 4 MByte pages, 256 entries</td>
</tr>
<tr>
<td>60H</td>
<td>1st-level data cache: 16 KByte, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>66H</td>
<td>1st-level data cache: 8 KByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>67H</td>
<td>1st-level data cache: 16 KByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>68H</td>
<td>1st-level data cache: 32 KByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>70H</td>
<td>Trace cache: 12 K\text{t}op, 8-way set associative</td>
</tr>
<tr>
<td>71H</td>
<td>Trace cache: 16 K\text{t}op, 8-way set associative</td>
</tr>
<tr>
<td>72H</td>
<td>Trace cache: 32 K\text{t}op, 8-way set associative</td>
</tr>
<tr>
<td>78H</td>
<td>2nd-level cache: 1 MByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>79H</td>
<td>2nd-level cache: 128 KByte, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>7AH</td>
<td>2nd-level cache: 256 KByte, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>7BH</td>
<td>2nd-level cache: 512 KByte, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>7CH</td>
<td>2nd-level cache: 1 MByte, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>7DH</td>
<td>2nd-level cache: 2 MByte, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>7FH</td>
<td>2nd-level cache: 512 KByte, 2-way set associative, 64-byte line size</td>
</tr>
<tr>
<td>82H</td>
<td>2nd-level cache: 256 KByte, 8-way set associative, 32 byte line size</td>
</tr>
</tbody>
</table>
Example 1-1. Example of Cache and TLB Interpretation

The first member of the family of Pentium 4 processors returns the following information about caches and TLBs when the CPUID executes with an input value of 2:

```
EAX 66 5B 50 01H
EBX 0H
ECX 0H
EDX 00 7A 70 00H
```

Which means:

- The least-significant byte (byte 0) of register EAX is set to 01H. This indicates that CPUID needs to be executed once with an input value of 2 to retrieve complete information about caches and TLBs.
- The most-significant bit of all four registers (EAX, EBX, ECX, and EDX) is set to 0, indicating that each register contains valid 1-byte descriptors.
- Bytes 1, 2, and 3 of register EAX indicate that the processor has:
  - 50H - a 64-entry instruction TLB, for mapping 4-KByte and 2-MByte or 4-MByte pages.
  - 5BH - a 64-entry data TLB, for mapping 4-KByte and 4-MByte pages.
  - 66H - an 8-KByte 1st level data cache, 4-way set associative, with a 64-Byte cache line size.
- The descriptors in registers EBX and ECX are valid, but contain NULL descriptors.
- Bytes 0, 1, 2, and 3 of register EDX indicate that the processor has:
  - 00H - NULL descriptor.
  - 70H - Trace cache: 12 K-μop, 8-way set associative.
  - 7AH - a 256-KByte 2nd level cache, 8-way set associative, with a sectored, 64-byte cache line size.
  - 00H - NULL descriptor.

**INPUT EAX = 04H: Returns Deterministic Cache Parameters for Each Level**

When CPUID executes with EAX set to 04H and ECX contains an index value, the processor returns encoded data that describe a set of deterministic cache parameters (for the cache level associated with the input in ECX). Valid index values start from 0.

Software can enumerate the deterministic cache parameters for each level of the cache hierarchy starting with an index value of 0, until the parameters report the value associated with the cache type field is 0. The architecturally defined fields reported by deterministic cache parameters are documented in Table 1-5.
The CPUID leaf 4 also reports data that can be used to derive the topology of processor cores in a physical package. This information is constant for all valid index values. Software can query the raw data reported by executing CPUID with EAX=04H and ECX=0H and use it as part of the topology enumeration algorithm described in Chapter 8, “Multiple-Processor Management,” in the *Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A*.

**INPUT EAX = 05H: Returns MONITOR and MWAIT Features**

When CPUID executes with EAX set to 05H, the processor returns information about features available to MONITOR/MWAIT instructions. The MONITOR instruction is used for address-range monitoring in conjunction with MWAIT instruction. The MWAIT instruction optionally provides additional extensions for advanced power management. See Table 1-5.

**INPUT EAX = 06H: Returns Thermal and Power Management Features**

When CPUID executes with EAX set to 06H, the processor returns information about thermal and power management features. See Table 1-5.

**INPUT EAX = 07H: Returns Structured Extended Feature Enumeration Information**

When CPUID executes with EAX set to 07H and ECX = 0H, the processor returns information about the maximum number of sub-leaves that contain extended feature flags. See Table 1-5.

When CPUID executes with EAX set to 07H and ECX = n (n > 1 and less than the number of non-zero bits in CPUID.(EAX=07H, ECX= 0H).EAX, the processor returns information about extended feature flags. See Table 1-5. In sub-leaf 0, only EAX has the number of sub-leaves. In sub-leaf 0, EBX, ECX & EDX all contain extended feature flags.

**Table 1-11. Structured Extended Feature Leaf, Function 0, EBX Register**

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>RwFSGSBASE</td>
<td>A value of 1 indicates the processor supports RD/WR FSGSBASE instructions</td>
</tr>
<tr>
<td>1-31</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
</tbody>
</table>

**INPUT EAX = 09H: Returns Direct Cache Access Information**

When CPUID executes with EAX set to 09H, the processor returns information about Direct Cache Access capabilities. See Table 1-5.

**INPUT EAX = 0AH: Returns Architectural Performance Monitoring Features**

When CPUID executes with EAX set to 0AH, the processor returns information about support for architectural performance monitoring capabilities. Architectural performance monitoring is supported if the version ID (see Table 1-5) is greater than Pn 0. See Table 1-5.

For each version of architectural performance monitoring capability, software must enumerate this leaf to discover the programming facilities and the architectural performance events available in the processor. The details are described in Chapter 17, ”Debug, Branch Profile, TSC, and Quality of Service,” in the *Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A*.

**INPUT EAX = 0BH: Returns Extended Topology Information**

When CPUID executes with EAX set to 0BH, the processor returns information about extended topology enumeration data. Software must detect the presence of CPUID leaf 0BH by verifying (a) the highest leaf index supported by CPUID is >= 0BH, and (b) CPUID.0BH:EBX[15:0] reports a non-zero value. See Table 1-5.
INPUT EAX = ODH: Returns Processor Extended States Enumeration Information

When CPUID executes with EAX set to ODH and ECX = 0H, the processor returns information about the bit-vector representation of all processor state extensions that are supported in the processor and storage size requirements of the XSAVE/XRSTOR area. See Table 1-5.

When CPUID executes with EAX set to ODH and ECX = n (n > 1, and is a valid sub-leaf index), the processor returns information about the size and offset of each processor extended state save area within the XSAVE/XRSTOR area. See Table 1-5. Software can use the forward-extendable technique depicted below to query the valid sub-leaves and obtain size and offset information for each processor extended state save area:

For i = 2 to 62 // sub-leaf 1 is reserved
   IF (CPUID.(EAX=0DH, ECX=0):VECTOR[i] = 1 ) // VECTOR is the 64-bit value of EDX:EAX
      Execute CPUID.(EAX=0DH, ECX = i) to examine size and offset for sub-leaf i;
   FI;

INPUT EAX = OFH: Returns Platform Quality of Service (PQoS) Monitoring Enumeration Information

When CPUID executes with EAX set to OFH and ECX = 0, the processor returns information about the bit-vector representation of QoS monitoring resource types that are supported in the processor and maximum range of RMID values the processor can use to monitor any supported resource types. Each bit, starting from bit 1, corresponds to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or ResID) that software must use to query QoS monitoring capability available for that type. See Table 1-5.

When CPUID executes with EAX set to OFH and ECX = n (n >= 1, and is a valid ResID), the processor returns information software can use to program IA32_PQR_ASSOC, IA32_QM_EVTSEL MSRs before reading QoS data from the IA32_QM_CTR MSR.

INPUT EAX = 10H: Returns Platform Quality of Service (PQoS) Enforcement Enumeration Information

When CPUID executes with EAX set to 10H and ECX = 0, the processor returns information about the bit-vector representation of QoS Enforcement resource types that are supported in the processor. Each bit, starting from bit 1, corresponds to a specific resource type if the bit is set. The bit position corresponds to the sub-leaf index (or ResID) that software must use to query QoS enforcement capability available for that type. See Table 1-5.

When CPUID executes with EAX set to 10H and ECX = n (n >= 1, and is a valid ResID), the processor returns information about available classes of service and range of QoS mask MSRs that software can use to configure each class of services using capability bit masks in the QoS Mask registers, IA32_resourceType_Mask_n.

INPUT EAX = 12H: Returns Intel SGX Enumeration Information

When CPUID executes with EAX set to 12H and ECX = 0H, the processor returns information about Intel SGX capabilities. See Table 1-5.

When CPUID executes with EAX set to 12H and ECX = 1H, the processor returns information about Intel SGX attributes. See Table 1-5.

When CPUID executes with EAX set to 12H and ECX = n (n > 1), the processor returns information about Intel SGX Enclave Page Cache. See Table 1-5.

INPUT EAX = 14H: Returns Intel Processor Trace Enumeration Information

When CPUID executes with EAX set to 14H and ECX = 0H, the processor returns information about Intel Processor Trace extensions. See Table 1-5.

When CPUID executes with EAX set to 14H and ECX = n (n > 0 and less than the number of non-zero bits in CPUID.(EAX=14H, ECX= 0H).EAX), the processor returns information about packet generation in Intel Processor Trace. See Table 1-5.

INPUT EAX = 15H: Returns Time Stamp Counter and Core Crystal Clock Information

When CPUID executes with EAX set to 15H and ECX = 0H, the processor returns information about Time Stamp
Counter and Core Crystal Clock. See Table 1-5.

**INPUT EAX = 16H: Returns Processor Frequency Information**
When CPUID executes with EAX set to 16H, the processor returns information about Processor Frequency Information. See Table 1-5.

**INPUT EAX = 17H: Returns System-On-Chip Information**
When CPUID executes with EAX set to 17H, the processor returns information about the System-On-Chip Vendor Attribute Enumeration. See Table 1-5.

**INPUT EAX = 18H: Returns Deterministic Address Translation Parameters Information**
When CPUID executes with EAX set to 18H, the processor returns information about the Deterministic Address Translation Parameters. See Table 1-5.

**INPUT EAX = 1BH: Returns PCONFIG Information**
When CPUID executes with EAX set to 1BH, the processor returns information about PCONFIG capabilities. See Table 1-3.

**METHODS FOR RETURNING BRANDING INFORMATION**
Use the following techniques to access branding information:
1. Processor brand string method; this method also returns the processor’s maximum operating frequency
2. Processor brand index; this method uses a software supplied brand string table.

These two methods are discussed in the following sections. For methods that are available in early processors, see Section: "Identification of Earlier IA-32 Processors" in Chapter 16 of the *Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1*.

**The Processor Brand String Method**
Figure 1-6 describes the algorithm used for detection of the brand string. Processor brand identification software should execute this algorithm on all Intel 64 and IA-32 processors.

This method (introduced with Pentium 4 processors) returns an ASCII brand identification string and the maximum operating frequency of the processor to the EAX, EBX, ECX, and EDX registers.
How Brand Strings Work

To use the brand string method, execute CPUID with EAX input of 8000002H through 80000004H. For each input value, CPUID returns 16 ASCII characters using EAX, EBX, ECX, and EDX. The returned string will be NULL-terminated.

Table 1-12 shows the brand string that is returned by the first processor in the Pentium 4 processor family.

![Figure 1-6. Determination of Support for the Processor Brand String](image)

### Table 1-12. Processor Brand String Returned with Pentium 4 Processor

<table>
<thead>
<tr>
<th>EAX Input Value</th>
<th>Return Values</th>
<th>ASCII Equivalent</th>
</tr>
</thead>
<tbody>
<tr>
<td>80000002H</td>
<td>EAX = 20202020H</td>
<td></td>
</tr>
<tr>
<td></td>
<td>EBX = 20202020H</td>
<td></td>
</tr>
<tr>
<td></td>
<td>ECX = 20202020H</td>
<td></td>
</tr>
<tr>
<td></td>
<td>EDX = 6E492020H</td>
<td>&quot;ni&quot;</td>
</tr>
<tr>
<td>80000003H</td>
<td>EAX = 286C6574H</td>
<td></td>
</tr>
<tr>
<td></td>
<td>EBX = 50202952H</td>
<td>&quot;P)R&quot;</td>
</tr>
<tr>
<td></td>
<td>ECX = 69746665H</td>
<td>&quot;itne&quot;</td>
</tr>
<tr>
<td></td>
<td>EDX = 52286D75H</td>
<td>&quot;R(mu)&quot;</td>
</tr>
<tr>
<td>80000004H</td>
<td>EAX = 20342029H</td>
<td></td>
</tr>
<tr>
<td></td>
<td>EBX = 20555043H</td>
<td>&quot;4)&quot;</td>
</tr>
<tr>
<td></td>
<td>ECX = 30303531H</td>
<td>&quot;UPC&quot;</td>
</tr>
</tbody>
</table>
|                 | EDX = 007A484DH | "0zHM"
Extracting the Maximum Processor Frequency from Brand Strings

Figure 1-7 provides an algorithm which software can use to extract the maximum processor operating frequency from the processor brand string.

NOTE

When a frequency is given in a brand string, it is the maximum qualified frequency of the processor, not the frequency at which the processor is currently running.

The Processor Brand Index Method

The brand index method (introduced with Pentium® III Xeon® processors) provides an entry point into a brand identification table that is maintained in memory by system software and is accessible from system- and user-level code. In this table, each brand index is associate with an ASCII brand identification string that identifies the official Intel family and model number of a processor.

When CPUID executes with EAX set to 01H, the processor returns a brand index to the low byte in EBX. Software can then use this index to locate the brand identification string for the processor in the brand identification table. The first entry (brand index 0) in this table is reserved, allowing for backward compatibility with processors that do not support the brand identification feature. Starting with processor signature family ID = 0FH, model = 03H, brand index method is no longer supported. Use brand string method instead.

Figure 1-7. Algorithm for Extracting Maximum Processor Frequency
Table 1-13 shows brand indices that have identification strings associated with them.

**Table 1-13. Mapping of Brand Indices; and Intel 64 and IA-32 Processor Brand Strings**

<table>
<thead>
<tr>
<th>Brand Index</th>
<th>Brand String</th>
</tr>
</thead>
<tbody>
<tr>
<td>00H</td>
<td>This processor does not support the brand identification feature</td>
</tr>
<tr>
<td>01H</td>
<td>Intel(R) Celeron(R) processor&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>02H</td>
<td>Intel(R) Pentium(R) II processor&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>03H</td>
<td>Intel(R) Pentium(R) III Xeon(R) processor; If processor signature = 000006B1h, then Intel(R) Celeron(R) processor</td>
</tr>
<tr>
<td>04H</td>
<td>Intel(R) Pentium(R) III processor</td>
</tr>
<tr>
<td>06H</td>
<td>Mobile Intel(R) Pentium(R) III processor-M</td>
</tr>
<tr>
<td>07H</td>
<td>Mobile Intel(R) Celeron(R) processor&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>08H</td>
<td>Intel(R) Pentium(R) 4 processor</td>
</tr>
<tr>
<td>09H</td>
<td>Intel(R) Pentium(R) 4 processor</td>
</tr>
<tr>
<td>0AH</td>
<td>Intel(R) Celeron(R) processor&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>0BH</td>
<td>Intel(R) Xeon(R) processor; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor MP</td>
</tr>
<tr>
<td>0CH</td>
<td>Intel(R) Xeon(R) processor MP</td>
</tr>
<tr>
<td>0EH</td>
<td>Mobile Intel(R) Pentium(R) 4 processor-M; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor</td>
</tr>
<tr>
<td>0FH</td>
<td>Mobile Intel(R) Celeron(R) processor&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>11H</td>
<td>Mobile Genuine Intel(R) processor</td>
</tr>
<tr>
<td>12H</td>
<td>Intel(R) Celeron(R) M processor</td>
</tr>
<tr>
<td>13H</td>
<td>Mobile Intel(R) Celeron(R) processor&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>14H</td>
<td>Intel(R) Celeron(R) processor</td>
</tr>
<tr>
<td>15H</td>
<td>Mobile Genuine Intel(R) processor</td>
</tr>
<tr>
<td>16H</td>
<td>Intel(R) Pentium(R) M processor</td>
</tr>
<tr>
<td>17H</td>
<td>Mobile Intel(R) Celeron(R) processor&lt;sup&gt;1&lt;/sup&gt;</td>
</tr>
<tr>
<td>18H - 0FFH</td>
<td>RESERVED</td>
</tr>
</tbody>
</table>

**NOTES:**

1. Indicates versions of these processors that were introduced after the Pentium III

**IA-32 Architecture Compatibility**

CPUID is not supported in early models of the Intel486 processor or in any IA-32 processor earlier than the Intel486 processor.

**Operation**

IA32_BIOS_SIGN_ID MSR ← Update with installed microcode revision number;

CASE (EAX) OF

EAX = 0:

EAX ← Highest basic function input value understood by CPUID;
EBX ← Vendor identification string;
EDX ← Vendor identification string;
ECX ← Vendor identification string;
BREAK;
EAX = 1H:
EAX[3:0] ← Stepping ID;
EAX[7:4] ← Model;
EAX[11:8] ← Family;
EAX[13:12] ← Processor type;
EAX[15:14] ← Reserved;
EAX[19:16] ← Extended Model;
EAX[27:20] ← Extended Family;
EAX[31:28] ← Reserved;
EBX[7:0] ← Brand Index; (* Reserved if the value is zero. *)
EBX[15:8] ← CLFLUSH Line Size;
EBX[16:23] ← Reserved; (* Number of threads enabled = 2 if MT enable fuse set. *)
EBX[24:31] ← Initial APIC ID;
ECX ← Feature flags; (* See Figure 1-4. *)
EDX ← Feature flags; (* See Figure 1-5. *)

BREAK;
EAX = 2H:
EAX ← Cache and TLB information;
EBX ← Cache and TLB information;
ECX ← Cache and TLB information;
EDX ← Cache and TLB information;

BREAK;
EAX = 3H:
EAX ← Reserved;
EBX ← Reserved;
ECX ← ProcessorSerialNumber[31:0];
(* Pentium III processors only, otherwise reserved. *)
EDX ← ProcessorSerialNumber[63:32];
(* Pentium III processors only, otherwise reserved. *)

BREAK

EAX = 4H:
EAX ← Deterministic Cache Parameters Leaf; (* See Table 1-5. *)
EBX ← Deterministic Cache Parameters Leaf;
ECX ← Deterministic Cache Parameters Leaf;
EDX ← Deterministic Cache Parameters Leaf;

BREAK;
EAX = 5H:
EAX ← MONITOR/MWAIT Leaf; (* See Table 1-5. *)
EBX ← MONITOR/MWAIT Leaf;
ECX ← MONITOR/MWAIT Leaf;
EDX ← MONITOR/MWAIT Leaf;

BREAK;
EAX = 6H:
EAX ← Thermal and Power Management Leaf; (* See Table 1-5. *)
EBX ← Thermal and Power Management Leaf;
ECX ← Thermal and Power Management Leaf;
EDX ← Thermal and Power Management Leaf;

BREAK;
EAX = 7H:
EAX ← Structured Extended Feature Leaf; (* See Table 1-5. *)
EBX ← Structured Extended Feature Leaf;
ECX ← Structured Extended Feature Leaf;
EDX ← Structured Extended Feature Leaf;

BREAK;
EAX = 8H:
EAX ← Reserved = 0;
EBX ← Reserved = 0;
ECX ← Reserved = 0;
EDX ← Reserved = 0;
BREAK;
EAX = 9H:
   EAX ← Direct Cache Access Information Leaf; (* See Table 1-5. *)
   EBX ← Direct Cache Access Information Leaf;
   ECX ← Direct Cache Access Information Leaf;
   EDX ← Direct Cache Access Information Leaf;
BREAK;
EAX = AH:
   EAX ← Architectural Performance Monitoring Leaf; (* See Table 1-5. *)
   EBX ← Architectural Performance Monitoring Leaf;
   ECX ← Architectural Performance Monitoring Leaf;
   EDX ← Architectural Performance Monitoring Leaf;
BREAK
EAX = BH:
   EAX ← Extended Topology Enumeration Leaf; (* See Table 1-5. *)
   EBX ← Extended Topology Enumeration Leaf;
   ECX ← Extended Topology Enumeration Leaf;
   EDX ← Extended Topology Enumeration Leaf;
BREAK;
EAX = CH:
   EAX ← Reserved = 0;
   EBX ← Reserved = 0;
   ECX ← Reserved = 0;
   EDX ← Reserved = 0;
BREAK;
EAX = DH:
   EAX ← Processor Extended State Enumeration Leaf; (* See Table 1-5. *)
   EBX ← Processor Extended State Enumeration Leaf;
   ECX ← Processor Extended State Enumeration Leaf;
   EDX ← Processor Extended State Enumeration Leaf;
BREAK;
EAX = EH:
   EAX ← Reserved = 0;
   EBX ← Reserved = 0;
   ECX ← Reserved = 0;
   EDX ← Reserved = 0;
BREAK;
EAX = FH:
   EAX ← Platform Quality of Service Monitoring Enumeration Leaf; (* See Table 1-5. *)
   EBX ← Platform Quality of Service Monitoring Enumeration Leaf;
   ECX ← Platform Quality of Service Monitoring Enumeration Leaf;
   EDX ← Platform Quality of Service Monitoring Enumeration Leaf;
BREAK;
EAX = 10H:
   EAX ← Platform Quality of Service Enforcement Enumeration Leaf; (* See Table 1-5. *)
   EBX ← Platform Quality of Service Enforcement Enumeration Leaf;
   ECX ← Platform Quality of Service Enforcement Enumeration Leaf;
   EDX ← Platform Quality of Service Enforcement Enumeration Leaf;
BREAK;
EAX = 14H:
   EAX ← Intel Processor Trace Enumeration Leaf; (* See Table 1-5. *)
EBX ← Intel Processor Trace Enumeration Leaf;
ECX ← Intel Processor Trace Enumeration Leaf;
EDX ← Intel Processor Trace Enumeration Leaf;
BREAK;
EAX = 15H:
   EAX ← Time Stamp Counter and Core Crystal Clock Information Leaf; (* See Table 1-5. *)
   EBX ← Time Stamp Counter and Core Crystal Clock Information Leaf;
   ECX ← Time Stamp Counter and Core Crystal Clock Information Leaf;
   EDX ← Time Stamp Counter and Core Crystal Clock Information Leaf;
BREAK;
EAX = 16H:
   EAX ← Processor Frequency Information Enumeration Leaf; (* See Table 1-5. *)
   EBX ← Processor Frequency Information Enumeration Leaf;
   ECX ← Processor Frequency Information Enumeration Leaf;
   EDX ← Processor Frequency Information Enumeration Leaf;
BREAK;
EAX = 17H:
   EAX ← System-On-Chip Vendor Attribute Enumeration Leaf; (* See Table 1-5. *)
   EBX ← System-On-Chip Vendor Attribute Enumeration Leaf;
   ECX ← System-On-Chip Vendor Attribute Enumeration Leaf;
   EDX ← System-On-Chip Vendor Attribute Enumeration Leaf;
BREAK;
EAX = 18H:
   EAX ← Deterministic Address Translation Parameters Enumeration Leaf; (* See Table 1-5. *)
   EBX ← Deterministic Address Translation Parameters Enumeration Leaf;
   ECX ← Deterministic Address Translation Parameters Enumeration Leaf;
   EDX ← Deterministic Address Translation Parameters Enumeration Leaf;
BREAK;
EAX = 1BH:
   EAX ← PCONFIG Information Enumeration Leaf; (* See Table 1-5. *)
   EBX ← PCONFIG Information Enumeration Leaf;
   ECX ← PCONFIG Information Enumeration Leaf;
   EDX ← PCONFIG Information Enumeration Leaf;
BREAK;
EAX = 80000000H:
   EAX ← Highest extended function input value understood by CPUID;
   EBX ← Reserved;
   ECX ← Reserved;
   EDX ← Reserved;
BREAK;
EAX = 80000001H:
   EAX ← Reserved;
   EBX ← Reserved;
   ECX ← Extended Feature Bits (* See Table 1-5.*);
   EDX ← Extended Feature Bits (* See Table 1-5.*);
BREAK;
EAX = 80000002H:
   EAX ← Processor Brand String;
   EBX ← Processor Brand String, continued;
   ECX ← Processor Brand String, continued;
   EDX ← Processor Brand String, continued;
BREAK;
EAX = 80000003H:
   EAX ← Processor Brand String, continued;
EBX ← Processor Brand String, continued;
ECX ← Processor Brand String, continued;
EDX ← Processor Brand String, continued;

BREAK;

EAX = 80000004H:
   EAX ← Processor Brand String, continued;
   EAX ← Processor Brand String, continued;
   ECX ← Processor Brand String, continued;
   EDX ← Processor Brand String, continued;

BREAK;

EAX = 80000005H:
   EAX ← Reserved = 0;
   EBX ← Reserved = 0;
   ECX ← Reserved = 0;
   EDX ← Reserved = 0;

BREAK;

EAX = 80000006H:
   ECX ← Cache information;
   EDX ← Reserved = 0;

BREAK;

EAX = 80000007H:
   EAX ← Reserved = 0;
   EBX ← Reserved = 0;
   ECX ← Reserved = 0;
   EDX ← Reserved = 0;

BREAK;

EAX = 80000008H:
   EAX ← Reserved = 0;
   EBX ← Reserved = 0;
   ECX ← Reserved = 0;
   EDX ← Reserved = 0;

BREAK;

DEFAULT: (* EAX = Value outside of recognized range for CPUID. *)
   (* If the highest basic information leaf data depend on ECX input value, ECX is honored.*)
   EAX ← Reserved; (* Information returned for highest basic information leaf. *)
   EBX ← Reserved; (* Information returned for highest basic information leaf. *)
   ECX ← Reserved; (* Information returned for highest basic information leaf. *)
   EDX ← Reserved; (* Information returned for highest basic information leaf. *)

BREAK;
ESAC;

Flags Affected
None.

Exceptions (All Operating Modes)

#UD If the LOCK prefix is used.

In earlier IA-32 processors that do not support the CPUID instruction, execution of the instruction results in an invalid opcode (#UD) exception being generated. §
1.8 COMPRESSED DISPLACEMENT (DISP8*N) SUPPORT IN EVEX

For memory addressing using disp8 form, EVEX-encoded instructions always use a compressed displacement scheme by multiplying disp8 in conjunction with a scaling factor N that is determined based on the vector length, the value of EVEX.b bit (embedded broadcast) and the input element size of the instruction. In general, the factor N corresponds to the number of bytes characterizing the internal memory operation of the input operand (e.g., 64 when accessing a full 512-bit memory vector). The scale factor N is listed in Table 1-14 and Table 1-15 below, where EVEX encoded instructions are classified using the tupletype attribute. The scale factor N of each tupletype is listed based on the vector length (VL) and other factors affecting it.

Table 1-14 covers EVEX-encoded instructions which has a load semantic in conjunction with additional computational or data element movement operation, operating either on the full vector or half vector (due to conversion of numerical precision from a wider format to narrower format). EVEX.b is supported for such instructions for data element sizes which are either dword or qword.

EVEX-encoded instruction that are pure load/store, and “Load+op” instruction semantic that operate on data element size less then dword do not support broadcasting using EVEX.b. These are listed in Table 1-15. Table 1-15 also includes many broadcast instructions which perform broadcast using a subset of data elements without using EVEX.b. These instructions and a few data element size conversion instruction are covered in Table 1-15. Instruction classified in Table 1-15 do not use EVEX.b and EVEX.b must be 0, otherwise #UD will occur.

The tupletype will be referenced in the instruction operand encoding table in the reference page of each instruction, providing the cross reference for the scaling factor N to encoding memory addressing operand.

Note that the disp8*N rules still apply when using 16b addressing.

<table>
<thead>
<tr>
<th>TupleType</th>
<th>EVEX.b</th>
<th>InputSize</th>
<th>EVEX.W</th>
<th>Broadcast</th>
<th>N (VL=128)</th>
<th>N (VL=256)</th>
<th>N (VL=512)</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>Full</td>
<td>0</td>
<td>32bit</td>
<td>0</td>
<td>none</td>
<td>16</td>
<td>32</td>
<td>64</td>
<td></td>
</tr>
<tr>
<td></td>
<td>1</td>
<td>32bit</td>
<td>0</td>
<td>[1tox]</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>Load+Op (Full Vector Dword/Qword)</td>
</tr>
<tr>
<td></td>
<td>0</td>
<td>64bit</td>
<td>1</td>
<td>none</td>
<td>16</td>
<td>32</td>
<td>64</td>
<td></td>
</tr>
<tr>
<td></td>
<td>1</td>
<td>64bit</td>
<td>1</td>
<td>[1tox]</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td></td>
</tr>
<tr>
<td>Half</td>
<td>0</td>
<td>32bit</td>
<td>0</td>
<td>none</td>
<td>8</td>
<td>16</td>
<td>32</td>
<td>Load+Op (Half Vector)</td>
</tr>
<tr>
<td></td>
<td>1</td>
<td>32bit</td>
<td>0</td>
<td>[1tox]</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>TupleType</th>
<th>InputSize</th>
<th>EVEX.W</th>
<th>N (VL=128)</th>
<th>N (VL=256)</th>
<th>N (VL=512)</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>Full Mem</td>
<td>N/A</td>
<td>N/A</td>
<td>16</td>
<td>32</td>
<td>64</td>
<td>Load/store or subDword full vector</td>
</tr>
<tr>
<td>Tuple1 Scalar</td>
<td>8bit</td>
<td>N/A</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td></td>
<td>16bit</td>
<td>N/A</td>
<td>2</td>
<td>2</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td></td>
<td>32bit</td>
<td>0</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td></td>
</tr>
<tr>
<td></td>
<td>64bit</td>
<td>1</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td></td>
</tr>
<tr>
<td>Tuple1 Fixed</td>
<td>32bit</td>
<td>N/A</td>
<td>4</td>
<td>4</td>
<td>4</td>
<td>1 Tuple, memsize not affected by EVEX.W</td>
</tr>
<tr>
<td></td>
<td>64bit</td>
<td>N/A</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td></td>
</tr>
<tr>
<td>Tuple1_4X</td>
<td>32bit</td>
<td>0</td>
<td>16^1</td>
<td>N/A</td>
<td>16</td>
<td>4FMA(PS)</td>
</tr>
<tr>
<td>Tuple2</td>
<td>32bit</td>
<td>0</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>Broadcast (2 elements)</td>
</tr>
<tr>
<td></td>
<td>64bit</td>
<td>1</td>
<td>NA</td>
<td>16</td>
<td>16</td>
<td></td>
</tr>
<tr>
<td>Tuple4</td>
<td>32bit</td>
<td>0</td>
<td>NA</td>
<td>16</td>
<td>16</td>
<td>Broadcast (4 elements)</td>
</tr>
<tr>
<td></td>
<td>64bit</td>
<td>1</td>
<td>NA</td>
<td>NA</td>
<td>32</td>
<td></td>
</tr>
<tr>
<td>Tuple8</td>
<td>32bit</td>
<td>0</td>
<td>NA</td>
<td>NA</td>
<td>32</td>
<td>Broadcast (8 elements)</td>
</tr>
</tbody>
</table>

*Note: 16^1 indicates 16 is expanded to full 32-bit width*
### Table 1-15. EVEX DISP8*N for Instructions Not Affected by Embedded Broadcast (Continued)

<table>
<thead>
<tr>
<th>TupleType</th>
<th>InputSize</th>
<th>EVEX.w</th>
<th>N (VL= 128)</th>
<th>N (VL= 256)</th>
<th>N (VL= 512)</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>Half Mem</td>
<td>N/A</td>
<td>N/A</td>
<td>8</td>
<td>16</td>
<td>32</td>
<td>SubQword Conversion</td>
</tr>
<tr>
<td>Quarter Mem</td>
<td>N/A</td>
<td>N/A</td>
<td>4</td>
<td>8</td>
<td>16</td>
<td>SubDword Conversion</td>
</tr>
<tr>
<td>Eighth Mem</td>
<td>N/A</td>
<td>N/A</td>
<td>2</td>
<td>4</td>
<td>8</td>
<td>SubWord Conversion</td>
</tr>
<tr>
<td>Mem128</td>
<td>N/A</td>
<td>N/A</td>
<td>16</td>
<td>16</td>
<td>16</td>
<td>Shift count from memory</td>
</tr>
<tr>
<td>MOVDDUP</td>
<td>N/A</td>
<td>N/A</td>
<td>8</td>
<td>32</td>
<td>64</td>
<td>VMOVDDUP</td>
</tr>
</tbody>
</table>

**NOTES:**

1. Scalar
Instructions described in this document follow the general documentation convention established in *Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A*.

2.1 INSTRUCTION SET REFERENCE
INSTRUCTION SET REFERENCE, A-Z

CLDEMOTE—Cache Line Demote

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Op/ En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>NP 0F 1C /0 CLDEMOTE m8</td>
<td>A</td>
<td>V/V</td>
<td>CLDEMOTE</td>
<td>Hint to hardware to move the cache line containing m8 to a more distant level of the cache without writing back to memory.</td>
</tr>
</tbody>
</table>

**Instruction Operand Encoding**

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>ModRM:rr/m [w]</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

**Description**

Hints to hardware that the cache line that contains the linear address specified with the memory operand should be moved ("demoted") from the cache(s) closest to the processor core to a level more distant from the processor core. This may accelerate subsequent accesses to the line by other cores in the same coherence domain, especially if the line was written by the core that demotes the line. Moving the line in such a manner is a performance optimization, i.e., it is a hint which does not modify architectural state. Hardware may choose which level in the cache hierarchy to retain the line (e.g., L3 in typical server designs). The source operand is a byte memory location.

The availability of the CLDEMOTE instruction is indicated by the presence of the CPUID feature flag CLDEMOTE (bit 25 of the ECX register in sub-leaf 07H, see “CPUID—CPU Identification” in Chapter 1). On processors which do not support the CLDEMOTE instruction (including legacy hardware) the instruction will be treated as a NOP.

A CLDEMOTE instruction is ordered with respect to stores to the same cache line, but unordered with respect to other instructions including memory fences, CLDEMOTE, CLWB or CLFLUSHOPT instructions to a different cache line. Since CLDEMOTE will retire in order with respect to stores to the same cache line, software should ensure that after issuing CLDEMOTE the line is not accessed again immediately by the same core to avoid cache data movement penalties.

The effective memory type of the page containing the affected line determines the effect; cacheable types are likely to generate a data movement operation, while uncacheable types may cause the instruction to be ignored.

Speculative fetching can occur at any time and is not tied to instruction execution. The CLDEMOTE instruction is not ordered with respect to PREFETCHh instructions or any of the speculative fetching mechanisms. That is, data can be speculatively loaded into a cache line just before, during, or after the execution of a CLDEMOTE instruction that references the cache line.

Unlike CLFLUSH, CLFLUSHOPT and CLWB instructions, CLDEMOTE is not guaranteed to write back modified data to memory.

The CLDEMOTE instruction may be ignored by hardware in certain cases and is not a guarantee.

The CLDEMOTE instruction can be used at all privilege levels. In certain processor implementations the CLDEMOTE instruction may set the A bit but not the D bit in the page tables.

If the line is not found in the cache, the instruction will be treated as a NOP.

In some implementations, the CLDEMOTE instruction may always cause a transactional abort with Transactional Synchronization Extensions (TSX). However, programmers must not rely on CLDEMOTE instruction to force a transactional abort.

---

1. ModRM.MOD != 011B
Operation
Cache_Line_Demote(m8);

Flags Affected
None.

C/C++ Compiler Intrinsic Equivalent
CLDEMOTE void _cldemote(const void*);

Protected Mode Exceptions
#UD If the LOCK prefix is used.

Real-Address Mode Exceptions
#UD If the LOCK prefix is used.

Virtual-8086 Mode Exceptions
Same exceptions as in real address mode.

Compatibility Mode Exceptions
Same exceptions as in protected mode.

64-Bit Mode Exceptions
#UD If the LOCK prefix is used.
GF2P8AFFINEINVQB — Galois Field Affine Transformation Inverse

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F3A CF /r /ib</td>
<td>A</td>
<td>V/V</td>
<td>GFNI</td>
<td>Computes inverse affine transformation in the finite field GF(2^8).</td>
</tr>
<tr>
<td>GF2P8AFFINEINVQB xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A.W1 CF /r /ib</td>
<td>B</td>
<td>V/V</td>
<td>AVX GFNI</td>
<td>Computes inverse affine transformation in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8AFFINEINVQB xmm1,xmm2, xmm3/m128, imm8</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F3A.W1 CF /r /ib</td>
<td>B</td>
<td>V/V</td>
<td>AVX GFNI</td>
<td>Computes inverse affine transformation in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8AFFINEINVQB ymm1, ymm2, ymm3/m256, imm8</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.NDS.128.66.0F3A.W1 CF /r /ib</td>
<td>C</td>
<td>V/V</td>
<td>AVX512VL GFNI</td>
<td>Computes inverse affine transformation in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8AFFINEINVQB xmm1[k1][z], xmm2, xmm3/m128/m64bcst, imm8</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.NDS.256.66.0F3A.W1 CF /r /ib</td>
<td>C</td>
<td>V/V</td>
<td>AVX512VL GFNI</td>
<td>Computes inverse affine transformation in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8AFFINEINVQB ymm1[k1][z], ymm2, ymm3/m256/m64bcst, imm8</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.NDS.512.66.0F3A.W1 CF /r /ib</td>
<td>C</td>
<td>V/V</td>
<td>AVX512F GFNI</td>
<td>Computes inverse affine transformation in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8AFFINEINVQB zmm1[k1][z], zmm2, zmm3/m512/m64bcst, imm8</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRMreg (r, w)</td>
<td>ModRMr/m (r)</td>
<td>imm8 (r)</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>NA</td>
<td>ModRMreg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRMr/m (r)</td>
<td>imm8 (r)</td>
</tr>
<tr>
<td>C</td>
<td>Full</td>
<td>ModRMreg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRMr/m (r)</td>
<td>imm8 (r)</td>
</tr>
</tbody>
</table>

Description

The AFFINEINVQB instruction computes an affine transformation in the Galois Field 2^8. For this instruction, an affine transformation is defined by A * inv(x) + b where "A" is an 8 by 8 bit matrix, and "x" and "b" are 8-bit vectors. The inverse of the bytes in x is defined with respect to the reduction polynomial x^8 + x^4 + x^3 + x + 1.

One SIMD register (operand 1) holds "x" as either 16, 32 or 64 8-bit vectors. A second SIMD (operand 2) register or memory operand contains 2, 4, or 8 "A" values, which are operated upon by the correspondingly aligned 8 "x" values in the first register. The "b" vector is constant for all calculations and contained in the immediate byte.

The EVEX encoded form of this instruction does not support memory fault suppression. The SSE encoded forms of the instruction require 16B alignment on their memory operations.
The inverse of each byte is given by the following table. The upper nibble is on the vertical axis and the lower nibble is on the horizontal axis. For example, the inverse of 0x95 is 0x8A.

<table>
<thead>
<tr>
<th>Table 2-1. Inverse Byte Listings</th>
</tr>
</thead>
<tbody>
<tr>
<td>-</td>
</tr>
<tr>
<td>----</td>
</tr>
<tr>
<td>0</td>
</tr>
<tr>
<td>1</td>
</tr>
<tr>
<td>2</td>
</tr>
<tr>
<td>3</td>
</tr>
<tr>
<td>4</td>
</tr>
<tr>
<td>5</td>
</tr>
<tr>
<td>6</td>
</tr>
<tr>
<td>7</td>
</tr>
<tr>
<td>8</td>
</tr>
<tr>
<td>9</td>
</tr>
<tr>
<td>A</td>
</tr>
<tr>
<td>B</td>
</tr>
<tr>
<td>C</td>
</tr>
<tr>
<td>D</td>
</tr>
<tr>
<td>E</td>
</tr>
<tr>
<td>F</td>
</tr>
</tbody>
</table>

Operation

define affine_inverse_byte(tsrc2qw, src1byte, imm):
    FOR i ← 0 to 7:
        * parity(x) = 1 if x has an odd number of 1s in it, and 0 otherwise.*
        * inverse(x) is defined in the table above *
        retbyte.bit[i] ← parity(tsrc2qw.byte[7-i] AND inverse(src1byte)) XOR imm8.bit[i]
    return retbyte

VGF2P8AFFINEINVQB dest, src1, src2, imm8 (EVEX encoded version)
(KL, VL) = (2, 128), (4, 256), (8, 512)
FOR j ← 0 TO KL-1:
    IF SRC2 is memory and EVEX.b==1:
        tsrc2 ← SRC2.qword[0]
    ELSE:
        tsrc2 ← SRC2.qword[j]
    FOR b ← 0 to 7:
        IF k1[j]*8+b] OR *no writemask*:
            FOR i ← 0 to 7:
                DEST.qword[j].byte[b] ← affine_inverse_byte(tsrc2, SRC1.qword[j].byte[b], imm8)
        ELSE IF *zeroing*:
            DEST.qword[j].byte[b] ← 0
        *ELSE DEST.qword[j].byte[b] remains unchanged*
        DEST[MAX_VL-1:VL] ← 0
VGF2P8AFFINEINVQB dest, src1, src2, imm8 (128b and 256b VEX encoded versions)

(KL, VL) = (2, 128), (4, 256)

FOR j ← 0 TO KL-1:
    FOR b ← 0 to 7:
        DEST.qword[j].byte[b] ← affine_inverse_byte(SRC2.qword[j], SRC1.qword[j].byte[b], imm8)
    DEST[MAX_VL-1:VL] ← 0

GF2P8AFFINEINVQB srcdest, src1, imm8 (128b SSE encoded version)

FOR j ← 0 TO 1:
    FOR b ← 0 to 7:
        SRCDEST.qword[j].byte[b] ← affine_inverse_byte(SRC1.qword[j], SRCDEST.qword[j].byte[b], imm8)

Intel C/C++ Compiler Intrinsic Equivalent

GF2P8AFFINEINVQB __m128i _mm_gf2p8affineinv_epi64_epi8(__m128i, __m128i, int);
GF2P8AFFINEINVQB __m128i _mm_mask_gf2p8affineinv_epi64_epi8(__m128i, __mmask16, __m128i, __m128i, int);
GF2P8AFFINEINVQB __m128i _mm_maskz_gf2p8affineinv_epi64_epi8(__mmask16, __m128i, __m128i, int);
GF2P8AFFINEINVQB __m256i _mm256_gf2p8affineinv_epi64_epi8(__m256i, __m256i, int);
GF2P8AFFINEINVQB __m256i _mm256_mask_gf2p8affineinv_epi64_epi8(__m256i, __mmask32, __m256i, __m256i, int);
GF2P8AFFINEINVQB __m256i _mm256_maskz_gf2p8affineinv_epi64_epi8(__mmask32, __m256i, __m256i, int);
GF2P8AFFINEINVQB __m512i _mm512_gf2p8affineinv_epi64_epi8(__m512i, __m512i, int);
GF2P8AFFINEINVQB __m512i _mm512_mask_gf2p8affineinv_epi64_epi8(__m512i, __mmask64, __m512i, __m512i, int);
GF2P8AFFINEINVQB __m512i _mm512_maskz_gf2p8affineinv_epi64_epi8(__mmask64, __m512i, __m512i, int);

SIMD Floating-Point Exceptions

None.

Other Exceptions

Legacy-encoded and VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
GF2P8AFFINEQB — Galois Field Affine Transformation

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Op/ En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F3A CE /r /ib GF2P8AFFINEQB xmm1, xmm2/m128, imm8</td>
<td>A</td>
<td>V/V</td>
<td>GFNI</td>
<td>Computes affine transformation in the finite field ( \text{GF}(2^{\text{8}}) ).</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A.W1 CE /r /ib VGF2P8AFFINEQB xmm1, xmm2, xmm3/m128, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX, GFNI</td>
<td>Computes affine transformation in the finite field ( \text{GF}(2^{\text{8}}) ).</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F3A.W1 CE /r /ib VGF2P8AFFINEQB ymm1, ymm2, ymm3/m256, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX, GFNI</td>
<td>Computes affine transformation in the finite field ( \text{GF}(2^{\text{8}}) ).</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A.W1 CE /r /ib VGF2P8AFFINEQB xmm1[k1][k], xmm2, xmm3/m128/m64bcst, imm8</td>
<td>C</td>
<td>V/V</td>
<td>AVX512VL GFNI</td>
<td>Computes affine transformation in the finite field ( \text{GF}(2^{\text{8}}) ).</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F3A.W1 CE /r /ib VGF2P8AFFINEQB ymm1[k1][k], ymm2, ymm3/m256/m64bcst, imm8</td>
<td>C</td>
<td>V/V</td>
<td>AVX512VL GFNI</td>
<td>Computes affine transformation in the finite field ( \text{GF}(2^{\text{8}}) ).</td>
</tr>
<tr>
<td>VEX.NDS.512.66.0F3A.W1 CE /r /ib VGF2P8AFFINEQB zmm1[k1][k], zmm2, zmm3/m512/m64bcst, imm8</td>
<td>C</td>
<td>V/V</td>
<td>AVX512F GFNI</td>
<td>Computes affine transformation in the finite field ( \text{GF}(2^{\text{8}}) ).</td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM:reg (r, w)</td>
<td>ModRM:r/m (r)</td>
<td>imm8 (r)</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>NA</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>imm8 (r)</td>
</tr>
<tr>
<td>C</td>
<td>Full</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>imm8 (r)</td>
</tr>
</tbody>
</table>

Description

The AFFINEB instruction computes an affine transformation in the Galois Field \( \text{GF}(2^{\text{8}}) \). For this instruction, an affine transformation is defined by \( A \times x + b \) where \( "A" \) is an 8 by 8 bit matrix, and \( "x" \) and \( "b" \) are 8-bit vectors. One SIMD register (operand 1) holds \( "x" \) as either 16, 32 or 64 8-bit vectors. A second SIMD (operand 2) register or memory operand contains 2, 4, or 8 \( "A" \) values, which are operated upon by the correspondingly aligned 8 \( "x" \) values in the first register. The \( "b" \) vector is constant for all calculations and contained in the immediate byte.

The EVEX encoded form of this instruction does not support memory fault suppression. The SSE encoded forms of the instruction require16B alignment on their memory operations.
**Operation**

define parity(x):
    \( t \leftarrow 0 \) // single bit
    FOR \( i \leftarrow 0 \) to 7:
        \( t = t \oplus x.\text{bit}[i] \)
    return \( t \)

define affine_byte(tsrc2qw, src1byte, imm):
    FOR \( i \leftarrow 0 \) to 7:
        * \( \text{parity}(x) = 1 \) if \( x \) has an odd number of 1s in it, and 0 otherwise.*
        \( \text{retbyte.}\text{bit}[i] \leftarrow \text{parity}(\text{tsrc2qw.}\text{byte}[7-i] \text{ AND src1byte}) \oplus \text{imm8.}\text{bit}[i] \)
    return \( \text{retbyte} \)

**VGF2P8AFFINEQB dest, src1, src2, imm8 (EVEX encoded version)**

\((KL, VL) = (2, 128), (4, 256), (8, 512)\)
FOR \( j \leftarrow 0 \) TO \( KL-1 \):
    IF \( \text{SRC2 is memory and EVEX.b} = 1 \):
        \( \text{tsrc2} \leftarrow \text{SRC2.qword[0]} \)
    ELSE:
        \( \text{tsrc2} \leftarrow \text{SRC2.qword}[j] \)
    FOR \( b \leftarrow 0 \) to 7:
        IF \( k1[\text{mask}+b] \) OR *no writemask*:
            \( \text{DEST.qword}[j].\text{byte}[b] \leftarrow \text{affine_byte}(\text{tsrc2}, \text{SRC1.qword}[j].\text{byte}[b], \text{imm8}) \)
        ELSE IF *zeroing*:
            \( \text{DEST.qword}[j].\text{byte}[b] \leftarrow 0 \)
        *ELSE DEST.qword[j].byte[b] remains unchanged*
    \( \text{DEST[MAX}_{-}\text{VL-1:VL}] \leftarrow 0 \)

**VGF2P8AFFINEQB dest, src1, src2, imm8 (128b and 256b VEX encoded versions)**

\((KL, VL) = (2, 128), (4, 256)\)
FOR \( j \leftarrow 0 \) TO \( KL-1 \):
    FOR \( b \leftarrow 0 \) to 7:
        \( \text{DEST.qword}[j].\text{byte}[b] \leftarrow \text{affine_byte}(\text{SRC2.qword}[j], \text{SRC1.qword}[j].\text{byte}[b], \text{imm8}) \)
    \( \text{DEST[MAX}_{-}\text{VL-1:VL}] \leftarrow 0 \)

**GF2P8AFFINEQB srcdest, src1, imm8 (128b SSE encoded version)**

FOR \( j \leftarrow 0 \) TO 1:
    FOR \( b \leftarrow 0 \) to 7:
        \( \text{SRCDEST.qword}[j].\text{byte}[b] \leftarrow \text{affine_byte}(\text{SRC1.qword}[j], \text{SRCDEST.qword}[j].\text{byte}[b], \text{imm8}) \)

**Intel C/C++ Compiler Intrinsic Equivalent**

\[ \text{GF2P8AFFINEQB} \_m128i \_mm_gf2p8affine_epi64\_epi8(\_m128i, \_m128i, \text{int}); \]
\[ \text{GF2P8AFFINEQB} \_m128i \_mm_mask_gf2p8affine_epi64\_epi8(\_m128i, \_m128i, \_mmask16, \_m128i, \_m128i, \text{int}); \]
\[ \text{GF2P8AFFINEQB} \_m128i \_mm_maskz_gf2p8affine_epi64\_epi8(\_m128i, \_m128i, \text{int}); \]
\[ \text{GF2P8AFFINEQB} \_m256i \_mm256_gf2p8affine_epi64\_epi8(\_m256i, \_m256i, \text{int}); \]
\[ \text{GF2P8AFFINEQB} \_m256i \_mm256_mask_gf2p8affine_epi64\_epi8(\_mmask32, \_m256i, \_m256i, \text{int}); \]
\[ \text{GF2P8AFFINEQB} \_m256i \_mm256_maskz_gf2p8affine_epi64\_epi8(\_mmask32, \_m256i, \_m256i, \text{int}); \]
\[ \text{GF2P8AFFINEQB} \_m512i \_mm512_gf2p8affine_epi64\_epi8(\_m512i, \_m512i, \text{int}); \]
\[ \text{GF2P8AFFINEQB} \_m512i \_mm512_mask_gf2p8affine_epi64\_epi8(\_mmask64, \_m512i, \_m512i, \text{int}); \]
\[ \text{GF2P8AFFINEQB} \_m512i \_mm512_maskz_gf2p8affine_epi64\_epi8(\_mmask64, \_m512i, \_m512i, \text{int}); \]
**SIMD Floating-Point Exceptions**
None.

**Other Exceptions**
Legacy-encoded and VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
INSTRUCTION SET REFERENCE, A-Z

GF2P8MULB — Galois Field Multiply Bytes

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F38 CF /r</td>
<td>A</td>
<td>V/V</td>
<td>GFNI</td>
<td>Multiplies elements in the finite field GF(2^8).</td>
</tr>
<tr>
<td>GF2P8MULB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38.W0 CF /r</td>
<td>B</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiplies elements in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8MULB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td>GFNI</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F38.W0 CF /r</td>
<td>B</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiplies elements in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8MULB ymm1, ymm2, ymm3/m256</td>
<td></td>
<td></td>
<td>GFNI</td>
<td></td>
</tr>
<tr>
<td>EVEX.NDS.128.66.0F38.W0 CF /r</td>
<td>C</td>
<td>V/V</td>
<td>AVX512VL</td>
<td>Multiplies elements in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8MULB xmm1[k1]{z}, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td>GFNI</td>
<td></td>
</tr>
<tr>
<td>EVEX.NDS.256.66.0F38.W0 CF /r</td>
<td>C</td>
<td>V/V</td>
<td>AVX512VL</td>
<td>Multiplies elements in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8MULB ymm1[k1]{z}, ymm2, ymm3/m256</td>
<td></td>
<td></td>
<td>GFNI</td>
<td></td>
</tr>
<tr>
<td>EVEX.NDS.512.66.0F38.W0 CF /r</td>
<td>C</td>
<td>V/V</td>
<td>AVX512F</td>
<td>Multiplies elements in the finite field GF(2^8).</td>
</tr>
<tr>
<td>VGF2P8MULB zmm1[k1]{z}, zmm2, zmm3/m512</td>
<td></td>
<td></td>
<td>GFNI</td>
<td></td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM:reg (r, w)</td>
<td>ModRMr/m (r)</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>NA</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRMr/m (r)</td>
<td>NA</td>
</tr>
<tr>
<td>C</td>
<td>Full Mem</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRMr/m (r)</td>
<td>NA</td>
</tr>
</tbody>
</table>

Description

The instruction multiplies elements in the finite field GF(2^8), operating on a byte (field element) in the first source operand and the corresponding byte in a second source operand. The field GF(2^8) is represented in polynomial representation with the reduction polynomial x^8 + x^4 + x^3 + x + 1.

This instruction does not support broadcasting.

The EVEX encoded form of this instruction supports memory fault suppression. The SSE encoded forms of the instruction require 16B alignment on their memory operations.
Operation

define gf2p8mul_byte(src1byte, src2byte):
    tword ← 0
    FOR i ← 0 to 7:
        IF src2byte.bit[i]:
            tword ← tword XOR (src1byte<< i)
        * carry out polynomial reduction by the characteristic polynomial p*
    FOR i ← 14 downto 8:
        p ← 0x11B << (i-8)  *0x11B = 0000_0001_0001_1011 in binary*
        IF tword.bit[i]:
            tword ← tword XOR p
    return tword.byte[0]

VGF2P8MULB dest, src1, src2 (EVEX encoded version)
(KL, VL) = (16, 128), (32, 256), (64, 512)
FOR j ← 0 TO KL-1:
    IF k1[j] OR *no writemask*:
        DEST.byte[j] ← gf2p8mul_byte(SRC1.byte[j], SRC2.byte[j])
    ELSE IF *zeroing*:
        DEST.byte[j] ← 0
    * ELSE DEST.byte[j] remains unchanged*
    DEST[MAX_VL-1:VL] ← 0

VGF2P8MULB dest, src1, src2 (128b and 256b VEX encoded versions)
(KL, VL) = (16, 128), (32, 256)
FOR j ← 0 TO KL-1:
    DEST.byte[j] ← gf2p8mul_byte(SRC1.byte[j], SRC2.byte[j])
DEST[MAX_VL-1:VL] ← 0

GF2P8MULB srcdest, src1 (128b SSE encoded version)
FOR j ← 0 TO 15:
    SRCDEST.byte[j] ← gf2p8mul_byte(SRCDEST.byte[j], SRC1.byte[j])

Intel C/C++ Compiler Intrinsic Equivalent

VGF2P8MULB __m128i _mm_gf2p8mul_epi8(__m128i, __m128i);
VGF2P8MULB __m128i _mm_mask_gf2p8mul_epi8(__m128i, __mmask16, __m128i, __m128i);
VGF2P8MULB __m128i _mm_maskz_gf2p8mul_epi8(__mmask16, __m128i, __m128i);
VGF2P8MULB __m256i _mm256_gf2p8mul_epi8(__m256i, __m256i);
VGF2P8MULB __m256i _mm256_mask_gf2p8mul_epi8(__m256i, __mmask32, __m256i, __m256i);
VGF2P8MULB __m256i _mm256_maskz_gf2p8mul_epi8(__mmask32, __m256i, __m256i);
VGF2P8MULB __m512i _mm512_gf2p8mul_epi8(__m512i, __m512i);
VGF2P8MULB __m512i _mm512_mask_gf2p8mul_epi8(__m512i, __mmask64, __m512i, __m512i);
VGF2P8MULB __m512i _mm512_maskz_gf2p8mul_epi8(__mmask64, __m512i, __m512i);

SIMD Floating-Point Exceptions
None.

Other Exceptions
Legacy-encoded and VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
MOVDIRI—Move Doubleword as Direct Store

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>NP 0F 38 F9 /r MOVDIRI m32, r32</td>
<td>A</td>
<td>V/V</td>
<td>MOVDIRI</td>
<td>Move doubleword from r32 to m32 using direct store.</td>
</tr>
<tr>
<td>NP REX.W + 0F 38 F9 /r MOVDIRI m64, r64</td>
<td>A</td>
<td>V/N.E.</td>
<td>MOVDIRI</td>
<td>Move quadword from r64 to m64 using direct store.</td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM:r/m (w)</td>
<td>ModRM:reg (r)</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

Description

Moves the doubleword integer in the source operand (second operand) to the destination operand (first operand) using a direct-store operation. The source operand is a general purpose register. The destination operand is a 32-bit memory location (MODRM.MOD != 0b11). In 64-bit mode, the instruction’s default operation size is 32 bits. Use of the REX.R prefix permits access to additional registers (R8-R15). Use of the REX.W prefix promotes operation to 64 bits. See summary chart at the beginning of this section for encoding data and limits.

The direct-store is implemented by using write combining (WC) memory type protocol for writing data. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. If the destination address is cached, the line is written-back (if modified) and invalidated from the cache, before the direct-store. Unlike stores with non-temporal hint that allow uncached (UC) and write-protected (WP) memory-type for the destination to override the non-temporal hint, direct-stores always follow WC memory type protocol irrespective of the destination address memory type (including UC and WP types).

Unlike WC stores and stores with non-temporal hint, direct-stores are eligible for immediate eviction from the write-combining buffer, and thus not combined with younger stores (including direct-stores) to the same address. Older WC and non-temporal stores held in the write-combining buffer may be combined with younger direct stores to the same address. Because WC protocol used by direct-stores follows a weakly-ordered memory consistency model, a fencing operation using SFENCE or MFENCE should follow the MOVDIRI instruction to enforce ordering when needed.

Direct-stores issued by MOVDIRI to a destination aligned to a 4-byte boundary (8-byte boundary if used with REX.W prefix) guarantee 4-byte (8-byte with REX.W prefix) write-completion atomicity. This means that the data arrives at the destination in a single undivided 4-byte (or 8-byte) write transaction. If the destination is not aligned for the write size, the direct-stores issued by MOVDIRI are split and arrive at the destination in two parts. Each part of such split direct-store will not merge with younger stores but can arrive at the destination in either order. Availability of the MOVDIRI instruction is indicated by the presence of the CPUID feature flag MOVDIRI (bit 27 of the ECX register in leaf 07H, see “CPUID — CPU Identification” in Chapter 1).

Operation

DEST ← SRC;

Intel C/C++ Compiler Intrinsic Equivalent

MOVDIRI void _directstoreu_u32( void *dst, uint32_t val)
MOVDIRI void _directstoreu_u64( void *dst, uint64_t val)
Protected Mode Exceptions

#GP(0) For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
#SS(0) For an illegal address in the SS segment.
#PF (fault-code) For a page fault.
#UD If CPUID.07H.0H:ECX.MOVDIRI[bit 27] = 0.
#AC If LOCK prefix or operand-size (66H) prefix is used.

Real-Address Mode Exceptions

#GP If any part of the operand lies outside the effective address space from 0 to FFFFH.
#UD If CPUID.07H.0H:ECX.MOVDIRI[bit 27] = 0.
#AC If LOCK prefix or operand-size (66H) prefix is used.

Virtual-8086 Mode Exceptions
Same exceptions as in real address mode.

#PF (fault-code) For a page fault.
#AC If alignment checking is enabled and an unaligned memory reference made while in current privilege level 3.

Compatibility Mode Exceptions
Same exceptions as in protected mode.

64-Bit Mode Exceptions

#SS(0) If memory address referencing the SS segment is in non-canonical form.
#GP(0) If the memory address is in non-canonical form.
#PF (fault-code) For a page fault.
#UD If CPUID.07H.0H:ECX.MOVDIRI[bit 27] = 0.
#AC If alignment checking is enabled and an unaligned memory reference made while in current privilege level 3.
MOVDIR64B—Move 64 Bytes as Direct Store

**Description**

Moves 64-bytes as direct-store with 64-byte write atomicity from source memory address to destination memory address. The source operand is a normal memory operand (MODRM.MOD != 0b11). The destination operand is a memory location specified in a general-purpose register. The register content is interpreted as an offset into ES segment without any segment override. In 64-bit mode, the register operand width is 64-bits (32-bits with 67H prefix). Outside of 64-bit mode, the register width is 32-bits when CS.D=1 (16-bits with 67H prefix), and 16-bits when CS.D=0 (32-bits with 67H prefix). MOVDIR64B requires the destination address to be 64-byte aligned. No alignment restriction is enforced for source operand.

MOVDIR64B reads 64-bytes from the source memory address and performs a 64-byte direct-store operation to the destination address. The load operation follows normal read ordering based on source address memory-type. The direct-store is implemented by using the write combining (WC) memory type protocol for writing data. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. If the destination address is cached, the line is written-back (if modified) and invalidated from the cache, before the direct-store.

Unlike stores with non-temporal hint which allow UC/WP memory-type for destination to override the non-temporal hint, direct-stores always follow WC memory type protocol irrespective of destination address memory type (including UC/WP types). Unlike WC stores and stores with non-temporal hint, direct-stores are eligible for immediate eviction from the write-combining buffer, and thus not combined with younger stores (including direct-stores) to the same address. Older WC and non-temporal stores held in the write-combining buffer may be combined with younger direct stores to the same address. Because WC protocol used by direct-stores follow weakly-ordered memory consistency model, fencing operation using SFENCE or MFENCE should follow the MOVDIR64B instruction to enforce ordering when needed.

There is no atomicity guarantee provided for the 64-byte load operation from source address, and processor implementations may use multiple load operations to read the 64-bytes. The 64-byte direct-store issued by MOVDIR64B guarantees 64-byte write-completion atomicity. This means that the data arrives at the destination in a single undivided 64-byte write transaction.

Availability of the MOVDIR64B instruction is indicated by the presence of the CPUID feature flag MOVDIR64B (bit 28 of the ECX register in leaf 07H, see “CPUID — CPU Identification” in Chapter 1).

**Operation**

\[ \text{DEST} \leftarrow \text{SRC}; \]

**Intel C/C++ Compiler Intrinsic Equivalent**

`MOVDIR64B void _movdir64b(void *dst, const void* src)`
Protected Mode Exceptions

#GP(0) For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
If address in destination (register) operand is not aligned to a 64-byte boundary.

#SS(0) For an illegal address in the SS segment.

#PF (fault-code) For a page fault.

#UD If CPUID.07H.0H:ECX.MOVDIR64B[bit 28] = 0.
If LOCK prefix is used.

Real-Address Mode Exceptions

#GP If any part of the operand lies outside the effective address space from 0 to FFFFH.
If address in destination (register) operand is not aligned to a 64-byte boundary.

#UD If CPUID.07H.0H:ECX.MOVDIR64B[bit 28] = 0.
If LOCK prefix is used.

Virtual-8086 Mode Exceptions

Same exceptions as in real address mode.

#PF (fault-code) For a page fault.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#SS(0) If memory address referencing the SS segment is in non-canonical form.

#GP(0) If the memory address is in non-canonical form.
If address in destination (register) operand is not aligned to a 64-byte boundary.

#PF (fault-code) For a page fault.

#UD If CPUID.07H.0H:ECX.MOVDIR64B[bit 28] = 0.
If LOCK prefix is used.
PCONFIG — Platform Configuration

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Op/ En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>NP 0F 01 C5 PCONFIG</td>
<td>A</td>
<td>V/V</td>
<td>PCONFIG</td>
<td>This instruction is used to execute functions for configuring platform features. EAX: Leaf function to be invoked. RBX/RCX/RDX: Leaf-specific purpose.</td>
</tr>
</tbody>
</table>

Description

PCONFIG allows software to configure certain platform features. PCONFIG supports multiple leaf functions, with a leaf function identified by the value in EAX. The registers RBX, RCX, and RDX have leaf-specific purposes.

Each PCONFIG leaf function applies to a specific hardware block called a PCONFIG target, and each PCONFIG target is associated with a numerical identifier. The identifiers of the PCONFIG targets supported by the CPU (which imply the supported leaf functions) are enumerated in the sub-leaves of the PCONFIG-information leaf of CPUID (EAX = 1BH). An attempt to execute an undefined leaf function results in a general-protection exception (#GP).

Addresses and operands are 32 bits outside 64-bit mode (IA32_EFER.LMA = 0 || CS.L = 0) and are 64 bits in 64-bit mode (IA32_EFER.LMA = 1 && CS.L = 1). The value of CS.D has no effect on address calculation.

Table 2-2 shows the leaf encodings for PCONFIG.

<table>
<thead>
<tr>
<th>Leaf</th>
<th>Encoding</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>MKTME_KEY_PROGRAM</td>
<td>00000000H</td>
<td>This leaf is used to program the key and encryption mode associated with a KeyID.</td>
</tr>
<tr>
<td>RESERVED</td>
<td>00000001H - FFFFFFFH</td>
<td>Reserved for future use (#GP(0) if used).</td>
</tr>
</tbody>
</table>

The MKTME_KEY_PROGRAM leaf of PCONFIG pertains to the MKTME target, which has target identifier 1. It is used by software to manage the key associated with a KeyID. The leaf function is invoked by setting the leaf value of 0 in EAX and the address of MKTME_KEY_PROGRAM_STRUCT in RBX. Successful execution of the leaf clears RAX (set to zero) and ZF, CF, PF, AF, OF, and SF are cleared. In case of failure, the failure reason is indicated in RAX with ZF set to 1 and CF, PF, AF, OF, and SF are cleared. The MKTME_KEY_PROGRAM leaf uses the MKTME_KEY_PROGRAM_STRUCT in memory shown in Table 2-3.

<table>
<thead>
<tr>
<th>Field</th>
<th>Offset (bytes)</th>
<th>Size (bytes)</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>KEYID</td>
<td>0</td>
<td>2</td>
<td>Key Identifier.</td>
</tr>
<tr>
<td>KEYID_CTRL</td>
<td>2</td>
<td>4</td>
<td>KeyID control:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>• Bits[7:0]: COMMAND.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>• Bits[23:8]: ENC_ALG.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>• Bits[31:24]: Reserved, must be zero.</td>
</tr>
<tr>
<td>RESERVED</td>
<td>6</td>
<td>58</td>
<td>Reserved, must be zero.</td>
</tr>
<tr>
<td>KEY_FIELD_1</td>
<td>64</td>
<td>64</td>
<td>Software supplied KeyID data key or entropy for KeyID data key.</td>
</tr>
<tr>
<td>KEY_FIELD_2</td>
<td>128</td>
<td>64</td>
<td>Software supplied KeyID tweak key or entropy for KeyID tweak key.</td>
</tr>
</tbody>
</table>
A description of each of the fields in MKTME_KEY_PROGRAM_STRUCT is provided below:

- **KEYID**: Key Identifier being programmed to the MKTME engine.
- **KEYID_CTRL**: The KEYID_CTRL field carries two sub-fields used by software to control the behavior of a KeyID: Command and KeyID encryption algorithm.

The command used controls the encryption mode for a KeyID. Table 2-4 provides a summary of the commands supported.

### Table 2-4. Supported Key Programming Commands

<table>
<thead>
<tr>
<th>Command</th>
<th>Encoding</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>KEYID_SET_KEY_DIRECT</td>
<td>0</td>
<td>Software uses this mode to directly program a key for use with KeyID.</td>
</tr>
<tr>
<td>KEYID_SET_KEY_RANDOM</td>
<td>1</td>
<td>CPU generates and assigns an ephemeral key for use with a KeyID. Each time the instruction is executed, the CPU generates a new key using a hardware random number generator and the keys are discarded on reset.</td>
</tr>
<tr>
<td>KEYID_CLEAR_KEY</td>
<td>2</td>
<td>Clear the (software programmed) key associated with the KeyID. On execution of this command, the KeyID gets TME behavior (encrypt with platform TME key).</td>
</tr>
<tr>
<td>KEYID_NO_ENCRYPT</td>
<td>3</td>
<td>Do not encrypt memory when this KeyID is in use.</td>
</tr>
</tbody>
</table>

The encryption algorithm field (ENC_ALG) allows software to select one of the activated encryption algorithms for the KeyID. The BIOS can activate a set of algorithms to allow for use when programming keys using the IA32_TME_ACTIVATE MSR (does not apply to KeyID 0 which uses TME policy). The ISA checks to ensure that the algorithm selected by software is one of the algorithms that has been activated by the BIOS.

- **KEY_FIELD_1**: This field carries the software supplied data key to be used for the KeyID if the direct key programming option is used (KEYID_SET_KEY_DIRECT). When the random key programming option is used (KEYID_SET_KEY_RANDOM), this field carries the software supplied entropy to be mixed in the CPU generated random data key. It is software's responsibility to ensure that the key supplied for the direct programming option or the entropy supplied for the random programming option does not result in weak keys. There are no explicit checks in the instruction to detect or prevent weak keys. When AES XTS-128 is used, the upper 48B are treated as reserved and must be zeroed out by software before executing the instruction.

- **KEY_FIELD_2**: This field carries the software supplied tweak key to be used for the KeyID if the direct key programming option is used (KEYID_SET_KEY_DIRECT). When the random key programming option is used (KEYID_SET_KEY_RANDOM), this field carries the software supplied entropy to be mixed in the CPU generated random tweak key. It is software's responsibility to ensure that the key supplied for the direct programming option or the entropy supplied for the random programming option does not result in weak keys. There are no explicit checks in the instruction to detect or prevent weak keys. When AES XTS-128 is used, the upper 48B are treated as reserved and must be zeroed out by software before executing the instruction.

All KeyIDs use the TME key on MKTME activation. Software can at any point decide to change the key for a KeyID using the PCONFIG instruction. Change of keys for a KeyID does NOT change the state of the TLB caches or memory pipeline. It is software's responsibility to take appropriate actions to ensure correct behavior.

Table 2-5 shows the return values associated with the MKTME_KEY_PROGRAM leaf of PCONFIG. On instruction execution, RAX is populated with the return value.

### Table 2-5. Supported Key Programming Commands

<table>
<thead>
<tr>
<th>Return Value</th>
<th>Encoding</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>PROG_SUCCESS</td>
<td>0</td>
<td>KeyID was successfully programmed.</td>
</tr>
<tr>
<td>INVALID_PROG_CMD</td>
<td>1</td>
<td>Invalid KeyID programming command.</td>
</tr>
<tr>
<td>ENTROPY_ERROR</td>
<td>2</td>
<td>Insufficient entropy.</td>
</tr>
<tr>
<td>INVALID_KEYID</td>
<td>3</td>
<td>KeyID not valid.</td>
</tr>
<tr>
<td>INVALID_ENC_ALG</td>
<td>4</td>
<td>Invalid encryption algorithm chosen (not supported).</td>
</tr>
<tr>
<td>DEVICE_BUSY</td>
<td>5</td>
<td>Failure to access key table.</td>
</tr>
</tbody>
</table>
PCONF Virtualization
Software in VMX root mode can control the execution of PCONF in VMX non-root mode using the following execution controls introduced for PCONF:

- **PCONFIG_ENABLE**: This control is a single bit control and enables the PCONF instruction in VMX non-root mode. If 0, the execution of PCONF in VMX non-root mode causes #UD. Otherwise, execution of PCONF works according to PCONF_EXITING.

- **PCONFIG_EXITING**: This is a 64b control and allows VMX root mode to cause a VM-exit for various leaf functions of PCONF. This control does not have any effect if the PCONF_ENABLE control is clear.

PCONF Concurrency
In a scenario, where the MKTME_KEY_PROGRAM leaf of PCONF is executed concurrently on multiple logical processors, only one logical processor will succeed in updating the key table. PCONF execution will return with an error code (DEVICE_BUSY) on other logical processors and software must retry. In cases where the instruction execution fails with a DEVICE_BUSY error code, the key table is not updated, thereby ensuring that either the key table is updated in its entirety with the information for a KeyID, or it is not updated at all. In order to accomplish this, the MKTME_KEY_PROGRAM leaf of PCONF maintains a writer lock for updating the key table. This lock is referred to as the Key table lock and denoted in the instruction flows as KEY_TABLE_LOCK. The lock can either be unlocked, when no logical processor is holding the lock (also the initial state of the lock) or be in an exclusive state where a logical processor is trying to update the key table. There can be only one logical processor holding the lock in exclusive state. The lock, being exclusive, can only be acquired when the lock is in unlocked state.

PCONF uses the following syntax to acquire KEY_TABLE_LOCK in exclusive mode and release the lock:

- **KEY_TABLE_LOCK.ACQUIRE(WRITE)**
- **KEY_TABLE_LOCK.RELEASE()**

Operation

Table 2-6. PCONF Operation Variables

<table>
<thead>
<tr>
<th>Variable Name</th>
<th>Type</th>
<th>Size (Bytes)</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>TMP_KEY_PROGRAM_STRUCT</td>
<td>MKTME_KEY_PROGRAM_STRUCT</td>
<td>192</td>
<td>Structure holding the key programming structure.</td>
</tr>
<tr>
<td>TMP_RND_DATA_KEY</td>
<td>UINT128</td>
<td>16</td>
<td>Random data key generated for random key programming option.</td>
</tr>
<tr>
<td>TMP_RND_TWEAK_KEY</td>
<td>UINT128</td>
<td>16</td>
<td>Random tweak key generated for random key programming option.</td>
</tr>
</tbody>
</table>

(*) #UD if PCONF is not enumerated or CPL>0 *)
if (CPUID.7:0:EDX[18] == 0 OR CPL > 0) #UD;

if (in VMX non-root mode)
{
    if (VMCS.PCONF_ENABLE == 1)
    {
        if ((EAX > 62 AND VMCS.PCONF_EXITING[63] == 1) OR
            (EAX < 63 AND VMCS.PCONF_EXITING[EAX] == 1))
        {
            Set VMCS.EXIT_REASON = PCONF; //No Exit qualification
            Deliver VMEXIT;
        }
    }
    else
    {
        #UD
    }
}
(* #GP(0) for an unsupported leaf *)
if(EAX != 0) #GP(0)

(* KEY_PROGRAM leaf flow *)
if (EAX == 0)
{
    (* #GP(0) if TME_ACTIVATE MSR is not locked or does not enable TME or multiple keys are not enabled *)
    if (IA32_TME_ACTIVATE.LOCK != 1 OR IA32_TME_ACTIVATE.ENABLE != 1 OR IA32_TME_ACTIVATE.MK_TME_KEYID_BITS == 0) #GP(0)

    (* Check MKTME_KEY_PROGRAM_STRUCT is 256B aligned *)
    if(DS:RBX is not 256B aligned) #GP(0);

    (* Check that MKTME_KEY_PROGRAM_STRUCT is read accessible *)
    <<DS: RBX should be read accessible>>

    (* Copy MKTME_KEY_PROGRAM_STRUCT to a temporary variable *)
    TMP_KEY_PROGRAM_STRUCT = DS:RBX.*;

    (* RSVD field check *)
    if(TMP_KEY_PROGRAM_STRUCT.RSVD != 0) #GP(0);
    if(TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.RSVD !=0) #GP(0);
    if(TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_1.BYTES[63:16] != 0) #GP(0);
    if(TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_2.BYTES[63:16] != 0) #GP(0);

    (* Check for a valid command *)
    if(TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.COMMAND is not a valid command)
    {
        RFLAGS.ZF = 1;
        RAX = INVALID_PROG_CMD;
        goto EXIT;
    }

    (* Check that the KEYID being operated upon is a valid KEYID *)
    if(TMP_KEY_PROGRAM_STRUCT.KEYID > 2^IA32_TME_ACTIVATE.MK_TME_KEYID_BITS - 1
        OR TMP_KEY_PROGRAM_STRUCT.KEYID > IA32_TME_CAPABILITY.MK_TME_MAX_KEYS
        OR TMP_KEY_PROGRAM_STRUCT.KEYID == 0)
    {
        RFLAGS.ZF = 1;
        RAX = INVALID_KEYID;
        goto EXIT;
    }

    (* Check that only one algorithm is requested for the KeyID and it is one of the activated algorithms *)
    if(NUM_BITS(TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.ENC_ALG) != 1 ||
        (TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.ENC_ALG & IA32_TME_ACTIVATE.MK_TME_CRYPTO_ALGS == 0))

    RFLAGS.ZF = 1;
    RAX = INVALID_PROG_CMD;
    goto EXIT;
INSTRUCTION SET REFERENCE, A-Z

{  
RFLAGS.ZF = 1;
RAX = INVALID_ENC_ALG;
goto EXIT;
}
(* Try to acquire exclusive lock *)
if (NOT KEY_TABLE_LOCK.ACQUIRE(WRITE))
{
  //PCONFIG failure
  RFLAGS.ZF = 1;
  RAX = DEVICE_BUSY;
goto EXIT;
}

(* Lock is acquired and key table will be updated as per the command
Before this point no changes to the key table are made *)

switch(TMP_KEY_PROGRAM_STRUCT.KEYID_CTRL.COMMAND)
{
case KEYID_SET_KEY_DIRECT:
  <<Write
    DATA_KEY=TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_1,
    TWEAK_KEY=TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_2,
    ENCRYPTION_MODE=ENCRYPT_WITH_KEYID_KEY,
    to MKTME Key table at index TMP_KEY_PROGRAM_STRUCT.KEYID
  >>
  break;

case KEYID_SET_KEY_RANDOM:
  TMP_RND_DATA_KEY = <<Generate a random key using hardware RNG>>
  if (NOT ENOUGH ENTROPY)
  {
    RFLAGS.ZF = 1;
    RAX = ENTROPY_ERROR;
goto EXIT;
  }
  TMP_RND_TWEAK_KEY = <<Generate a random key using hardware RNG>>
  if (NOT ENOUGH ENTROPY)
  {
    RFLAGS.ZF = 1;
    RAX = ENTROPY_ERROR;
goto EXIT;
  }
  (* Mix user supplied entropy to the data key and tweak key *)
  TMP_RND_DATA_KEY = TMP_RND_KEY XOR
    TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_1.BYTES[15:0];
  TMP_RND_TWEAK_KEY = TMP_RND_TWEAK_KEY XOR
    TMP_KEY_PROGRAM_STRUCT.KEY_FIELD_2.BYTES[15:0];

  <<Write
    DATA_KEY=TMP_RND_DATA_KEY,
    TWEAK_KEY=TMP_RND_TWEAK_KEY,
    ENCRYPTION_MODE=ENCRYPT_WITH_KEYID_KEY,
    to MKTME_KEY_TABLE at index TMP_KEY_PROGRAM_STRUCT.KEYID

2-20

Ref. # 319433-034
break;

case KEYID_CLEAR_KEY:
    <<Write
        DATA_KEY='0,
        TWEAK_KEY='0,
        ENCRYPTION_MODE = ENCRYPT_WITH_TME_KEY,
        to MKTME_KEY_TABLE at index TMP_KEY_PROGRAM_STRUCT.KEYID
    >>
    break;

case KD_NO_ENCRYPT:
    <<Write
        ENCRYPTION_MODE=NO_ENCRYPTION,
        to MKTME_KEY_TABLE at index TMP_KEY_PROGRAM_STRUCT.KEYID
    >>
    break;
}

RAX = 0;
RFLAGS.ZF = 0;

//Release Lock
KEY_TABLE_LOCK(RELEASE);

EXIT:
RFLAGS.CF=0;
RFLAGS.PF=0;
RFLAGS.AF=0;
RFLAGS.OF=0;
RFLAGS.SF=0;
}

end_of_flow

Intel C/C++ Compiler Intrinsic Equivalent
TBD

Protected Mode Exceptions

#GP(0) If input value in EAX encodes an unsupported leaf.
    If IA32_TME_ACTIVATE MSR is not locked.
    If TME and MKTME capability are not enabled in IA32_TME_ACTIVATE MSR.
    If the memory operand is not 256B aligned.
    If any of the reserved bits in MKTME_KEY_PROGRAM_STRUCT are set.
    If a memory operand effective address is outside the DS segment limit.

#PF(fault-code) If a page fault occurs in accessing memory operands.

#UD If any of the LOCK/REP/OSIZE/VEX prefixes are used.
    If current privilege level is not 0.
    If CPUID.7.0:EDX[bit 18] = 0
    If in VMX non-root mode and VMCS.PCONFIG_ENABLE = 0.
Real Address Mode Exceptions

#GP  If input value in EAX encodes an unsupported leaf.
     If IA32_TME_ACTIVATE MSR is not locked.
     If TME and MKTME capability is not enabled in IA32_TME_ACTIVATE MSR.
     If a memory operand is not 256B aligned.
     If any of the reserved bits in MKTME_KEY_PROGRAM_STRUCT are set.

#UD  If any of the LOCK/REP/OSIZE/VEX prefixes are used.
     If current privilege level is not 0.
     If CPUID.7.0:EDX.PCONFIG[bit 18] = 0
     If in VMX non-root mode and VMCS.PCONFIG_ENABLE = 0.

Virtual 8086 Mode Exceptions

#UD  PCONFIG instruction is not recognized in virtual-8086 mode.

Compatibility Mode Exceptions

Same exceptions as in protected mode.

64-Bit Mode Exceptions

#GP(0)  If input value in EAX encodes an unsupported leaf.
        If IA32_TME_ACTIVATE MSR is not locked.
        If TME and MKTME capability is not enabled in IA32_TME_ACTIVATE MSR.
        If a memory operand is not 256B aligned.
        If any of the reserved bits in MKTME_KEY_PROGRAM_STRUCT are set.
        If a memory operand is non-canonical form.

#PF(fault-code)  If a page fault occurs in accessing memory operands.
#UD  If any of the LOCK/REP/OSIZE/VEX prefixes are used.
     If the current privilege level is not 0.
     If CPUID.7.0:EDX.PCONFIG[bit 18] = 0.
     If in VMX non-root mode and VMCS.PCONFIG_ENABLE = 0.
TPAUSE—Timed PAUSE

<table>
<thead>
<tr>
<th>Opcode / Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F AE /6</td>
<td>A</td>
<td>V/V</td>
<td>WAITPKG</td>
<td>Directs the processor to enter an implementation-dependent optimized state until the TSC reaches the value in EDX:EAX.</td>
</tr>
</tbody>
</table>

**Instruction Operand Encoding**

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

**Description**

TPAUSE instructs the processor to enter an implementation-dependent optimized state. There are two such optimized states to choose from: light-weight power/performance optimized state, and improved power/performance optimized state. The selection between the two is governed by the explicit input register bit[0] source operand.

TPAUSE is available when CPUID.7.0:ECX.WAITPKG[bit 5] is enumerated as 1. TPAUSE may be executed at any privilege level. This instruction’s operation is the same in non-64-bit modes and in 64-bit mode.

Unlike PAUSE, the TPAUSE instruction will not cause an abort when used inside a transactional region, described in the chapter “Programming with Intel Transactional Synchronization Extensions” of the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.

The input register contains information such as the preferred optimized state the processor should enter as described in the following table. Bits other than bit 0 are reserved and will result in #GP if non-zero.

**Table 2-7. TPAUSE Input Register Bit Definitions**

<table>
<thead>
<tr>
<th>Bit Value</th>
<th>State Name</th>
<th>Wakeup Time</th>
<th>Power Savings</th>
<th>Other Benefits</th>
</tr>
</thead>
<tbody>
<tr>
<td>bit[0] = 0</td>
<td>C0.2</td>
<td>Slower</td>
<td>Larger</td>
<td>Improves performance of the other SMT thread(s).</td>
</tr>
<tr>
<td>bit[0] = 1</td>
<td>C0.1</td>
<td>Faster</td>
<td>Smaller</td>
<td>NA</td>
</tr>
<tr>
<td>bits[31:1]</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>Reserved</td>
</tr>
</tbody>
</table>

The instruction execution wakes up when the time-stamp counter reaches or exceeds the implicit EDX:EAX 64-bit input value.

Prior to executing the TPAUSE instruction, an operating system may specify the maximum delay it allows the processor to suspend its operation. It can do so by writing TSC-quanta value to the following 32-bit MSR (IA32_UMWAIT_CONTROL at MSR index E1H):

- IA32_UMWAIT_CONTROL[31:2] — Determines the maximum time in TSC-quanta that the processor can reside in either C0.1 or C0.2. A zero value indicates no maximum time. The maximum time value is a 32-bit value where the upper 30 bits come from this field and the lower two bits are zero.
- IA32_UMWAIT_CONTROL[0] — C0.2 is not allowed by the OS. Value of “1” means all C0.2 requests revert to C0.1.

If the processor that executed a TPAUSE instruction wakes due to the expiration of the operating system time-limit, the instructions sets RFLAGS.CF; otherwise, that flag is cleared.

The following additional events cause the processor to exit the implementation-dependent optimized state: a store to the read-set range within the transactional region, an NMI or SMI, a debug exception, a machine check exception, the BINIT# signal, the INIT# signal, and the RESET# signal.

Other implementation-dependent events may cause the processor to exit the implementation-dependent optimized state proceeding to the instruction following TPAUSE. In addition, an external interrupt causes the processor to exit the implementation-dependent optimized state regardless of whether maskable-interrupts are inhibited.
(EFLAGS.IF = 0). It should be noted that if maskable-interrupts are inhibited execution will proceed to the instruction following TPAUSE.
MODRM.MOD must be 0b11 for this instruction.

**Operation**

\[
\begin{align*}
os\_\text{deadline} & \leftarrow \text{TSC} + (\text{IA32\_MWAIT\_CONTROL}[31:2] << 2) \\
instr\_\text{deadline} & \leftarrow \text{UINT64}(\text{EDX}:\text{EAX})
\end{align*}
\]

IF \( os\_\text{deadline} < instr\_\text{deadline} \):
    \[
    \begin{align*}
deadline & \leftarrow os\_\text{deadline} \\
using\_os\_\text{deadline} & \leftarrow 1
\end{align*}
\]
ELSE:
    \[
    \begin{align*}
deadline & \leftarrow instr\_\text{deadline} \\
using\_os\_\text{deadline} & \leftarrow 0
\end{align*}
\]

WHILE TSC < deadline:
    \[
    \text{implementation\_dependent\_optimized\_state(Source\ register, deadline, IA32\_UMWAIT\_CONTROL[0])}
    \]

IF using\_os\_deadline AND TSC > deadline:
    \[
    \begin{align*}
RFLAGS.CF & \leftarrow 1
\end{align*}
\]
ELSE:
    \[
    \begin{align*}
RFLAGS.CF & \leftarrow 0
\end{align*}
\]

RFLAGS.AF,PF,SF,ZF,OF \leftarrow 0

**Intel C/C++ Compiler Intrinsic Equivalent**

\[
\text{TPAUSE uint8\_t \_tpause(uint32\_t control, uint64\_t counter)};
\]

**Numeric Exceptions**

None.

**Exceptions (All Operating Modes)**

- **#GP(0)** If src[31:1] != 0.
- **#UD** If CPUID.7.0:ECX.WAITPKG[bit 5]=0.
  If MODRM.MOD != 0b11.
UMONITOR—User Level Set Up Monitor Address

<table>
<thead>
<tr>
<th>Opcode / Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F AE /6 UMONITOR r16/r32/r64</td>
<td>A</td>
<td>V/V</td>
<td>WAITPKG</td>
<td>Sets up a linear address range to be monitored by hardware and activates the monitor. The address range should be a write-back memory caching type. The address is contained in r16/r32/r64.</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

Description

The UMONITOR instruction arms address monitoring hardware using an address specified in the source register (the address range that the monitoring hardware checks for store operations can be determined by using the CPUID monitor leaf function, EAX=05H). A store to an address within the specified address range triggers the monitoring hardware. The state of monitor hardware is used by UMWAIT.

The content of the source register is an effective address. By default, the DS segment is used to create a linear address that is monitored. Segment overrides can be used. The address range must use memory of the write-back type. Only write-back memory is guaranteed to correctly trigger the monitoring hardware. Additional information on determining what address range to use in order to prevent false wake-ups is described in Chapter 8, “Multiple-Processor Management” of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

The UMONITOR instruction is ordered as a load operation with respect to other memory transactions. The instruction is subject to the permission checking and faults associated with a byte load. Like a load, UMONITOR sets the A-bit but not the D-bit in page tables.

UMONITOR and UMWAIT are available when CPUID.7.0:ECX.WAITPKG[bit 5] is enumerated as 1. UMONITOR and UMWAIT may be executed at any privilege level. Except for the width of the source register, the instruction’s operation is the same in non-64-bit modes and in 64-bit mode.

UMONITOR does not interoperate with the legacy MWAIT instruction. If UMONITOR was executed prior to executing MWAIT and following the most recent execution of the legacy MONITOR instruction, MWAIT will not enter an optimized state. Execution will continue to the instruction following MWAIT.

The UMONITOR instruction causes a transactional abort when used inside a transactional region.

The width of the source register (16b, 32b or 64b) is determined by the effective addressing width, which is affected in the standard way by the machine mode settings and 67 prefix.

Operation

UMONITOR sets up an address range for the monitor hardware using the content of source register as an effective address and puts the monitor hardware in armed state. A store to the specified address range will trigger the monitor hardware.

Intel C/C++ Compiler Intrinsic Equivalent

UMONITOR void _umonitor(void *address);

Numeric Exceptions

None
Protected Mode Exceptions

#GP(0) If the specified segment is not SS and the source register is outside the specified segment limit.
    If the specified segment register contains a NULL segment selector.
#SS(0) If the specified segment is SS and the source register is outside the SS segment limit.
#PF(fault-code) For a page fault.
#UD If CPUID.7.0:ECX.WAITPKG[bit 5]=0.
               If MODRM.MOD != 0b11.

Real Address Mode Exceptions

#GP If the specified segment is not SS and the source register is outside of the effective address space from 0 to FFFFH.
#SS If the specified segment is SS and the source register is outside of the effective address space from 0 to FFFFH.
#UD If CPUID.7.0:ECX.WAITPKG[bit 5]=0.

Virtual 8086 Mode Exceptions
Same exceptions as in real address mode; additionally:

#PF(fault-code) For a page fault.

Compatibility Mode Exceptions
Same exceptions as in protected mode.

64-Bit Mode Exceptions

#GP(0) If the specified segment is not SS and the linear address is in non-canonical form.
#SS(0) If the specified segment is SS and the source register is in non-canonical form.
#PF(fault-code) For a page fault.
#UD If CPUID.7.0:ECX.WAITPKG[bit 5]=0.
               If MODRM.MOD != 0b11.
UMWAIT—User Level Monitor Wait

**Opcodes / Instruction**

<table>
<thead>
<tr>
<th>Opcode / Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F AE /6</td>
<td>A</td>
<td>V/V</td>
<td>WAITPKG</td>
<td>A hint that allows the processor to stop instruction execution and enter an implementation-dependent optimized state until occurrence of a class of events.</td>
</tr>
<tr>
<td>UMWAIT r32, &lt;edx&gt;, &lt;eax&gt;</td>
<td>A</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

**Instruction Operand Encoding**

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM/r/m (r)</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

**Description**

UMWAIT instructs the processor to enter an implementation-dependent optimized state while monitoring a range of addresses. The optimized state may be either a light-weight power/performance optimized state or an improved power/performance optimized state. The selection between the two states is governed by the explicit input register bit[0] source operand.

UMWAIT is available when CPUID.7.0:ECX.WAITPKG[bit 5] is enumerated as 1. UMWAIT may be executed at any privilege level. This instruction’s operation is the same in non-64-bit modes and in 64-bit mode.

The input register contains information such as the preferred optimized state the processor should enter as described in the following table. Bits other than bit 0 are reserved and will result in #GP if nonzero.

**Table 2-8. UMWAIT Input Register Bit Definitions**

<table>
<thead>
<tr>
<th>Bit Value</th>
<th>State Name</th>
<th>Wakeup Time</th>
<th>Power Savings</th>
<th>Other Benefits</th>
</tr>
</thead>
<tbody>
<tr>
<td>bit[0] = 0</td>
<td>C0.2</td>
<td>Slower</td>
<td>Larger</td>
<td>Improves performance of the other SMT thread(s) on the same core.</td>
</tr>
<tr>
<td>bit[0] = 1</td>
<td>C0.1</td>
<td>Faster</td>
<td>Smaller</td>
<td>NA</td>
</tr>
<tr>
<td>bits[31:1]</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
<td>Reserved</td>
</tr>
</tbody>
</table>

The instruction wakes up when the time-stamp counter reaches or exceeds the implicit EDX:EAX 64-bit input value (if the monitoring hardware did not trigger beforehand).

Prior to executing the UMWAIT instruction, an operating system may specify the maximum delay it allows the processor to suspend its operation. It can do so by writing TSC-quanta value to the following 32bit MSR (IA32_UMWAIT_CONTROL at MSR index E1H):

- **IA32_UMWAIT_CONTROL[31:2]** — Determines the maximum time in TSC-quanta that the processor can reside in either C0.1 or C0.2. A zero value indicates no maximum time. The maximum time value is a 32-bit value where the upper 30 bits come from this field and the lower two bits are zero.

- **IA32_UMWAIT_CONTROL[1]** — Reserved.

- **IA32_UMWAIT_CONTROL[0]** — C0.2 is not allowed by the OS. Value of “1” means all C0.2 requests revert to C0.1.

If the processor that executed a UMWAIT instruction wakes due to the expiration of the operating system time-limit, the instructions sets RFLAGS.CF; otherwise, that flag is cleared.

The UMWAIT instruction causes a transactional abort when used inside a transactional region.

The UMWAIT instruction operates with the UMONITOR instruction. The two instructions allow the definition of an address at which to wait (UMONITOR) and an implementation-dependent optimized operation to perform while waiting (UMWAIT). The execution of UMWAIT is a hint to the processor that it can enter an implementation-dependent-optimized state while waiting for an event or a store operation to the address range armed by UMONITOR.

The following additional events cause the processor to exit the implementation-dependent optimized state: a store to the address range armed by the UMONITOR instruction, an NMI or SMI, a debug exception, a machine check...
exception, the BINIT# signal, the INIT# signal, and the RESET# signal. Other implementation-dependent events may also cause the processor to exit the implementation-dependent optimized state.

In addition, an external interrupt causes the processor to exit the implementation-dependent optimized state regardless of whether maskable-interrupts are inhibited (EFLAGS.IF =0).

Following exit from the implementation-dependent-optimized state, control passes to the instruction after the UMWAIT instruction. A pending interrupt that is not masked (including an NMI or an SMI) may be delivered before execution of that instruction.

Unlike the HLT instruction, the UMWAIT instruction does not restart at the UMWAIT instruction following the handling of an SMI.

If the preceding UMONITOR instruction did not successfully arm an address range or if UMONITOR was not executed prior to executing UMWAIT and following the most recent execution of the legacy MONITOR instruction (UMWAIT does not interoperate with MONITOR), then the processor will not enter an optimized state. Execution will continue to the instruction following UMWAIT.

A store to the address range armed by the UMONITOR instruction will cause the processor to exit UMWAIT if either the store was originated by other processor agents or the store was originated by a non-processor agent. MODRM.MOD must be 0b11 for this instruction.

**Operation**

```plaintext
os_deadline ← TSC+(IA32_MWAIT_CONTROL[31:2] << 2)
instr_deadline ← UINT64(EDX:EAX)

IF os_deadline < instr_deadline:
    deadline ← os_deadline
    using_os_deadline ← 1
ELSE:
    deadline ← instr_deadline
    using_os_deadline ← 0

WHILE monitor hardware armed AND TSC < deadline:
    implementation_dependent_optimized_state(Source register, deadline, IA32_UMWAIT_CONTROL[0])

IF using_os_deadline AND TSC > deadline:
    RFLAGS.CF ← 1
ELSE:
    RFLAGS.CF ← 0

RFLAGS.AF,PF,SF,ZF,OF ← 0
```

**Intel C/C++ Compiler Intrinsic Equivalent**

```c
UMWAIT uint8_t _umwait(uint32_t control, uint64_t counter);
```

**Numeric Exceptions**

None

**Exceptions (All Operating Modes)**

- #GP(0) If src[31:1] != 0.
- #UD If CPUID.7.0:ECX.WAITPKG[bit 5]=0.
  If MODRM.MOD != 0b11.
**VAESDEC — Perform One Round of an AES Decryption Flow**

<table>
<thead>
<tr>
<th>Opcode/InSTRUCTION</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.256.66.0F38.WIG DE /r VAESDEC ymm1, ymm2, ymm3/m256</td>
<td>A</td>
<td>V/V</td>
<td>VAES</td>
<td>Perform one round of an AES decryption flow, using the Equivalent Inverse Cipher, operating on a 128-bit data (state) from ymm2 with a 128-bit round key from ymm3/m256; store the result in ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.128.66.0F38.WIG DE /r VAESDEC xmm1, xmm2, xmm3/m128</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL</td>
<td>Perform one round of an AES decryption flow, using the Equivalent Inverse Cipher, operating on a 128-bit data (state) from xmm2 with a 128-bit round key from xmm3/m128; store the result in xmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.256.66.0F38.WIG DE /r VAESDEC ymm1, ymm2, ymm3/m256</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL</td>
<td>Perform one round of an AES decryption flow, using the Equivalent Inverse Cipher, operating on a 128-bit data (state) from ymm2 with a 128-bit round key from ymm3/m256; store the result in ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.512.66.0F38.WIG DE /r VAESDEC zmm1, zmm2, zmm3/m512</td>
<td>B</td>
<td>V/V</td>
<td>AVX512F</td>
<td>Perform one round of an AES decryption flow, using the Equivalent Inverse Cipher, operating on a 128-bit data (state) from zmm2 with a 128-bit round key from zmm3/m512; store the result in zmm1.</td>
</tr>
</tbody>
</table>

**Instruction Operand Encoding**

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>Full Mem</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
</tr>
</tbody>
</table>

**Description**

This instruction performs a single round of the AES decryption flow using the Equivalent Inverse Cipher, with the round key from the second source operand, operating on a 128-bit data (state) from the first source operand, and store the result in the destination operand.

Use the AESDEC instruction for all but the last decryption round. For the last decryption round, use the AESDEC-CLAST instruction.

VEX and EVEX encoded versions of the instruction allows 3-operand (non-destructive) operation. The legacy encoded versions of the instruction require that the first source operand and the destination operand are the same and must be an XMM register.

The EVEX encoded form of this instruction does not support memory fault suppression.

**Operation**

**AESDEC**

STATE ← SRC1  
RoundKey ← SRC2  
STATE ← InvShiftRows( STATE )  
STATE ← InvSubBytes( STATE )  
STATE ← InvMixColumns( STATE )  
DEST[127:0] ← STATE XOR RoundKey  
DEST[MAXVL-1:128] (Unmodified)
VAESDEC (128b and 256b VEX encoded versions)
(KL,V) = (1,128), (2,256)
FOR i = 0 to KL-1:
   STATE ← SRC1.xmm[i]
   RoundKey ← SRC2.xmm[i]
   STATE ← InvShiftRows( STATE )
   STATE ← InvSubBytes( STATE )
   STATE ← InvMixColumns( STATE )
   DEST.xmm[i] ← STATE XOR RoundKey
DEST[MAXVL-1:VL] ← 0

VAESDEC (EVEX encoded version)
(KL,VL) = (1,128), (2,256), (4,512)
FOR i = 0 to KL-1:
   STATE ← SRC1.xmm[i]
   RoundKey ← SRC2.xmm[i]
   STATE ← InvShiftRows( STATE )
   STATE ← InvSubBytes( STATE )
   STATE ← InvMixColumns( STATE )
   DEST.xmm[i] ← STATE XOR RoundKey
DEST[MAXVL-1:VL] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
VAESDEC __m256i _mm256_aesdec_epi128(__m256i, __m256i);
VAESDEC __m512i _mm512_aesdec_epi128(__m512i, __m512i);

SIMD Floating-Point Exceptions
None.

Other Exceptions
VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
VAESDECLAST — Perform Last Round of an AES Decryption Flow

### Opcode/Instruction

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Op/ En</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.256.66.0F38.WIG DF /r VAESDECLAST ymm1, ymm2, ymm3/m256</td>
<td>A</td>
<td>V/V</td>
<td>VAES</td>
<td>Perform the last round of an AES decryption flow, using the Equivalent Inverse Cipher, operating on a 128-bit data (state) from ymm2 with a 128-bit round key from ymm3/m256; store the result in ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.128.66.0F38.WIG DF /r VAESDECLAST xmm1, xmm2, xmm3/m128</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL VAES</td>
<td>Perform the last round of an AES decryption flow, using the Equivalent Inverse Cipher, operating on a 128-bit data (state) from xmm2 with a 128-bit round key from xmm3/m128; store the result in xmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.256.66.0F38.WIG DF /r VAESDECLAST ymm1, ymm2, ymm3/m256</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL VAES</td>
<td>Perform the last round of an AES decryption flow, using the Equivalent Inverse Cipher, operating on a 128-bit data (state) from ymm2 with a 128-bit round key from ymm3/m256; store the result in ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.512.66.0F38.WIG DF /r VAESDECLAST zmm1, zmm2, zmm3/m512</td>
<td>B</td>
<td>V/V</td>
<td>AVX512F VAES</td>
<td>Perform the last round of an AES decryption flow, using the Equivalent Inverse Cipher, operating on a 128-bit data (state) from zmm2 with a 128-bit round key from zmm3/m512; store the result in zmm1.</td>
</tr>
</tbody>
</table>

### Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRMreg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRMreg/r/m (r)</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>Full Mem</td>
<td>ModRMreg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRMreg/r/m (r)</td>
<td>NA</td>
</tr>
</tbody>
</table>

### Description

This instruction performs the last round of the AES decryption flow using the Equivalent Inverse Cipher, with the round key from the second source operand, operating on a 128-bit data (state) from the first source operand, and store the result in the destination operand.

VEX and EVEX encoded versions of the instruction allows 3-operand (non-destructive) operation. The legacy encoded versions of the instruction require that the first source operand and the destination operand are the same and must be an XMM register.

The EVEX encoded form of this instruction does not support memory fault suppression.

### Operation

**AESDECLAST**

STATE ← SRC1
RoundKey ← SRC2
STATE ← InvShiftRows( STATE )
STATE ← InvSubBytes( STATE )
DEST[127:0] ← STATE XOR RoundKey
DEST[MAXVL-1:128] (Unmodified)
VAESDECLAST (128b and 256b VEX encoded versions)
(KL,VL) = (1,128), (2,256)
FOR i = 0 to KL-1:
  STATE ← SRC1.xmm[i]
  RoundKey ← SRC2.xmm[i]
  STATE ← InvShiftRows( STATE )
  STATE ← InvSubBytes( STATE )
  DEST.xmm[i] ← STATE XOR RoundKey
DEST[MAXVL-1:VL] ← 0

VAESDECLAST (EVEX encoded version)
(KL,VL) = (1,128), (2,256), (4,512)
FOR i = 0 to KL-1:
  STATE ← SRC1.xmm[i]
  RoundKey ← SRC2.xmm[i]
  STATE ← InvShiftRows( STATE )
  STATE ← InvSubBytes( STATE )
  DEST.xmm[i] ← STATE XOR RoundKey
DEST[MAXVL-1:VL] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
VAESDECLAST __m256i _mm256_aesdeclast_epi128(__m256i, __m256i);
VAESDECLAST __m512i _mm512_aesdeclast_epi128(__m512i, __m512i);

SIMD Floating-Point Exceptions
None.

Other Exceptions
VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
VAESENC — Perform One Round of an AES Encryption Flow

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.256.66.0F38.WIG DC /r VAESENC ymm1, ymm2, ymm3/m256</td>
<td>A</td>
<td>V/V</td>
<td>VAES</td>
<td>Perform one round of an AES encryption flow, operating on a 128-bit data (state) from ymm2 with a 128-bit round key from the ymm3/m256; store the result in ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.128.66.0F38.WIG DC /r VAESENC xmm1, xmm2, xmm3/m256</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL VAES</td>
<td>Perform one round of an AES encryption flow, operating on a 128-bit data (state) from xmm2 with a 128-bit round key from the xmm3/m256; store the result in xmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.256.66.0F38.WIG DC /r VAESENC ymm1, ymm2, ymm3/m256</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL VAES</td>
<td>Perform one round of an AES encryption flow, operating on a 128-bit data (state) from ymm2 with a 128-bit round key from the ymm3/m256; store the result in ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.512.66.0F38.WIG DC /r VAESENC zmm1, zmm2, zmm3/m512</td>
<td>B</td>
<td>V/V</td>
<td>AVX512F VAES</td>
<td>Perform one round of an AES encryption flow, operating on a 128-bit data (state) from zmm2 with a 128-bit round key from the zmm3/m512; store the result in zmm1.</td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>Full Mem</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
</tr>
</tbody>
</table>

Description

This instruction performs a single round of an AES encryption flow using a round key from the second source operand, operating on 128-bit data (state) from the first source operand, and store the result in the destination operand.

Use the AESENC instruction for all but the last encryption rounds. For the last encryption round, use the AESENC-CLAST instruction.

VEX and EVEX encoded versions of the instruction allows 3-operand (non-destructive) operation. The legacy encoded versions of the instruction require that the first source operand and the destination operand are the same and must be an XMM register.

The EVEX encoded form of this instruction does not support memory fault suppression.

Operation

**AESENC**

STATE ← SRC1
RoundKey ← SRC2
STATE ← ShiftRows( STATE )
STATE ← SubBytes( STATE )
STATE ← MixColumns( STATE )
DEST[127:0] ← STATE XOR RoundKey
DEST[MAXVL-1:128] (Unmodified)
VAESENC (128b and 256b VEX encoded versions)
(KL,VL) = (1,128), (2,256)
FOR I ← 0 to KL-1:
   STATE ← SRC1.xmm[i]
   RoundKey ← SRC2.xmm[i]
   STATE ← ShiftRows( STATE )
   STATE ← SubBytes( STATE )
   STATE ← MixColumns( STATE )
   DEST.xmm[i] ← STATE XOR RoundKey
DEST[MAXVL-1:VL] ← 0

VAESENC (EVEX encoded version)
(KL,VL) = (1,128), (2,256), (4,512)
FOR i ← 0 to KL-1:
   STATE ← SRC1 xmm[i] // xmm[i] is the i'th xmm word in the SIMD register
   RoundKey ← SRC2.xmm[i]
   STATE ← ShiftRows( STATE )
   STATE ← SubBytes( STATE )
   STATE ← MixColumns( STATE )
   DEST.xmm[i] ← STATE XOR RoundKey
DEST[MAXVL-1:VL] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
VAESENC __m256i _mm256_aesenc_epi128(__m256i, __m256i);
VAESENC __m512i _mm512_aesenc_epi128(__m512i, __m512i);

SIMD Floating-Point Exceptions
None.

Other Exceptions
VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
VAESENCLAST — Perform Last Round of an AES Encryption Flow

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.256.66.0F38.WIG DD /r VAESENCLAST ymm1, ymm2, ymm3/m256</td>
<td>A</td>
<td>V/V</td>
<td>VAES</td>
<td>Perform the last round of an AES encryption flow, operating on a 128-bit data (state) from ymm2 with a 128 bit round key from ymm3/m256; store the result in ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.128.66.0F38.WIG DD /r VAESENCLAST xmm1, xmm2, xmm3/m128</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL VAES</td>
<td>Perform the last round of an AES encryption flow, operating on a 128-bit data (state) from xmm2 with a 128 bit round key from xmm3/m128; store the result in xmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.256.66.0F38.WIG DD /r VAESENCLAST ymm1, ymm2, ymm3/m256</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL VAES</td>
<td>Perform the last round of an AES encryption flow, operating on a 128-bit data (state) from ymm2 with a 128 bit round key from ymm3/m256; store the result in ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.512.66.0F38.WIG DD /r VAESENCLAST zmm1, zmm2, zmm3/m512</td>
<td>B</td>
<td>V/V</td>
<td>AVX512F VAES</td>
<td>Perform the last round of an AES encryption flow, operating on a 128-bit data (state) from zmm2 with a 128 bit round key from zmm3/m512; store the result in zmm1.</td>
</tr>
</tbody>
</table>

### InstructionOperand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>Full Mem</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
</tr>
</tbody>
</table>

### Description

This instruction performs the last round of an AES encryption flow using a round key from the second source operand, operating on 128-bit data (state) from the first source operand, and store the result in the destination operand.

VEX and EVEX encoded versions of the instruction allows 3-operand (non-destructive) operation. The legacy encoded versions of the instruction require that the first source operand and the destination operand are the same and must be an XMM register.

The EVEX encoded form of this instruction does not support memory fault suppression.

### Operation

**AESENCLAST**

- \( \text{STATE} \leftarrow \text{SRC1} \)
- \( \text{RoundKey} \leftarrow \text{SRC2} \)
- \( \text{STATE} \leftarrow \text{ShiftRows} (\text{STATE}) \)
- \( \text{STATE} \leftarrow \text{SubBytes} (\text{STATE}) \)
- \( \text{DEST}[127:0] \leftarrow \text{STATE XOR RoundKey} \)
- \( \text{DEST}[\text{MAXVL}-1:128] \) (Unmodified)
VAESENCLAST (128b and 256b VEX encoded versions)
(KL, VL) = (1,128), (2,256)
FOR I=0 to KL-1:
   STATE ← SRC1.xmm[i]
   RoundKey ← SRC2.xmm[i]
   STATE ← ShiftRows( STATE )
   STATE ← SubBytes( STATE )
   DEST.xmm[i]← STATE XOR RoundKey
DEST[MAXVL-1:VL] ← 0

VAESENCLAST (EVEX encoded version)
(KL,VL) = (1,128), (2,256), (4,512)
FOR I = 0 to KL-1:
   STATE ← SRC1.xmm[i]
   RoundKey ← SRC2.xmm[i]
   STATE ← ShiftRows( STATE )
   STATE ← SubBytes( STATE )
   DEST.xmm[i]← STATE XOR RoundKey
DEST[MAXVL-1:VL] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
VAESENCLAST __m256i _mm256_aesenclast_epi128(__m256i, __m256i);
VAESENCLAST __m512i _mm512_aesenclast_epi128(__m512i, __m512i);

SIMD Floating-Point Exceptions
None.

Other Exceptions
VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
VPCLMULQDQ — Carry-Less Multiplication Quadword

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Op/ En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.256.66.6F3A.WIG 44 /r /ib</td>
<td>A</td>
<td>V/V</td>
<td>VPCLMULQDQ</td>
<td>Carry-less multiplication of one quadword of ymm2 by one quadword of ymm3/m256, stores the 128-bit result in ymm1. The immediate is used to determine which quadwords of ymm2 and ymm3/m256 should be used.</td>
</tr>
<tr>
<td>EVEX.NDS.128.66.6F3A.WIG 44 /r /ib</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL VPCLMULQDQ</td>
<td>Carry-less multiplication of one quadword of xmm2 by one quadword of xmm3/m128, stores the 128-bit result in xmm1. The immediate is used to determine which quadwords of xmm2 and xmm3/m128 should be used.</td>
</tr>
<tr>
<td>EVEX.NDS.256.66.6F3A.WIG 44 /r /ib</td>
<td>B</td>
<td>V/V</td>
<td>AVX512VL VPCLMULQDQ</td>
<td>Carry-less multiplication of one quadword of ymm2 by one quadword of ymm3/m256, stores the 128-bit result in ymm1. The immediate is used to determine which quadwords of ymm2 and ymm3/m256 should be used.</td>
</tr>
<tr>
<td>EVEX.NDS.512.66.6F3A.WIG 44 /r /ib</td>
<td>B</td>
<td>V/V</td>
<td>AVX512F VPCLMULQDQ</td>
<td>Carry-less multiplication of one quadword of zmm2 by one quadword of zmm3/m512, stores the 128-bit result in zmm1. The immediate is used to determine which quadwords of zmm2 and zmm3/m512 should be used.</td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>NA</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>imm8 (r)</td>
</tr>
<tr>
<td>B</td>
<td>Full Mem</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>imm8 (r)</td>
</tr>
</tbody>
</table>

Description

Performs a carry-less multiplication of two quadwords, selected from the first source and second source operand according to the value of the immediate byte. Bits 4 and 0 are used to select which 64-bit half of each operand to use according to the table below, other bits of the immediate byte are ignored.

The EVEX encoded form of this instruction does not support memory fault suppression.

Table 2-9. PCLMULQDQ Quadword Selection of Immediate Byte

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>CL_MUL( SRC2[63:0], SRC1[63:0] )</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>CL_MUL( SRC2[63:0], SRC1[127:64] )</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>CL_MUL( SRC2[127:64], SRC1[63:0] )</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>CL_MUL( SRC2[127:64], SRC1[127:64] )</td>
</tr>
</tbody>
</table>

NOTES:

SRC2 denotes the second source operand, which can be a register or memory; SRC1 denotes the first source and destination operand.

The first source operand and the destination operand are the same and must be a ZMM/YMM/XMM register. The second source operand can be a ZMM/YMM/XMM register or a 512/256/128-bit memory location. Bits (VL_MAX-1:128) of the corresponding YMM destination register remain unchanged.

Ref. # 319433-034
Compilers and assemblers may implement the following pseudo-op syntax to simply programming and emit the required encoding for imm8.

\[
\text{Table 2-10. Pseudo-Op and PCLMULQDQ Implementation}
\]

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>Imm8 Encoding</th>
</tr>
</thead>
<tbody>
<tr>
<td>PCLMULLQLQDQ xmm1, xmm2</td>
<td>0000_0000B</td>
</tr>
<tr>
<td>PCLMULHQLQDQ xmm1, xmm2</td>
<td>0000_0001B</td>
</tr>
<tr>
<td>PCLMULLHQDQ xmm1, xmm2</td>
<td>0001_0000B</td>
</tr>
<tr>
<td>PCLMULHQQDQ xmm1, xmm2</td>
<td>0001_0001B</td>
</tr>
</tbody>
</table>

**Operation**

\[
\text{define PCLMUL128(X,Y): // helper function}
\]

\[
\text{FOR } i \leftarrow 0 \text{ to 63:}
\]

\[
\text{TMP } [i] \leftarrow X[0] \text{ and } Y[i]
\]

\[
\text{FOR } j \leftarrow 1 \text{ to } i:
\]

\[
\text{TMP } [i] \leftarrow \text{TMP } [i] \text{ xor } (X[j] \text{ and } Y[i-j])
\]

\[
\text{DEST}[i] \leftarrow \text{TMP}[i]
\]

\[
\text{FOR } i \leftarrow 64 \text{ to 126:}
\]

\[
\text{TMP } [i] \leftarrow 0
\]

\[
\text{FOR } j \leftarrow i - 63 \text{ to 63:}
\]

\[
\text{TMP } [i] \leftarrow \text{TMP } [i] \text{ xor } (X[j] \text{ and } Y[i-j])
\]

\[
\text{DEST}[i] \leftarrow \text{TMP}[i]
\]

\[
\text{DEST}[127] \leftarrow 0;
\]

\[
\text{RETURN } \text{DEST} \quad \text{// 128b vector}
\]

**PCLMULQDQ (SSE version)**

\[
\text{IF } \text{Imm8}[0] = 0:
\]

\[
\text{TEMP1 } \leftarrow \text{SRC1.qword[0]}
\]

\[
\text{ELSE:}
\]

\[
\text{TEMP1 } \leftarrow \text{SRC1.qword[1]}
\]

\[
\text{IF } \text{Imm8}[4] = 0:
\]

\[
\text{TEMP2 } \leftarrow \text{SRC2.qword[0]}
\]

\[
\text{ELSE:}
\]

\[
\text{TEMP2 } \leftarrow \text{SRC2.qword[1]}
\]

\[
\text{DEST}[127:0] \leftarrow \text{PCLMUL128(TEMP1, TEMP2)}
\]

\[
\text{DEST}[\text{MAXVL-1:128}] \quad \text{(Unmodified)}
\]

**VPCLMULQDQ (128b and 256b VEX encoded versions)**

\[
(KL,VL) = (1,128), (2,256)
\]

\[
\text{FOR } i = 0 \text{ to KL-1:}
\]

\[
\text{IF } \text{Imm8}[0] = 0:
\]

\[
\text{TEMP1 } \leftarrow \text{SRC1.xmm[i].qword[0]}
\]

\[
\text{ELSE:}
\]

\[
\text{TEMP1 } \leftarrow \text{SRC1.xmm[i].qword[1]}
\]

\[
\text{IF } \text{Imm8}[4] = 0:
\]

\[
\text{TEMP2 } \leftarrow \text{SRC2.xmm[i].qword[0]}
\]

\[
\text{ELSE:}
\]

\[
\text{TEMP2 } \leftarrow \text{SRC2.xmm[i].qword[1]}
\]

\[
\text{DEST.xmm[i]} \leftarrow \text{PCLMUL128(TEMP1, TEMP2)}
\]

\[
\text{DEST[\text{MAXVL-1:VL}]} \leftarrow 0
\]
VPCLMULQDQ (EVEX encoded version)

(KL, VL) = (1,128), (2,256), (4,512)

FOR i = 0 to KL-1:
    IF Imm8[0] = 0:
        TEMP1 ← SRC1.xmm[i].qword[0]
    ELSE:
        TEMP1 ← SRC1.xmm[i].qword[1]
    IF Imm8[4] = 0:
        TEMP2 ← SRC2.xmm[i].qword[0]
    ELSE:
        TEMP2 ← SRC2.xmm[i].qword[1]
    DEST.xmm[i] ← PCLMUL128(TEMP1, TEMP2)
    DEST[MAXVL-1:VL] ← 0

Intel C/C++ Compiler Intrinsic Equivalent

VPCLMULQDQ __m256i _mm256_clmulepi64_epi128(__m256i, __m256i, const int);
VPCLMULQDQ __m512i _mm512_clmulepi64_epi128(__m512i, __m512i, const int);

SIMD Floating-Point Exceptions

None.

Other Exceptions

VEX-encoded: Exceptions Type 4.
EVEX-encoded: See Exceptions Type E4NF.
INSTRUCTION SET REFERENCE, A-Z

VPCOMPRESS — Store Sparse Packed Byte/Word Integer Values into Dense Memory/Register

<table>
<thead>
<tr>
<th>Opcode/</th>
<th>Instruction</th>
<th>Op/</th>
<th>En</th>
<th>64/32</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.128.66.0F38.W0 63 /r</td>
<td>VPCOMPRESSB m128[k1], xmm1</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 128 bits of packed byte values from xmm1 to m128 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.128.66.0F38.W0 63 /r</td>
<td>VPCOMPRESSB xmm1[k1]{z}, xmm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 128 bits of packed byte values from xmm2 to xmm1 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.256.66.0F38.W0 63 /r</td>
<td>VPCOMPRESSB m256[k1], ymm1</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 256 bits of packed byte values from ymm1 to m256 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.256.66.0F38.W0 63 /r</td>
<td>VPCOMPRESSB ymm1[k1]{z}, ymm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 256 bits of packed byte values from ymm2 to ymm1 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.512.66.0F38.W0 63 /r</td>
<td>VPCOMPRESSB m512[k1], zmm1</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 512 bits of packed byte values from zmm1 to m512 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.512.66.0F38.W0 63 /r</td>
<td>VPCOMPRESSB zmm1[k1]{z}, zmm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 512 bits of packed byte values from zmm2 to zmm1 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.128.66.0F38.W1 63 /r</td>
<td>VPCOMPRESSW m128[k1], xmm1</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 128 bits of packed word values from xmm1 to m128 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.128.66.0F38.W1 63 /r</td>
<td>VPCOMPRESSW xmm1[k1]{z}, xmm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 128 bits of packed word values from xmm2 to xmm1 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.256.66.0F38.W1 63 /r</td>
<td>VPCOMPRESSW m256[k1], ymm1</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 256 bits of packed word values from ymm1 to m256 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.256.66.0F38.W1 63 /r</td>
<td>VPCOMPRESSW ymm1[k1]{z}, ymm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 256 bits of packed word values from ymm2 to ymm1 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.512.66.0F38.W1 63 /r</td>
<td>VPCOMPRESSW m512[k1], zmm1</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 512 bits of packed word values from zmm1 to m512 with writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.512.66.0F38.W1 63 /r</td>
<td>VPCOMPRESSW zmm1[k1]{z}, zmm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBM12 AVX512VL</td>
<td>Compress up to 512 bits of packed word values from zmm2 to zmm1 with writemask k1.</td>
<td></td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Tuple1 Scalar</td>
<td>ModRM:r/m (w)</td>
<td>ModRM:reg (r)</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>NA</td>
<td>ModRM:r/m (w)</td>
<td>ModRM:reg (r)</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

Description

Compress (stores) up to 64 byte values or 32 word values from the source operand (second operand) to the destination operand (first operand), based on the active elements determined by the writemask operand. Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Moves up to 512 bits of packed byte values from the source operand (second operand) to the destination operand (first operand). This instruction is used to store partial contents of a vector register into a byte vector or single memory location using the active elements in operand writemask.

Memory destination version: Only the contiguous vector is written to the destination memory location. EVEX.z must be zero.

Register destination version: If the vector length of the contiguous vector is less than that of the input vector in the source operand, the upper bits of the destination register are unmodified if EVEX.z is not set, otherwise the upper bits are zeroed.

This instruction supports memory fault suppression.

Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector.
Operation

VPCOMPRESSB store form
(KL, VL) = (16, 128), (32, 256), (64, 512)
k ← 0
FOR j ← 0 TO KL-1:
   IF k1[j] OR *no writemask*:
      DEST.byte[k] ← SRC.byte[j]
      k ← k + 1

VPCOMPRESSB reg-reg form
(KL, VL) = (16, 128), (32, 256), (64, 512)
k ← 0
FOR j ← 0 TO KL-1:
   IF k1[j] OR *no writemask*:
      DEST.byte[k] ← SRC.byte[j]
      k ← k + 1
IF *merging-masking*:
   *DEST[VL-1:k*8] remains unchanged*
   ELSE DEST[VL-1:k*8] ← 0
DEST[MAX_VL-1:VL] ← 0

VPCOMPRESSW store form
(KL, VL) = (8, 128), (16, 256), (32, 512)
k ← 0
FOR j ← 0 TO KL-1:
   IF k1[j] OR *no writemask*:
      DEST.word[k] ← SRC.word[j]
      k ← k + 1

VPCOMPRESSW reg-reg form
(KL, VL) = (8, 128), (16, 256), (32, 512)
k ← 0
FOR j ← 0 TO KL-1:
   IF k1[j] OR *no writemask*:
      DEST.word[k] ← SRC.word[j]
      k ← k + 1
IF *merging-masking*:
   *DEST[VL-1:k*16] remains unchanged*
   ELSE DEST[VL-1:k*16] ← 0
DEST[MAX_VL-1:VL] ← 0
**Intel C/C++ Compiler Intrinsic Equivalent**

VPCOMPRESSB __m128i __mm_mask_compress_epi8(__m128i, __mmask16, __m128i);
VPCOMPRESSB __m128i __mm_maskz_compress_epi8(__mmask16, __m128i);
VPCOMPRESSB __m256i __mm256_mask_compress_epi8(__m256i, __mmask32, __m256i);
VPCOMPRESSB __m256i __mm256_maskz_compress_epi8(__mmask32, __m256i);
VPCOMPRESSB __m512i __mm512_mask_compress_epi8(__m512i, __mmask64, __m512i);
VPCOMPRESSB __m512i __mm512_maskz_compress_epi8(__mmask64, __m512i);
VPCOMPRESSB void __mm_mask_compressstoreu_epi8(void*, __mmask16, __m128i);
VPCOMPRESSB void __mm256_mask_compressstoreu_epi8(void*, __mmask32, __m256i);
VPCOMPRESSB void __mm512_mask_compressstoreu_epi8(void*, __mmask64, __m512i);
VPCOMPRESSW __m128i __mm_mask_compress_epi16(__m128i, __mmask8, __m128i);
VPCOMPRESSW __m128i __mm_maskz_compress_epi16(__mmask8, __m128i);
VPCOMPRESSW __m256i __mm256_mask_compress_epi16(__m256i, __mmask16, __m256i);
VPCOMPRESSW __m256i __mm256_maskz_compress_epi16(__mmask16, __m256i);
VPCOMPRESSW __m512i __mm512_mask_compress_epi16(__m512i, __mmask32, __m512i);
VPCOMPRESSW __m512i __mm512_maskz_compress_epi16(__mmask32, __m512i);
VPCOMPRESSW void __mm_mask_compressstoreu_epi16(void*, __mmask8, __m128i);
VPCOMPRESSW void __mm256_mask_compressstoreu_epi16(void*, __mmask16, __m256i);
VPCOMPRESSW void __mm512_mask_compressstoreu_epi16(void*, __mmask32, __m512i);

**SIMD Floating-Point Exceptions**

None.

**Other Exceptions**

See Exceptions Type E4.
VPDPBUSD — Multiply and Add Unsigned and Signed Bytes

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.DDS.128.66.0F38.W0 50 /r</td>
<td>A/V/V</td>
<td>AVX512_VNNI AVX512VL</td>
<td>Multiply groups of 4 pairs of signed bytes in xmm3/m128/m32bcst with corresponding unsigned bytes of xmm2, summing those products and adding them to doubleword result in xmm1 under writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.DDS.256.66.0F38.W0 50 /r</td>
<td>A/V/V</td>
<td>AVX512_VNNI AVX512VL</td>
<td>Multiply groups of 4 pairs of signed bytes in ymm3/m256/m32bcst with corresponding unsigned bytes of ymm2, summing those products and adding them to doubleword result in ymm1 under writemask k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.DDS.512.66.0F38.W0 50 /r</td>
<td>A/V/V</td>
<td>AVX512_VNNI</td>
<td>Multiply groups of 4 pairs of signed bytes in zmm3/m512/m32bcst with corresponding unsigned bytes of zmm2, summing those products and adding them to doubleword result in zmm1 under writemask k1.</td>
<td></td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Full</td>
<td>ModRM:reg (r, w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
</tr>
</tbody>
</table>

Description

Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second source operand, producing intermediate signed word results. The word results are then summed and accumulated in the destination dword element size operand.

This instruction supports memory fault suppression.

Operation

VPDPBUSD dest, src1, src2
(KL,VL)=(4,128), (8,256), (16,512)
ORIGDEST ← DEST
FOR i ← 0 TO KL-1:
  IF k1[i] or *no writemask*:
    // Byte elements of SRC1 are zero-extended to 16b and
    // byte elements of SRC2 are sign extended to 16b before multiplication.
    IF SRC2 is memory and EVEX.b == 1:
      t ← SRC2.dword[0]
    ELSE:
      t ← SRC2.dword[i]
    p1word ← ZERO_EXTEND(src1.byte[4*i]) * SIGN_EXTEND(t.byte[0])
    p2word ← ZERO_EXTEND(src1.byte[4*i+1]) * SIGN_EXTEND(t.byte[1])
    p3word ← ZERO_EXTEND(src1.byte[4*i+2]) * SIGN_EXTEND(t.byte[2])
    p4word ← ZERO_EXTEND(src1.byte[4*i+3]) * SIGN_EXTEND(t.byte[3])
    DEST.dword[i] ← ORIGDEST.dword[i] + p1word + p2word + p3word + p4word
  ELSE if zeroing:
    DEST.dword[i] ← 0
  ELSE:
    // Merge masking, dest element unchanged
    DEST.dword[i] ← ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] ← 0
Intel C/C++ Compiler Intrinsic Equivalent

VPDPBUSD __m128i _mm_dpbusd_epi32(__m128i, __m128i, __m128i);
VPDPBUSD __m128i _mm_mask_dpbusd_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSD __m128i _mm_maskz_dpbusd_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSD __m256i _mm256_dpbusd_epi32(__m256i, __m256i, __m256i);
VPDPBUSD __m256i _mm256_mask_dpbusd_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPBUSD __m256i _mm256_maskz_dpbusd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSD __m512i _mm512_dpbusd_epi32(__m512i, __m512i, __m512i);
VPDPBUSD __m512i _mm512_mask_dpbusd_epi32(__mmask16, __m512i, __m512i);
VPDPBUSD __m512i _mm512_maskz_dpbusd_epi32(__mmask16, __m512i, __m512i);

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type E4.
VPDPBUSDS — Multiply and Add Unsigned and Signed Bytes with Saturation

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.DDS.128.66.0F38.W0 51/r VPDPBUSDS xmm1<a href="z">k1</a>, xmm2, xmm3/m128/m32bcst</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VNNI AVX512VL</td>
<td>Multiply groups of 4 pairs signed bytes in xmm3/m128/m32bcst with corresponding unsigned bytes of xmm2, summing those products and adding them to doubleword result, with signed saturation in xmm1, under writemask k1.</td>
</tr>
<tr>
<td>EVEX.DDS.256.66.0F38.W0 51/r VPDPBUSDS ymm1<a href="z">k1</a>, ymm2, ymm3/m256/m32bcst</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VNNI AVX512VL</td>
<td>Multiply groups of 4 pairs signed bytes in ymm3/m256/m32bcst with corresponding unsigned bytes of ymm2, summing those products and adding them to doubleword result, with signed saturation in ymm1, under writemask k1.</td>
</tr>
<tr>
<td>EVEX.DDS.512.66.0F38.W0 51/r VPDPBUSDS zmm1<a href="z">k1</a>, zmm2, zmm3/m512/m32bcst</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VNNI</td>
<td>Multiply groups of 4 pairs signed bytes in zmm3/m512/m32bcst with corresponding unsigned bytes of zmm2, summing those products and adding them to doubleword result, with signed saturation in zmm1, under writemask k1.</td>
</tr>
</tbody>
</table>

**Instruction Operand Encoding**

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Full</td>
<td>ModRM:reg (r, w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
</tr>
</tbody>
</table>

**Description**

Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second source operand, producing intermediate signed word results. The word results are then summed and accumulated in the destination dword element size operand. If the intermediate sum overflows a 32b signed number the result is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.

This instruction supports memory fault suppression.

**Operation**

VPDPBUSDS dest, src1, src2

(KL,VL)=(4,128), (8,256), (16,512)

ORIGDEST ← DEST

FOR i ← 0 TO KL-1:

IF k1[i] or *no writemask*:

// Byte elements of SRC1 are zero-extended to 16b and
// byte elements of SRC2 are sign extended to 16b before multiplication.

IF SRC2 is memory and EVEX.b == 1:

    t ← SRC2.dword[0]

ELSE:

    t ← SRC2.dword[i]

p1word ← ZERO_EXTEND(SRC1.byte[4*i]) * SIGN_EXTEND(t.byte[0])
p2word ← ZERO_EXTEND(SRC1.byte[4*i+1]) * SIGN_EXTEND(t.byte[1])
p3word ← ZERO_EXTEND(SRC1.byte[4*i+2]) * SIGN_EXTEND(t.byte[2])
p4word ← ZERO_EXTEND(SRC1.byte[4*i+3]) * SIGN_EXTEND(t.byte[3])

DEST.dword[i] ← SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1word + p2word + p3word + p4word)
ELSE IF *zeroing*:
    DEST.dword[i] ← 0
ELSE:  // Merge masking, dest element unchanged
    DEST.dword[i] ← ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] ← 0

Intel C/C++ Compiler Intrinsic Equivalent

VPDPBUSDS __m128i _mm_dpbusds_epi32(__m128i, __m128i, __m128i);
VPDPBUSDS __m128i _mm_mask_dpbusds_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSDS __m128i _mm_maskz_dpbusds_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSDS __m256i _mm256_dpbusds_epi32(__m256i, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_mask_dpbusds_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_maskz_dpbusds_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSDS __m512i _mm512_dpbusds_epi32(__m512i, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_mask_dpbusds_epi32(__mmask16, __m512i, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_maskz_dpbusds_epi32(__mmask16, __m512i, __m512i, __m512i);

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type E4.
VPDPWSSD — Multiply and Add Signed Word Integers

### Instruction Set Reference, A–Z

**Description**

Multiplies the individual signed words of the first source operand by the corresponding signed words of the second source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then summed and accumulated in the destination operand.

This instruction supports memory fault suppression.

**Operation**

**VPDPWSSD dest, src1, src2**

(KL,VL)=(4,128), (8,256), (16,512)

ORIGDEST ← DEST

FOR i ← 0 TO KL-1:

IF k1[i] or *no writemask*:

IF SRC2 is memory and EVEX.b == 1:

\[ t \leftarrow SRC2\text{.}dword[0] \]

ELSE:

\[ t \leftarrow SRC2\text{.}dword[i] \]

\[ p1dword \leftarrow SRC1\text{.}word[2*i] * t\text{.}word[0] \]

\[ p2dword \leftarrow SRC1\text{.}word[2*i+1] * t\text{.}word[1] \]

\[ DEST\text{.}dword[i] \leftarrow ORIGDEST\text{.}dword[i] + p1dword + p2dword \]

ELSE IF *zeroing*:

\[ DEST\text{.}dword[i] \leftarrow 0 \]

ELSE:

// Merge masking, dest element unchanged

\[ DEST\text{.}dword[i] \leftarrow ORIGDEST\text{.}dword[i] \]

\[ DEST\text{.}MAX\_VL-1:VL \leftarrow 0 \]
Intel C/C++ Compiler Intrinsic Equivalent

VPDPWSSD __m128i __mm_dpwssd_epi32(__m128i, __m128i, __m128i);
VPDPWSSD __m128i __mm_mask_dpwssd_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPWSSD __m128i __mm_maskz_dpwssd_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPWSSD __m256i __mm256_dpwssd_epi32(__m256i, __m256i, __m256i);
VPDPWSSD __m256i __mm256_mask_dpwssd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPWSSD __m256i __mm256_maskz_dpwssd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPWSSD __m512i __mm512_dpwssd_epi32(__m512i, __m512i, __m512i);
VPDPWSSD __m512i __mm512_mask_dpwssd_epi32(__mmask16, __m512i, __m512i, __m512i);
VPDPWSSD __m512i __mm512_maskz_dpwssd_epi32(__mmask16, __m512i, __m512i, __m512i);

SIMD Floating-Point Exceptions
None.

Other Exceptions
See Exceptions Type E4.
VPDPWSSDS — Multiply and Add Word Integers with Saturation

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.DS.128.66.0F38.W0 53 /r</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VNNI AVX512VL</td>
<td>Multiply groups of 2 pairs of signed words in xmm3/m128/m32bcst with corresponding signed words of xmm2, summing those products and adding them to doubleword result in xmm1, with signed saturation, under writemask k1.</td>
</tr>
<tr>
<td>EVEX.DS.256.66.0F38.W0 53 /r</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VNNI AVX512VL</td>
<td>Multiply groups of 2 pairs of signed words in ymm3/m256/m32bcst with corresponding signed words of ymm2, summing those products and adding them to doubleword result in ymm1, with signed saturation, under writemask k1.</td>
</tr>
<tr>
<td>EVEX.DS.512.66.0F38.W0 53 /r</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VNNI</td>
<td>Multiply groups of 2 pairs of signed words in zmm3/m512/m32bcst with corresponding signed words of zmm2, summing those products and adding them to doubleword result in zmm1, with signed saturation, under writemask k1.</td>
</tr>
</tbody>
</table>

Description

Multiplies the individual signed words of the first source operand by the corresponding signed words of the second source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then summed and accumulated in the destination operand. If the intermediate sum overflows a 32b signed number, the result is saturated to either 0x7FFF_FFFF for positive numbers of 0x8000_0000 for negative numbers.

This instruction supports memory fault suppression.

Operation

VPDPWSSDS dest, src1, src2
(KL,VL)= (4,128), (8,256), (16,512)
ORIGDEST ← DEST
FOR i ← 0 TO KL-1:
  IF k1[i] or *no writemask*:
    IF SRC2 is memory and EVEX.b == 1:
      t ← SRC2.dword[0]
    ELSE:
      t ← SRC2.dword[i]
    p1dword ← SRC1.word[2*i] * t.word[0]
    p2dword ← SRC1.word[2*i+1] * t.word[1]
    DEST.dword[i] ← SIGNED_DWORD_SATURATE(ORIGDEST.dword[i] + p1dword + p2dword)
  ELSE IF *zeroing*:
    DEST.dword[i] ← 0
  ELSE: // Merge masking, dest element unchanged
    DEST.dword[i] ← ORIGDEST.dword[i]
  DEST[MAX_VL-1:VL] ← 0
Intel C/C++ Compiler Intrinsic Equivalent

VPDPWSSDS __m128i _mm_dpwssds_epi32(__m128i, __m128i, __m128i);
VPDPWSSDS __m128i _mm_mask_dpwssd_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPWSSDS __m128i _mm_maskz_dpwssd_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPWSSDS __m256i _mm256_dpwssd_epi32(__m256i, __m256i, __m256i);
VPDPWSSDS __m256i _mm256_mask_dpwssd_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPWSSDS __m256i _mm256_maskz_dpwssd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPWSSDS __m512i _mm512_dpwssd_epi32(__m512i, __m512i, __m512i);
VPDPWSSDS __m512i _mm512_mask_dpwssd_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPWSSDS __m512i _mm512_maskz_dpwssd_epi32(__mmask16, __m512i, __m512i, __m512i);

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type E4.
VPEXPAND — Expand Byte/Word Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.128.66.0F38.W0 62 /r VPEXPANDB xmm1{k1}{z}, m128</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Expands up to 128 bits of packed byte values from m128 to xmm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.128.66.0F38.W0 62 /r VPEXPANDB xmm1{k1}{z}, xmm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Expands up to 128 bits of packed byte values from xmm2 to xmm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.256.66.0F38.W0 62 /r VPEXPANDB ymm1{k1}[z], m256</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Expands up to 256 bits of packed byte values from m256 to ymm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.256.66.0F38.W0 62 /r VPEXPANDB ymm1{k1}[z], ymm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Expands up to 256 bits of packed byte values from ymm2 to ymm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.512.66.0F38.W0 62 /r VPEXPANDB zmm1{k1}[z], m512</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Expands up to 512 bits of packed byte values from m512 to zmm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.512.66.0F38.W0 62 /r VPEXPANDB zmm1{k1}[z], zmm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Expands up to 512 bits of packed byte values from zmm2 to zmm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.128.66.0F38.W1 62 /r VPEXPANDW xmm1{k1}{z}, m128</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Expands up to 128 bits of packed word values from m128 to xmm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.128.66.0F38.W1 62 /r VPEXPANDW xmm1{k1}{z}, xmm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Expands up to 128 bits of packed word values from xmm2 to xmm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.256.66.0F38.W1 62 /r VPEXPANDW ymm1{k1}[z], m256</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Expands up to 256 bits of packed word values from m256 to ymm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.256.66.0F38.W1 62 /r VPEXPANDW ymm1{k1}[z], ymm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Expands up to 256 bits of packed word values from ymm2 to ymm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.512.66.0F38.W1 62 /r VPEXPANDW zmm1{k1}[z], m512</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Expands up to 512 bits of packed word values from m512 to zmm1 with writemask k1.</td>
</tr>
<tr>
<td>EVEX.512.66.0F38.W1 62 /r VPEXPANDW zmm1{k1}[z], zmm2</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Expands up to 512 bits of packed byte integer values from zmm2 to zmm1 with writemask k1.</td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Tuple1 Scalar</td>
<td>ModRM:reg (w)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>NA</td>
<td>ModRM:reg (w)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
<td>NA</td>
</tr>
</tbody>
</table>

Description

Expands (loads) up to 64 byte integer values or 32 word integer values from the source operand (memory operand) to the destination operand (register operand), based on the active elements determined by the writemask operand.

Note: EVEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
INSTRUCTION SET REFERENCE, A-Z

Moves 128, 256 or 512 bits of packed byte integer values from the source operand (memory operand) to the destination operand (register operand). This instruction is used to load from an int8 vector register or memory location while inserting the data into sparse elements of destination vector register using the active elements pointed out by the operand writemask.

This instruction supports memory fault suppression.

Note that the compressed displacement assumes a pre-scaling (N) corresponding to the size of one single element instead of the size of the full vector.

**Operation**

**VPFPANDB**

(KL, VL) = (16, 128), (32, 256), (64, 512)

\[ k \leftarrow 0 \]

FOR \( j \leftarrow 0 \) TO \( KL-1 \):

\[ \text{IF } k1[j] \text{ OR *no writemask*}: \]

\[ \text{DEST.byte}[j] \leftarrow \text{SRC.byte}[k]; \]

\[ k \leftarrow k + 1 \]

ELSE:

\[ \text{IF *merging-masking*}: \]

\[ \text{*DEST.byte}[j] \text{ remains unchanged*} \]

ELSE:

\[ \text{DEST.byte}[j] \leftarrow 0 \]

\[ \text{DEST[MAX_VL-1:VL]} \leftarrow 0 \]

**VPFPANDW**

(KL, VL) = (8, 128), (16, 256), (32, 512)

\[ k \leftarrow 0 \]

FOR \( j \leftarrow 0 \) TO \( KL-1 \):

\[ \text{IF } k1[j] \text{ OR *no writemask*}: \]

\[ \text{DEST.word}[j] \leftarrow \text{SRC.word}[k]; \]

\[ k \leftarrow k + 1 \]

ELSE:

\[ \text{IF *merging-masking*}: \]

\[ \text{*DEST.word}[j] \text{ remains unchanged*} \]

ELSE:

\[ \text{DEST.word}[j] \leftarrow 0 \]

\[ \text{DEST[MAX_VL-1:VL]} \leftarrow 0 \]
Intel C/C++ Compiler Intrinsic Equivalent

VPEXPAND __m128i _mm_mask_expand_epi8(__m128i, __mmask16, __m128i);
VPEXPAND __m128i _mm_maskz_expand_epi8(__mmask16, __m128i);
VPEXPAND __m128i _mm_maskz_expandloadu_epi8(__m128i, __mmask16, const void*);
VPEXPAND __m256i __m512i _mm256_mask_expand_epi8(__m256i, __mmask32, __m256i);
VPEXPAND __m512i _mm512_maskz_expand_epi8(__mmask64, __m512i);
VPEXPAND __m512i _mm512_maskz_expandloadu_epi8(__mmask64, const void*);
VPEXPANDW __m128i _mm_mask_expand_epi16(__m128i, __mmask8, __m128i);
VPEXPANDW __m128i _mm_maskz_expand_epi16(__mmask8, __m128i);
VPEXPANDW __m128i _mm_maskz_expandloadu_epi16(__mmask8, const void*);
VPEXPANDW __m256i _mm256_maskz_expand_epi16(__m256i, __mmask16, __m256i);
VPEXPANDW __m512i _mm512_maskz_expand_epi16(__m512i, __mmask32, __m512i);
VPEXPANDW __m512i _mm512_maskz_expandloadu_epi16(__m512i, __mmask32, const void*);

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type E4.
VPOPCNT — Return the Count of Number of Bits Set to 1 in BYTE/WORD/DWORD/QWORD

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.128.66.0F38.w0 54 /r</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_BITALG AVX512VL</td>
<td>Counts the number of bits set to one in xmm2/m128 and puts the result in xmm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTB xmm1{k1}{z}, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.256.66.0F38.w0 54 /r</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_BITALG AVX512VL</td>
<td>Counts the number of bits set to one in ymm2/m256 and puts the result in ymm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTB ymm1{k1}{z}, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.512.66.0F38.w0 54 /r</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_BITALG</td>
<td>Counts the number of bits set to one in zmm2/m512 and puts the result in zmm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTB zmm1{k1}{z}, zmm2/m512</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.128.66.0F38.w1 54 /r</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_BITALG AVX512VL</td>
<td>Counts the number of bits set to one in xmm2/m128 and puts the result in xmm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTW xmm1{k1}{z}, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.256.66.0F38.w1 54 /r</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_BITALG AVX512VL</td>
<td>Counts the number of bits set to one in ymm2/m256 and puts the result in ymm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTW ymm1{k1}{z}, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.512.66.0F38.w1 54 /r</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_BITALG</td>
<td>Counts the number of bits set to one in zmm2/m512 and puts the result in zmm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTW zmm1{k1}{z}, zmm2/m512</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.128.66.0F38.w0 55 /r</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VPOPCNTDQ AVX512VL</td>
<td>Counts the number of bits set to one in xmm2/m128/m32bcst and puts the result in xmm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTD xmm1{k1}{z}, xmm2/m128/m32bcst</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.256.66.0F38.w0 55 /r</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VPOPCNTDQ AVX512VL</td>
<td>Counts the number of bits set to one in ymm2/m256/m32bcst and puts the result in ymm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTD ymm1{k1}{z}, ymm2/m256/m32bcst</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.512.66.0F38.w0 55 /r</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VPOPCNTDQ</td>
<td>Counts the number of bits set to one in zmm2/m512/m32bcst and puts the result in zmm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTD zmm1{k1}{z}, zmm2/m512/m32bcst</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.128.66.0F38.w1 55 /r</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VPOPCNTDQ AVX512VL</td>
<td>Counts the number of bits set to one in xmm2/m128/m32bcst and puts the result in xmm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTQ xmm1{k1}{z}, xmm2/m128/m64bcst</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.256.66.0F38.w1 55 /r</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VPOPCNTDQ AVX512VL</td>
<td>Counts the number of bits set to one in ymm2/m256/m32bcst and puts the result in ymm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTQ ymm1{k1}{z}, ymm2/m256/m64bcst</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>EVEX.512.66.0F38.w1 55 /r</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VPOPCNTDQ</td>
<td>Counts the number of bits set to one in zmm2/m512/m64bcst and puts the result in zmm1 with writemask k1.</td>
</tr>
<tr>
<td>VPOPCNTQ zmm1{k1}{z}, zmm2/m512/m64bcst</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

This instruction counts the number of bits set to one in each byte, word, dword or qword element of its source (e.g., zmm2 or memory) and places the results in the destination register (zmm1). This instruction supports memory fault suppression.
**Operation**

**VPOPCNTB**

(KL, VL) = (16, 128), (32, 256), (64, 512)

FOR j ← 0 TO KL-1:

IF MaskBit(j) OR *no writemask*:
    DEST.byte[j] ← POPCNT(SRC.byte[j])
ELSE IF *merging-masking*:
    *DEST.byte[j] remains unchanged*
ELSE:
    DEST.byte[j] ← 0

DEST[MAX_VL-1:VL] ← 0

**VPOPCNTW**

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j ← 0 TO KL-1:

IF MaskBit(j) OR *no writemask*:
    DEST.word[j] ← POPCNT(SRC.word[j])
ELSE IF *merging-masking*:
    *DEST.word[j] remains unchanged*
ELSE:
    DEST.word[j] ← 0

DEST[MAX_VL-1:VL] ← 0

**VPOPCNTD**

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j ← 0 TO KL-1:

IF MaskBit(j) OR *no writemask*:
    IF SRC is broadcast memop:
        t ← SRC.dword[0]
    ELSE:
        t ← SRC.dword[j]
    DEST.dword[j] ← POPCNT(t)
ELSE IF *merging-masking*:
    *DEST.dword[j] remains unchanged*
ELSE:
    DEST.dword[j] ← 0

DEST[MAX_VL-1:VL] ← 0

**VPOPCNTQ**

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j ← 0 TO KL-1:

IF MaskBit(j) OR *no writemask*:
    IF SRC is broadcast memop:
        t ← SRC.qword[0]
    ELSE:
        t ← SRC.qword[j]
    DEST.qword[j] ← POPCNT(t)
ELSE IF *merging-masking*:
    *DEST.qword[j] remains unchanged*
ELSE:
    DEST.qword[j] ← 0

DEST[MAX_VL-1:VL] ← 0
Intel C/C++ Compiler Intrinsic Equivalent

VPOPCNTW __m128i _mm_popcnt_epi16(__m128i);
VPOPCNTW __m128i _mm_mask_popcnt_epi16(__m128i,__mmask8,__m128i);
VPOPCNTW __m128i _mm_maskz_popcnt_epi16(__mmask8,__m128i);
VPOPCNTW __m256i _mm256_popcnt_epi16(__m256i);
VPOPCNTW __m256i _mm256_mask_popcnt_epi16(__m256i,__mmask16,__m256i);
VPOPCNTW __m256i _mm256_maskz_popcnt_epi16(__mmask16,__m256i);
VPOPCNTW __m512i _mm512_popcnt_epi16(__m512i);
VPOPCNTW __m512i _mm512_mask_popcnt_epi16(__m512i,__mmask32,__m512i);
VPOPCNTW __m512i _mm512_maskz_popcnt_epi16(__mmask32,__m512i);

VPOPCNTQ __m128i _mm_popcnt_epi64(__m128i);
VPOPCNTQ __m128i _mm_mask_popcnt_epi64(__m128i,__mmask8,__m128i);
VPOPCNTQ __m128i _mm_maskz_popcnt_epi64(__mmask8,__m128i);
VPOPCNTQ __m256i _mm256_popcnt_epi64(__m256i);
VPOPCNTQ __m256i _mm256_mask_popcnt_epi64(__m256i,__mmask8,__m256i);
VPOPCNTQ __m256i _mm256_maskz_popcnt_epi64(__mmask8,__m256i);
VPOPCNTQ __m512i _mm512_popcnt_epi64(__m512i);
VPOPCNTQ __m512i _mm512_mask_popcnt_epi64(__m512i,__mmask16,__m512i);
VPOPCNTQ __m512i _mm512_maskz_popcnt_epi64(__mmask16,__m512i);

VPOPCNTB __m128i _mm_popcnt_epi8(__m128i);
VPOPCNTB __m128i _mm_mask_popcnt_epi8(__m128i,__mmask16,__m128i);
VPOPCNTB __m256i _mm256_popcnt_epi8(__m256i);
VPOPCNTB __m256i _mm256_mask_popcnt_epi8(__m256i,__mmask32,__m256i);
VPOPCNTB __m256i _mm256_maskz_popcnt_epi8(__mmask32,__m256i);
VPOPCNTB __m512i _mm512_popcnt_epi8(__m512i);
VPOPCNTB __m512i _mm512_mask_popcnt_epi8(__m512i,__mmask64,__m512i);
VPOPCNTB __m512i _mm512_maskz_popcnt_epi8(__mmask64,__m512i);

SIMD Floating-Point Exceptions
None.

Other Exceptions
See Type E4.
VPSHLD — Concatenate and Shift Packed Data Left Logical

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.NDS.128:66:OF3A.W1 70 /r /ib VPSHLDW xmm1{k1}{z}, xmm2, xmm3/m128, imm8</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the left by constant value in imm8 into xmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.256:66:OF3A.W1 70 /r /ib VPSHLDW ymm1{k1}{z}, ymm2, ymm3/m256, imm8</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the left by constant value in imm8 into ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.512:66:OF3A.W1 70 /r /ib VPSHLDW zmm1{k1}{z}, zmm2, zmm3/m512, imm8</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Concatenate destination and source operands, extract result shifted to the left by constant value in imm8 into zmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.128:66:OF3A.W0 71 /r /ib VPSHLDQ xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the left by constant value in imm8 into xmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.256:66:OF3A.W0 71 /r /ib VPSHLDQ ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the left by constant value in imm8 into ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.512:66:OF3A.W0 71 /r /ib VPSHLDQ zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the left by constant value in imm8 into zmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.128:66:OF3A.W1 71 /r /ib VPSHLDQ xmm1{k1}{z}, xmm2, xmm3/m128/m64bcst, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the left by constant value in imm8 into xmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.256:66:OF3A.W1 71 /r /ib VPSHLDQ ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the left by constant value in imm8 into ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.512:66:OF3A.W1 71 /r /ib VPSHLDQ zmm1{k1}{z}, zmm2, zmm3/m512/m64bcst, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the left by constant value in imm8 into zmm1.</td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Full Mem</td>
<td>ModRMreg (w)</td>
<td>EVEX.vvvv</td>
<td>ModRMreg/r/m (r)</td>
<td>imm8 (r)</td>
</tr>
<tr>
<td>B</td>
<td>Full</td>
<td>ModRMreg (w)</td>
<td>EVEX.vvvv</td>
<td>ModRMreg/r/m (r)</td>
<td>imm8 (r)</td>
</tr>
</tbody>
</table>

Description

Concatenate packed data, extract result shifted to the left by constant value.
This instruction supports memory fault suppression.
Operation

**VPSHLDW DEST, SRC2, SRC3, imm8**

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j ← 0 TO KL-1:

  IF MaskBit(j) OR *no writemask*:
    tmp ← concat(SRC2.word[j], SRC3.word[j]) << (imm8 & 15)
    DEST.word[j] ← tmp.word[1]

  ELSE IF *zeroing*:
    DEST.word[j] ← 0

  ELSE DEST.word[j] remains unchanged

DEST[MAX_VL-1:VL] ← 0

**VPSHLDQ DEST, SRC2, SRC3, imm8**

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j ← 0 TO KL-1:

  IF SRC3 is broadcast memop:
    tsrc3 ← SRC3.dword[0]

  ELSE:
    tsrc3 ← SRC3.dword[j]

  IF MaskBit(j) OR *no writemask*:
    tmp ← concat(SRC2.dword[j], tsrc3) << (imm8 & 31)
    DEST.dword[j] ← tmp.dword[1]

  ELSE IF *zeroing*:
    DEST.dword[j] ← 0

  ELSE DEST.dword[j] remains unchanged

DEST[MAX_VL-1:VL] ← 0

**VPSHLDQ DEST, SRC2, SRC3, imm8**

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j ← 0 TO KL-1:

  IF SRC3 is broadcast memop:
    tsrc3 ← SRC3.qword[0]

  ELSE:
    tsrc3 ← SRC3.qword[j]

  IF MaskBit(j) OR *no writemask*:
    tmp ← concat(SRC2.qword[j], tsrc3) << (imm8 & 63)
    DEST.qword[j] ← tmp.qword[1]

  ELSE IF *zeroing*:
    DEST.qword[j] ← 0

  ELSE DEST.qword[j] remains unchanged

DEST[MAX_VL-1:VL] ← 0
Intel C/C++ Compiler Intrinsic Equivalent

VPSHLDQ __m128i _mm_shldi_epi64(__m128i, __m128i, int);
VPSHLDQ __m128i _mm_mask_shldi_epi64(__m128i, __mmask8, __m128i, __m128i, int);
VPSHLDQ __m128i _mm_maskz_shldi_epi64(__mmask8, __m128i, __m128i, int);
VPSHLDQ __m256i _mm256_shldi_epi64(__m256i, __m256i, int);
VPSHLDQ __m256i _mm256_mask_shldi_epi64(__m256i, __mmask8, __m256i, __m256i, int);
VPSHLDQ __m256i _mm256_maskz_shldi_epi64(__mmask8, __m256i, __m256i, int);
VPSHLDQ __m512i _mm512_shldi_epi64(__m512i, __m512i, int);
VPSHLDQ __m512i _mm512_mask_shldi_epi64(__m512i, __mmask8, __m512i, __m512i, int);
VPSHLDQ __m512i _mm512_maskz_shldi_epi64(__mmask8, __m512i, __m512i, int);

SIMD Floating-Point Exceptions

None.

Other Exceptions
See Type E4.
### VPSHLDV — Concatenate and Variable Shift Packed Data Left Logical

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Op/ En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.DDS.128.66.0F38.W1 70 /r VPSHLDVW xmm1[k1]{k1}{z}, xmm2, xmm3/m128</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate xmm1 and xmm2, extract result shifted to the left by value in xmm3/m128 into xmm1.</td>
</tr>
<tr>
<td>EVEX.DDS.256.66.0F38.W1 70 /r VPSHLDVW ymm1[k1]{k1}{z}, ymm2, ymm3/m256</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate ymm1 and ymm2, extract result shifted to the left by value in xmm3/m256 into ymm1.</td>
</tr>
<tr>
<td>EVEX.DDS.512.66.0F38.W1 70 /r VPSHLDVW zmm1[k1]{k1}{z}, zmm2, zmm3/m512</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Concatenate zmm1 and zmm2, extract result shifted to the left by value in xmm3/m512 into zmm1.</td>
</tr>
<tr>
<td>EVEX.DDS.128.66.0F38.W0 71 /r VPSHLDVD xmm1[k1]{k1}{z}, xmm2, xmm3/m128/m32bcst</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate xmm1 and xmm2, extract result shifted to the left by value in xmm3/m128 into xmm1.</td>
</tr>
<tr>
<td>EVEX.DDS.256.66.0F38.W0 71 /r VPSHLDVD ymm1[k1]{k1}{z}, ymm2, ymm3/m256/m32bcst</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate ymm1 and ymm2, extract result shifted to the left by value in xmm3/m256 into ymm1.</td>
</tr>
<tr>
<td>EVEX.DDS.512.66.0F38.W0 71 /r VPSHLDVD zmm1[k1]{k1}{z}, zmm2, zmm3/m512/m32bcst</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Concatenate zmm1 and zmm2, extract result shifted to the left by value in xmm3/m512 into zmm1.</td>
</tr>
<tr>
<td>EVEX.DDS.128.66.0F38.W1 71 /r VPSHLDVQ xmm1[k1]{k1}{z}, xmm2, xmm3/m128/m64bcst</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate xmm1 and xmm2, extract result shifted to the left by value in xmm3/m128 into xmm1.</td>
</tr>
<tr>
<td>EVEX.DDS.256.66.0F38.W1 71 /r VPSHLDVQ ymm1[k1]{k1}{z}, ymm2, ymm3/m256/m64bcst</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate ymm1 and ymm2, extract result shifted to the left by value in xmm3/m256 into ymm1.</td>
</tr>
<tr>
<td>EVEX.DDS.512.66.0F38.W1 71 /r VPSHLDVQ zmm1[k1]{k1}{z}, zmm2, zmm3/m512/m64bcst</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Concatenate zmm1 and zmm2, extract result shifted to the left by value in xmm3/m512 into zmm1.</td>
</tr>
</tbody>
</table>

### Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Full Mem ModRM:reg (r, w) EVEX.vvvv</td>
<td>ModRMr/rm (r)</td>
<td>NA</td>
<td></td>
<td></td>
</tr>
<tr>
<td>B</td>
<td>Full ModRM:reg (r, w) EVEX.vvvv</td>
<td>ModRMr/rm (r)</td>
<td>NA</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Concatenate packed data, extract result shifted to the left by variable value.

This instruction supports memory fault suppression.
**Operation**

FUNCTION concat(a, b):
   IF words:
      d.word[1] ← a
      d.word[0] ← b
      return d
   ELSE IF dwords:
      q.dword[1] ← a
      q.dword[0] ← b
      return q
   ELSE IF qwords:
      o.qword[1] ← a
      o.qword[0] ← b
      return o

**VPSHLDVW DEST, SRC2, SRC3**

(KL, VL) = (8, 128), (16, 256), (32, 512)
FOR j ← 0 TO KL-1:
   IF MaskBit(j) OR *no writemask*:
      tmp ← concat(DEST.word[j], SRC2.word[j]) << (SRC3.word[j] & 15)
      DEST.word[j] ← tmp.word[1]
   ELSE IF *zeroing*:
      DEST.word[j] ← 0
   *ELSE DEST.word[j] remains unchanged*
   DEST[MAX_VL-1:VL] ← 0

**VPSHLDVD DEST, SRC2, SRC3**

(KL, VL) = (4, 128), (8, 256), (16, 512)
FOR j ← 0 TO KL-1:
   IF SRC3 is broadcast memop:
      tsrc3 ← SRC3.dword[0]
   ELSE:
      tsrc3 ← SRC3.dword[j]
   IF MaskBit(j) OR *no writemask*:
      tmp ← concat(DEST.dword[j], SRC2.dword[j]) << (tsrc3 & 31)
      DEST.dword[j] ← tmp.dword[1]
   ELSE IF *zeroing*:
      DEST.dword[j] ← 0
   *ELSE DEST.dword[j] remains unchanged*
   DEST[MAX_VL-1:VL] ← 0
VPSHLDVQ DEST, SRC2, SRC3

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j ← 0 TO KL-1:
  IF SRC3 is broadcast memop:
    tsrc3 ← SRC3.qword[0]
  ELSE:
    tsrc3 ← SRC3.qword[j]
  IF MaskBit(j) OR *no writemask*:
    tmp ← concat(DEST.qword[j], SRC2.qword[j]) << (tsrc3 & 63)
    DEST.qword[j] ← tmp.qword[1]
  ELSE IF *zeroing*:
    DEST.qword[j] ← 0
  *ELSE DEST.qword[j] remains unchanged*

DEST[MAX_VL-1:VL] ← 0

Intel C/C++ Compiler Intrinsic Equivalent

VPSHLDVw __m128i __mm_shldv_epi16(__m128i, __m128i, __m128i);
VPSHLDVw __m128i __mm_mask_shldv_epi16(__m128i, __mmask8, __m128i, __m128i);
VPSHLDVw __m128i __mm_maskz_shldv_epi16(__mmask8, __m128i, __m128i, __m128i);
VPSHLDVw __m256i __mm256_shldv_epi16(__m256i, __m256i, __m256i, __m256i);
VPSHLDVw __m256i __mm256_mask_shldv_epi16(__m256i, __mmask16, __m256i, __m256i);
VPSHLDVw __m256i __mm256_maskz_shldv_epi16(__mmask16, __m256i, __m256i, __m256i);
VPSHLDVQ __m512i __mm512_shldv_epi64(__m512i, __m512i, __m512i);
VPSHLDVQ __m512i __mm512_mask_shldv_epi64(__m512i, __mmask8, __m512i, __m512i);
VPSHLDVQ __m512i __mm512_maskz_shldv_epi64(__mmask8, __m512i, __m512i, __m512i);
VPSHLDVw __m128i __mm_shldv_epi16(__m128i, __m128i, __m128i);
VPSHLDVw __m128i __mm_mask_shldv_epi16(__m128i, __mmask8, __m128i, __m128i);
VPSHLDVw __m128i __mm_maskz_shldv_epi16(__mmask8, __m128i, __m128i, __m128i);
VPSHLDVw __m256i __mm256_shldv_epi16(__m256i, __m256i, __m256i, __m256i);
VPSHLDVw __m256i __mm256_mask_shldv_epi16(__m256i, __mmask16, __m256i, __m256i);
VPSHLDVw __m256i __mm256_maskz_shldv_epi16(__mmask16, __m256i, __m256i, __m256i);
VPSHLDVQ __m512i __mm512_shldv_epi64(__m512i, __m512i, __m512i);
VPSHLDVQ __m512i __mm512_mask_shldv_epi64(__m512i, __mmask8, __m512i, __m512i);
VPSHLDVQ __m512i __mm512_maskz_shldv_epi64(__mmask8, __m512i, __m512i, __m512i);
VPSHLDVw __m128i __mm_shldv_epi16(__m128i, __m128i, __m128i);
VPSHLDVw __m128i __mm_mask_shldv_epi16(__m128i, __mmask8, __m128i, __m128i);
VPSHLDVw __m128i __mm_maskz_shldv_epi16(__mmask8, __m128i, __m128i, __m128i);
VPSHLDVw __m256i __mm256_shldv_epi16(__m256i, __m256i, __m256i, __m256i);
VPSHLDVw __m256i __mm256_mask_shldv_epi16(__m256i, __mmask16, __m256i, __m256i);
VPSHLDVw __m256i __mm256_maskz_shldv_epi16(__mmask16, __m256i, __m256i, __m256i);
VPSHLDVQ __m512i __mm512_shldv_epi64(__m512i, __m512i, __m512i);
VPSHLDVQ __m512i __mm512_mask_shldv_epi64(__m512i, __mmask8, __m512i, __m512i);
VPSHLDVQ __m512i __mm512_maskz_shldv_epi64(__mmask8, __m512i, __m512i, __m512i);

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Type E4.
**VPSHRD — Concatenate and Shift Packed Data Right Logical**

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.NDS.128.66.0F3A.W1 72 /r /ib VPSHRD xmm1{k1}{z}, xmm2, xmm3/m128, imm8</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the right by constant value in imm8 into xmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.256.66.0F3A.W1 72 /r /ib VPSHRD ymm1{k1}{z}, ymm2, ymm3/m256, imm8</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the right by constant value in imm8 into ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.512.66.0F3A.W1 72 /r /ib VPSHRD zmm1{k1}{z}, zmm2, zmm3/m512, imm8</td>
<td>A</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Concatenate destination and source operands, extract result shifted to the right by constant value in imm8 into zmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.128.66.0F3A.W0 73 /r /ib VPSHRDQ xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the right by constant value in imm8 into xmm1.</td>
</tr>
<tr>
<td>EVEX.NDS.256.66.0F3A.W0 73 /r /ib VPSHRDQ ymm1{k1}{z}, ymm2, ymm3/m256/m32bcst, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate destination and source operands, extract result shifted to the right by constant value in imm8 into ymm1.</td>
</tr>
<tr>
<td>EVEX.NDS.512.66.0F3A.W0 73 /r /ib VPSHRDQ zmm1{k1}{z}, zmm2, zmm3/m512/m32bcst, imm8</td>
<td>B</td>
<td>V/V</td>
<td>AVX512_VBMI2</td>
<td>Concatenate destination and source operands, extract result shifted to the right by constant value in imm8 into zmm1.</td>
</tr>
</tbody>
</table>

**Instruction Operand Encoding**

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Full Mem</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv</td>
<td>ModRM:r/m (r)</td>
<td>imm8 (r)</td>
</tr>
<tr>
<td>B</td>
<td>Full</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv</td>
<td>ModRM:r/m (r)</td>
<td>imm8 (r)</td>
</tr>
</tbody>
</table>

**Description**

Concatenate packed data, extract result shifted to the right by constant value. This instruction supports memory fault suppression.
**Operation**

**VPSHRDw DEST, SRC2, SRC3, imm8**

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j ← 0 TO KL-1:

  IF MaskBit(j) OR *no writemask*:
    DEST.word[j] ← concat(SRC3.word[j], SRC2.word[j]) >> (imm8 & 15)
  ELSE IF *zeroing*:
    DEST.word[j] ← 0
  *ELSE DEST.word[j] remains unchanged*

DEST[MAX_VL-1:VL] ← 0

**VPSHRDD DEST, SRC2, SRC3, imm8**

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j ← 0 TO KL-1:

  IF SRC3 is broadcast memop:
    tsrc3 ← SRC3.dword[0]
  ELSE:
    tsrc3 ← SRC3.dword[j]
  IF MaskBit(j) OR *no writemask*:
    DEST.dword[j] ← concat(tsrc3, SRC2.dword[j]) >> (imm8 & 31)
  ELSE IF *zeroing*:
    DEST.dword[j] ← 0
  *ELSE DEST.dword[j] remains unchanged*

DEST[MAX_VL-1:VL] ← 0

**VPSHRDQ DEST, SRC2, SRC3, imm8**

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j ← 0 TO KL-1:

  IF SRC3 is broadcast memop:
    tsrc3 ← SRC3.qword[0]
  ELSE:
    tsrc3 ← SRC3.qword[j]
  IF MaskBit(j) OR *no writemask*:
    DEST.qword[j] ← concat(tsrc3, SRC2.qword[j]) >> (imm8 & 63)
  ELSE IF *zeroing*:
    DEST.qword[j] ← 0
  *ELSE DEST.qword[j] remains unchanged*

DEST[MAX_VL-1:VL] ← 0
**Intel C/C++ Compiler Intrinsic Equivalent**

VPSHRDQ __m128i _mm_shrdi_epi64(__m128i, __m128i, int);
VPSHRDQ __m128i _mm_mask_shrdi_epi64(__m128i, __mmask8, __m128i, __m128i, int);
VPSHRDQ __m128i _mm_maskz_shrdi_epi64(__mmask8, __m128i, __m128i, int);
VPSHRDQ __m256i _mm256_shrdi_epi64(__m256i, __m256i, int);
VPSHRDQ __m256i _mm256_mask_shrdi_epi64(__m256i, __mmask8, __m256i, __m256i, int);
VPSHRDQ __m256i _mm256_maskz_shrdi_epi64(__mmask8, __m256i, __m256i, int);
VPSHRDQ __m512i _mm512_shrdi_epi64(__m512i, __m512i, int);
VPSHRDQ __m512i _mm512_mask_shrdi_epi64(__m512i, __mmask8, __m512i, __m512i, int);
VPSHRDQ __m512i _mm512_maskz_shrdi_epi64(__mmask8, __m512i, __m512i, int);
VPSHRDD __m128i _mm_shrdi_epi32(__m128i, __m128i, int);
VPSHRDD __m128i _mm_mask_shrdi_epi32(__m128i, __mmask8, __m128i, __m128i, int);
VPSHRDD __m128i _mm_maskz_shrdi_epi32(__mmask8, __m128i, __m128i, int);
VPSHRDD __m256i _mm256_shrdi_epi32(__m256i, __m256i, int);
VPSHRDD __m256i _mm256_mask_shrdi_epi32(__m256i, __mmask16, __m256i, __m256i, int);
VPSHRDD __m256i _mm256_maskz_shrdi_epi32(__mmask16, __m256i, __m256i, int);
VPSHRDD __m512i _mm512_shrdi_epi32(__m512i, __m512i, int);
VPSHRDD __m512i _mm512_mask_shrdi_epi32(__m512i, __mmask32, __m512i, __m512i, int);
VPSHRDD __m512i _mm512_maskz_shrdi_epi32(__mmask32, __m512i, __m512i, int);
VPSHRDW __m128i _mm_shrdi_epi16(__m128i, __m128i, int);
VPSHRDW __m128i _mm_mask_shrdi_epi16(__m128i, __mmask8, __m128i, __m128i, int);
VPSHRDW __m128i _mm_maskz_shrdi_epi16(__mmask8, __m128i, __m128i, int);
VPSHRDW __m256i _mm256_shrdi_epi16(__m256i, __m256i, int);
VPSHRDW __m256i _mm256_mask_shrdi_epi16(__m256i, __mmask16, __m256i, __m256i, int);
VPSHRDW __m256i _mm256_maskz_shrdi_epi16(__mmask16, __m256i, __m256i, int);
VPSHRDW __m512i _mm512_shrdi_epi16(__m512i, __m512i, int);
VPSHRDW __m512i _mm512_mask_shrdi_epi16(__m512i, __mmask32, __m512i, __m512i, int);
VPSHRDW __m512i _mm512_maskz_shrdi_epi16(__mmask32, __m512i, __m512i, int);

**SIMD Floating-Point Exceptions**

None.

**Other Exceptions**

See Type E4.
VPSHRDV — Concatenate and Variable Shift Packed Data Right Logical

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Op/ En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.DDS.128.66.0F38.W1 72 /r VPSHRDVW xmm1{k1}{z}, xmm2, xmm3/m128</td>
<td>A V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate xmm1 and xmm2, extract result shifted to the right by value in xmm3/m128 into xmm1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.DDS.256.66.0F38.W1 72 /r VPSHRDVW ymm1{k1}{z}, ymm2, ymm3/m256</td>
<td>A V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate ymm1 and ymm2, extract result shifted to the right by value in xmm3/m256 into ymm1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.DDS.512.66.0F38.W1 72 /r VPSHRDVW zmm1{k1}{z}, zmm2, zmm3/m512</td>
<td>A V/V</td>
<td>AVX512_VBMI2</td>
<td>Concatenate zmm1 and zmm2, extract result shifted to the right by value in xmm3/m512 into zmm1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.DDS.128.66.0F38.W0 73 /r VPSHRDVQ xmm1{k1}{z}, xmm2, xmm3/m128/m32bcst</td>
<td>B V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate xmm1 and xmm2, extract result shifted to the right by value in xmm3/m128 into xmm1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.DDS.256.66.0F38.W0 73 /r VPSHRDVQ ymm1{k1}{z}, ymm2, ymm3/m256/m64bcst</td>
<td>B V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate ymm1 and ymm2, extract result shifted to the right by value in xmm3/m256 into ymm1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.DDS.512.66.0F38.W0 73 /r VPSHRDVQ zmm1{k1}{z}, zmm2, zmm3/m512/m64bcst</td>
<td>B V/V</td>
<td>AVX512_VBMI2 AVX512VL</td>
<td>Concatenate zmm1 and zmm2, extract result shifted to the right by value in xmm3/m512 into zmm1.</td>
<td></td>
</tr>
</tbody>
</table>

Instruction Operand Encoding

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Full Mem</td>
<td>ModRMreg (r, w)</td>
<td>EVEX.vvvv</td>
<td>ModRMrr/m (r)</td>
<td>NA</td>
</tr>
<tr>
<td>B</td>
<td>Full</td>
<td>ModRMreg (r, w)</td>
<td>EVEX.vvvv</td>
<td>ModRMrr/m (r)</td>
<td>NA</td>
</tr>
</tbody>
</table>

Description

Concatenate packed data, extract result shifted to the right by variable value.
This instruction supports memory fault suppression.
**Operation**

**VPSHRDVW DEST, SRC2, SRC3**

(KL, VL) = (8, 128), (16, 256), (32, 512)

FOR j ← 0 TO KL-1:
   IF MaskBit(j) OR *no writemask*:
      DEST.word[j] ← concat(SRC2.word[j], DEST.word[j]) >> (SRC3.word[j] & 15)
   ELSE IF *zeroing*:
      DEST.word[j] ← 0
   *ELSE DEST.word[j] remains unchanged*

DEST[MAX_VL-1:VL] ← 0

**VPSHRDVQ DEST, SRC2, SRC3**

(KL, VL) = (4, 128), (8, 256), (16, 512)

FOR j ← 0 TO KL-1:
   IF SRC3 is broadcast memop:
      tsrc3 ← SRC3.dword[0]
   ELSE:
      tsrc3 ← SRC3.dword[j]
   IF MaskBit(j) OR *no writemask*:
      DEST.dword[j] ← concat(SRC2.dword[j], DEST.dword[j]) >> (tsrc3 & 31)
   ELSE IF *zeroing*:
      DEST.dword[j] ← 0
   *ELSE DEST.dword[j] remains unchanged*

DEST[MAX_VL-1:VL] ← 0

**VPSHRDQV DEST, SRC2, SRC3**

(KL, VL) = (2, 128), (4, 256), (8, 512)

FOR j ← 0 TO KL-1:
   IF SRC3 is broadcast memop:
      tsrc3 ← SRC3.qword[0]
   ELSE:
      tsrc3 ← SRC3.qword[j]
   IF MaskBit(j) OR *no writemask*:
      DEST.qword[j] ← concat(SRC2.qword[j], DEST.qword[j]) >> (tsrc3 & 63)
   ELSE IF *zeroing*:
      DEST.qword[j] ← 0
   *ELSE DEST.qword[j] remains unchanged*

DEST[MAX_VL-1:VL] ← 0
Intel C/C++ Compiler Intrinsic Equivalent

VPSHRDVQ __m128i _mm_shrdv_epi64(__m128i, __m128i, __m128i);
VPSHRDVQ __m128i _mm_mask_shrdv_epi64(__m128i, __mmask8, __m128i, __m128i);
VPSHRDVQ __m128i _mm_maskz_shrdv_epi64(__mmask8, __m128i, __m128i, __m128i);
VPSHRDVQ __m256i _mm256_shrdv_epi64(__m256i, __m256i, __m256i);
VPSHRDVQ __m256i _mm256_mask_shrdv_epi64(__m256i, __mmask8, __m256i, __m256i);
VPSHRDVQ __m256i _mm256_maskz_shrdv_epi64(__mmask8, __m256i, __m256i, __m256i);
VPSHRDVQ __m512i _mm512_shrdv_epi64(__m512i, __m512i, __m512i);
VPSHRDVQ __m512i _mm512_mask_shrdv_epi64(__m512i, __mmask8, __m512i, __m512i);
VPSHRDVQ __m512i _mm512_maskz_shrdv_epi64(__mmask8, __m512i, __m512i, __m512i);
VPSHRDVQ __m128i _mm_shrdv_epi32(__m128i, __m128i, __m128i);
VPSHRDVQ __m128i _mm_mask_shrdv_epi32(__m128i, __mmask8, __m128i, __m128i);
VPSHRDVQ __m128i _mm_maskz_shrdv_epi32(__mmask8, __m128i, __m128i, __m128i);
VPSHRDVQ __m256i _mm256_shrdv_epi32(__m256i, __m256i, __m256i);
VPSHRDVQ __m256i _mm256_mask_shrdv_epi32(__m256i, __mmask16, __m256i, __m256i);
VPSHRDVQ __m256i _mm256_maskz_shrdv_epi32(__mmask16, __m256i, __m256i, __m256i);
VPSHRDVQ __m512i _mm512_shrdv_epi32(__m512i, __m512i, __m512i);
VPSHRDVQ __m512i _mm512_mask_shrdv_epi32(__m512i, __mmask32, __m512i, __m512i);
VPSHRDVQ __m512i _mm512_maskz_shrdv_epi32(__mmask32, __m512i, __m512i, __m512i);

SIMD Floating-Point Exceptions
None.

Other Exceptions
See Type E4.
**VPUSHUBITQMB — Shuffle Bits from Quadword Elements Using Byte Indexes into Mask**

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Op/En</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>EVEX.NDS.128:66.0F38:W0 8F /r VPSHUFBITQMB k1[k2], xmm2, xmm3/m128</td>
<td>A V/V</td>
<td>AVX512_BITALG AVX512VL</td>
<td>Extract values in xmm2 using control bits of xmm3/m128 with writemask k2 and leave the result in mask register k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.NDS.256:66.0F38:W0 8F /r VPSHUFBITQMB k1[k2], ymm2, ymm3/m256</td>
<td>A V/V</td>
<td>AVX512_BITALG AVX512VL</td>
<td>Extract values in ymm2 using control bits of ymm3/m256 with writemask k2 and leave the result in mask register k1.</td>
<td></td>
</tr>
<tr>
<td>EVEX.NDS.512:66.0F38:W0 8F /r VPSHUFBITQMB k1[k2], zmm2, zmm3/m512</td>
<td>A V/V</td>
<td>AVX512_BITALG</td>
<td>Extract values in zmm2 using control bits of zmm3/m512 with writemask k2 and leave the result in mask register k1.</td>
<td></td>
</tr>
</tbody>
</table>

**Instruction Operand Encoding**

<table>
<thead>
<tr>
<th>Op/En</th>
<th>Tuple</th>
<th>Operand 1</th>
<th>Operand 2</th>
<th>Operand 3</th>
<th>Operand 4</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>Full Mem</td>
<td>ModRM:reg (w)</td>
<td>EVEX.vvvv (r)</td>
<td>ModRM:r/m (r)</td>
<td>NA</td>
</tr>
</tbody>
</table>

**Description**

The VPSHUFBITQMB instruction performs a bit gather select using second source as control and first source as data. Each bit uses 6 control bits (2nd source operand) to select which data bit is going to be gathered (first source operand). A given bit can only access 64 different bits of data (first 64 destination bits can access first 64 data bits, second 64 destination bits can access second 64 data bits, etc.).

Control data for each output bit is stored in 8 bit elements of SRC2, but only the 6 least significant bits of each element are used.

This instruction uses write masking (zeroing only). This instruction supports memory fault suppression.

The first source operand is a ZMM register. The second source operand is a ZMM register or a memory location. The destination operand is a mask register.

**Operation**

**VPUSHUBITQMB DEST, SRC1, SRC2**

(KL, VL) = (16,128), (32,256), (64, 512)

FOR i ← 0 TO KL/8-1: //Qword
    FOR j ← 0 to 7: // Byte
        IF k2[i*8+j] or *no writemask*:
            m ← SRC2.qword[i].byte[j] & 0x3F
            k1[i*8+j] ← SRC1.qword[i].bit[m]
        ELSE:
            k1[i*8+j] ← 0
    k1[MAX_KL-1:KL] ← 0

**Intel C/C++ Compiler Intrinsic Equivalent**

VPUSHUBITQMB __m128 _mm_bitshuffle_epi64_mask(__m128i, __m128i);
VPUSHUBITQMB __m128 __mm_mask__bitshuffle_epi64_mask(__m128, __m128);
VPUSHUBITQMB __m256 _mm256_bitshuffle_epi64_mask(__m256i, __m256i);
VPUSHUBITQMB __m256 __mm256_mask__bitshuffle_epi64_mask(__m256, __m256i, __m256i);
VPUSHUBITQMB __m512 _mm512_bitshuffle_epi64_mask(__m512i, __m512i);
VPUSHUBITQMB __m512 __mm512_mask__bitshuffle_epi64_mask(__m512, __m512i, __m512i);

Ref. # 319433-034
**WBNOINVD—Write Back and Do Not Invalidate Cache**

### Description

The WBNOINVD instruction writes back all modified cache lines in the processor’s internal cache to main memory but does not invalidate (flush) the internal caches.

After executing this instruction, the processor does not wait for the external caches to complete their write-back operation before proceeding with instruction execution. It is the responsibility of hardware to respond to the cache write-back signal. The amount of time or cycles for WBNOINVD to complete will vary due to size and other factors of different cache hierarchies. As a consequence, the use of the WBNOINVD instruction can have an impact on logical processor interrupt/event response time.

The WBNOINVD instruction is a privileged instruction. When the processor is running in protected mode, the CPL of a program or procedure must be 0 to execute this instruction. This instruction is also a serializing instruction (see “Serializing Instructions” in Chapter 8 of the *Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A*).

In situations where cache coherency with main memory is not a concern, software can use the INVD instruction. This instruction’s operation is the same in non-64-bit modes and 64-bit mode.

### IA-32 Architecture Compatibility

The WBNOINVD instruction is implementation dependent, and its function may be implemented differently on future Intel 64 and IA-32 processors.

### Operation

WriteBack(InternalCaches);
Continue; (* Continue execution *)

### Intel C/C++ Compiler Intrinsic Equivalent

```
WBNOINVD void _wbnoinvd(void);
```

### Flags Affected

None.

### Protected Mode Exceptions

- **#GP(0)** If the current privilege level is not 0.
- **#UD** If the LOCK prefix is used.

### Real-Address Mode Exceptions

- **#UD** If the LOCK prefix is used.
Virtual-8086 Mode Exceptions
#GP(0) WBN0INVD cannot be executed at the virtual-8086 mode.

Compatibility Mode Exceptions
Same exceptions as in protected mode.

64-Bit Mode Exceptions
Same exceptions as in protected mode.
3.1 INTRODUCTION

This chapter describes an EPT-based sub-page permissions capability to allow Virtual Machine Monitors (VMM) to specify write-permission for guest physical memory at a sub-page (128 byte) granularity. When this capability is utilized, the CPU establishes write-access permissions for sub-page regions of 4-KByte pages as specified by the VMM. EPT-based sub-page permissions is intended to enable fine-grained memory write enforcement by a VMM for security (guest OS monitoring).

3.2 VMCS CHANGES

A new secondary processor-based VM-execution control is defined as "sub-page write permission". The bit position of this control is 23.

If bit 31 of the primary processor-based VM-execution controls is 0, the logical processor operates as if the sub-page write permission VM-execution control is 0.

A new 64-bit control field is defined as "sub-page permission table pointer" (SPPTP). The encodings for this field are 00002030H (all 64 bits in 64-bit mode; low 32 bits in legacy mode) and 00002031H (high 32 bits).

3.3 CHANGES TO EPT PAGING-STRUCTURE ENTRIES

Bit 61 of an EPT PTE is defined as a "Sub-Page Permission" (SPP bit). Setting this bit allows write permissions for the mapped page to be enforced on a sub-page basis (see Section 6.4). The processor ignores this bit in all other EPT paging-structure entries (as it does if the "sub-page write permission" VM-execution control is 0).

3.4 CHANGES TO GUEST-PHYSICAL ACCESSES

If the logical processor is in VMX non-root operation with EPT enabled, and if the sub-page write permission VM-execution control (see Section 3.2) is 0, an EPT violation occurs if a memory store uses a guest-physical address and the write-access bit (bit 1) is clear in any of the EPT paging-structure entries used to translate the guest physical address. (This is same as legacy behavior.)

If the sub-page write permission VM-execution control is 1, treatment of write accesses to guest-physical accesses depends on the state of the accumulated write-access bit (position 1) and sub-page permission bit (position 61) in the leaf EPT paging-structure used to translate guest-physical addresses.

If EPT translates a guest-physical address using a 4-KByte page, the accumulated write-access bit is 0, and the SPP bit set to 1 in the EPT PTE, the processor uses the guest-physical address to select from a VMM-managed Sub-Page Permission Table (SPPT) a write permission bit for the 128-byte sub-page region being accessed within the 4-KByte page. If the sub-page region write permission bit is set, the write is allowed; otherwise the write is disallowed and results in an EPT violation normally.

In other cases, the processor does not consult the SPPT. Guest-physical pages mapped via leaf EPT-paging-structures for which the accumulated write-access bit and the SPP bits are both clear (0) generate EPT violations on memory writes accesses. Guest-physical pages mapped via EPT-paging-structure for which the accumulated write-access bit is set (1) allow writes, effectively ignoring the SPP bit on the leaf EPT-paging structure.
EPT-BASED SUB-PAGE PERMISSIONS

3.5 SUB-PAGE PERMISSION TABLE

The sub-page permission table is referenced via a 64-bit control field called Sub-Page Permission Table Pointer (SPPTP) which contains a 4K-aligned physical address. The SPPT allows specification of write-permissions for 32 128 byte sub-page regions for 4KB guest-physical memory pages accessed via the EPT. The format of SPPTP is shown in Table 3-1 below.

<table>
<thead>
<tr>
<th>Bit Position</th>
<th>Contents</th>
</tr>
</thead>
<tbody>
<tr>
<td>11:0</td>
<td>Reserved.</td>
</tr>
<tr>
<td>M-1:12</td>
<td>Bits M-1:12 of the physical address of the 4-KByte aligned SPPT L4 table.</td>
</tr>
<tr>
<td>63:M</td>
<td>Reserved (must be 0).</td>
</tr>
</tbody>
</table>

**NOTES:**
1. M is the physical-address width supported by the processor.

The memory type used for SPPT accesses will be the memory type reported in IA32_VMX_BASIC MSR.

When SPPT is in use, write accesses to any guest-physical addresses produced via a mapping for a 4KB page in the EPT can be controlled at a 128 byte granularity sub-page region within the 4KB guest-physical page. Note that reads and instruction fetches are not affected by the SPPT.

3.5.1 SPPT Overview

SPPT is active when the sub-page write permission VM-execution control is 1. SPPT looks up the guest-physical addresses to derive a 64 bit sub-page permission value containing sub-page region write permissions. The lookup from guest-physical addresses to the sub-page region permissions is determined by a set of SPPT paging structures. Section 3.5.2 gives the details of the SPPT structures.

When the sub-page write permission VM-execution control is 1, the SPPT is used to look up write permission bits for the 128 byte sub-page regions contained in the 4KB guest-physical page. EPT specifies the 4KB page-level privileges that software is allowed when accessing the guest-physical address, whereas SPPT defines the write permissions for software at the 128 byte granularity regions within a 4KB page. Similar to EPT, a logical processor uses SPPT to look up sub-page region write permissions for guest-physical addresses only when those addresses are used to access memory.

3.5.2 Operation of SPPT-based Write-Permission

The SPPT translation mechanism uses only bits 47:7 of a guest-physical address. The SPPT is a 4-level paging structure. Four SPPT paging structures are accessed to look up a sub-page region write permission bit for a guest-physical address. The 48 bits are partitioned by the logical processor to traverse the SPPT paging structures as follows.

- A 4KB naturally aligned SPPT L4 table is located at the physical address specified in bits 51:12 of the SPPTP. An SPPT L4 table comprises 512 64-bit entries (SPPT L4Es). An SPPT L4E is selected at the physical address defined as follows.
  - Bits 63:52 are all 0.
  - Bits 51:12 are from the SPPTP.
  - Bits 11:3 are bits 47:39 of the guest-physical address.
  - Bits 2:0 are all 0.

The format of a SPPT L4E is given in Table 4-2.
EPT-BASED SUB-PAGE PERMISSIONS

A 4KB naturally aligned SPPT L3 table is located at the physical address specified in bits 51:12 of the SPPT L4E. An SPPT L3 table comprises 512 64-bit entries (SPPT L3Es). An SPPT L3E is selected at the physical address defined as follows.

- Bits 63:52 are all 0.
- Bits 51:12 are from the SPPT L4E.
- Bits 11:3 are bits 38:30 of the guest-physical address.
- Bits 2:0 are all 0.

The format of the SPPT L3E is the same as that given in Table 3-2 for SPPT L4Es. The SPPT L3E references a 4KB naturally aligned SPPT L2 Table.

A 4KB naturally aligned SPPT L2 table is located at the physical address specified in bits 51:12 of the SPPT L3E. An SPPT L2 table comprises 512 64-bit entries (SPPT L2Es). An SPPT L2E is selected at the physical address defined as follows.

- Bits 63:52 are all 0.
- Bits 51:12 are from the SPPT L3E.
- Bits 11:3 are bits 29:21 of the guest-physical address.
- Bits 2:0 are all 0.

The format of a SPPT L2E is the same as that given in Table 3-2 for SPPT L4Es. The SPPT L2E references a 4KB naturally aligned SPPT L1 Table.

A 4KB naturally aligned SPPT L1 table is located at the physical address specified in bits 51:12 of the SPPT L2E. An SPPT L1 table comprises 512 64-bit entries (SPPT L1Es). An SPPT L1E is selected at the physical address defined as follows.

- Bits 63:52 are all 0.
- Bits 51:12 are from the SPPT L2E.
- Bits 11:3 are bits 20:12 of the guest-physical address.
- Bits 2:0 are all 0.

The processor then consults bit 2i of the SPPT L1E, where i is the value of bits 11:7 of the guest-physical address; a write access to the guest-physical address is allowed if the bit is 1. (The odd bits in the SPPT L1E are reserved and must be 0.)

---

Table 3-2. Format of the SPPT L4E

<table>
<thead>
<tr>
<th>Bit Position</th>
<th>Contents</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>Valid entry when set; indicates whether the entry is present.</td>
</tr>
<tr>
<td>11:1</td>
<td>Reserved (must be 0).</td>
</tr>
<tr>
<td>M-1:12</td>
<td>Physical address of 4KB naturally aligned SPPT L3 table referenced by this entry.¹</td>
</tr>
<tr>
<td>63:M</td>
<td>Reserved (must be 0).</td>
</tr>
</tbody>
</table>

NOTES:

1. M is the physical-address width supported by the processor. Software can determine a processor’s physical-address width by executing CPUID with 80000008H in EAX. The physical-address width is returned in bits 7:0 of EAX.
EPT-BASED SUB-PAGE PERMISSIONS

3.5.3 SPP-Induced VM Exits

Accesses using guest-physical addresses may cause SPP-induced VM exits due to an SPPT misconfiguration or an SPPT miss. The basic VM exit reason reported for SPP-induced VM exits is 66.

An SPPT misconfiguration VM exit occurs when, in the course of an SPPT lookup, an SPPT paging-structure entry is encountered that sets a reserved bit. See Section 3.5.3.1 for which bits are reserved in SPPT paging-structure entries.

An SPPT miss VM exit occurs when, in the course of an SPPT lookup, an SPPT paging-structure entry is encountered in which the valid bit is clear.

SPPT misconfigurations and SPPT misses can occur only due to an attempt to write memory with a guest-physical address.

SPP-induced VM exits save an exit qualification with the format given in Table 3-3. These VM exits also save a guest-linear address and a guest-physical address.

### Table 3-3. Exit Qualification for SPPT-Induced VM Exits

<table>
<thead>
<tr>
<th>Bit Position</th>
<th>Contents</th>
</tr>
</thead>
<tbody>
<tr>
<td>10:0</td>
<td>Not used.</td>
</tr>
<tr>
<td>11</td>
<td>SPPT VM exit type. Set for SPPT miss; cleared for SPPT misconfiguration VM exit.</td>
</tr>
<tr>
<td>12</td>
<td>NMI unblocking due to IRET.</td>
</tr>
<tr>
<td>63:13</td>
<td>Not used.</td>
</tr>
</tbody>
</table>

**Guest Linear Address:** In addition to the existing cases for which this field is reported, for a VM exit due to an SPPT misconfiguration or SPPT miss, this field receives a linear address that caused the SPPT misconfiguration or SPPT miss VM exit.

**Guest Physical Address:** In addition to the existing cases for which this field is reported, for a VM exit due to an SPPT misconfiguration or SPPT miss, this field receives the guest-physical address that caused the SPPT misconfiguration or SPPT miss VM exit.

3.5.3.1 Sub-Page Permissions and EPT Violations

Memory writes that consult but are not permitted by the SPPT cause EPT violations normally.

For memory writes that access memory across sub-page regions on the same 4K page, the processor will check writeability of both sub-pages and will generate an EPT violation if either of the accessed sub-page regions is not writeable.

For memory writes that access adjoining 4-KByte pages, the processor may ignores the SPP bit in either of the EPT PTEs that map those pages (operating as if it were 0).

Sub-page write permissions are intended principally for simple instructions (such as AND, MOV, OR, TEST, XCHG, INC, XOR, etc.). Execution of an instruction that normally performs multiple memory-writes may or may not ignore the sub-page permissions and cause EPT violations unconditionally if an accessed page is mapped with an EPT PTE in which the W bit is 0.

Accesses to any guest-physical address that translates to an address on the APIC-access page that also is specified by the VMM to have sub-page permissions associated with it may operate as if the virtualize APIC accesses VM-execution control is 0.

Processor writes to guest paging structures (to set accessed and dirty flags) ignore sub-page permissions and always cause EPT violations when attempting to write to guest-physical addresses to which EPT does not allow writes. The same is true for processor reads of guest paging structures (during linear-address translation) if accessed and dirty flags for EPT are enabled. (This is because, when accessed and dirty flags for EPT are enabled, processor reads of guest paging structures are treated as writes).
3.5.4 Invalidating Cached SPP Permissions

Sub-page permissions may be cached by the CPU. Any modification to the sub-page permissions specified in SPPT entries must be invalidated using INVEPT. The EPTP switching VM function may flush any information cached about sub-page permissions, as well as intermediate EPT and SPPT caches.

3.5.5 Sub-Page Permission Interaction with Intel® TSX

Instructions that begin or execute within a transactional region may attempt to write to guest-physical addresses to which EPT does not allow writes. Such cases result in transactional aborts.

This behavior is retained even with sub-page permissions. A write by an instruction that begins or executes within a transactional region ignores sub-page permissions and causes a transactional abort if EPT does not allow writes to the guest-physical address.

3.5.6 Sub-Page Permission Interaction with Intel® SGX

A VMM cannot access memory in the enclave page cache (EPC) and cannot easily determine how to protect those pages selectively with SPP.

The checking of sub-page permissions takes priority over EPC-specific access control. Memory writes by an enclave to addresses within the enclave’s ELRANGE ignore sub-page permissions and will cause EPT violations when made to guest-physical addresses to which EPT does not allow writes. The same is true for writes to the EPC by Intel SGX instructions. Memory writes by an enclaves to addresses outside its ELRANGE are treated normally and may be allowed based on sub-page permissions.

The fault behavior summarized in Table 3-4 below.

<table>
<thead>
<tr>
<th>ID</th>
<th>Enclave Access</th>
<th>API Access</th>
<th>In EPC</th>
<th>EPTE.W</th>
<th>EPTE.SPP</th>
<th>Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>NA</td>
<td>X</td>
<td>X</td>
<td>See notes(^1).</td>
</tr>
<tr>
<td>2</td>
<td>0</td>
<td>1</td>
<td>NA</td>
<td>X</td>
<td>0</td>
<td>See notes(^2).</td>
</tr>
<tr>
<td>3</td>
<td>0</td>
<td>1</td>
<td>NA</td>
<td>1</td>
<td>1</td>
<td>See notes(^3).</td>
</tr>
<tr>
<td>4</td>
<td>0</td>
<td>1</td>
<td>NA</td>
<td>0</td>
<td>1</td>
<td>See notes(^4).</td>
</tr>
<tr>
<td>5</td>
<td>1</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>0</td>
<td>See notes(^5).</td>
</tr>
<tr>
<td>6</td>
<td>1</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>1</td>
<td>See notes(^6).</td>
</tr>
</tbody>
</table>

NOTES:
1. Fault behavior as per SPP architecture described in this chapter.
2. Fault behavior as per the APIC virtualization architecture.
3. (SPP is ignored since EPT is writeable)
   If violation of EPT permissions then EPT violation
   Else If Implementation_supports_vAPIC_AND_SPP_Together
     Then APIC redirection or exit
   Else Access Allowed
4. If violation of EPT permissions then EPT violation exit
   Else If Implementation_supports_vAPIC_AND_SPP_Together
     If write access then EPT violation
     Else APIC redirection or exit
   Else If write access - fault behavior per SPP architecture in this specification
   Else If read/execute access - access allowed

Ref. # 319433-034
5. If violation of EPT permissions then EPT violation
   Else If PA not in EPC then #PF
   Else If PA matches APIC access page then #PF
   Else If violation of EPCM permissions then #PF
   Else Access Allowed

6. If violation of EPT permissions – EPT violation
   Else EPT violation (SPP on enclave access)

3.5.7 Memory Type Used for Accessing SPPT

The memory type used for any such reference will be the memory type reported in IA32_VMX_BASIC MSR. Bits 53:50 of the IA32_VMX_BASIC MSR report the memory type that the processor uses to access the VMCS and data structures referenced by pointers in the VMCS. Software should ensure that the VMCS and referenced data structures are located at physical addresses that are mapped to WB memory type by the MTRRs.

3.6 Changes to VM Entries

If the activate secondary controls and sub-page write permission VM-execution controls are both 1, VM entries ensure that the enable EPT VM-execution control is 1. Additionally, the sub-page permission table control field is checked for consistency per Section 3.5. VM entry fails if these checks fail. When such a failure occurs, control is passed to the next instruction, RFLAGS.ZF is set to 1 to indicate the failure, and the VM-instruction error field is loaded with value 7, indicating "VM entry with invalid control field(s)". This check may be performed in any order with respect to other checks on VMX controls and the host-state area. Different processors may thus give different error numbers for the same VMCS.

3.7 Changes to VMX Capability Reporting

Section 3.2 specified that secondary processor-based VM-execution control 23 is defined as "sub-page write permission". A processor that supports the 1-setting of the control sets bit 55 of the IA32_VMX_PROCBASED_CTLS2 MSR (index 48BH). RDMSR of that MSR returns 1 in bit 23 of EDX.
4.1 INTRODUCTION

Intel® Processor Trace (Intel® PT) is an extension of Intel® Architecture that captures information about software execution using dedicated hardware facilities that cause only minimal performance perturbation to the software being traced. Details on the Intel PT infrastructure and trace capabilities can be found in the *Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C*.

This chapter describes the architecture for VMX support improvements made for Intel PT. The suite of architecture changes described below serve to simplify the process of virtualizing Intel PT for use by a guest software. There are two primary elements to this new architecture support.

1. **Addition of a new guest IA32_RTIT_CTL value field to the VMCS.** — This serves to speed and simplify the process of disabling trace on VM exit, and restoring it on VM entry.
2. **Enabling use of EPT to redirect PT output.** — This enables the VMM to elect to virtualize the PT output buffer using EPT. In this mode, the CPU will treat PT output addresses as Guest Physical Addresses (GPAs) and translate them using EPT. This means that output reads (of the ToPA table), output writes (of trace output), and other output events can cause EPT violations.

4.2 ARCHITECTURE DETAILS

4.2.1 IA32_RTIT_CTL in VMCS Guest State

A new 64-bit field will be added to the VMCS Guest State, to hold the value of IA32_RTIT_CTL. This field will use encodings 2814H and 2815H. On VM exit, the MSR value will be written to this field unconditionally. Additionally, there are two new controls to govern use of this field; see Table 4-1 below.

<table>
<thead>
<tr>
<th>Name</th>
<th>Position</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Clear IA32_RTIT_CTL on exit.</td>
<td>Exit control 25.</td>
<td>When set, the IA32_RTIT_CTL MSR will be cleared on VM exit, after it has been saved. This disables PT before entering the VMX host.</td>
</tr>
<tr>
<td>Load IA32_RTIT_CTL on entry.</td>
<td>Entry control 18.</td>
<td>When set, the IA32_RTIT_CTL MSR will be written with the value of the associated Guest State field of the VMCS on VM entry. This restores PT before entering the guest. VM entry fails if the value to be loaded sets reserved bits or a reserved values in an encoded field.</td>
</tr>
</tbody>
</table>

4.2.2 Supporting EPT for Trace Output

In order to enable use of EPT to redirect PT trace output, a new secondary processor-based VM-execution control is added; see Table 4-2 below.

<table>
<thead>
<tr>
<th>Name</th>
<th>Position</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Guest PT uses Guest Physical Addresses.</td>
<td>Execution control 24.</td>
<td>When set, all PT output addresses, including those in the IA32_RTIT_OUTPUT_BASE MSR and in ToPA tables, will be treated as guest physical addresses (GPAs) and translated with EPT.</td>
</tr>
</tbody>
</table>
Setting this new VM-execution control to 1 requires also setting the VM-exit and VM-entry controls described above in Table 4-1. This ensures that PT is disabled before entering root operation, where EPT does not apply. See details on new consistency checks in Section 4.2.3.

4.2.2.1 VM Exits Due to Intel PT Output

Treating PT output addresses as guest-physical addresses introduces the possibility of taking events on PT output reads and writes. Event possibilities include EPT violations, EPT misconfigurations, PML log-full VM exits, and APIC access VM exits.

Exit Qualification

Intel PT output reads and writes are asynchronous to instruction execution, as a result of the internal buffering of trace data. Trace packets are output some unpredictable number of cycles after the completion of the instructions or events that generated them. For this reason, any VM exit caused by Intel PT output will set the following new exit qualification bit.

<table>
<thead>
<tr>
<th>Name</th>
<th>Position</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Asynchronous to Instruction Execution.</td>
<td>Exit qualification bit 16 for EPT violations, PML log-full VM exits, and APIC-access VM exits due to guest-physical accesses.</td>
<td>This VM exit results neither from the instruction referenced by the RIP saved into the VMCS, nor from any event delivery recorded in the VMCS IDT-vectoring fields.</td>
</tr>
</tbody>
</table>

There is no guest linear address relevant for EPT violations resulting from Intel PT output reads and writes. For this reason, these VM exits clear bit 7 of the exit qualification, which is set only if the guest linear-address is valid.

Preserving Pending Events

A VM entry that enables Intel PT can cause an immediate VM exit, if PT output is configured to use GPA addressing and the access to the page causes a VM exit (e.g., EPT violation). This VM exit will be taken after the completion of the VM entry, but before other events which may be pending or injected by the VM entry. To ensure that no events are lost, VM exits caused by PT output will take the following measures.

- The guest pending debug exceptions field in the VMCS is not cleared, and the value saved will match the behavior of existing VM exits (e.g., INIT) that do not clear the field.
- The VMCS VM-entry interrupt information field is saved to the VMCS IDT-vectoring information field. This serves to simplify the process of re-injecting the event on the next VM entry. Note that this introduces a scenario where Pending MTF VM exit can be set in the IDT-vectoring information field.

Additional VM Exits

EPT violations caused by Intel PT output will always cause VM exits; virtualization exceptions (#VEs) are not supported.

Intel PT output accesses to the APIC-access page cause VM exits unconditionally, with no virtualization by the processor. This is consistent with other guest-physical accesses to the APIC-access page.

If the "Guest PT uses Guest Physical Addresses" VM-execution control is 1 and IA32_RTIT_CTL.TraceEn = 1, any invocation of the VM function 0 (EPTP switching) causes a VM exit. The VM exit gives a VMM the opportunity to disable tracing (if desired for certain EPT contexts) and ensures that the processor does not retain a PT-specific EPT-based translation across a change of EPTP. Reporting is the same as any VM exit caused by a VM function, setting the basic exit reason to 59 (indicating "VMFUNC") and saving the length of the VMFUNC instruction into the VM-exit instruction-length field.
4.2.2.2 Trace Data Management with Output Events

Because PT packet data is buffered within the CPU before being written out through the memory subsystem or other trace transport mechanism, the CPU takes measures to ensure that buffered trace data is not lost on the PT disable during VM exit. This requires ensuring that there is sufficient space left in the current output page to write out the buffer. Without such care, buffered trace data could be lost, and the resulting trace corrupted.

The CPU will employ an early page lookup mechanism in order to avoid trace corruption. It will try to cache the physical addresses (PAs) of the current PT output block and the next PT output block, in order to ensure no event is needed when transitioning from the current block to the next. An output block is defined as the smaller of the EPT page and the PT output buffer segment, which is either a ToPA output region or the single-range output buffer. Using this scheme, the CPU will always lookup the translation for the next block when it begins writing the current block, so that any events needed in order to translate the next block base address can be taken long before writes to that next block commence.

When PT is enabled, the CPU will lookup the first 2 output block translations, and cache the resulting PAs internally. PT enable flows include WRMSR (as well as loads from the MSR-load areas by VMX transitions), XRSTORS, VM entry, and RSM.

If either EPT lookup requires a VM exit, the exit will be taken before tracing begins. However, the value of IA32_RTIT_CTL saved into the new VMCS field will have the new value, with TraceEn set. This ensures that the subsequent VM entry will try again to enable PT.

These VM exits resulting from the use of Intel PT are taken after the completion of the current instruction or operation. On VM entry, any Intel PT-induced VM exit will be taken after transition to the guest completes, but before any event injection or guest instructions execute.

Once the PAs for the first two output blocks are cached (this could require multiple events, and hence multiple VM exits/VM entries), tracing will commence. Henceforth, anytime an output block is filled with trace data, output will transition to the next (cached) output block, and the CPU will lookup the EPT translation for the output block that follows the new current block. Here again, an event may need to be taken, which would result in a VM exit. If the lookup encounters a ToPA entry with the STOP bit set, it will cease to lookup further entries beyond that entry.

This early page lookup mechanism serves to reduce the likelihood that the trace could fill all available, translated output blocks. The CPU should typically have the current and next block cached and ready for output. In cases where trace data nonetheless has to be dropped, which could happen if an EPT violation VM exit for the next page translation is not taken for an extended period of time, the CPU will signal an internal buffer overflow and drop packets until the new translation can be cached.

4.2.2.3 Intel PT Output Errors

Improper configuration of Intel PT output can result in operation errors that cause tracing to be disabled. See the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C, Section 35.3.9, “Operational Errors” for details.

When Intel PT output is redirected using EPT, all address-based checks continue to be executed using the guest physical address specified in the ToPA table or MSR, with one exception. Checks against restricted memory (see the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3C, Section 35.2.6.4, “Restricted Memory Access” for details) are done using the translated, platform physical address to which output will be written.
4.2.3 New VM-Entry Consistency Checks

The following consistency checks will cause the VM entry to fall through to the next sequential instruction, and RFLAGS.ZF to be set, if failed.

- If the “Guest PT uses Guest Physical Addresses” execution control is 1, the “Clear IA32_RTIT_CTL on exit” exit control and the “Load IA32_RTIT_CTL on entry” entry control must also be 1. This ensures that the processor will not switch from treating Intel PT output addresses as GPAs to treating them as PPAs.
- If the “Guest PT uses Guest Physical Addresses” execution control is 1, the “enable EPT” execution control must also be 1.

If the following consistency check fails, VM entry fails by loading processor state from the guest-state area of the VMCS.

- If the “Load IA32_RTIT_CTL on entry” is 1, IA32_RTIT_CTL.TraceEn must be zero.

The lower 16 bits of the exit reason VMCS field will hold value 33, indicating failure due to invalid guest state.

4.2.3.1 Special Treatment for SMM VM Exits

The consistency checks above do not ensure that an SMM VM exit that occurs with the 1-setting of the “Guest PT uses Guest Physical Addresses” VM-execution control will find the “Clear IA32_RTIT_CTL on exit” VM-exit control set to 1. For this reason, such VM exits always clear the IA32_RTIT_CTL MSR, regardless of the setting of the VM-exit control.

4.3 Enumeration

Section 5.2 identified three new controls in the VMCS. The following paragraphs provide details of how processors enumerate for support of those controls:

- “Guest PT uses Guest Physical Addresses” is a new secondary processor-based VM-execution control, located at bit position 24. Processors supporting the 1-setting of this control enumerate that support by setting bit 56 of the IA32_VMX_PROCBASED_CTLS2 MSR (index 48BH).
- “Clear IA32_RTIT_CTL on exit” is a new VM-exit control, located at bit position 25. Processors supporting the 1-settings of this control enumerate that support by setting bit 57 in both the IA32_VMX_EXIT_CTLS MSR (index 483H) and the IA32_VMX_TRUE_EXIT_CTLS MSR (index 48FH).
- “Load IA32_RTIT_CTL on entry” is a new VM-entry control, located at bit position 18. Processors supporting the 1-settings of this control enumerate that support by setting bit 50 in both the IA32_VMX_ENTRY_CTLS MSR (index 484H) and the IA32_VMX_TRUE_ENTRY_CTLS MSR (index 490H).
5.1 HARDWARE FEEDBACK INTERFACE

Hardware provides guidance to the OS scheduler to perform optimal workload scheduling through a hardware feedback interface structure in memory. This structure has a global header that is 16 byte in size. Following this global header, there is one 8 byte entry per logical processor in the socket. The structure is designed as follows.

<table>
<thead>
<tr>
<th>Byte Offset</th>
<th>Size (Bytes)</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>16</td>
<td>Global Header</td>
</tr>
<tr>
<td>16</td>
<td>8</td>
<td>LP0 Scheduler Feedback</td>
</tr>
<tr>
<td>24</td>
<td>8</td>
<td>LP1 Scheduler Feedback</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>16 + n*8</td>
<td>8</td>
<td>LPn Scheduler Feedback</td>
</tr>
</tbody>
</table>

The global header is structured as follows.

<table>
<thead>
<tr>
<th>Byte Offset</th>
<th>Size (Bytes)</th>
<th>Field Name</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>8</td>
<td>Timestamp</td>
<td>Timestamp of when the table was last updated by hardware. This is a timestamp in crystal clock units. Initialized by software to 0.</td>
</tr>
<tr>
<td>8</td>
<td>1</td>
<td>Perf Capability Changed</td>
<td>If set to 1, indicates the performance capability field for one or more logical processors was updated in the table. Initialized by software to 0.</td>
</tr>
<tr>
<td>9</td>
<td>1</td>
<td>Energy Efficiency Capability Changed</td>
<td>If set to 1, indicates the performance capability field for one or more logical processors was updated in the table. Initialized by software to 0.</td>
</tr>
<tr>
<td>10</td>
<td>6</td>
<td>Reserved</td>
<td>Initialized by software to 0.</td>
</tr>
</tbody>
</table>

The per logical processor scheduler feedback entry is structured as follows.

<table>
<thead>
<tr>
<th>Byte Offset</th>
<th>Size (Bytes)</th>
<th>Field Name</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>1</td>
<td>Performance Capability</td>
<td>Perf capability is an 8-bit value (0 ... 255) specifying the relative performance level of a logical processor. Higher values indicate higher performance, the lowest perf level of 0 indicates the OS should idle the core and not schedule any software threads on it. Initialized by software to 0.</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>Energy Efficiency Capability</td>
<td>Energy efficiency capability is an 8-bit value (0 ... 255) specifying the relative energy efficiency level of a logical processor. Higher values indicate higher energy efficiency, the lowest energy efficiency capability of 0 indicates OS should idle the core and not schedule any software threads on it due to efficiency reasons. Initialized by software to 0.</td>
</tr>
<tr>
<td>2</td>
<td>6</td>
<td>Reserved</td>
<td>Initialized by software to 0.</td>
</tr>
</tbody>
</table>
5.1.1 Hardware Feedback Interface Pointer

The physical address of the hardware feedback interface structure is programmed by the OS into a package scoped MSR named IA32_HW_FEEDBACK_PTR. The MSR is structured as follows:

- Bits 63:MAXPHYADDR\(^1\) - Reserved.
- Bits MAXPHYADDR-1:12 - ADDR. This is the physical address of the page frame of the first page of this structure.
- Bits 11:1 - Reserved.
- Bit 0 - Valid. When set to 1, indicates a valid pointer is programmed into the MSR.

The address of this MSR is 17D0H.

See Section 5.1.4 for details on how the OS detects the size of memory to allocate for this structure. This MSR is cleared on reset to its default value of 0. The MSR retains its state on INIT.

5.1.2 Hardware Feedback Interface Configuration

The operating system enables the hardware feedback interface using a package scoped MSR named IA32_HW_FEEDBACK_CONFIG (address 17D1H).

The MSR is structured as follows:

- Bits 63:1 - Reserved.
- Bit 0 - Enable. When set to 1, enables the hardware feedback interface.

This MSR is cleared on reset to its default value of 0. The MSR retains its state on INIT.

When the Enable bit transitions from 1 to 0, hardware sets the IA32_PACKAGE_THERM_STATUS bit 26 to 1 to acknowledge disabling of the interface. Software should wait for this bit to be set to 1 after disabling the interface before reclaiming the memory allocated for this structure. When this bit is set to 1, it is safe to reclaim the memory as it is guaranteed that there are no writes in progress to this structure by hardware.

SENTER clears the enable bit to 0 on all sockets.

5.1.3 Hardware Feedback Interface Notifications

The IA32_PACKAGE_THERM_STATUS MSR is extended with a new bit, hardware feedback interface structure change status (bit 26, R/WC0), to indicate that the hardware has updated the hardware feedback interface structure. This is a sticky bit and once set, indicates that the OS scheduler should read the structure to determine the change and adjust its scheduling decisions. Once set, the hardware will not generate any further updates to this structure until the OS clears this bit by writing 0. The hardware guarantees that all writes to the hardware feedback interface structure are globally observed.

The OS can enable interrupt-based notifications when the structure is updated by hardware through a new enable bit, hardware feedback interrupt enable (bit 25, R/W), in the IA32_PACKAGE_THERM_INTERRUPT MSR. When this bit is set to 1, it enables the generation of an interrupt when the hardware guided scheduler interface structure is updated by hardware.

5.1.4 Hardware Feedback Interface Enumeration

CPUID.06H.0H:EAX.HW_FEEDBACK[bit 19] enumerates support for this feature. When this bit is enumerated to 1, the following MSR (or bits in the MSR) are supported by the hardware:

- IA32_HW_FEEDBACK_PTR (address 17D0H)
- IA32_HW_FEEDBACK_CONFIG (address 17D1H)
- IA32_PACKAGE_THERM_STATUS bit 26
- IA32_PACKAGE_THERM_INTERRUPT bit 25

\(^1\) MAXPHYADDR is reported in CPUID.80000008H:EAX[7:0].
When CPUID.06H.0H:EAX.HW_FEEDBACK[bit 19] = 1, then CPUID.06H.0H:EDX reports the following:

- **EDX[7:0]** - Bitmap of supported hardware feedback interface capabilities.
  - Bit 0: When set to 1, indicates support for performance capability reporting.
  - Bit 1: When set to 1, indicates support for energy efficiency capability reporting.
  - Other bits are reserved.

- **EDX[11:8]** - Enumerates the size of the hardware feedback interface structure in number of 4 KB pages using minus-one notation.

- **EDX[31:16]** - Index (starting at 0) of this logical processor's row in the hardware feedback interface structure. Note that the index may be same for multiple logical processors on some parts. On some parts the indices may not be contiguous, i.e., there may be unused rows in the table.
A new control bit (bit 29) in the TEST_CTRL MSR will be introduced in future processors based on Tremont and Ice Lake microarchitectures to enable detection of split locks.

When bit 29 of the TEST_CTRL MSR is set, the processor causes an #AC(0) exception for split locked accesses at all CPL irrespective of CR0.AM or EFLAGS.AC. A previous control bit (bit 31) in this MSR causes the processor to disable LOCK# assertion for split locked accesses when set. When bits 29 and 31 are both set, bit 29 takes precedence.

**Table 6-1. TEST_CTL MSR Details**

<table>
<thead>
<tr>
<th>Register Address</th>
<th>Register Name / Bit Fields</th>
<th>Bit Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hex, Dec</td>
<td></td>
<td></td>
</tr>
<tr>
<td>33H, 51</td>
<td>TEST_CTL</td>
<td>Test Control Register</td>
</tr>
<tr>
<td></td>
<td>28:0</td>
<td>Reserved</td>
</tr>
<tr>
<td></td>
<td>29</td>
<td>Enable #AC(0) exception for split locked accesses: Cause #AC(0) exception for split locked access at all CPL irrespective of CR0.AM or EFLAGS.AC. If bits 29 and 31 are both set, bit 29 takes precedence.</td>
</tr>
<tr>
<td></td>
<td>30</td>
<td>Reserved</td>
</tr>
<tr>
<td></td>
<td>31</td>
<td>Disable LOCK# assertion for split locked access.</td>
</tr>
</tbody>
</table>
INDEX

B
Brand information 1-35
   processor brand index 1-37
   processor brand string 1-35

C
Cache and TLB information 1-30
   Cache Inclusiveness 1-9
   CLFLUSH instruction
      CPUID flag 1-29
   CMOVcc flag 1-29
   CMOVcc instructions
      CPUID flag 1-29
   CMPXCHG16B instruction
      CPUID bit 1-27
   CMPXCHG8B instruction
      CPUID flag 1-29
   CPUID instruction 1-7, 1-29
      36-bit page size extension 1-29
      APIC on-chip 1-29
      basic CPUID information 1-8
      cache and TLB characteristics 1-8, 1-30
      CLFLUSH flag 1-29
      CLFLUSH instruction cache line size 1-25
   CMPXCHG16B flag 1-27
   CMPXCHG8B flag 1-29
   CPL qualified debug store 1-26
   debug extensions, CR4.DE 1-28
   debug store supported 1-29
   deterministic cache parameters leaf 1-8, 1-11, 1-12, 1-13, 1-14, 1-15, 1-16, 1-17, 1-18
   extended function information 1-21
   feature information 1-28
   FPU on-chip 1-28
   FSXSAVE flag 1-29
   FXRSTOR flag 1-29
   IA-32e mode available 1-22
   input limits for EAX 1-23
   L1 Context ID 1-27
   local APIC physical ID 1-25
   machine check architecture 1-29
   machine check exception 1-29
   memory type range registers 1-29
   MONITOR feature information 1-33
   MONITOR/MWAIT flag 1-26
   MONITOR/MWAIT leaf 1-9, 1-10, 1-11, 1-13, 1-19
   MWAIT feature information 1-33
   page attribute table 1-29
   page size extension 1-28
   performance monitoring features 1-33
   physical address bits 1-23
   physical address extension 1-29
   power management 1-33, 1-34, 1-35
   processor brand index 1-25, 1-35
   processor brand string 1-22, 1-35
   processor serial number 1-29
   processor type field 1-25
   RDMSR flag 1-28
   returned in EBX 1-25
   returned in ECX & EDX 1-25
   self snoop 1-30
   SpeedStep technology 1-26
   SS2 extensions flag 1-30

Ref. # 319433-034
SSE extensions flag 1-30
SSE3 extensions flag 1-26
SSSE3 extensions flag 1-26
SYSENDER flag 1-29
SYSEXIT flag 1-29
thermal management 1-33, 1-34, 1-35
thermal monitor 1-26, 1-29, 1-30
time stamp counter 1-28
using CPUID 1-7
vendor ID string 1-23
version information 1-8, 1-32
virtual 8086 Mode flag 1-28
virtual address bits 1-23
WRMSR flag 1-28

F
Feature information, processor 1-7
FXRSTOR instruction
  CPUID flag 1-29
FXSAVE instruction
  CPUID flag 1-29

I
IA-32e mode
  CPUID flag 1-22
Instruction set
  grouped by processor 1-1

L
L1 Context ID 1-27

M
Machine check architecture
  CPUID flag 1-29
description 1-29
MMX instructions
  CPUID flag for technology 1-29
Model & family information 1-32
MONITOR instruction
  CPUID flag 1-26
feature data 1-33
MOV instruction (control registers) 2-12, 2-14
MWAIT instruction
  CPUID flag 1-26
feature data 1-33

P
Pending break enable 1-30
Performance-monitoring counters
  CPUID inquiry for 1-33

R
RDMSR instruction
  CPUID flag 1-28

S
Self Snoop 1-30
SpeedStep technology 1-26
SSE extensions
  CPUID flag 1-30
SSE2 extensions
  CPUID flag 1-30
SSE3
    CPUID flag 1-26
SSE3 extensions
    CPUID flag 1-26
SSSE3 extensions
    CPUID flag 1-26
Stepping information 1-32
SYSENDER instruction
    CPUID flag 1-29
SYSEXIT instruction
    CPUID flag 1-29

T
Thermal Monitor
    CPUID flag 1-30
Thermal Monitor 2 1-26
    CPUID flag 1-26
Time Stamp Counter 1-28

V
Version information, processor 1-7
VPMULTISHIFTQB – Select Packed Unaligned Bytes from Quadword Source 2-45

W
WBINVD instruction 2-70
WBINVD/INVD bit 1-9
WRMSR instruction
    CPUID flag 1-28

X
XFEATURE_ENALBED_MASK 1-4
XRSTOR 1-4, 1-34
XSAVE 1-4, 1-27, 1-34