CHAPTER 1
INTEL® ADVANCED VECTOR EXTENSIONS
1.1 About This Document .......................................................... 1-1
1.2 Overview .............................................................................. 1-1
1.3 Intel® Advanced Vector Extensions Architecture Overview .......... 1-2
1.3.1 256-Bit Wide SIMD Register Support .................................. 1-2
1.3.2 Instruction Syntax Enhancements ......................................... 1-3
1.3.3 VEX Prefix Instruction Encoding Support ............................ 1-4
1.4 Functional Overview .......................................................... 1-4
1.4.1 256-bit Floating-Point Arithmetic Processing Enhancements .. 1-5
1.4.2 256-bit Non-Arithmetic Instruction Enhancements ............... 1-5
1.4.3 Arithmetic Primitives for 128-bit Vector and Scalar processing ... 1-6
1.4.4 Non-Arithmetic Primitives for 128-bit Vector and Scalar Processing .. 1-6
1.5 General Encryption and Cryptographic Processing ..................... 1-7

CHAPTER 2
APPLICATION PROGRAMMING MODEL
2.1 DetectiON of PCLMULQDQ and AES Instructions ..................... 2-1
2.2 Detection of AVX and FMA Instructions .................................. 2-1
2.2.1 Detection of FMA ..................................................... 2-3
2.2.2 Detection of VEX-Encoded AES and VPCLMULQDQ. .......... 2-4
2.3 Fused-Multiply-ADD (FMA) Numeric Behavior ....................... 2-6
2.3.1 FMA Instruction Operand Order and Arithmetic Behavior .... 2-10
2.4 Accessing YMM Registers .................................................. 2-10
2.5 Memory alignment .......................................................... 2-11
2.6 SIMD floating-point ExCeptions .......................................... 2-13
2.7 Instruction Exception Specification ....................................... 2-14
2.7.1 Exceptions Type 1 (Aligned memory reference) .................. 2-18
2.7.2 Exceptions ... Type 2 (>=16 Byte Memory Reference, Unaligned with VEX prefix) ... 2-19
2.7.3 Exceptions Type 3 (<16 Byte memory argument) ............... 2-20
2.7.4 Exceptions Type 4 (>=16 Byte, mem arg no alignment with VEX prefix, no floating-point exceptions) .................. 2-21
2.7.5 Exceptions Type 5 (<16 Byte mem arg and no FP exceptions). 2-22
2.7.6 Exceptions Type 6 (VEX-Encoded Instructions Without Legacy SSE Analogues) ... 2-23
2.7.7 Exceptions Type 7 (No FP exceptions, no memory arg) ........ 2-24
2.7.8 Exceptions Type 8 (AVX and no memory argument) .......... 2-25
2.7.9 Exception Type 9 (AVX) ............................................ 2-26
2.7.10 Exception Type 10 (>=16 Byte mem arg no alignment, no floating-point exceptions) 2-26
2.8 Programming Considerations with 128-bit SIMD instructions .... 2-28
2.8.1 Clearing Upper YMM State Between AVX and Legacy SSE Instructions ........ 2-28
2.8.2 Using AVX 128-bit Instructions Instead of Legacy SSE instructions .. 2-29
2.8.3 Unaligned Memory Access and Buffer Size Management ........ 2-29
2.9 CPUID Instruction ....................................................... 2-30
CPUID—CPU Identification .................................................. 2-31
CHAPTER 3
SYSTEM PROGRAMMING MODEL
3.1 YMM State, VEX Prefix and Supported Operating Modes ........................................... 3-1
3.2 YMM State Management .............................................................................................. 3-2
3.2.1 Detection of YMM State Support .............................................................................. 3-2
3.2.2 Enabling of YMM State ........................................................................................... 3-2
3.2.3 Enabling of SIMD Floating-Exception Support ......................................................... 3-3
3.2.4 The Layout of XSAVE Area ..................................................................................... 3-4
3.2.5 XSAVE/XRSTOR Interaction with YMM State and MXCSR .................................. 3-5
3.3 Reset Behavior ............................................................................................................ 3-6
3.4 Emulation ..................................................................................................................... 3-7
3.5 Writing AVX floating-point exception handlers ............................................................. 3-7

CHAPTER 4
INSTRUCTION FORMAT
4.1 Instruction Formats ..................................................................................................... 4-1
4.1.1 VEX and the LOCK prefix ...................................................................................... 4-2
4.1.2 VEX and the 66H, F2H, and F3H prefixes ............................................................... 4-2
4.1.3 VEX and the REX prefix ......................................................................................... 4-2
4.1.4 The VEX Prefix ..................................................................................................... 4-2
4.1.4.1 VEX Byte 0, bits[7:0] ......................................................................................... 4-6
4.1.4.2 VEX Byte 1, bit [7] - 'R' ...................................................................................... 4-6
4.1.4.3 3-byte VEX byte 1, bit[6] - 'X' ........................................................................... 4-6
4.1.4.4 3-byte VEX byte 1, bit[5] - 'B' ........................................................................... 4-6
4.1.4.5 3-byte VEX byte 2, bit[7] - 'W' ........................................................................... 4-6
4.1.4.6 2-byte VEX Byte 1, bits[6:3] and 3-byte VEX Byte 2, bits [6:3]- 'vvvv' the Source or dest Register Specifier ................................................................. 4-7
4.1.5 Instruction Operand Encoding and VEX.vvvv, ModR/M .......................................... 4-8
4.1.5.1 3-byte VEX byte 1, bits[4:0] - “m-mmmm” ......................................................... 4-11
4.1.5.2 2-byte VEX byte 1, bit[2], and 3-byte VEX byte 2, bit [2]- "L" ......................... 4-11
4.1.5.3 2-byte VEX byte 1, bits[1:0], and 3-byte VEX Byte 2, bits [1:0]- "pp" ............... 4-12
4.1.6 The Opcode Byte ................................................................................................ 4-12
4.1.7 The MODRM, SIB, and Displacement Bytes ......................................................... 4-12
4.1.8 The Third Source Operand (Immediate Byte) ....................................................... 4-12
4.1.9 AVX Instructions and the Upper 128-bits of YMM registers .................................. 4-13
4.1.10 AVX Instruction Length ....................................................................................... 4-13

CHAPTER 5
INSTRUCTION SET REFERENCE
5.1 Interpreting Instruction Reference Pages .................................................................... 5-1
5.1.1 Instruction Format ................................................................................................ 5-1
5.1.2VBROADCASTF128- Broadcast 128 Bits of Floating-Point Values (THIS IS AN EXAMPLE) 5-2
5.1.2 Opcode Column in the Instruction Summary Table .............................................. 5-2
5.1.3 Instruction Column in the Instruction Summary Table ........................................ 5-4
5.1.4 64/32 bit Mode Support column in the Instruction Summary Table ....................... 5-5
5.1.5 CPUID Support column in the Instruction Summary Table .................................. 5-5
5.2 AES Transformations and Data Structure .................................................. 5-6
5.2.1 Little-Endian Architecture and Big-Endian Specification (FIPS 197) ........... 5-6
5.2.1.1 AES Data Structure in Intel 64 Architecture ................................. 5-6
5.2.2 AES Transformations and Functions ..................................................... 5-8
5.3 Summary of Terms .................................................................................. 5-12
5.4 Instruction SET Reference ..................................................................... 5-13

ADDPD - Add Packed Double Precision Floating-Point Values .................. 5-14
ADDPSS- Add Packed Single Precision Floating-Point Values ....................... 5-16
ADDS- Add Scalar Double Precision Floating-Point Values ......................... 5-18
ADDS- Add Scalar Single Precision Floating-Point Values ......................... 5-20
ADDSUBPD- Packed Double FP Add/Subtract ........................................... 5-22
ADDSUBPSS- Packed Single FP Add/Subtract ............................................ 5-24
AESENC/AESENCLAST- Perform One Round of an AES Encryption Flow .... 5-26
AESDEC/AESDECLAST- Perform One Round of an AES Decryption Flow ..... 5-29
AESIMC- Perform the AES InvMixColumn Transformation ........................ 5-32
AESKEYGENASSIST - AES Round Key Generation Assist .......................... 5-34
ANDPD- Bitwise Logical AND of Packed Double Precision Floating-Point Values 5-36
ANDPS- Bitwise Logical AND of Packed Single Precision Floating-Point Values 5-38
ANDNPD- Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values 5-40
ANDNPS- Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values 5-42
BLENDPD- Blend Packed Double Precision Floating-Point Values ............... 5-44
BLENDPS- Blend Packed Single Precision Floating-Point Values .................. 5-46
BLENDVDPD- Blend Packed Double Precision Floating-Point Values .......... 5-49
BLENDVPD- Blend Packed Single Precision Floating-Point Values ......... 5-52
VBROADCAST- Load with Broadcast ......................................................... 5-55
CMPPD- Compare Packed Double-Precision Floating-Point Values ............. 5-59
CMPPSS- Compare Packed Single-Precision Floating-Point Values ............... 5-67
CMPSD- Compare Scalar Double-Precision Floating-Point Values .............. 5-74
CMPSSS- Compare Scalar Single-Precision Floating-Point Values ............... 5-79
COMISO- Compare Scalar Ordered Double-Precision Floating-Point Values and Set EFLAGS .......................................................... 5-84
COMISS- Compare Scalar Ordered Single-Precision Floating-Point Values and Set EFLAGS 5-86
CVTDQ2PD- Convert Packed Doubleword Integers to Packed Double-Precision Floating-Point Values ......................................................... 5-88
CVTDQ2PS- Convert Packed Doubleword Integers to Packed Single-Precision Floating-Point Values ......................................................... 5-90
CVTDPD2DQ- Convert Packed Double-Precision Floating-point values to Packed Doubleword Integers .......................................................... 5-92
CVTDPD2PS- Convert Packed Double-Precision Floating-point values to Packed Single-Precision Floating-Point Values ................................ 5-95
CVTPS2DQ- Convert Packed Single Precision Floating-Point Values to Packed Single Precision Floating-Point Values .............................. 5-98
CVTPS2PD- Convert Packed Single Precision Floating-point values to Packed Double Precision Floating-Point Values ........................................... 5-100
CVTSD2SI- Convert Scalar Double-Precision Floating-Point Value to Doubleword Integer
CVTSD2SS- Convert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value ......................................................... 5-105
CVTSD2SS- Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value 5-107
CVTSS2SD- Convert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value ............................................. 5-111
CVTSS2SI- Convert Scalar Single-Precision Floating-Point Value to Doubleword Integer. 5-113
CVTTPD2DQ- Convert with Truncation Packed Double-Precision Floating-point values to Packed Doubleword Integers ..................................... 5-115
CVTTPS2DQ- Convert with Truncation Packed Single Precision Floating-Point Values to Packed Singed Doubleword Integer Values ............... 5-118
CVTTSD2SI- Convert with Truncation Scalar Double-Precision Floating-Point Value to Signed Doubleword Integer .................................. 5-120
CVTTSS2SI- Convert with Truncation Scalar Single-Precision Floating-Point Value to Doubleword Integer .................................................. 5-122
DIVPD- Divide Packed Double-Precision Floating-Point Values ...................... 5-124
DIVPS- Divide Packed Single-Precision Floating-Point Values ...................... 5-126
DIVSD- Divide Scalar Double-Precision Floating-Point Values ...................... 5-128
DIVSS- Divide Scalar Single-Precision Floating-Point Values ...................... 5-130
DPPD- Dot Product of Packed Double-Precision Floating-Point Values .......... 5-132
DPPS- Dot Product of Packed Single-Precision Floating-Point Values .......... 5-134
VEXTRACTF128- Extract packed floating-point values ............................. 5-137
EXTRACTPS- Extract packed floating-point values .................................. 5-139
HADDPD- Add Horizontal Double Precision Floating-Point Values .................. 5-141
HADDPD- Add Horizontal Single Precision Floating-Point Values ................ 5-143
HSUBPD- Subtract Horizontal Double Precision Floating-Point Values ........... 5-146
HSUBPS- Subtract Horizontal Single Precision Floating-Point Values ........... 5-148
VINSERTF128- Insert packed floating-point values .................................. 5-151
INSERTPS- Insert Scalar Single Precision Floating-Point Value .................. 5-152
LDDQ- Move Unaligned Integer .............................................................. 5-156
VLDMXCSR—Load MXCSR Register ....................................................... 5-158
MASKMOVQ- Store Selected Bytes of Double Quadwords with NT Hint ........... 5-159
VMASKMOV- Conditional SIMD Packed Loads and Stores .......................... 5-161
MAXPD- Maximum of Packed Double Precision Floating-Point Values .......... 5-165
MAXPS- Minimum of Packed Single Precision Floating-Point Values .......... 5-167
MAXSD- Return Maximum Scalar Double-Precision Floating-Point Value ....... 5-170
MAXSS- Return Maximum Scalar Single-Precision Floating-Point Value ........ 5-172
MINPD- Minimum of Packed Double Precision Floating-Point Values .......... 5-174
MINPS- Minimum of Packed Single Precision Floating-Point Values .......... 5-176
MINSD- Return Minimum Scalar Double-Precision Floating-Point Value ....... 5-179
MINSS- Return Minimum Scalar Single-Precision Floating-Point Value ........ 5-181
MOVAPD- Move Aligned Packed Double-Precision Floating-Point Values ....... 5-183
MOVAPS- Move Aligned Packed Single-Precision Floating-Point Values ....... 5-186
MOVD/MOVQ- Move Doubleword and Quadword .......................... 5-189
MOVDQ- Move Unaligned Packed Integer Values ...................... 5-196
MOVDQU- Move Unaligned Packed Integer Values ..................... 5-198
MOVHPS- Move High Packed Double-Precision Floating-Point Values . 5-202
MOVHPS- Move High Packed Single-Precision Floating-Point Values . 5-204
MOVLHP- Move Packed Single-Precision Floating-Point Values Low to High ... 5-206
MOVLPS- Move Low Packed Single-Precision Floating-Point Values ... 5-210
MOVMSKPD- Extract Double-Precision Floating-Point Sign mask .... 5-212
MOVMSKPS- Extract Single-Precision Floating-Point Sign mask ....... 5-214
MOVNTDQ- Store Packed Integers Using Non-Temporal Hint ......... 5-216
MOVNTQ- Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint .............. 5-218
MOVNTPS- Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint .......... 5-220
MOVQ- Move Quadword .................................................. 5-192
MOVQDUP- Replicate Double FP Values ............................... 5-194
MOVSD- Move or Merge Scalar Double-Precision Floating-Point Value ... 5-224
MOVSDDUP- Replicate Single FP Values .............................. 5-227
MOVSS- Move or Merge Scalar Single-Precision Floating-Point Value ... 5-233
MOVUPD- Move Unaligned Packed Double-Precision Floating-Point Values 5-236
MOVUPS- Move Unaligned Packed Single-Precision Floating-Point Values 5-239
MULP- Multiply Packed Double Precision Floating-Point Values ...... 5-247
MULPS- Multiply Packed Single Precision Floating-Point Values ...... 5-249
MULSD- Multiply Scalar Double-Precision Floating-Point Values ..... 5-251
MULSS- Multiply Scalar Single-Precision Floating-Point Values ..... 5-253
MULPD- Multiply Packed Double Precision Floating-Point Values ...... 5-247
MULPS- Multiply Packed Single Precision Floating-Point Values ...... 5-249
MULSD- Multiply Scalar Double-Precision Floating-Point Values ..... 5-251
MULSS- Multiply Scalar Single-Precision Floating-Point Values ..... 5-253
ORM- Bitwise Logical OR of Packed Double Precision Floating-Point Values ... 5-255
ORMS- Bitwise Logical OR of Packed Single Precision Floating-Point Values ... 5-257
PAEB/PABSD- Packed Absolute Value ................................. 5-259
PACKSSWB/PACKSSDW- Pack with Signed Saturation ................. 5-262
PACKUSWB/PACKUSDW- Pack with Unsigned Saturation ............... 5-266
PADDB/PADD/DADD/QADD- Add Packed Integers ..................... 5-269
PADDSB/PADDSW- Add Packed Signed Integers with Signed Saturation 5-273
PADUSB/PADUSW- Add Packed Unsigned Integers with Unsigned Saturation 5-275
PALIGNR- Byte Align .................................................... 5-277
PAND- Logical AND ..................................................... 5-279
PANDN- Logical AND NOT ............................................... 5-281
PAVG/PAGW- Average Packed Integers ................................ 5-283
PBLENDVB- Variable Blend Packed Bytes ............................. 5-285
PBLENDW- Blend Packed Words ......................................... 5-288
PCMULQDQ- Carry-Less Multiplication Quadword ...................... 5-290
PCMPESTRI- Packed Compare Explicit Length Strings, Return Index 5-294
PCMPESTRM- Packed Compare Explicit Length Strings, Return Mask 5-296
<table>
<thead>
<tr>
<th>Instruction Description</th>
<th>Page</th>
</tr>
</thead>
<tbody>
<tr>
<td>Packed Compare Implicit Length Strings, Return Index</td>
<td>5-298</td>
</tr>
<tr>
<td>Packed Compare Implicit Length Strings, Return Mask</td>
<td>5-300</td>
</tr>
<tr>
<td>Compare Packed Integers for Equality</td>
<td>5-302</td>
</tr>
<tr>
<td>Compare Packed Integers for Greater Than</td>
<td>5-306</td>
</tr>
<tr>
<td>Packed Compare Implicit Length Strings, Return Mask</td>
<td>5-300</td>
</tr>
<tr>
<td>Compare Packed Integers for Equality</td>
<td>5-302</td>
</tr>
<tr>
<td>Compare Packed Integers for Greater Than</td>
<td>5-306</td>
</tr>
<tr>
<td>Packed Compare Implicit Length Strings, Return Index</td>
<td>5-298</td>
</tr>
<tr>
<td>Compare Packed Integers for Equality</td>
<td>5-302</td>
</tr>
<tr>
<td>Compare Packed Integers for Greater Than</td>
<td>5-306</td>
</tr>
<tr>
<td>Packed Compare Implicit Length Strings, Return Mask</td>
<td>5-300</td>
</tr>
<tr>
<td>Compare Packed Integers for Equality</td>
<td>5-302</td>
</tr>
<tr>
<td>Compare Packed Integers for Greater Than</td>
<td>5-306</td>
</tr>
</tbody>
</table>

Ref. # 319433-005
CHAPTER 6

INSTRUCTION SET REFERENCE - FMA

6.1 FMA Instruction SET Reference .................................................. 6-1

VFMADD132PD/VFMADD213PD/VFMADD231PD - Fused Multiply-Add of Packed Dou-

VZEROALL- Zero All YMM registers .............................................. 5-501
VZEROUPPER- Zero Upper bits of YMM registers .............................. 5-503
TABLES

Table 2-1 Rounding behavior of Zero Result in FMA Operation ........................................ 2-7
Table 2-2 FMA Numeric Behavior ..................................................................................... 2-8
Table 2-3 Alignment Faulting Conditions when Memory Access is Not Aligned .............. 2-12
Table 2-4 Instructions Requiring Explicitly Aligned Memory ......................................... 2-12
Table 2-5 Instructions Not Requiring Explicit Memory Alignment .................................. 2-13
Table 2-6 Exception class description ............................................................................... 2-14
Table 2-7 Instructions in each Exception Class ................................................................. 2-15
Table 2-8 #UD Exception and VEX.L Field Encoding .................................................... 2-18
Table 2-9 Type 1 Class Exception Conditions ................................................................. 2-19
Table 2-10 Type 2 Class Exception Conditions ............................................................... 2-20
Table 2-11 Type 3 Class Exception Conditions ............................................................... 2-21
Table 2-12 Type 4 Class Exception Conditions ............................................................... 2-22
Table 2-13 Type 5 Class Exception Conditions ............................................................... 2-23
Table 2-14 Type 6 Class Exception Conditions ............................................................... 2-24
Table 2-15 Type 7 Class Exception Conditions ............................................................... 2-25
Table 2-16 Type 8 Class Exception Conditions ............................................................... 2-26
Table 2-17 Type 9 Class Exception Conditions ............................................................... 2-27
Table 2-18 Type 10 Class Exception Conditions ............................................................. 2-28
Table 2-19 Highest CPUID Source Operand for Intel 64 and IA-32 Processors .......................... 2-41
Table 2-20 Feature Information Returned in the ECX Register ..................................... 2-42
Table 2-21 Processor Type Field .................................................................................... 2-44
Table 2-22 More on Feature Information Returned in the EDX Register ....................... 2-48
Table 2-23 Encoding of Cache and TLB Descriptors ....................................................... 2-50
Table 2-24 Processor Brand String Returned with Pentium 4 Processor ...................... 2-56
Table 2-25 Mapping of Brand Indices; and Intel 64 and IA-32 Processor Brand Strings .................................................. 2-59
Table 2-26 XFEATURE_ENABLED_MASK and Processor State Components ..................... 3-3
Table 2-27 CR4 bits for AVX New Instructions technology support .................................. 3-3
Table 2-28 Layout of XSAXE Area For Processor Supporting YMM State ...................... 3-4
Table 2-29 XSAVE Header Format .................................................................................. 3-4
Table 2-30 XSAVE Save Area Layout for YMM State (Ext_Save_Area_2) ......................... 3-5
Table 2-31 XSAVE Save Area Layout for YMM State (Ext_Save_Area_2) ......................... 3-5
Table 2-32 XRSTOR Action on MXCSR, XMM Registers, YMM Registers ....................... 3-6
Table 2-33 Processor Supplied Init Values XRSTOR May Use ....................................... 3-6
Table 2-34 VEX.vvvv to register name mapping ............................................................. 4-8
Table 2-35 Instructions with a VEX.vvvv destination ....................................................... 4-9
Table 2-36 Instructions with a VEX.vvvv destination ....................................................... 4-9
Table 2-37 Interpreting VEX.vvvv, reg_field, and rm_field ........................................... 4-10
Table 2-38 VEX.m-mmm interpretation .......................................................................... 4-11
Table 2-39 VEX.L interpretation ..................................................................................... 4-12
Table 2-40 VEX.pp interpretation ................................................................................... 4-12
Table 2-41 Byte and 32-bit Word Representation of a 128-bit State .............................. 5-7
Table 2-42 Matrix Representation of a 128-bit State ...................................................... 5-7
Table 2-43 Little Endian Representation of a 128-bit State ............................................... 5-7

Ref. # 319433-005
<table>
<thead>
<tr>
<th>FIGURE</th>
<th>DESCRIPTIVE</th>
<th>PAGE</th>
</tr>
</thead>
<tbody>
<tr>
<td>Figure 2-1.</td>
<td>General Procedural Flow of Application Detection of AVX</td>
<td>2-2</td>
</tr>
<tr>
<td>Figure 2-2.</td>
<td>Version Information Returned by CPUID in EAX</td>
<td>2-42</td>
</tr>
<tr>
<td>Figure 2-3.</td>
<td>Feature Information Returned in the ECX Register</td>
<td>2-44</td>
</tr>
<tr>
<td>Figure 2-4.</td>
<td>Feature Information Returned in the EDX Register</td>
<td>2-47</td>
</tr>
<tr>
<td>Figure 2-5.</td>
<td>Determination of Support for the Processor Brand String</td>
<td>2-56</td>
</tr>
<tr>
<td>Figure 2-6.</td>
<td>Algorithm for Extracting Maximum Processor Frequency</td>
<td>2-58</td>
</tr>
<tr>
<td>Figure 2-7.</td>
<td>Instruction Encoding Format with VEX Prefix</td>
<td>4-2</td>
</tr>
<tr>
<td>Figure 4-5.</td>
<td>VEX bitfields</td>
<td>4-5</td>
</tr>
<tr>
<td>Figure 5-1.</td>
<td>VBROADCASTSS Operation (VEX.256 encoded version)</td>
<td>5-56</td>
</tr>
<tr>
<td>Figure 5-2.</td>
<td>VBROADCASTSS Operation (128-bit version)</td>
<td>5-56</td>
</tr>
<tr>
<td>Figure 5-3.</td>
<td>VBROADCASTSD Operation</td>
<td>5-57</td>
</tr>
<tr>
<td>Figure 5-4.</td>
<td>VBROADCASTSF128 Operation</td>
<td>5-57</td>
</tr>
<tr>
<td>Figure 5-5.</td>
<td>CVTDQ2PD (VEX.256 encoded version)</td>
<td>5-89</td>
</tr>
<tr>
<td>Figure 5-6.</td>
<td>VCVTDP2DQ (VEX.256 encoded version)</td>
<td>5-93</td>
</tr>
<tr>
<td>Figure 5-7.</td>
<td>VCVTDP2PS (VEX.256 encoded version)</td>
<td>5-96</td>
</tr>
<tr>
<td>Figure 5-8.</td>
<td>CVTPS2PD (VEX.256 encoded version)</td>
<td>5-101</td>
</tr>
<tr>
<td>Figure 5-9.</td>
<td>VCVTTPD2DQ (VEX.256 encoded version)</td>
<td>5-116</td>
</tr>
<tr>
<td>Figure 5-10.</td>
<td>VHADDPD operation</td>
<td>5-141</td>
</tr>
<tr>
<td>Figure 5-11.</td>
<td>VHADDPS operation</td>
<td>5-143</td>
</tr>
<tr>
<td>Figure 5-12.</td>
<td>VHSUBPD operation</td>
<td>5-146</td>
</tr>
<tr>
<td>Figure 5-13.</td>
<td>VHSUBPS operation</td>
<td>5-148</td>
</tr>
<tr>
<td>Figure 5-14.</td>
<td>MOVVDUP Operation</td>
<td>5-195</td>
</tr>
<tr>
<td>Figure 5-15.</td>
<td>MOVSHDUP Operation</td>
<td>5-228</td>
</tr>
<tr>
<td>Figure 5-16.</td>
<td>MOVSLDUP Operation</td>
<td>5-231</td>
</tr>
<tr>
<td>Figure 5-17.</td>
<td>PACKSSDW Instruction Operation using 64-bit Operands</td>
<td>5-263</td>
</tr>
<tr>
<td>Figure 5-18.</td>
<td>VPERMILPD operation</td>
<td>5-311</td>
</tr>
<tr>
<td>Figure 5-19.</td>
<td>VPERMILPS operation</td>
<td>5-311</td>
</tr>
<tr>
<td>Figure 5-20.</td>
<td>VPERMILPS Shufle Control</td>
<td>5-315</td>
</tr>
<tr>
<td>Figure 5-21.</td>
<td>VPERMILPS Shufle Control</td>
<td>5-315</td>
</tr>
<tr>
<td>Figure 5-22.</td>
<td>VPERM2F128 Operation</td>
<td>5-318</td>
</tr>
<tr>
<td>Figure 5-23.</td>
<td>PSHUFD Instruction Operation</td>
<td>5-389</td>
</tr>
<tr>
<td>Figure 5-24.</td>
<td>PUNPCKHDQ Instruction Operation</td>
<td>5-430</td>
</tr>
<tr>
<td>Figure 5-25.</td>
<td>PUNPCKLBW Instruction Operation using 64-bit Operands</td>
<td>5-434</td>
</tr>
<tr>
<td>Figure 5-26.</td>
<td>VROUNDxx immediate control field definition</td>
<td>5-450</td>
</tr>
<tr>
<td>Figure 5-27.</td>
<td>VSHUFPD Operation</td>
<td>5-461</td>
</tr>
<tr>
<td>Figure 5-28.</td>
<td>VSHUFPS Operation</td>
<td>5-464</td>
</tr>
<tr>
<td>Figure 5-29.</td>
<td>VUNPCKHPS Operation</td>
<td>5-490</td>
</tr>
<tr>
<td>Figure 5-30.</td>
<td>VUNPCKLPS Operation</td>
<td>5-495</td>
</tr>
</tbody>
</table>
1.1 ABOUT THIS DOCUMENT

This document describes the software programming interfaces of several vector SIMD instruction extensions of the Intel® 64 architecture that will be introduced starting with Intel 64 processors built on 32nm process technology. The instruction set extensions covered in this document, with respect to availability in different processor generations, is referred to by the following categories:

- General-purpose encryption and AES: 128-bit SIMD extensions targeted to accelerate high-speed block encryption and cryptographic processing using the Advanced Encryption Standard.

- Intel® Advanced Vector Extensions (AVX) introduces 256-bit vector processing capability and includes two components to be introduced on Intel processor generations built from 32nm process and beyond:
  - The first generation Intel AVX provides 256-bit SIMD register support, 256-bit vector floating-point instructions, enhancements to 128-bit SIMD instructions, support for three and four operand syntax.
  - FMA is a future extension of Intel AVX, FMA provides floating-point, fused multiply-add instructions supporting 256-bit and 128-bit SIMD vectors.

Chapter 1 provides an overview of these instruction set extensions. Chapter 2 describes the application programming environment. Chapter 3 describes system programming requirements needed to support 256-bit registers. Chapter 4 describes the architectural extensions of Intel 64 instruction encoding format that support 256-bit registers, three and four operand syntax. Chapter 5 provides detailed instruction reference for AVX and encryption/AES instructions. Chapter 6 provides detailed instruction reference for FMA instructions.

1.2 OVERVIEW

Intel® Advanced Vector Extensions extend beyond the capabilities and programming environment over those of multiple generations of Streaming SIMD Extensions. Intel AVX address the continued need for vector floating-point performance in mainstream scientific and engineering numerical applications, visual processing, recognition, data-mining/synthesis, gaming, physics, cryptography and other areas of applications. Intel AVX is designed to facilitate efficient implementation by wide spectrum of software architectures of varying degrees of thread parallelism, and data vector lengths. Intel AVX offers the following benefits:

- efficient building blocks for applications targeted across all segments of computing platforms.
• significant increase in floating-point performance density with good power efficiency over previous generations of 128-bit SIMD instruction set extensions,

• scalable performance with multi-core processor capability.

Intel AVX also establishes a foundation for future evolution in both instruction set functionality and vector lengths by introducing an efficient instruction encoding scheme, three and four operand instruction syntax, supporting load and store masking, etc.

Intel Advanced Vector Extensions offers comprehensive architectural enhancements and functional enhancements in arithmetic as well as data processing primitives. Section 1.3 summarizes the architectural enhancement of AVX. Functional overview of AVX and FMA instructions are summarized in Section 1.4. General-purpose encryption and AES instructions follow the existing architecture of 128-bit SIMD instruction sets like SSE4 and its predecessors, Section 1.5 provides a short summary.

1.3 INTEL® ADVANCED VECTOR EXTENSIONS ARCHITECTURE OVERVIEW

Intel AVX has many similarities to the SSE and double-precision floating-point portions of SSE2. However, Intel AVX introduces the following architectural enhancements:

• Support for 256-bit wide vectors and SIMD register set. 256-bit register state is managed by Operating System using XSAVE/XRSTOR instructions introduced in 45 nm Intel 64 processors (see IA-32 Intel® Architecture Software Developer’s Manual, Volumes 2B and 3A).

• Instruction syntax support for generalized three-operand syntax to improve instruction programming flexibility and efficient encoding of new instruction extensions.

• Enhancement of legacy 128-bit SIMD instruction extensions to support three-operand syntax and to simplify compiler vectorization of high-level language expressions.

• Instruction encoding format using a new prefix (referred to as VEX) to provide compact, efficient encoding for three-operand syntax, vector lengths, compaction of SIMD prefixes and REX functionality.

• FMA extensions and enhanced floating-point compare instructions add support for IEEE-754-2008 standard.

1.3.1 256-Bit Wide SIMD Register Support

Intel AVX introduces support for 256-bit wide SIMD registers (YMM0-YMM7 in operating modes that are 32-bit or less, YMM0-YMM15 in 64-bit mode). The lower 128-bits of the YMM registers are aliased to the respective 128-bit XMM registers.
1.3.2 Instruction Syntax Enhancements

Intel AVX employs an instruction encoding scheme using a new prefix (known as “VEX” prefix). Instruction encoding using the VEX prefix can directly encode a register operand within the VEX prefix. This supports two new instruction syntax in Intel 64 architecture:

- A non-destructive operand (in a three-operand instruction syntax): The non-destructive source reduces the number of registers, register-register copies and explicit load operations required in typical SSE loops, reduces code size, and improves micro-fusion opportunities.
- A third source operand (in a four-operand instruction syntax) via the upper 4 bits in an 8-bit immediate field. Support for the third source operand is defined for selected instructions (e.g. VBLENDVPD, VBLENDVPS, PBLENDVB).

Two-operand instruction syntax previously expressed as

\[ \text{ADDPS } xmm1, \, xmm2/m128 \]

now can be expressed in three-operand syntax as

\[ \text{VADDPS } xmm1, \, xmm2, \, xmm3/m128 \]

In four-operand syntax, the extra register operand is encoded in the immediate byte.
INTEL® ADVANCED VECTOR EXTENSIONS

Note SIMD instructions supporting three-operand syntax but processing only 128-bits of data are considered part of the 256-bit SIMD instruction set extensions of AVX, because bits 255:128 of the destination register are zeroed by the processor.

1.3.3 VEX Prefix Instruction Encoding Support

Intel AVX introduces a new prefix, referred to as VEX, in the Intel 64 and IA-32 instruction encoding format. Instruction encoding using the VEX prefix provides the following capabilities:

- Direct encoding of a register operand within VEX. This provides instruction syntax support for non-destructive source operand.
- Efficient encoding of instruction syntax operating on 128-bit and 256-bit register sets.
- Compaction of REX prefix functionality: The equivalent functionality of the REX prefix is encoded within VEX.
- Compaction of SIMD prefix functionality and escape byte encoding: The functionality of SIMD prefix (66H, F2H, F3H) on opcode is equivalent to an opcode extension field to introduce new processing primitives. This functionality is replaced by a more compact representation of opcode extension within the VEX prefix. Similarly, the functionality of the escape opcode byte (0FH) and two-byte escape (0F38H, 0F3AH) are also compacted within the VEX prefix encoding.
- Most VEX-encoded SIMD numeric and data processing instruction semantics with memory operand have relaxed memory alignment requirements than instructions encoded using SIMD prefixes (see Section 2.5).

VEX prefix encoding applies to SIMD instructions operating on YMM registers, XMM registers, and in some cases with a general-purpose register as one of the operand. VEX prefix is not supported for instructions operating on MMX or x87 registers. Details of VEX prefix and instruction encoding are discussed in Chapter 4.

1.4 FUNCTIONAL OVERVIEW

Intel AVX and FMA provide comprehensive functional improvements over previous generations of SIMD instruction extensions. The functional improvements include:

- 256-bit floating-point arithmetic primitives: AVX enhances existing 128-bit floating-point arithmetic instructions with 256-bit capabilities for floating-point processing. FMA provides additional set of 256-bit floating-point processing capabilities with a rich set of fused-multiply-add and fused multiply-subtract primitives.
- Enhancements for flexible SIMD data movements: AVX provides a number of new data movement primitives to enable efficient SIMD programming in relation to loading non-unit-strided data into SIMD registers, intra-register SIMD data manipulation, conditional expression and branch handling, etc. Enhancements
for SIMD data movement primitives cover 256-bit and 128-bit vector floating-point data, and across 128-bit integer SIMD data processing using VEX-encoded instructions.

Several key categories of functional improvements in AVX and FMA are summarized in the following subsections.

### 1.4.1 256-bit Floating-Point Arithmetic Processing Enhancements

Intel AVX provides 35 256-bit floating-point arithmetic instructions. The arithmetic operations cover add, subtract, multiply, divide, square-root, compare, max, min, round, etc., on single-precision and double-precision floating-point data.

The enhancement in AVX on floating-point compare operation provides 32 conditional predicates to improve programming flexibility in evaluating conditional expressions.

FMA provides 36 256-bit floating-point instructions to perform computation on 256-bit vectors. The arithmetic operations cover fused multiply-add, fused multiply-subtract, fused multiply add/subtract interleave, signed-reversed multiply on fused multiply-add and multiply-subtract.

### 1.4.2 256-bit Non-Arithmetic Instruction Enhancements

Intel AVX provides new primitives for handling data movement within 256-bit floating-point vectors and promotes many 128-bit floating data processing instructions to handle 256-bit floating-point vectors.

AVX includes 33 256-bit data processing instructions that are promoted from previous generations of SIMD instruction extensions, ranging from logical, blend, convert, test, unpacking, shuffling, load and stores.

AVX introduces 19 new data processing instructions that operate on 256-bit vectors. These new primitives cover the following operations:

- Non-unit-strided fetching of SIMD data. AVX provides several flexible SIMD floating-point data fetching primitives:
  - broadcast of single or multiple data elements into a 256-bit destination,
  - masked move primitives to load or store SIMD data elements conditionally,

- Intra-register manipulation of SIMD data elements. AVX provides several flexible SIMD floating-point data manipulation primitives:
  - insert/extract multiple SIMD floating-point data elements to/from 256-bit SIMD registers
  - permute primitives to facilitate efficient manipulation of floating-point data elements in 256-bit SIMD registers

- Branch handling. AVX provides several primitives to enable handling of branches in SIMD programming:
— new variable blend instructions supports four-operand syntax with non-destructive source syntax. This is more flexible than the equivalent SSE4 instruction syntax which uses the XMM0 register as the implied mask for blend selection.
— Packed TEST instructions for floating-point data.

1.4.3 Arithmetic Primitives for 128-bit Vector and Scalar processing

Intel AVX provides 131 128-bit numeric processing instructions that employ VEX-prefix encoding. These VEX-encoded instructions generally provide the same functionality over instructions operating on XMM register that are encoded using SIMD prefixes. The 128-bit numeric processing instructions in AVX cover floating-point and integer data processing; across 128-bit vector and scalar processing.

The enhancement in AVX on 128-bit floating-point compare operation provides 32 conditional predicates to improve programming flexibility in evaluating conditional expressions. This contrasts with floating-point SIMD compare instructions in SSE and SSE2 supporting only 8 conditional predicates.

FMA provides 60 128-bit floating-point instructions to process 128-bit vector and scalar data. The arithmetic operations cover fused multiply-add, fused multiply-subtract, signed-reversed multiply on fused multiply-add and multiply-subtract.

1.4.4 Non-Arithmetic Primitives for 128-bit Vector and Scalar Processing

Intel AVX provides 126 data processing instructions that employ VEX-prefix encoding. These VEX-encoded instructions generally provide the same functionality over instructions operating on XMM register that are encoded using SIMD prefixes. The 128-bit data processing instructions in AVX cover floating-point and integer data movement primitives.

Additional enhancements in AVX on 128-bit data processing primitives include 16 new instructions with the following capabilities:
• Non-unit-strided fetching of SIMD data. AVX provides several flexible SIMD floating-point data fetching primitives:
  — broadcast of single data element into a 128-bit destination,
  — masked move primitives to load or store SIMD data elements conditionally,
• Intra-register manipulation of SIMD data elements. AVX provides several flexible SIMD floating-point data manipulation primitives:
  — permute primitives to facilitate efficient manipulation of floating-point data elements in 128-bit SIMD registers
• Branch handling. AVX provides several primitives to enable handling of branches in SIMD programming:
— new variable blend instructions supports four-operand syntax with non-destructive source syntax. Branching conditions dependent on floating-point data or integer data can benefit from Intel AVX. This is more flexible than non-VEX encoded instruction syntax that uses the XMM0 register as implied mask for blend selection. While variable blend with implied XMM0 syntax is supported in SSE4 using SIMD prefix encoding, VEX-encoded 128-bit variable blend instructions only support the more flexible four-operand syntax.
— Packed TEST instructions for floating-point data.

1.5 GENERAL ENCRYPTION AND CRYPTOGRAPHIC PROCESSING

Intel 64 processors using 32nm processing technology introduces several primitives targeted to accelerate general-purpose block encryption and cryptographic functions using the Advanced Encryption Standard (AES) of block cipher encryption and decryption on 128-bit blocks.

General-purpose block encryption primitives are provided by PCLMULQDQ instruction, which can perform carry-less multiplication for two binary numbers up to 64-bit wide.

AES encryption involves processing 128-bit input data (plaintext) through a finite number of iterative operation, referred to as “AES round”, into a 128-bit encrypted block (ciphertext). Decryption follows the reverse direction of iterative operation using the “equivalent inverse cipher” instead of the “inverse cipher”.

The cryptographic processing at each round involves two input data, one is the “state”, the other is the “round key”. Each round uses a different “round key”. The round keys are derived from the cipher key using a “key schedule” algorithm. The “key schedule” algorithm is independent of the data processing of encryption/decryption, and can be carried out independently from the encryption/decryption phase.

The AES standard supports cipher key of sizes 128, 192, and 256 bits. The respective cipher key sizes correspond to 10, 12, and 14 rounds of iteration.

The AES extensions provide two primitives to accelerate AES rounds on encryption, two primitives for AES rounds on decryption using the equivalent inverse cipher, and two instructions to support the AES key expansion procedure.
This page was intentionally left blank.
The application programming model for Intel AVX, FMA, AES and encryption primitives extend from that of Streaming SIMD Extensions (SSE) and is summarized as follows:

• The AES extensions and carry-less multiplication instruction (PCLMULQDQ) follow the same programming model as SSE, SSE2, SSE3, SSSE3, and SSE4 (see IA-32 Intel Architecture Software Developer’s Manual, Volume 1). The detection of AES and PCLMULQDQ is described in Section 2.1.

• The AVX and FMA extensions follow a programming model analogous to that of SSE with minor differences. This is described in Section 2.1 through Section 2.8. Note however that the OS support and detection process has changed considerably.

• The numeric exception behavior of FMA is similar to previous generations of SIMD floating-point instructions. The specific details are described in Section 2.3.

CPUID instruction details for detecting AVX, FMA, AES, PCLMULQDQ are described in Section 2.9.

2.1 DETECTION OF PCLMULQDQ AND AES INSTRUCTIONS

Before an application attempts to use the following AES instructions: AESDEC/AESDECLAST/AESENC/AESENCLAST/AESIMC/AESKEYGENASSIST, it must check that the processor supports the AES extensions. AES extensions is supported if CPUID.01H:ECX.AES[bit 25] = 1.

Prior to using PCLMULQDQ instruction, an application must check if CPUID.01H:ECX.PCLMULQDQ[bit 1] = 1.

Operating systems that support handling SSE state will also support applications that use AES extensions and PCLMULQDQ instruction. This is the same requirement for SSE2, SSE3, SSSE3, and SSE4.

2.2 DETECTION OF AVX AND FMA INSTRUCTIONS

AVX and FMA operate on the 256-bit YMM register state. System software requirements to support YMM state is described in Chapter 3.

Application detection of new instruction extensions operating on the YMM state follows the general procedural flow in Figure 2-1.
Prior to using AVX, the application must identify that the operating system supports
the XGETBV instruction, the YMM register state, in addition to processor’s support for
YMM state management using XSAVE/XRSTOR and AVX instructions. The following
simplified sequence accomplishes both and is strongly recommended.

1) Detect CPUID.1:ECX.OSXSAVE[bit 27] = 1 (XGETBV enabled for application use⁴)
2) Issue XGETBV and verify that XFEATURE_ENABLED_MASK[2:1] = ‘11b’ (XMM
state and YMM state are enabled by OS).
3) detect CPUID.1:ECX.AVX[bit 28] = 1 (AVX instructions supported).
   (Step 3 can be done in any order relative to 1 and 2)

The following pseudocode illustrates this recommended application AVX detection
process:

```c
INT supports_AVX()
{
    ; result in eax
    mov eax, 1
    cpuid
```

---

1. If CPUID.01H:ECX.OSXSAVE reports 1, it also indirectly implies the processor supports XSAVE,
   XRSTOR, XGETBV, processor extended state bit vector XFEATURE_ENABLED_MASK register.
   Thus an application may streamline the checking of CPUID feature flags for XSAVE and OSXSAVE.
   XSETBV is a privileged instruction.
and ecx, 018000000H
cmp ecx, 018000000H; check both OSXSAVE and AVX feature flags
jne not_supported
; processor supports AVX instructions and XGETBV is enabled by OS
mov ecx, 0; specify 0 for XFEATURE_ENABLED_MASK register
XGETBV; result in EDX:EAX
and eax, 06H
cmp eax, 06H; check OS has enabled both XMM and YMM state support
jne not_supported
mov eax, 1
jmp done
NOT_SUPPORTED:
mov eax, 0
done:
}

Note: It is unwise for an application to rely exclusively on CPUID.1:ECX.AVX[bit 28] or at all on CPUID.1:ECX.XSAVE[bit 26]: These indicate hardware support but not operating system support. If YMM state management is not enabled by an operating system, AVX instructions will #UD regardless of CPUID.1:ECX.AVX[bit 28]. "CPUID.1:ECX.XSAVE[bit 26] = 1" does not guarantee the OS actually uses the XSAVE process for state management.

These steps above also apply to enhanced 128-bit SIMD floating-pointing instructions in AVX (using VEX prefix-encoding) that operate on the YMM states. Application detection of VEX-encoded AES is described in Section 2.2.2.

2.2.1 Detection of FMA

Hardware support for FMA is indicated by CPUID.1:ECX.FMA[bit 12]=1.

Application Software must identify that hardware supports AVX as explained in Section 2.2, after that it must also detect support for FMA by CPUID.1:ECX.FMA[bit 12]. The recommended pseudocode sequence for detection of FMA is:

INT supports_fma()
{
    ; result in eax
    mov eax, 1
cpuid
    and ecx, 018001000H
APPLICATION PROGRAMMING MODEL

cmp ecx, 018001000H; check OSXSAVE, AVX, FMA feature flags
jne not_supported
; processor supports AVX, FMA instructions and XGETBV is enabled by OS
mov ecx, 0; specify 0 for XFEATURE_ENABLED_MASK register
XGETBV; result in EDX:EAX
and eax, 06H
cmp eax, 06H; check OS has enabled both XMM and YMM state support
jne not_supported
mov eax, 1
jmp done
NOT_SUPPORTED:
mov eax, 0
done:
}
-------------------------------------------------------------------------------------------------------------------

Note that FMA comprises of 256-bit and 128-bit SIMD instructions operating on YMM states.

2.2.2 Detection of VEX-Encoded AES and VPCLMULQDQ

VAESDEC/VAESEDECLAST/VAESENC/VAESENCLAST/VAESIMC/VAESKEYGENASSIST instructions operate on YMM states. The detection sequence must combine checking for CPUID.1:ECX.AES[bit 25] = 1 and the sequence for detection application support for AVX.

Similarly, the detection sequence for VPCLMULQDQ must combine checking for CPUID.1:ECX.PCLMULQDQ[bit 1] = 1 and the sequence for detection application support for AVX.

This is shown in the pseudocode:
-------------------------------------------------------------------------------------------------------------------
INT supports_VAES()
{
  ; result in eax
  mov eax, 1
cpuid
  and ecx, 01A000000H
  cmp ecx, 01A000000H; check OSXSAVE, AVX and AES feature flags
  jne not_supported

  mov eax, 1
  jmp done
NOT_SUPPORTED:
  mov eax, 0
done:
}
; processor supports AVX and VEX.128-encoded AES instructions and XGETBV is enabled by OS
   mov ecx, 0; specify 0 for XFEATURE_ENABLED_MASK register
   XGETBV; result in EDX:EAX
   and eax, 06H
   cmp eax, 06H; check OS has enabled both XMM and YMM state support
   jne not_supported
   mov eax, 1
   jmp done
   NOT_SUPPORTED:
   mov eax, 0
   done:

INT supports_VPCLMULQDQ()
{
   ; result in eax
   mov eax, 1
   cpuid
   and ecx, 018000002H
   cmp ecx, 018000002H; check OSXSAVE, AVX and PCLMULQDQ feature flags
   jne not_supported
   ; processor supports AVX and VPCLMULQDQ instructions and XGETBV is enabled by OS
   mov ecx, 0; specify 0 for XFEATURE_ENABLED_MASK register
   XGETBV; result in EDX:EAX
   and eax, 06H
   cmp eax, 06H; check OS has enabled both XMM and YMM state support
   jne not_supported
   mov eax, 1
   jmp done
   NOT_SUPPORTED:
   mov eax, 0
   done:
2.3 FUSED-MULTIPLY-ADD (FMA) NUMERIC BEHAVIOR

FMA instructions can perform fused-multiply-add operations (including fused-multiply-subtract, and other varieties) on packed and scalar data elements in the instruction operands. FMA instructions provide separate instructions to handle different types of arithmetic operations on the three source operands.

FMA instruction syntax is defined using three source operands and the first source operand is updated based on the result of the arithmetic operations of the data elements of 128-bit or 256-bit operands, i.e. The first source operand is also the destination operand.

The arithmetic FMA operation performed in an FMA instruction takes one of several forms, \( r = (x \times y) + z \), \( r = (x \times y) - z \), \( r = -(x \times y) + z \), or \( r = -(x \times y) - z \). Packed FMA instructions can perform eight single-precision FMA operations or four double-precision FMA operations with 256-bit vectors.

Scalar FMA instructions only perform one arithmetic operation on the low order data element. The content of the rest of the data elements in the lower 128-bits of the destination operand is preserved. The upper 128-bits of the destination operand are filled with zero.

An arithmetic FMA operation of the form, \( r = (x \times y) + z \), takes two IEEE-754-2008 single (double) precision values and multiplies them to form an infinite precision intermediate value. This intermediate value is added to a third single (double) precision value (also at infinite precision) and rounded to produce a single (double) precision result.

Table 2-2 describes the numerical behavior of the FMA operation, \( r = (x \times y) + z \), \( r = (x \times y) - z \), \( r = -(x \times y) + z \), \( r = -(x \times y) - z \) for various input values. The input values can be 0, finite non-zero (F in Table 2-2), infinity of either sign (INF in Table 2-2), positive infinity (+INF in Table 2-2), negative infinity (-INF in Table 2-2), or NaN (including QNaN or SNaN). If any one of the input values is a NaN, the result of FMA operation, \( r \), may be a quietized NaN. The result can be either \( Q(x) \), \( Q(y) \), or \( Q(z) \), see Table 2-2. If \( x \) is a NaN, then:

- \( Q(x) = x \) if \( x \) is QNaN or
- \( Q(x) = \) the quietized NaN obtained from \( x \) if \( x \) is SNaN

The notation for output value in Table 2-2 are:

- \(+INF\) : positive infinity, \(-INF\) : negative infinity. When the result depends on a conditional expression, both values are listed in the result column and the condition is described in the comment column.
- QNaNIndefinite represents the QNaN which has the sign bit equal to 1, the second most significand field equal to 1, and the remaining significand field bits equal to 0.
• The summation or subtraction of 0s or identical values in FMA operation can lead to the following situations shown in Table 2-1

<table>
<thead>
<tr>
<th>x*y</th>
<th>z</th>
<th>(x*y) + z</th>
<th>(x*y) - z</th>
<th>- (x*y) + z</th>
<th>- (x*y) - z</th>
</tr>
</thead>
<tbody>
<tr>
<td>(+0)</td>
<td>(+0)</td>
<td>+0 in all rounding modes</td>
<td>- 0 when rounding down, and +0 otherwise</td>
<td>- 0 when rounding down, and +0 otherwise</td>
<td>- 0 in all rounding modes</td>
</tr>
<tr>
<td>(+0)</td>
<td>(-0)</td>
<td>- 0 when rounding down, and +0 otherwise</td>
<td>+0 in all rounding modes</td>
<td>- 0 in all rounding modes</td>
<td>- 0 when rounding down, and +0 otherwise</td>
</tr>
<tr>
<td>(-0)</td>
<td>(+0)</td>
<td>- 0 when rounding down, and +0 otherwise</td>
<td>- 0 in all rounding modes</td>
<td>+ 0 in all rounding modes</td>
<td>- 0 when rounding down, and +0 otherwise</td>
</tr>
<tr>
<td>(-0)</td>
<td>(-0)</td>
<td>- 0 in all rounding modes</td>
<td>- 0 when rounding down, and +0 otherwise</td>
<td>- 0 when rounding down, and +0 otherwise</td>
<td>+ 0 in all rounding modes</td>
</tr>
<tr>
<td>F</td>
<td>-F</td>
<td>- 0 when rounding down, and +0 otherwise</td>
<td>2*F</td>
<td>-2*F</td>
<td>- 0 when rounding down, and +0 otherwise</td>
</tr>
<tr>
<td>F</td>
<td>F</td>
<td>2*F</td>
<td>- 0 when rounding down, and +0 otherwise</td>
<td>- 0 when rounding down, and +0 otherwise</td>
<td>-2*F</td>
</tr>
</tbody>
</table>
### Table 2-2. FMA Numeric Behavior

<table>
<thead>
<tr>
<th>x</th>
<th>y</th>
<th>z</th>
<th>( r = (x \cdot y) + z )</th>
<th>( r = (x \cdot y) - z )</th>
<th>( r = -(x \cdot y) + z )</th>
<th>( r = -(x \cdot y) - z )</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>NaN</td>
<td>0, F, INF, NaN</td>
<td>0, F, INF, NaN</td>
<td>Q(x)</td>
<td>Q(x)</td>
<td>Q(x)</td>
<td>Q(x)</td>
<td>Signal invalid exception if ( x ) or ( y ) or ( z ) is SNaN</td>
</tr>
<tr>
<td>0, F, INF</td>
<td>NaN</td>
<td>0, F, INF, NaN</td>
<td>Q(y)</td>
<td>Q(y)</td>
<td>Q(y)</td>
<td>Q(y)</td>
<td>Signal invalid exception if ( y ) or ( z ) is SNaN</td>
</tr>
<tr>
<td>0, F, INF</td>
<td>0, F, INF</td>
<td>NaN</td>
<td>Q(z)</td>
<td>Q(z)</td>
<td>Q(z)</td>
<td>Q(z)</td>
<td>Signal invalid exception if ( z ) is SNaN</td>
</tr>
<tr>
<td>INF</td>
<td>F, INF</td>
<td>+INF</td>
<td>+INF</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>-INF</td>
<td>if ( x \cdot y ) and ( z ) have the same sign</td>
</tr>
<tr>
<td>INF</td>
<td>F, INF</td>
<td>-INF</td>
<td>-INF</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>+INF</td>
<td>if ( x \cdot y ) and ( z ) have opposite signs</td>
</tr>
<tr>
<td>INF</td>
<td>F, INF</td>
<td>0, F</td>
<td>+INF</td>
<td>+INF</td>
<td>-INF</td>
<td>-INF</td>
<td>if ( x ) and ( y ) have the same sign</td>
</tr>
<tr>
<td>INF</td>
<td>0</td>
<td>0, F, INF</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>Signal invalid exception</td>
</tr>
<tr>
<td>INF</td>
<td>F</td>
<td>0, F</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>Signal invalid exception</td>
</tr>
<tr>
<td>INF</td>
<td>+INF</td>
<td>+INF</td>
<td>+INF</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>-INF</td>
<td>if ( x ) and ( y ) have the same sign</td>
</tr>
<tr>
<td>INF</td>
<td>-INF</td>
<td>-INF</td>
<td>-INF</td>
<td>+INF</td>
<td>+INF</td>
<td>+INF</td>
<td>if ( x ) and ( y ) have opposite signs</td>
</tr>
<tr>
<td>F</td>
<td>INF</td>
<td>+INF</td>
<td>+INF</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>-INF</td>
<td>if ( x ) and ( y ) have the same sign</td>
</tr>
<tr>
<td>F</td>
<td>INF</td>
<td>-INF</td>
<td>-INF</td>
<td>QNaNIn-definite</td>
<td>QNaNIn-definite</td>
<td>+INF</td>
<td>if ( x ) and ( y ) have opposite signs</td>
</tr>
<tr>
<td>F</td>
<td>0, F</td>
<td>+INF</td>
<td>+INF</td>
<td>+INF</td>
<td>-INF</td>
<td>-INF</td>
<td>if ( x \cdot y &gt; 0 )</td>
</tr>
<tr>
<td>F</td>
<td>0, F</td>
<td>-INF</td>
<td>-INF</td>
<td>+INF</td>
<td>+INF</td>
<td>+INF</td>
<td>if ( x \cdot y &lt; 0 )</td>
</tr>
<tr>
<td>0, F</td>
<td>0, F</td>
<td>+INF</td>
<td>+INF</td>
<td>-INF</td>
<td>+INF</td>
<td>+INF</td>
<td>if ( z &gt; 0 )</td>
</tr>
<tr>
<td>0, F</td>
<td>0, F</td>
<td>-INF</td>
<td>-INF</td>
<td>+INF</td>
<td>-INF</td>
<td>+INF</td>
<td>if ( z &lt; 0 )</td>
</tr>
</tbody>
</table>
If unmasked floating-point exceptions are signaled (invalid operation, denormal operand, overflow, underflow, or inexact result) the result register is left unchanged and a floating-point exception handler is invoked.
2.3.1 FMA Instruction Operand Order and Arithmetic Behavior

FMA instruction mnemonics are defined explicitly with an ordered three digits, e.g. VFMADD132PD. The value of each digit refer to the ordering of the three source operand as defined by instruction encoding specification (see Table 4-37):

- 1: The first source operand (also the destination operand) in the syntactical order listed in this specification.
- 2: The second source operand in the syntactical order. This is a YMM/XMM register, encoded using VEX prefix.
- 3: The third source operand in the syntactical order. The first and third operand are encoded following ModR/M encoding rules.

The ordering of each digit within the mnemonic refers to the floating-point data listed on the right-hand side of the arithmetic equation of each FMA operation (see Table 2-2):

- The first position in the three digit ordering of a FMA mnemonic refers to the first FP data expressed in the arithmetic equation of FMA operation, the multiplicand.
- The second position in the three digit FMA mnemonic refers to the second FP data expressed in the arithmetic equation of FMA operation, the multiplier.
- The third position in the three digit FMA mnemonic refers to the FP data being added/subtracted to the multiplication result.

Note non-numerical result of an FMA operation do not resemble the mathematically-defined commutative property between the multiplicand and the multiplier values (see Table 2-2). Consequently, software tools (such as an assembler) may support a complementary set of FMA mnemonics for each FMA instruction for ease of programming to take advantage of the mathematical property of commutative multiplications. For example, an assembler may optionally support the complementary mnemonic "VFMADD312PD" in addition to the true mnemonic "VFMADD132PD". The assembler will generate the same instruction opcode sequence corresponding to VFMADD132PD. The processor executes VFMADD132PD and report any NAN conditions based on the definition of VFMADD132PD. Similarly, if the complementary mnemonic VFMADD123PD is supported by an assembler at source level, it must generate the opcode sequence corresponding to VFMADD213PD; the complementary mnemonic VFMADD321PD must produce the opcode sequence defined by VFMADD231PD. In the absence of FMA operations reporting a NAN result, the numerical results of using either mnemonic with an assembler supporting both mnemonics will match the behavior defined in Table 2-2. Support for the complementary FMA mnemonics by software tools is optional.

2.4 ACCESSING YMM REGISTERS

The lower 128 bits of a YMM register is aliased to the corresponding XMM register. Legacy SSE instructions (i.e. SIMD instructions operating on XMM state but not using the VEX prefix, also referred to non-VEX encoded SIMD instructions) will not access
the upper bits (255:128) of the YMM registers. AVX and FMA instructions with a VEX prefix and vector length of 128-bits zeroes the upper 128 bits of the YMM register. See Chapter 2, "Programming Considerations with 128-bit SIMD Instructions" for more details.

Upper bits of YMM registers (255:128) can be read and written by many instructions with a VEX.256 prefix.

XSAVE and XRSTOR may be used to save and restore the upper bits of the YMM registers.

### 2.5 MEMORY ALIGNMENT

Memory alignment requirements on VEX-encoded instruction differs from non-VEX-encoded instructions. Memory alignment applies to non-VEX-encoded SIMD instructions in three categories:

- Explicitly-aligned SIMD load and store instructions accessing 16 bytes of memory (e.g. MOVAPD, MOVAPS, MOVDQA, etc.). These instructions always require memory address to be aligned on 16-byte boundary.
- Explicitly-unaligned SIMD load and store instructions accessing 16 bytes or less of data from memory (e.g. MOVUPD, MOVUPS, MOVDQU, MOVQ, MOVD, etc.). These instructions do not require memory address to be aligned on 16-byte boundary.
- The vast majority of arithmetic and data processing instructions in legacy SSE instructions (non-VEX-encoded SIMD instructions) support memory access semantics. When these instructions access 16 bytes of data from memory, the memory address must be aligned on 16-byte boundary.

Most arithmetic and data processing instructions encoded using the VEX prefix and performing memory accesses have more flexible memory alignment requirements than instructions that are encoded without the VEX prefix. Specifically,

- With the exception of explicitly aligned 16 or 32 byte SIMD load/store instructions, most VEX-encoded, arithmetic and data processing instructions operate in a flexible environment regarding memory address alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load semantics will support unaligned load operation by default. Memory arguments for most instructions with VEX prefix operate normally without causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE instructions). The instructions that require explicit memory alignment requirements are listed in Table 2-4.

Software may see performance penalties when unaligned accesses cross cacheline boundaries, so reasonable attempts to align commonly used data sets should continue to be pursued.
Atomic memory operation in Intel 64 and IA-32 architecture is guaranteed only for a subset of memory operand sizes and alignment scenarios. The list of guaranteed atomic operations are described in Section 7.1.1 of IA-32 Intel® Architecture Software Developer’s Manual, Volumes 3A. AVX and FMA instructions do not introduce any new guaranteed atomic memory operations.

AVX and FMA will generate an #AC(0) fault on misaligned 4 or 8-byte memory references in Ring-3 when CR0.AM=1. 16 and 32-byte memory references will not generate #AC(0) fault. See Table 2-4 for details.

Certain AVX instructions always require 16- or 32-byte alignment (see the complete list of such instructions in Table 2-4). These instructions will #GP(0) if not aligned to 16-byte boundaries (for 16-byte granularity loads and stores) or 32-byte boundaries (for 32-byte loads and stores).

Table 2-3. Alignment Faulting Conditions when Memory Access is Not Aligned

<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>EFLAGS.AC==1 &amp;&amp; Ring-3 &amp;&amp; CR0.AM == 1</th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<td>AVX, FMA, 16- or 32-byte “explicitly unaligned” loads and stores (see Table 2-5)</td>
<td>no fault</td>
<td>no fault</td>
<td></td>
</tr>
<tr>
<td>VEX op YMM, m256</td>
<td>no fault</td>
<td>no fault</td>
<td></td>
</tr>
<tr>
<td>VEX op XMM, m128</td>
<td>no fault</td>
<td>no fault</td>
<td></td>
</tr>
<tr>
<td>“explicitly aligned” loads and stores (see Table 2-4)</td>
<td>#GP(0)</td>
<td>#GP(0)</td>
<td></td>
</tr>
<tr>
<td>2, 4, or 8-byte loads and stores</td>
<td>no fault</td>
<td>#AC(0)</td>
<td></td>
</tr>
<tr>
<td>SSE 16 byte “explicitly unaligned” loads and stores (see Table 2-5)</td>
<td>no fault</td>
<td>no fault</td>
<td></td>
</tr>
<tr>
<td>op XMM, m128</td>
<td>#GP(0)</td>
<td>#GP(0)</td>
<td></td>
</tr>
<tr>
<td>“explicitly aligned” loads and stores (see Table 2-4)</td>
<td>#GP(0)</td>
<td>#GP(0)</td>
<td></td>
</tr>
<tr>
<td>2, 4, or 8-byte loads and stores</td>
<td>no fault</td>
<td>#AC(0)</td>
<td></td>
</tr>
</tbody>
</table>

Table 2-4. Instructions Requiring Explicitly Aligned Memory

<table>
<thead>
<tr>
<th>Require 16-byte alignment</th>
<th>Require 32-byte alignment</th>
</tr>
</thead>
<tbody>
<tr>
<td>(V)MOVDQA xmm, m128</td>
<td>VMOVQDQA ymm, m256</td>
</tr>
<tr>
<td>(V)MOVDQA m128, xmm</td>
<td>VMOVQDQA m256, ymm</td>
</tr>
<tr>
<td>(V)MOVAPS xmm, m128</td>
<td>VMOVAPS ymm, m256</td>
</tr>
<tr>
<td>(V)MOVAPS m128, xmm</td>
<td>VMOVAPS m256, ymm</td>
</tr>
<tr>
<td>(V)MOVAPD xmm, m128</td>
<td>VMOVAPD ymm, m256</td>
</tr>
</tbody>
</table>
2.6 SIMD FLOATING-POINT EXCEPTIONS

AVX and FMA instructions can generate SIMD floating-point exceptions (#XM) and respond to exception masks in the same way as Legacy SSE instructions. When CR4.OSXMMEXCPT=0 any unmasked FP exceptions generate an Undefined Opcode exception (#UD).

AVX FP exceptions are created in a similar fashion (differing only in number of elements) to Legacy SSE and SSE2 instructions capable of generating SIMD floating-point exceptions.

AVX introduces no new arithmetic operations (AVX floating-point are analogues of existing Legacy SSE instructions). FMA introduces new arithmetic operations, detailed FMA numeric behavior are described in Section 2.3.

<table>
<thead>
<tr>
<th>Require 16-byte alignment</th>
<th>Require 32-byte alignment</th>
</tr>
</thead>
<tbody>
<tr>
<td>(V)MOVAPD m128, xmm</td>
<td>VMOVAPD m256, ymm</td>
</tr>
<tr>
<td>(V)MOVNTPS m128, xmm</td>
<td>VMOVNTPS m256, ymm</td>
</tr>
<tr>
<td>(V)MOVNTPD m128, xmm</td>
<td>VMOVNTPD m256, ymm</td>
</tr>
<tr>
<td>(V)MOVNTDQ m128, xmm</td>
<td>VMOVNTDQ m256, ymm</td>
</tr>
<tr>
<td>(V)MOVNTDQA xmm, m128</td>
<td></td>
</tr>
</tbody>
</table>

Table 2-5. Instructions Not Requiring Explicit Memory Alignment

(V)MOVDQU xmm, m128
(V)MOVDQU m128, m128
(V)MOVUPS xmm, m128
(V)MOVUPS m128, xmm
(V)MOVUPD xmm, m128
(V)MOVUPD m128, xmm
VMOVQDQU ymm, m256
VMOVQDQU m256, ymm
VMOVUPS ymm, m256
VMOVUPS m256, ymm
VMOVUPD ymm, m256
VMOVUPD m256, ymm
2.7 INSTRUCTION EXCEPTION SPECIFICATION

To use this reference of instruction exceptions, look at each instruction for a description of the particular exception type of interest. For example, ADDPS contains the entry:

"See Exceptions Type 2"

In this entry, "Type2" can be looked up in Table 2-6.

The instruction’s corresponding CPUID feature flag can be identified in the fourth column of the Instruction summary table.

Note: #UD on CPUID feature flags=0 is not guaranteed in a virtualized environment if the hardware supports the feature flag.

<table>
<thead>
<tr>
<th>Exception Class</th>
<th>Instruction set</th>
<th>Mem arg</th>
<th>Floating-Point Exceptions (#XM)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Type 1</td>
<td>AVX, Legacy SSE</td>
<td>16/32 byte; explicitly aligned</td>
<td>none</td>
</tr>
<tr>
<td>Type 2</td>
<td>AVX, FMA Legacy SSE</td>
<td>16/32 byte; not explicitly aligned with VEX prefix; explicitly aligned without VEX.</td>
<td>yes</td>
</tr>
<tr>
<td>Type 3</td>
<td>AVX, FMA Legacy SSE</td>
<td>&lt; 16 byte</td>
<td>yes</td>
</tr>
<tr>
<td>Type 4</td>
<td>AVX, Legacy SSE</td>
<td>16/32 byte; not explicitly aligned with VEX prefix; explicitly aligned without VEX.</td>
<td>no</td>
</tr>
<tr>
<td>Type 5</td>
<td>AVX, Legacy SSE</td>
<td>&lt; 16 byte</td>
<td>no</td>
</tr>
<tr>
<td>Type 6</td>
<td>AVX (no Legacy SSE)</td>
<td>Varies</td>
<td>(At present, none do)</td>
</tr>
<tr>
<td>Type 7</td>
<td>AVX, Legacy SSE</td>
<td>none</td>
<td>none</td>
</tr>
<tr>
<td>Type 8</td>
<td>AVX</td>
<td>none</td>
<td>none</td>
</tr>
<tr>
<td>Type 9</td>
<td>AVX</td>
<td>4 byte</td>
<td>none</td>
</tr>
<tr>
<td>Type 10</td>
<td>AVX, Legacy SSE</td>
<td>16/32 byte; not explicitly aligned</td>
<td>no</td>
</tr>
</tbody>
</table>

See Table 2-7 for lists of instructions in each exception class.
APPLICATION PROGRAMMING MODEL

Table 2-7. Instructions in each Exception Class
Exception Class

Instruction

Type 1

(V)MOVAPD, (V)MOVAPS, (V)MOVDQA, (V)MOVNTDQ,
(V)MOVNTDQA, (V)MOVNTPD, (V)MOVNTPS

Type 2

(V)ADDPD, (V)ADDPS, (V)ADDSUBPD, (V)ADDSUBPS, (V)CMPPD,
(V)CMPPS, (V)CVTDQ2PS, (V)CVTPD2DQ, (V)CVTPD2PS,
(V)CVTPS2DQ, (V)CVTTPD2DQ, (V)CVTTPS2DQ, (V)DIVPD, (V)DIVPS,
(V)DPPD*, (V)DPPS*, VFMADD132PD, VFMADD213PD, VFMADD231PD,
VFMADD132PS, VFMADD213PS, VFMADD231PS, VFMADDSUB132PD,
VFMADDSUB213PD, VFMADDSUB231PD, VFMADDSUB132PS,
VFMADDSUB213PS, VFMADDSUB231PS, VFMSUBADD132PD,
VFMSUBADD213PD, VFMSUBADD231PD, VFMSUBADD132PS,
VFMSUBADD213PS, VFMSUBADD231PS, VFMSUB132PD,
VFMSUB213PD, VFMSUB231PD, VFMSUB132PS, VFMSUB213PS,
VFMSUB231PS, VFNMADD132PD, VFNMADD213PD, VFNMADD231PD,
VFNMADD132PS, VFNMADD213PS, VFNMADD231PS, VFNMSUB132PD,
VFNMSUB213PD, VFNMSUB231PD, VFNMSUB132PS, VFNMSUB213PS,
VFNMSUB231PS, (V)HADDPD, (V)HADDPS, (V)HSUBPD, (V)HSUBPS,
(V)MAXPD, (V)MAXPS, (V)MINPD, (V)MINPS, (V)MULPD, (V)MULPS,
(V)ROUNDPS, (V)ROUNDPS, (V)SQRTPD, (V)SQRTPS, (V)SUBPD,
(V)SUBPS

Type 3

(V)ADDSD, (V)ADDSS, (V)CMPSD, (V)CMPSS, (V)COMISD, (V)COMISS,
(V)CVTPS2PD, (V)CVTSD2SI, (V)CVTSD2SS, (V)CVTSI2SD,
(V)CVTSI2SS, (V)CVTSS2SD, (V)CVTSS2SI, (V)CVTTSD2SI,
(V)CVTTSS2SI, (V)DIVSD, (V)DIVSS, VFMADD132SD, VFMADD213SD,
VFMADD231SD, VFMADD132SS, VFMADD213SS, VFMADD231SS,
VFMSUB132SD, VFMSUB213SD, VFMSUB231SD, VFMSUB132SS,
VFMSUB213SS, VFMSUB231SS, VFNMADD132SD, VFNMADD213SD,
VFNMADD231SD, VFNMADD132SS, VFNMADD213SS,
VFNMADD231SS, VFNMSUB132SD, VFNMSUB213SD, VFNMSUB231SD,
VFNMSUB132SS, VFNMSUB213SS, VFNMSUB231SS, (V)MAXSD,
(V)MAXSS, (V)MINSD, (V)MINSS, (V)MULSD, (V)MULSS,
(V)ROUNDSD, (V)ROUNDSS, (V)SQRTSD, (V)SQRTSS, (V)SUBSD,
(V)SUBSS, (V)UCOMISD, (V)UCOMISS

Ref. # 319433-005

2-15


APPLICATION PROGRAMMING MODEL

Exception Class

Type 4

Instruction
(V)AESDEC, (V)AESDECLAST, (V)AESENC, (V)AESENCLAST,
(V)AESIMC, (V)AESKEYGENASSIST, (V)ANDPD, (V)ANDPS,
(V)ANDNPD, (V)ANDNPS, (V)BLENDPD, (V)BLENDPS, VBLENDVPD,
VBLENDVPS, (V)MASKMOVDQU, (V)PTEST, VTESTPS, VTESTPD,
(V)MOVSHDUP, (V)MOVSLDUP, (V)MPSADBW, (V)ORPD, (V)ORPS,
(V)PABSB, (V)PABSW, (V)PABSD, (V)PACKSSWB, (V)PACKSSDW,
(V)PACKUSWB, (V)PACKUSDW, (V)PADDB, (V)PADDW, (V)PADDD,
(V)PADDQ, (V)PADDSB, (V)PADDSW, (V)PADDUSB, (V)PADDUSW,
(V)PALIGNR, (V)PAND, (V)PANDN, (V)PAVGB, (V)PAVGW,
(V)PBLENDVB, (V)PBLENDW, (V)PCMP(E/I)STRI/M, (V)PCMPEQB,
(V)PCMPEQW, (V)PCMPEQD, (V)PCMPEQQ, (V)PCMPGTB,
(V)PCMPGTW, (V)PCMPGTD, (V)PCMPGTQ, (V)PCLMULQDQ,
(V)PHADDW, (V)PHADDD, (V)PHADDSW, (V)PHMINPOSUW,
(V)PHSUBD, (V)PHSUBW, (V)PHSUBSW,
(V)PMADDWD, (V)PMADDUBSW, (V)PMAXSB, (V)PMAXSW,
(V)PMAXSD, (V)PMAXUB, (V)PMAXUW, (V)PMAXUD, (V)PMINSB,
(V)PMINSW, (V)PMINSD, (V)PMINUB, (V)PMINUW, (V)PMINUD,
(V)PMULHUW, (V)PMULHRSW, (V)PMULHW, (V)PMULLW,
(V)PMULLD, (V)PMULUDQ, (V)PMULDQ, (V)POR, (V)PSADBW,
(V)PSHUFB, (V)PSHUFD, (V)PSHUFHW, (V)PSHUFLW, (V)PSIGNB,
(V)PSIGNW, (V)PSIGND, (V)PSLLW, (V)PSLLD, (V)PSLLQ, (V)PSRAW,
(V)PSRAD, (V)PSRLW, (V)PSRLD, (V)PSRLQ, (V)PSUBB, (V)PSUBW,
(V)PSUBD, (V)PSUBQ, (V)PSUBSB, (V)PSUBSW, (V)PUNPCKHBW,
(V)PUNPCKHWD, (V)PUNPCKHDQ, (V)PUNPCKHQDQ,
(V)PUNPCKLBW, (V)PUNPCKLWD, (V)PUNPCKLDQ, (V)PUNPCKLQDQ,
(V)PXOR, (V)RCPPS, (V)RSQRTPS, (V)SHUFPD, (V)SHUFPS,
(V)UNPCKHPD, (V)UNPCKHPS, (V)UNPCKLPD, (V)UNPCKLPS,
(V)XORPD, (V)XORPS

2-16

Type 5

(V)CVTDQ2PD, (V)EXTRACTPS, (V)INSERTPS, (V)MOVD, (V)MOVQ,
(V)MOVDDUP, (V)MOVLPD, (V)MOVLPS, (V)MOVHPD, (V)MOVHPS,
(V)MOVSD, (V)MOVSS, (V)PEXTRB, (V)PEXTRD, (V)PEXTRW,
(V)PEXTRQ, (V)PINSRB, (V)PINSRD, (V)PINSRW, (V)PINSRQ, (V)RCPSS,
(V)RSQRTSS, (V)PMOVSX/ZX

Type 6

VEXTRACTF128, VPERMILPD, VPERMILPS, VPERM2F128,
VBROADCASTSS, VBROADCASTSD, VBROADCASTF128,
VINSERTF128, VMASKMOVPS**, VMASKMOVPD**

Type 7

(V)MOVLHPS, (V)MOVHLPS, (V)MOVMSKPD, (V)MOVMSKPS,
(V)PMOVMSKB, (V)PSLLDQ, (V)PSRLDQ, (V)PSLLW, (V)PSLLD,
(V)PSLLQ, (V)PSRAW, (V)PSRAD, (V)PSRLW, (V)PSRLD, (V)PSRLQ

Type 8

VZEROALL, VZEROUPPER

Type 9

VLDMXCSR*, VSTMXCSR

Type 10

(V)LDDQU, (V)MOVDQU*, (V)MOVUPD*, (V)MOVUPS*

Ref. # 319433-005


(*) - Additional exception restrictions are present - see the Instruction description for details

(**) - Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s, i.e. no alignment checks are performed.

Table 2-7 classifies exception behaviors for AVX instructions. Within each class of exception conditions that are listed in Table 2-9 through Table 2-15, certain subsets of AVX instructions may be subject to #UD exception depending on the encoded value of the VEX.L field. Table 2-8 provides supplemental information of AVX instructions that may be subject to #UD exception if encoded with incorrect values in the VEX.L field.
### Table 2-8. #UD Exception and VEX.L Field Encoding

<table>
<thead>
<tr>
<th>Exception Class</th>
<th>#UD If VEX.L = 0</th>
<th>#UD If VEX.L = 1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Type 1</td>
<td>VMOVNTDQA</td>
<td></td>
</tr>
<tr>
<td>Type 2</td>
<td>VDPPD</td>
<td></td>
</tr>
<tr>
<td>Type 3</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Type 5</td>
<td>VEXTRACTPS, VINSERTPS, VMOVQ, VMOVLDV, VMOVQ, VMOVLPS, VMOVHPD, VMOVHPED, VMOVSX, VPEXTRB, VPEXTRD, VPEXTRW, VPEXTRQ, VPINSB, VPINSRD, VPINSRW, VPINSRX, VPMOVSX/ZX</td>
<td></td>
</tr>
<tr>
<td>Type 6</td>
<td>VEXTRACTF128, VPERM2F128, VBROADCASTSD, VBROADCASTF128, VINSERTF128, VMOVLPS, VMOVHPD, VMOVHPED, VMOVSX, VMOVSX/ZX</td>
<td></td>
</tr>
<tr>
<td>Type 7</td>
<td>VMOVLHPS, VMOVHLPS, VPMOVMASKB, VPSLDQ, VPSRDLQ, VPSLW, VPSLLD, VPSLLQ, VPSRAW, VPSRAD, VPSRLW, VPSRDL, VPSRLQ</td>
<td></td>
</tr>
<tr>
<td>Type 8</td>
<td>VLDMXCSR, VSTMXCSR</td>
<td></td>
</tr>
</tbody>
</table>

#### 2.7.1 Exceptions Type 1 (Aligned memory reference)
**Table 2-9. Type 1 Class Exception Conditions**

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Virtual 80x86</th>
<th>Protected and Compatibility</th>
<th>64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalid Opcode, #UD</td>
<td>X X</td>
<td></td>
<td></td>
<td>VEX prefix</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X X</td>
<td>Legacy SSE instruction: If CR0.EM[bit 2] = 1. If CR4.OSFXSR[bit 9] = 0.</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X X X X</td>
<td>If preceded by a LOCK prefix (F0H)</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td>If any REX, F2, F3, or 66 prefixes precede a VEX prefix</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>X X X X</td>
<td></td>
<td>If any corresponding CPUID feature flag is '0'</td>
<td></td>
</tr>
<tr>
<td>Device Not Available, #NM</td>
<td>X X X X</td>
<td></td>
<td></td>
<td>If CR0.TS[bit 3]=1</td>
<td></td>
</tr>
<tr>
<td>Stack, SS(0)</td>
<td></td>
<td>X</td>
<td></td>
<td>For an illegal address in the SS segment</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td>If a memory address referencing the SS segment is in a non-canonical form</td>
<td></td>
</tr>
<tr>
<td>General Protection, #GP(0)</td>
<td></td>
<td>X</td>
<td></td>
<td>VEX.256: Memory operand is not 32-byte aligned VEX.128: Memory operand is not 16-byte aligned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>X X X X</td>
<td></td>
<td>Legacy SSE: Memory operand is not 16-byte aligned</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If the memory address is in a non-canonical form.</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X X</td>
<td>If any part of the operand lies outside the effective address space from 0 to FFFFH</td>
<td></td>
</tr>
</tbody>
</table>

2.7.2 Exceptions Type 2 (>=16 Byte Memory Reference, Unaligned with VEX prefix)

Ref. # 319433-005
## Table 2-10. Type 2 Class Exception Conditions

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Virtual 8086</th>
<th>Protected and Compatibility</th>
<th>64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalid Opcode, #UD</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>VEX prefix</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td>VEX prefix:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If XFEATURE_ENABLED_MASK[2:1] != '11b'.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR4.OSXSAVE[bit 18] = 0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Legacy SSE instruction:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR0.EM[bit 2] = 1.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR4.OSFXSR[bit 9] = 0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If preceded by a LOCK prefix (F0H)</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td>If any REX, F2, F3, or 66 prefixes precede a VEX prefix</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If any corresponding CPUID feature flag is '0'</td>
</tr>
<tr>
<td>Device Not Available, #NM</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If CR0.TS[bit 3] = 1</td>
</tr>
<tr>
<td>Stack, SS(0)</td>
<td>X</td>
<td></td>
<td></td>
<td>X</td>
<td>For an illegal address in the SS segment</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If a memory address referencing the SS segment is in a non-canonical form</td>
</tr>
<tr>
<td>General Protection, #GP(0)</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Legacy SSE: Memory operand is not 16-byte aligned</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If the memory address is in a non-canonical form.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td>X</td>
<td>If any part of the operand lies outside the effective address space from 0 to FFFFH</td>
</tr>
<tr>
<td>Page Fault #PF(fault-code)</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td>For a page fault</td>
</tr>
<tr>
<td>SIMD Floating-point Exception, #XM</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 1</td>
</tr>
</tbody>
</table>

### 2.7.3 Exceptions Type 3 (<16 Byte memory argument)
### Table 2-11. Type 3 Class Exception Conditions

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Virtual 80x86</th>
<th>Protected and Compatibility</th>
<th>64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalid Opcode, #UD</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>VEX prefix</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>VEX prefix:</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>If CR4.OSXSAVE[bit 18]=0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>Legacy SSE instruction:</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>If CR0.EM[bit 2] = 1.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>If CR4.OSFXSR[bit 9] = 0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>If preceded by a LOCK prefix (FOH)</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>If any corresponding CPUID feature flag is ‘0’</td>
</tr>
<tr>
<td>Device Not Available, #NM</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If CR0.TS[bit 3]=1</td>
</tr>
<tr>
<td>Stack, SS(0)</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td>For an illegal address in the SS segment</td>
</tr>
<tr>
<td>General Protection, #GP(0)</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td>For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td>If the memory address is in a non-canonical form.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td>If any part of the operand lies outside the effective address space from 0 to FFFFH</td>
</tr>
<tr>
<td>Page Fault #PF(fault-code)</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>For a page fault</td>
</tr>
<tr>
<td>Alignment Check #AC(0)</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td>If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.</td>
</tr>
<tr>
<td>SIMD Floating-point Exception, #XM</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If an unmasked SIMD floating-point exception and CR4.OSXMMEXCPT[bit 10] = 1</td>
</tr>
</tbody>
</table>

#### 2.7.4 Exceptions Type 4 (>=16 Byte, mem arg no alignment with VEX prefix, no floating-point exceptions)
APPLICATION PROGRAMMING MODEL

Table 2-12. Type 4 Class Exception Conditions

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Virtual 80x86</th>
<th>Protected and Compatibility 64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalid Opcode, #UD</td>
<td>X</td>
<td>X</td>
<td>X X X</td>
<td>VEX prefix:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X X X</td>
<td>Legacy SSE instruction: If CR0.EM[bit 2] = 1. If CR4.OSFXSR[bit 9] = 0.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If preceded by a LOCK prefix (F0H)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X X X</td>
<td>If any REX, F2, F3, or 66 prefixes precede a VEX prefix</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X X X</td>
<td>If any corresponding CPUID feature flag is '0'</td>
</tr>
<tr>
<td>Device Not Available, #NM</td>
<td>X</td>
<td>X</td>
<td>X X X</td>
<td>If CR0.TS[bit 3]=1</td>
</tr>
<tr>
<td>Stack, SS(0)</td>
<td>X</td>
<td></td>
<td>X</td>
<td>For an illegal address in the SS segment</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If a memory address referencing the SS segment is in a non-canonical form</td>
</tr>
<tr>
<td>General Protection, #GP(0)</td>
<td>X</td>
<td>X</td>
<td>X X X</td>
<td>Legacy SSE: Memory operand is not 16-byte aligned</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If the memory address is in a non-canonical form.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If any part of the operand lies outside the effective address space from 0 to FFFFFFFH</td>
</tr>
<tr>
<td>Page Fault, #PF(fault-code)</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>For a page fault</td>
</tr>
</tbody>
</table>

2.7.5 Exceptions Type 5 (<16 Byte mem arg and no FP exceptions)
### Table 2-13. Type 5 Class Exception Conditions

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Virtual 80x86</th>
<th>Protected and Compatibility</th>
<th>64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalid Opcode, #UD</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>VEX prefix:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR4.OSXSAVE[bit 18]=0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Legacy SSE instruction:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR0.EM[bit 2] = 1.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR4.OSFXSR[bit 9] = 0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If preceded by a LOCK prefix (FOH)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If any REX, F2, F3, or 66 prefixes precede a VEX prefix.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If any corresponding CPUID feature flag is ‘0’</td>
</tr>
<tr>
<td>Device Not Available, #NM</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If CR0.TS[bit 3]=1</td>
</tr>
<tr>
<td>Stack, SS(0)</td>
<td>X</td>
<td></td>
<td></td>
<td>X</td>
<td>For an illegal address in the SS segment</td>
</tr>
<tr>
<td>General Protection, #GP(0)</td>
<td>X</td>
<td></td>
<td></td>
<td>X</td>
<td>For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If the memory address is in a non-canonical form.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>If any part of the operand lies outside the effective address space from 0 to FFFFH</td>
</tr>
<tr>
<td>Page Fault #PF(fault-code)</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td>For a page fault</td>
</tr>
<tr>
<td>Alignment Check #AC(0)</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td>If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.</td>
</tr>
</tbody>
</table>

#### 2.7.6 Exceptions Type 6 (VEX-Encoded Instructions Without Legacy SSE Analogues)

Note: At present, the AVX instructions in this category do not generate floating-point exceptions.
### Table 2-14. Type 6 Class Exception Conditions

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Virtual 80x86</th>
<th>Protected and Compatibility</th>
<th>64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalid Opcode, #UD</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>VEX prefix</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If XFEATURE_ENABLED_MASK[2:1] != '11b'.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR4.OSXSAVE[bit 18]=0.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If preceded by a LOCK prefix (F0H)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If any REX, F2, F3, or 66 prefixes precede a VEX prefix</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If any corresponding CPUID feature flag is '0'</td>
</tr>
<tr>
<td>Device Not Available, #NM</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>If CR0.TS[bit 3]=1</td>
</tr>
<tr>
<td>Stack, SS(0)</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td>For an illegal address in the SS segment</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If a memory address referencing the SS segment is in a non-canonical form</td>
</tr>
<tr>
<td>General Protection, #GP(0)</td>
<td>X</td>
<td></td>
<td></td>
<td></td>
<td>For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If the memory address is in a non-canonical form.</td>
</tr>
<tr>
<td>Page Fault #PF(fault-code)</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>For a page fault</td>
</tr>
<tr>
<td>Alignment Check #AC(0)</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>For 4 or 8 byte memory references if alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.</td>
</tr>
</tbody>
</table>

#### 2.7.7 Exceptions Type 7 (No FP exceptions, no memory arg)
### Table 2-15. Type 7 Class Exception Conditions

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Virtual 80x86</th>
<th>Protected and Compatibility</th>
<th>64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalid Opcode, #UD</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>VEX prefix</td>
</tr>
<tr>
<td></td>
<td></td>
<td>X</td>
<td></td>
<td>X</td>
<td>VEX prefix:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR4.OSXSAVE[bit 18] = 0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td>X</td>
<td>Legacy SSE instruction:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR0.EM[bit 2] = 1.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR4.OSFXSR[bit 9] = 0.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If preceded by a LOCK prefix (F0H)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If any REX, F2, F3, or 66 prefixes precede a VEX prefix</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If any corresponding CPUID feature flag is '0'</td>
</tr>
<tr>
<td>Device Not Available, #NM</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>If CR0.TS[bit 3] = 1</td>
</tr>
</tbody>
</table>

### Table 2-16. Type 8 Class Exception Conditions

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Virtual 80x86</th>
<th>Protected and Compatibility</th>
<th>64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalid Opcode, #UD</td>
<td>X</td>
<td>X</td>
<td></td>
<td></td>
<td>Always in Real or Virtual 80x86 mode</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR4.OSXSAVE[bit 18] = 0.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CPUID.01H.ECX.AVX[bit 28] = 0.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If VEX.vvvv != 1111B.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td></td>
<td>X</td>
<td>If proceeded by a LOCK prefix (F0H)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If CR0.TS[bit 3] = 1</td>
</tr>
</tbody>
</table>

2.7.8 Exceptions Type 8 (AVX and no memory argument)
### Exception Type 9 (AVX)

#### Table 2-17. Type 9 Class Exception Conditions

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Virtual 80x86</th>
<th>Protected and Compatibility</th>
<th>64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td>Invalid Opcode, #UD</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Always in Real or Virtual 80x86 mode</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If CR4.OSXSAVE[bit 18]=0. If CR0.EM[bit 2] = 1. If CPUID.01H.ECX.AVX[bit 28]=0</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If VEX.L = 1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If preceded by a LOCK prefix (F0H)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If any REX, F2, F3, or 66 prefixes precede a VEX prefix</td>
</tr>
<tr>
<td>Device Not Available, #NM</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If CR0.TS[bit 3]=1</td>
</tr>
<tr>
<td>Stack, SS(0)</td>
<td>X</td>
<td></td>
<td>X</td>
<td>X</td>
<td>For an illegal address in the SS segment</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If a memory address referencing the SS segment is in a non-canonical form</td>
</tr>
<tr>
<td>General Protection, #GP(0)</td>
<td>X</td>
<td></td>
<td>X</td>
<td></td>
<td>For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>X</td>
<td>If the memory address is in a non-canonical form.</td>
</tr>
<tr>
<td>Page Fault #PF(fault-code)</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td>For a page fault</td>
</tr>
<tr>
<td>Alignment Check #AC(0)</td>
<td>X</td>
<td>X</td>
<td>X</td>
<td></td>
<td>If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3.</td>
</tr>
</tbody>
</table>

### Exception Type 10 (>=16 Byte mem arg no alignment, no floating-point exceptions)
### Table 2-18. Type 10 Class Exception Conditions

<table>
<thead>
<tr>
<th>Exception</th>
<th>Real</th>
<th>Protected and Compatibility</th>
<th>64-bit</th>
<th>Cause of Exception</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Invalid Opcode, #UD</strong></td>
<td>X</td>
<td>X</td>
<td></td>
<td>VEX prefix:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>- VEX prefix:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>- If CR4.OSXSAVE[bit 18]=0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>Legacy SSE instruction:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>- If CR0.EM[bit 2] = 1.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>- If CR4.OSFXSR[bit 9] = 0.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If preceded by a LOCK prefix (F0H)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If any REX, F2, F3, or 66 prefixes precede a VEX prefix</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If any corresponding CPUID feature flag is '0'</td>
</tr>
<tr>
<td><strong>Device Not Available, #NM</strong></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>If CR0.TS[bit 3]=1</td>
</tr>
<tr>
<td><strong>Stack, SS(0)</strong></td>
<td>X</td>
<td></td>
<td>X</td>
<td>For an illegal address in the SS segment</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If a memory address referencing the SS segment is in a non-canonical form</td>
</tr>
<tr>
<td><strong>Alignment Check #AC(0)</strong></td>
<td>X</td>
<td></td>
<td>X</td>
<td>Legacy SSE: If alignment checking is enabled and an unaligned memory reference is made while the current privilege level is 3</td>
</tr>
<tr>
<td><strong>General Protection, #GP(0)</strong></td>
<td>X</td>
<td></td>
<td></td>
<td>For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>If the memory address is in a non-canonical form.</td>
</tr>
<tr>
<td></td>
<td>X</td>
<td></td>
<td></td>
<td>If any part of the operand lies outside the effective address space from 0 to FFFFH</td>
</tr>
<tr>
<td><strong>Page Fault #PF(fault-code)</strong></td>
<td>X</td>
<td>X</td>
<td>X</td>
<td>For a page fault</td>
</tr>
</tbody>
</table>

Ref. # 319433-005  2-27
2.8 PROGRAMMING CONSIDERATIONS WITH 128-BIT SIMD INSTRUCTIONS

VEX-encoded SIMD instructions generally operate on the 256-bit YMM register state. In contrast, non-VEX encoded instructions (e.g. from SSE to AES) operating on XMM registers only access the lower 128-bit of YMM registers. Processors supporting both 256-bit VEX-encoded instruction and legacy 128-bit SIMD instructions has internal state to manage the upper and lower halves of the YMM states. Functionally, VEX-encoded SIMD instructions can be intermixed with legacy SSE instructions (non-VEX-encoded SIMD instructions operating on XMM registers). However, there is a performance impact with intermixing VEX-encoded SIMD instructions (AVX, FMA) and Legacy SSE instructions that only operate on the XMM register state.

The general programming considerations to realize optimal performance are the following:

• Minimize transition delays and partial register stalls with YMM registers accesses: Intermixed 256-bit, 128-bit or scalar SIMD instructions that are encoded with VEX prefixes have no transition delay due to internal state management. Sequences of legacy SSE instructions (including SSE2, and subsequent generations non-VEX-encoded SIMD extensions) that are not intermixed with VEX-encoded SIMD instructions are not subject to transition delays.

• When an application must employ AVX and/or FMA, along with legacy SSE code, it should minimize the number of transitions between VEX-encoded instructions and legacy, non-VEX-encoded SSE code. Section 2.8.1 provides recommendation for software to minimize the impact of transitions between VEX-encoded code and legacy SSE code.

2.8.1 Clearing Upper YMM State Between AVX and Legacy SSE Instructions

There is no transition penalty if an application clears the upper bits of all YMM registers (set to ‘0’) via VZEROUPPER, VZEROALL, before transitioning between AVX instructions and legacy SSE instructions. Note: clearing the upper state via sequences of XORPS or loading ‘0’ values individually may be useful for breaking dependency, but will not avoid state transition penalties.

Example 1: an application using 256-bit AVX instructions makes calls to a library written using Legacy SSE instructions. This would encounter a delay upon executing the first Legacy SSE instruction in that library and then (after exiting the library) upon executing the first AVX instruction. To eliminate both of these delays, the user should execute the instruction VZEROUPPER prior to entering the legacy library and (after exiting the library) before executing in a 256-bit AVX code path.
Example 2: a library using 256-bit AVX instructions is intended to support other applications that uses legacy SSE instructions. Such a library function should execute VZEROUPPER prior to executing other VEX-encoded instructions. The library function should issue VZEROUPPER at the end of the function before it returns to the calling application. This will prevent the calling application to experience delay when it starts to execute legacy SSE code.

2.8.2 Using AVX 128-bit Instructions Instead of Legacy SSE instructions

Applications using AVX and FMA should migrate legacy 128-bit SIMD instructions to their 128-bit AVX equivalents. AVX supplies the full complement of 128-bit SIMD instructions except for AES and PCLMULQDQ.

2.8.3 Unaligned Memory Access and Buffer Size Management

The majority of AVX instructions support loading 16/32 bytes from memory without alignment restrictions (A number non-VEX-encoded SIMD instructions also don’t require 16-byte address alignment, e.g. MOVDQU, MOVUPS, MOVUPD, LDDQU, PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM). A buffer size management issue related to unaligned SIMD memory access is discussed here.

The size requirements for memory buffer allocation should consider unaligned SIMD memory semantics and application usage. Frequently a caller function may pass an address pointer in conjunction with a length parameter. From the caller perspective, the length parameter usually corresponds to the limit of the allocated memory buffer range, or it may corresponds to certain application-specific configuration parameter that have indirect relationship with valid buffer size.

For certain types of application usage, it may be desirable to make distinctions between valid buffer range limit versus other application specific parameters related memory access patterns, examples of the latter may be stride distance, frame dimensions, etc. There may be situations that a callee wishes to load 16-bytes of data with parts of the 16-bytes lying outside the valid memory buffer region to take advantage of the efficiency of SIMD load bandwidth and discard invalid data elements outside the buffer boundary. An example of this may be in video processing of frames having dimensions that are not modular 16 bytes.

To support the added margin of safety in situations of buffer size allocation and iterative pointer advancement occurring across modules of different software visibility. The standard programming practice of caller function allocation of buffer size based on non-SIMD processing requirement should consider an added padding size to support newer SIMD extensions offering more lax alignment restrictions. The extra padding space can prevent the rare occurrence of access rights violation described below:

- A present page in the linear address space being used by ring 3 code is followed by a page owned by ring 0 code,
APPLICATION PROGRAMMING MODEL

- A caller routine allocates a memory buffer without adding extra pad space and passes the buffer address to a callee routine,
- A callee routine implements an iterative processing algorithm by advancing an address pointer relative to the buffer address using SIMD instructions with unaligned 16/32 load semantics
- The callee routine may choose to load 16/32 bytes near buffer boundary with the intent to discard invalid data outside the data buffer allocated by the caller.
- If the valid data buffer extends to the end of the present page, unaligned 16/32 byte loads near the end of a present page may spill over to the subsequent ring-0 page and causing a #GP.

As a general rule, the minimal padding size should be the width the SIMD register that might be used in conjunction with unaligned SIMD memory access.

2.9 CPUID INSTRUCTION
**CPUID—CPU Identification**

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>64-Bit Mode</th>
<th>Compat/ Leg Mode</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F A2</td>
<td>CPUID</td>
<td>Valid</td>
<td>Valid</td>
<td>Returns processor identification and feature information to the EAX, EBX, ECX, and EDX registers, as determined by input entered in EAX (in some cases, ECX as well).</td>
</tr>
</tbody>
</table>

**Description**

The ID flag (bit 21) in the EFLAGS register indicates support for the CPUID instruction. If a software procedure can set and clear this flag, the processor executing the procedure supports the CPUID instruction. This instruction operates the same in non-64-bit modes and 64-bit mode.

CPUID returns processor identification and feature information in the EAX, EBX, ECX, and EDX registers. The instruction’s output is dependent on the contents of the EAX register upon execution (in some cases, ECX as well). For example, the following pseudocode loads EAX with 00H and causes CPUID to return a Maximum Return Value and the Vendor Identification String in the appropriate registers:

```
MOV EAX, 00H
CPUID
```

Table 2-19 shows information returned, depending on the initial value loaded into the EAX register. Table 2-20 shows the maximum CPUID input value recognized for each family of IA-32 processors on which CPUID is implemented.

Two types of information are returned: basic and extended function information. If a value is entered for CPUID.EAX is invalid for a particular processor, the data for the highest basic information leaf is returned. For example, using the Intel Core 2 Duo E6850 processor, the following is true:

```
CPUID.EAX = 05H (* Returns MONITOR/MWAIT leaf. *)
CPUID.EAX = 0AH (* Returns Architectural Performance Monitoring leaf. *)
CPUID.EAX = 0BH (* INVALID: Returns the same information as CPUID.EAX = 0AH. *)
CPUID.EAX = 80000008H (* Returns virtual/physical address size data. *)
CPUID.EAX = 8000000AH (* INVALID: Returns same information as CPUID.EAX = 0AH. *)
```

When CPUID returns the highest basic leaf information as a result of an invalid input EAX value, any dependence on input ECX value in the basic leaf is honored.

CPUID can be executed at any privilege level to serialize instruction execution. Serializing instruction execution guarantees that any modifications to flags, registers,

---

1. On Intel 64 processors, CPUID clears the high 32 bits of the RAX/RBX/RCX/RDX registers in all modes.
and memory for previous instructions are completed before the next instruction is fetched and executed.

**See also:**

**Table 2-19. Information Returned by CPUID Instruction**

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>0H</strong></td>
<td></td>
</tr>
<tr>
<td>Basic CPUID Information</td>
<td></td>
</tr>
<tr>
<td>EAX</td>
<td>Maximum Input Value for Basic CPUID Information (see Table 2-20)</td>
</tr>
<tr>
<td>EBX</td>
<td>“Genu”</td>
</tr>
<tr>
<td>ECX</td>
<td>“intel”</td>
</tr>
<tr>
<td>EDX</td>
<td>“in”</td>
</tr>
<tr>
<td><strong>01H</strong></td>
<td></td>
</tr>
<tr>
<td>Version Information: Type, Family, Model, and Stepping ID (see Figure 2-2)</td>
<td></td>
</tr>
<tr>
<td>EAX</td>
<td>Bits 7-0: Brand Index</td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 15-8: CLFLUSH line size (Value + 8 = cache line size in bytes)</td>
</tr>
<tr>
<td>ECX</td>
<td>Bits 23-16: Maximum number of addressable IDs for logical processors in this physical package*</td>
</tr>
<tr>
<td>EDX</td>
<td>Bits 31-24: Initial APIC ID</td>
</tr>
<tr>
<td>Feature Information (see Figure 2-3 and Table 2-22)</td>
<td></td>
</tr>
<tr>
<td>Feature Information (see Figure 2-4 and Table 2-23)</td>
<td></td>
</tr>
<tr>
<td><strong>NOTES:</strong></td>
<td></td>
</tr>
<tr>
<td>* The nearest power-of-2 integer that is not smaller than EBX[23:16] is the maximum number of unique initial APIC IDs reserved for addressing different logical processors in a physical package.</td>
<td></td>
</tr>
<tr>
<td><strong>02H</strong></td>
<td></td>
</tr>
<tr>
<td>Cache and TLB Information (see Table 2-24)</td>
<td></td>
</tr>
<tr>
<td><strong>03H</strong></td>
<td></td>
</tr>
<tr>
<td>Reserved.</td>
<td></td>
</tr>
<tr>
<td>Reserved.</td>
<td></td>
</tr>
<tr>
<td>Bits 00-31 of 96 bit processor serial number. (Available in Pentium III processor only; otherwise, the value in this register is reserved.)</td>
<td></td>
</tr>
<tr>
<td>Bits 32-63 of 96 bit processor serial number. (Available in Pentium III processor only; otherwise, the value in this register is reserved.)</td>
<td></td>
</tr>
</tbody>
</table>
Table 2-19. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>NOTES:</strong></td>
<td>Processor serial number (PSN) is not supported in the Pentium 4 processor or later. On all models, use the PSN flag (returned using CPUID) to check for PSN support before accessing the feature.</td>
</tr>
<tr>
<td></td>
<td>See AP-485, <em>Intel Processor Identification and the CPUID Instruction</em> (Order Number 241618) for more information on PSN.</td>
</tr>
<tr>
<td>CPUID leaves &gt; 3 × 80000000 are visible only when IA32_MISC_ENABLES.BOOT_NT4[bit 22] = 0 (default).</td>
<td></td>
</tr>
</tbody>
</table>

**Deterministic Cache Parameters Leaf**

<table>
<thead>
<tr>
<th>EAX</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>Leaf 04H output depends on the initial value in ECX.</td>
<td>Bits 4-0: Cache Type Field</td>
</tr>
<tr>
<td>0 = Null - No more caches</td>
<td>1 = Data Cache</td>
</tr>
<tr>
<td>2 = Instruction Cache</td>
<td>3 = Unified Cache</td>
</tr>
<tr>
<td>4-31 = Reserved</td>
<td>Bits 7-5: Cache Level (starts at 1)</td>
</tr>
<tr>
<td>Bits 8: Self Initializing cache level (does not need SW initialization)</td>
<td>Bits 9: Fully Associative cache</td>
</tr>
<tr>
<td>Bits 13-10: Reserved</td>
<td>Bits 25-14: Maximum number of addressable IDs for logical processors sharing this cache*</td>
</tr>
<tr>
<td>Bits 26: Maximum number of addressable IDs for processor cores in the physical package*</td>
<td>Bits 31-26: Maximum number of addressable IDs for processor cores in the physical package*, ***, ****</td>
</tr>
<tr>
<td>EAX</td>
<td>Bits 11-00: L = System Coherency Line Size*</td>
</tr>
<tr>
<td>EBX</td>
<td>Bits 21-12: P = Physical Line partitions*</td>
</tr>
<tr>
<td>ECX</td>
<td>Bits 31-20: W = Ways of associativity*</td>
</tr>
<tr>
<td>ECX</td>
<td>Bits 31-00: S = Number of Sets*</td>
</tr>
</tbody>
</table>
### Table 2-19. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
</table>
| EDX               | Bit 0: WBINVD/INVD behavior on lower level caches  
|                   | Bit 10: Write-Back Invalidate/Invalidate  
|                   | 0 = WBINVD/INVD from threads sharing this cache acts upon lower level caches for threads sharing this cache  
|                   | 1 = WBINVD/INVD is not guaranteed to act upon lower level caches of non-originating threads sharing this cache.  
|                   | Bit 1: Cache Inclusiveness  
|                   | 0 = Cache is not inclusive of lower cache levels.  
|                   | 1 = Cache is inclusive of lower cache levels.  
|                   | Bits 31-02: Reserved = 0 |

**NOTES:**

* Add one to the return value to get the result.

**The nearest power-of-2 integer that is not smaller than \((1 + EAX[25:14])\) is the number of unique initial APIC IDs reserved for addressing different logical processors sharing this cache.

*** The nearest power-of-2 integer that is not smaller than \((1 + EAX[31:26])\) is the number of unique Core IDs reserved for addressing different processor cores in a physical package. Core ID is a subset of bits of the initial APIC ID.

**** The returned value is constant for valid initial values in ECX. Valid ECX values start from 0.

### MONITOR/MWAIT Leaf

| 5H | EAX | Bits 15-00: Smallest monitor-line size in bytes (default is processor's monitor granularity)  
|    |     | Bits 31-16: Reserved = 0  
|    | EBX | Bits 15-00: Largest monitor-line size in bytes (default is processor's monitor granularity)  
|    |     | Bits 31-16: Reserved = 0  
|    | ECX | Bits 00: Enumeration of Monitor-Mwait extensions (beyond EAX and EBX registers) supported  
|    |     | Bits 01: Supports treating interrupts as break-event for MWAIT, even when interrupts disabled  
<p>|    |     | Bits 31 - 02: Reserved |</p>
<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
</table>
| **EDX** | Bits 03 - 00: Number of C0* sub C-states supported using MWAIT  
| | Bits 07 - 04: Number of C1* sub C-states supported using MWAIT  
| | Bits 11 - 08: Number of C2* sub C-states supported using MWAIT  
| | Bits 15 - 12: Number of C3* sub C-states supported using MWAIT  
| | Bits 19 - 16: Number of C4* sub C-states supported using MWAIT  
| | Bits 31 - 20: Reserved = 0  
| **NOTE:** | * The definition of C0 through C4 states for MWAIT extension are processor-specific C-states, not ACPI C-states. |
| **6H** | EAX Bits 00: Digital temperature sensor is supported if set  
| | Bits 01: Intel Dynamic Acceleration Enabled  
| | Bits 31 - 02: Reserved  
| EBX | Bits 03 - 00: Number of Interrupt Thresholds in Digital Thermal Sensor  
| | Bits 31 - 04: Reserved  
| ECX | Bits 00: Hardware Coordination Feedback Capability (Presence of MCNT and ACNT MSRs). The capability to provide a measure of delivered processor performance (since last reset of the counters), as a percentage of expected processor performance at frequency specified in CPUID Brand String  
| | Bits 31 - 01: Reserved = 0  
| **EDX** | Reserved = 0  
| **09H** | EAX Value of bits [31:0] of IA32_PLATFORM_DCA_CAP MSR (address 1F8H)  
| | EBX Reserved  
| ECX Reserved  
| EDX Reserved  
| **0AH** | EAX Bits 07 - 00: Version ID of architectural performance monitoring  
| | Bits 15 - 08: Number of general-purpose performance monitoring counter per logical processor  
| | Bits 23 - 16: Bit width of general-purpose, performance monitoring counter  
| | Bits 31 - 24: Length of EBX bit vector to enumerate architectural performance monitoring events  

Table 2-19. Information Returned by CPUID Instruction (Continued)
### Table 2-19. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>EBX</td>
<td>Bit 0: Core cycle event not available if 1</td>
</tr>
<tr>
<td></td>
<td>Bit 1: Instruction retired event not available if 1</td>
</tr>
<tr>
<td></td>
<td>Bit 2: Reference cycles event not available if 1</td>
</tr>
<tr>
<td></td>
<td>Bit 3: Last-level cache reference event not available if 1</td>
</tr>
<tr>
<td></td>
<td>Bit 4: Last-level cache misses event not available if 1</td>
</tr>
<tr>
<td></td>
<td>Bit 5: Branch instruction retired event not available if 1</td>
</tr>
<tr>
<td></td>
<td>Bit 6: Branch mispredict retired event not available if 1</td>
</tr>
<tr>
<td></td>
<td>Bits 31-07: Reserved = 0</td>
</tr>
<tr>
<td>ECX</td>
<td>Reserved = 0</td>
</tr>
<tr>
<td>EDX</td>
<td>Bits 04-00: Number of fixed-function performance counters (if Version ID &gt; 1)</td>
</tr>
<tr>
<td></td>
<td>Bits 12-05: Bit width of fixed-function performance counters (if Version ID &gt; 1)</td>
</tr>
<tr>
<td></td>
<td>Reserved = 0</td>
</tr>
</tbody>
</table>

**Extended Topology Enumeration Leaf**

**NOTES:**

Most of Leaf 0BH output depends on the initial value in ECX.
EDX output do not vary with initial value in ECX.
ECX[7:0] output always reflect initial value in ECX.
All other output value for an invalid initial value in ECX are 0
This leaf exists if EBX[15:0] contain a non-zero value.

Bits 4-0: Number of bits to shift right on x2APIC ID to get a unique topology ID of the next level type*. All logical processors with the same next level ID share current level.
Bits 31-5: Reserved.

**0BH**

Bits 15-00: Number of logical processors at this level type. The number reflects configuration as shipped by Intel**.
Bits 31-16: Reserved.

**EBX**

Bits 07-00: Level number. Same value in ECX input
Bits 15-08: Level type***.
Bits 31-16: Reserved.

**EDX**

Bits 31-0: x2APIC ID the current logical processor.
### Table 2-19. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>NOTES:</strong></td>
<td>* Software should use this field (EAX[4:0]) to enumerate processor topology of the system.</td>
</tr>
<tr>
<td><strong>NOTES:</strong></td>
<td>** Software must not use EBX[15:0] to enumerate processor topology of the system. This value in this field (EBX[15:0]) is only intended for display/diagnostic purposes. The actual number of logical processors available to BIOS/OS/Applications may be different from the value of EBX[15:0], depending on software and platform hardware configurations.</td>
</tr>
<tr>
<td><strong>NOTES:</strong></td>
<td>*** The value of the “level type” field is not related to level numbers in any way, higher “level type” values do not mean higher levels. Level type field has the following encoding: 0: invalid 1: SMT 2: Core 3-255: Reserved</td>
</tr>
</tbody>
</table>

**Processor Extended State Enumeration Main Leaf (EAX = ODH, ECX = 0)**

<table>
<thead>
<tr>
<th>Leaf 0DH main leaf (ECX = 0).</th>
</tr>
</thead>
<tbody>
<tr>
<td>EAX</td>
</tr>
<tr>
<td>EBX</td>
</tr>
<tr>
<td>ECX</td>
</tr>
</tbody>
</table>

**Processor Extended State Enumeration Sub-leaf (EAX = ODH, ECX = 1)**

| EDX | Bit 31-0: Reports the valid bit fields of the upper 32 bits of the XFEATURE_ENABLED_MASK register. If a bit is 0, the corresponding bit field in XFEATURE_ENABLED_MASK is reserved. |
Table 2-19. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>EAX Reserved</td>
</tr>
<tr>
<td></td>
<td>EBX Reserved</td>
</tr>
<tr>
<td></td>
<td>ECX Reserved</td>
</tr>
<tr>
<td></td>
<td>EDX Reserved</td>
</tr>
<tr>
<td>0DH</td>
<td>Leaf 0DH output depends on the initial value in ECX. If ECX contains an invalid sub leaf index, EAX/EBX/ECX/EDX return 0.</td>
</tr>
<tr>
<td></td>
<td>Bits 31-0: The size in bytes of the save area for an extended state associated with a valid sub-leaf index, ( n ). Each valid sub-leaf index maps to a valid bit in XFEATURE_ENABLED_MASK starting at bit position 2. This field reports 0 if the sub-leaf index, ( n ), is invalid*.</td>
</tr>
<tr>
<td></td>
<td>Bits 31-0: The offset in bytes of the save area from the beginning of the XSAVE/XRSTOR area. This field reports 0 if the sub-leaf index, ( n ), is invalid*.</td>
</tr>
<tr>
<td></td>
<td>ECX This field reports 0 if the sub-leaf index, ( n ), is invalid*; otherwise it is reserved.</td>
</tr>
<tr>
<td></td>
<td>EDX This field reports 0 if the sub-leaf index, ( n ), is invalid*; otherwise it is reserved.</td>
</tr>
<tr>
<td></td>
<td>Extended Function CPUID Information</td>
</tr>
<tr>
<td>80000000H</td>
<td>EAX Maximum Input Value for Extended Function CPUID Information (see Table 2-20).</td>
</tr>
<tr>
<td></td>
<td>EBX Reserved</td>
</tr>
<tr>
<td></td>
<td>ECX Reserved</td>
</tr>
<tr>
<td></td>
<td>EDX Reserved</td>
</tr>
<tr>
<td>80000001H</td>
<td>EAX Extended Processor Signature and Feature Bits.</td>
</tr>
<tr>
<td></td>
<td>EBX Reserved</td>
</tr>
<tr>
<td></td>
<td>ECX Bit 0: LAHF/SAHF available in 64-bit mode</td>
</tr>
<tr>
<td></td>
<td>Bits 31-1 Reserved</td>
</tr>
</tbody>
</table>
Table 2-19. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>EDX</td>
<td>Bits 10-0: Reserved</td>
</tr>
<tr>
<td></td>
<td>Bit 11: SYSCALL/SYSRET available (when in 64-bit mode)</td>
</tr>
<tr>
<td></td>
<td>Bits 19-12: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>Bit 20: Execute Disable Bit available</td>
</tr>
<tr>
<td></td>
<td>Bits 28-21: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>Bit 29: Intel® 64 Architecture available if 1</td>
</tr>
<tr>
<td></td>
<td>Bits 31-30: Reserved = 0</td>
</tr>
<tr>
<td>80000002H</td>
<td>EAX: Processor Brand String</td>
</tr>
<tr>
<td></td>
<td>EBX: Processor Brand String Continued</td>
</tr>
<tr>
<td></td>
<td>ECX: Processor Brand String Continued</td>
</tr>
<tr>
<td></td>
<td>EDX: Processor Brand String Continued</td>
</tr>
<tr>
<td>80000003H</td>
<td>EAX: Processor Brand String Continued</td>
</tr>
<tr>
<td></td>
<td>EBX: Processor Brand String Continued</td>
</tr>
<tr>
<td></td>
<td>ECX: Processor Brand String Continued</td>
</tr>
<tr>
<td></td>
<td>EDX: Processor Brand String Continued</td>
</tr>
<tr>
<td>80000004H</td>
<td>EAX: Processor Brand String Continued</td>
</tr>
<tr>
<td></td>
<td>EBX: Processor Brand String Continued</td>
</tr>
<tr>
<td></td>
<td>ECX: Processor Brand String Continued</td>
</tr>
<tr>
<td></td>
<td>EDX: Processor Brand String Continued</td>
</tr>
<tr>
<td>80000005H</td>
<td>EAX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EBX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>ECX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EDX: Reserved = 0</td>
</tr>
<tr>
<td>80000006H</td>
<td>EAX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EBX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>ECX: Bits 7-0: Cache Line size in bytes</td>
</tr>
<tr>
<td></td>
<td>Bits 15-12: L2 Associativity field *</td>
</tr>
<tr>
<td></td>
<td>Bits 31-16: Cache size in 1K units</td>
</tr>
<tr>
<td></td>
<td>EDX: Reserved = 0</td>
</tr>
</tbody>
</table>

**NOTES:**
- * L2 associativity field encodings:
  - 00H - Disabled
  - 01H - Direct mapped
  - 02H - 2-way
  - 04H - 4-way
  - 06H - 8-way
  - 08H - 16-way
  - 0FH - Fully associative
APPLICATION PROGRAMMING MODEL

Table 2-19. Information Returned by CPUID Instruction (Continued)

<table>
<thead>
<tr>
<th>Initial EAX Value</th>
<th>Information Provided about the Processor</th>
</tr>
</thead>
<tbody>
<tr>
<td>80000007H</td>
<td>EAX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EBX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>ECX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EDX: Reserved = 0</td>
</tr>
<tr>
<td>80000008H</td>
<td>EAX: Virtual/Physical Address size</td>
</tr>
<tr>
<td></td>
<td>Bits 7-0: #Physical Address Bits*</td>
</tr>
<tr>
<td></td>
<td>Bits 15-8: #Virtual Address Bits</td>
</tr>
<tr>
<td></td>
<td>Bits 31-16: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EBX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>ECX: Reserved = 0</td>
</tr>
<tr>
<td></td>
<td>EDX: Reserved = 0</td>
</tr>
</tbody>
</table>

NOTES:
* If CPUID.80000008H:EAX[7:0] is supported, the maximum physical address number supported should come from this field.

INPUT EAX = 0: Returns CPUID’s Highest Value for Basic Processor Information and the Vendor Identification String

When CPUID executes with EAX set to 0, the processor returns the highest value the CPUID recognizes for returning basic processor information. The value is returned in the EAX register (see Table 2-20) and is processor specific.

A vendor identification string is also returned in EBX, EDX, and ECX. For Intel processors, the string is “GenuineIntel” and is expressed:

- EBX ← 756e6547h (* “Genu”, with G in the low 4 bits of BL *)
- EDX ← 49656e69h (* “ineI”, with i in the low 4 bits of DL *)
- ECX ← 6c65746eh (* “ntel”, with n in the low 4 bits of CL *)

INPUT EAX = 80000008H: Returns CPUID’s Highest Value for Extended Processor Information

When CPUID executes with EAX set to 0, the processor returns the highest value the processor recognizes for returning extended processor information. The value is returned in the EAX register (see Table 2-20) and is processor specific.

2-40  Ref. # 319433-005
IA32_BIOS_SIGN_ID Returns Microcode Update Signature

For processors that support the microcode update facility, the IA32_BIOS_SIGN_ID MSR is loaded with the update signature whenever CPUID executes. The signature is returned in the upper DWORD. For details, see Chapter 9 in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3A.
INPUT EAX = 1: Returns Model, Family, Stepping Information

When CPUID executes with EAX set to 1, version information is returned in EAX (see Figure 2-2). For example: model, family, and processor type for the Intel Xeon processor 5100 series is as follows:

- Model — 1111B
- Family — 0101B
- Processor Type — 00B

See Table 2-21 for available processor type values. Stepping IDs are provided as needed.

![Figure 2-2. Version Information Returned by CPUID in EAX](image)

**Table 2-21. Processor Type Field**

<table>
<thead>
<tr>
<th>Type</th>
<th>Encoding</th>
</tr>
</thead>
<tbody>
<tr>
<td>Original OEM Processor</td>
<td>00B</td>
</tr>
<tr>
<td>Intel OverDrive Processor</td>
<td>01B</td>
</tr>
<tr>
<td>Dual processor (not applicable to Intel486 processors)</td>
<td>10B</td>
</tr>
<tr>
<td>Intel reserved</td>
<td>11B</td>
</tr>
</tbody>
</table>

**NOTE**

See AP-485, *Intel Processor Identification and the CPUID Instruction* (Order Number 241618) and Chapter 14 in the *Intel® 64 and IA-32*
The Extended Family ID needs to be examined only when the Family ID is 0FH. Integrate the fields into a display using the following rule:

```plaintext
IF Family_ID ≠ 0FH
    THEN Displayed_Family = Family_ID;
    ELSE Displayed_Family = Extended_Family_ID + Family_ID;
    (* Right justify and zero-extend 4-bit field. *)
FI;
(* Show Display_Family as HEX field. *)
```

The Extended Model ID needs to be examined only when the Family ID is 06H or 0FH. Integrate the field into a display using the following rule:

```plaintext
IF (Family_ID = 06H or Family_ID = 0FH)
    THEN Displayed_Model = (Extended_Model_ID << 4) + Model_ID;
    (* Right justify and zero-extend 4-bit field; display Model_ID as HEX field.*)
    ELSE Displayed_Model = Model_ID;
FI;
(* Show Display_Model as HEX field. *)
```

**INPUT EAX = 1: Returns Additional Information in EBX**

When CPUID executes with EAX set to 1, additional information is returned to the EBX register:

- Brand index (low byte of EBX) — this number provides an entry into a brand string table that contains brand strings for IA-32 processors. More information about this field is provided later in this section.
- CLFLUSH instruction cache line size (second byte of EBX) — this number indicates the size of the cache line flushed with CLFLUSH instruction in 8-byte increments. This field was introduced in the Pentium 4 processor.
- Local APIC ID (high byte of EBX) — this number is the 8-bit ID that is assigned to the local APIC on the processor during power up. This field was introduced in the Pentium 4 processor.

**INPUT EAX = 1: Returns Feature Information in ECX and EDX**

When CPUID executes with EAX set to 1, feature information is returned in ECX and EDX.

- Figure 2-3 and Table 2-22 show encodings for ECX.
- Figure 2-4 and Table 2-23 show encodings for EDX.

For all feature flags, a 1 indicates that the feature is supported. Use Intel to properly interpret feature flags.
NOTE

Software must confirm that a processor feature is present using feature flags returned by CPUID prior to using the feature. Software should not depend on future offerings retaining all features.

![Figure 2-3. Feature Information Returned in the ECX Register](image_url)

### Table 2-22. Feature Information Returned in the ECX Register

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>SSE3</td>
<td>Streaming SIMD Extensions 3 (SSE3). A value of 1 indicates the processor supports this technology.</td>
</tr>
<tr>
<td>1</td>
<td>PCLMULQDQ</td>
<td>A value of 1 indicates the processor supports PCLMULQDQ instruction</td>
</tr>
<tr>
<td>2</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
</tbody>
</table>
Table 2-22. Feature Information Returned in the ECX Register (Continued)

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>3</td>
<td>MONITOR</td>
<td>MONITOR/MWAIT. A value of 1 indicates the processor supports this feature.</td>
</tr>
<tr>
<td>4</td>
<td>DS-CPL</td>
<td>CPL Qualified Debug Store. A value of 1 indicates the processor supports the extensions to the Debug Store feature to allow for branch message storage qualified by CPL.</td>
</tr>
<tr>
<td>5</td>
<td>VMX</td>
<td>Virtual Machine Extensions. A value of 1 indicates that the processor supports this technology.</td>
</tr>
<tr>
<td>6</td>
<td>SMX</td>
<td>Safer Mode Extensions. A value of 1 indicates that the processor supports this technology. See Chapter 6, “Safer Mode Extensions Reference”.</td>
</tr>
<tr>
<td>7</td>
<td>EST</td>
<td>Enhanced Intel SpeedStep® technology. A value of 1 indicates that the processor supports this technology.</td>
</tr>
<tr>
<td>8</td>
<td>TM2</td>
<td>Thermal Monitor 2. A value of 1 indicates whether the processor supports this technology.</td>
</tr>
<tr>
<td>9</td>
<td>SSSE3</td>
<td>A value of 1 indicates the presence of the Supplemental Streaming SIMD Extensions 3 (SSSE3). A value of 0 indicates the instruction extensions are not present in the processor.</td>
</tr>
<tr>
<td>10</td>
<td>CNXT-ID</td>
<td>L1 Context ID. A value of 1 indicates the L1 data cache mode can be set to either adaptive mode or shared mode. A value of 0 indicates this feature is not supported. See definition of the IA32_MISC_ENABLES MSR Bit 24 (L1 Data Cache Context Mode) for details.</td>
</tr>
<tr>
<td>11</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>12</td>
<td>FMA</td>
<td>A value of 1 indicates the processor supports FMA extensions using YMM state.</td>
</tr>
<tr>
<td>13</td>
<td>CMPXCHG16B</td>
<td>CMPXCHG16B Available. A value of 1 indicates that the feature is available.</td>
</tr>
<tr>
<td>14</td>
<td>xTPR Update Control</td>
<td>xTPR Update Control. A value of 1 indicates that the processor supports changing IA32_MISC_ENABLES[bit 23].</td>
</tr>
<tr>
<td>15</td>
<td>PDCM</td>
<td>Perfmon and Debug Capability. A value of 1 indicates the processor supports the performance and debug feature indication MSR IA32_PERF_CAPABILITIES.</td>
</tr>
<tr>
<td>16-17</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>18</td>
<td>DCA</td>
<td>A value of 1 indicates the processor supports the ability to prefetch data from a memory mapped device.</td>
</tr>
<tr>
<td>19</td>
<td>SSE4.1</td>
<td>A value of 1 indicates that the processor supports SSE4.1.</td>
</tr>
<tr>
<td>20</td>
<td>SSE4.2</td>
<td>A value of 1 indicates that the processor supports SSE4.2.</td>
</tr>
<tr>
<td>21</td>
<td>x2APIC</td>
<td>A value of 1 indicates that the processor supports x2APIC feature.</td>
</tr>
</tbody>
</table>
Table 2-22. Feature Information Returned in the ECX Register (Continued)

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>22</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>23</td>
<td>POPCNT</td>
<td>A value of 1 indicates that the processor supports the POPCNT instruction.</td>
</tr>
<tr>
<td>24</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>25</td>
<td>AES</td>
<td>A value of 1 indicates that the processor supports the AES instruction</td>
</tr>
<tr>
<td>26</td>
<td>XSAVE</td>
<td>A value of 1 indicates that the processor supports the XFEATURE_ENABLED_MASK register and XSAVE/XRSTOR/XSETPV/XGETBV instructions to manage processor extended states.</td>
</tr>
<tr>
<td>27</td>
<td>OSXSAVE</td>
<td>A value of 1 indicates that the OS has enabled support for using XGETBV/XSETPV instructions to query processor extended states.</td>
</tr>
<tr>
<td>28</td>
<td>AVX</td>
<td>A value of 1 indicates that processor supports AVX instructions operating on 256-bit YMM state, and three-operand encoding of 256-bit and 128-bit SIMD instructions.</td>
</tr>
<tr>
<td>31 - 29</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
</tbody>
</table>
Figure 2-4. Feature Information Returned in the EDX Register
Table 2-23. More on Feature Information Returned in the EDX Register

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>FPU</td>
<td>Floating-point Unit On-Chip. The processor contains an x87 FPU.</td>
</tr>
<tr>
<td>1</td>
<td>VME</td>
<td>Virtual 8086 Mode Enhancements. Virtual 8086 mode enhancements, including CR4.VME for controlling the feature, CR4.PVI for protected mode virtual interrupts, software interrupt indirection, expansion of the TSS with the software indirection bitmap, and EFLAGS.VIF and EFLAGS.VIP flags.</td>
</tr>
<tr>
<td>2</td>
<td>DE</td>
<td>Debugging Extensions. Support for I/O breakpoints, including CR4.DE for controlling the feature, and optional trapping of accesses to DR4 and DR5.</td>
</tr>
<tr>
<td>3</td>
<td>PSE</td>
<td>Page Size Extension. Large pages of size 4 MByte are supported, including CR4.PSE for controlling the feature, the defined dirty bit in PDE (Page Directory Entries), optional reserved bit trapping in CR3, PDEs, and PTEs.</td>
</tr>
<tr>
<td>4</td>
<td>TSC</td>
<td>Time Stamp Counter. The RDTSC instruction is supported, including CR4.TSD for controlling privilege.</td>
</tr>
<tr>
<td>5</td>
<td>MSR</td>
<td>Model Specific Registers RDMSR and WRMSR Instructions. The RDMSR and WRMSR instructions are supported. Some of the MSRs are implementation dependent.</td>
</tr>
<tr>
<td>6</td>
<td>PAE</td>
<td>Physical Address Extension. Physical addresses greater than 32 bits are supported: extended page table entry formats, an extra level in the page translation tables is defined, 2-MByte pages are supported instead of 4 Mbyte pages if PAE bit is 1. The actual number of address bits beyond 32 is not defined, and is implementation specific.</td>
</tr>
<tr>
<td>7</td>
<td>MCE</td>
<td>Machine Check Exception. Exception 18 is defined for Machine Checks, including CR4.MCE for controlling the feature. This feature does not define the model-specific implementations of machine-check error logging, reporting, and processor shutdowns. Machine Check exception handlers may have to depend on processor version to do model specific processing of the exception, or test for the presence of the Machine Check feature.</td>
</tr>
<tr>
<td>8</td>
<td>CX8</td>
<td>CMPXCHG8B Instruction. The compare-and-exchange 8 bytes (64 bits) instruction is supported (implicitly locked and atomic).</td>
</tr>
<tr>
<td>9</td>
<td>APIC</td>
<td>APIC On-Chip. The processor contains an Advanced Programmable Interrupt Controller (APIC), responding to memory mapped commands in the physical address range FFFE0000H to FFFE0FFFH (by default - some processors permit the APIC to be relocated).</td>
</tr>
<tr>
<td>10</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>11</td>
<td>SEP</td>
<td>SYSENTER and SYSEXIT Instructions. The SYSENTER and SYSEXIT and associated MSRs are supported.</td>
</tr>
<tr>
<td>12</td>
<td>MTRR</td>
<td>Memory Type Range Registers. MTRRs are supported. The MTRRcap MSR contains feature bits that describe what memory types are supported, how many variable MTRRs are supported, and whether fixed MTRRs are supported.</td>
</tr>
<tr>
<td>Bit #</td>
<td>Mnemonic</td>
<td>Description</td>
</tr>
<tr>
<td>-------</td>
<td>----------</td>
<td>-------------</td>
</tr>
<tr>
<td></td>
<td>PGE</td>
<td><strong>PTE Global Bit.</strong> The global bit in page directory entries (PDEs) and page table entries (PTEs) is supported, indicating TLB entries that are common to different processes and need not be flushed. The CR4.PGE bit controls this feature.</td>
</tr>
<tr>
<td></td>
<td>MCA</td>
<td><strong>Machine Check Architecture.</strong> The Machine Check Architecture, which provides a compatible mechanism for error reporting in P6 family, Pentium 4, Intel Xeon processors, and future processors, is supported. The MCG_CAP MSR contains feature bits describing how many banks of error reporting MSRs are supported.</td>
</tr>
<tr>
<td></td>
<td>CMOV</td>
<td><strong>Conditional Move Instructions.</strong> The conditional move instruction CMOV is supported. In addition, if x87 FPU is present as indicated by the CPUID.FPU feature bit, then the FCOMI and FCMOV instructions are supported.</td>
</tr>
<tr>
<td></td>
<td>PAT</td>
<td><strong>Page Attribute Table.</strong> Page Attribute Table is supported. This feature augments the Memory Type Range Registers (MTRRs), allowing an operating system to specify attributes of memory on a 4K granularity through a linear address.</td>
</tr>
<tr>
<td></td>
<td>PSE-36</td>
<td><strong>36-Bit Page Size Extension.</strong> Extended 4-MByte pages that are capable of addressing physical memory beyond 4 GBytes are supported. This feature indicates that the upper four bits of the physical address of the 4-MByte page is encoded by bits 13-16 of the page directory entry.</td>
</tr>
<tr>
<td></td>
<td>PSN</td>
<td><strong>Processor Serial Number.</strong> The processor supports the 96-bit processor identification number feature and the feature is enabled.</td>
</tr>
<tr>
<td></td>
<td>CLFSH</td>
<td><strong>CLFLUSH Instruction.</strong> CLFLUSH Instruction is supported.</td>
</tr>
<tr>
<td></td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td></td>
<td>DS</td>
<td><strong>Debug Store.</strong> The processor supports the ability to write debug information into a memory resident buffer. This feature is used by the branch trace store (BTS) and precise event-based sampling (PEBS) facilities (see Chapter 18, “Debugging and Performance Monitoring,” in the <em>Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3B</em>).</td>
</tr>
<tr>
<td></td>
<td>ACPI</td>
<td><strong>Thermal Monitor and Software Controlled Clock Facilities.</strong> The processor implements internal MSRs that allow processor temperature to be monitored and processor performance to be modulated in predefined duty cycles under software control.</td>
</tr>
<tr>
<td></td>
<td>MMX</td>
<td><strong>Intel MMX Technology.</strong> The processor supports the Intel MMX technology.</td>
</tr>
<tr>
<td></td>
<td>FXSR</td>
<td><strong>FXSAVE and FXRSTOR Instructions.</strong> The FXSAVE and FXRSTOR instructions are supported for fast save and restore of the floating-point context. Presence of this bit also indicates that CR4.OSFXSR is available for an operating system to indicate that it supports the FXSAVE and FXRSTOR instructions.</td>
</tr>
</tbody>
</table>
APPLICATION PROGRAMMING MODEL

Table 2-23. More on Feature Information Returned in the EDX Register (Continued)

<table>
<thead>
<tr>
<th>Bit #</th>
<th>Mnemonic</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>25</td>
<td>SSE</td>
<td>SSE, The processor supports the SSE extensions.</td>
</tr>
<tr>
<td>26</td>
<td>SSE2</td>
<td>SSE2, The processor supports the SSE2 extensions.</td>
</tr>
<tr>
<td>27</td>
<td>SS</td>
<td>Self Snoop, The processor supports the management of conflicting memory types by performing a snoop of its own cache structure for transactions issued to the bus.</td>
</tr>
<tr>
<td>28</td>
<td>HTT</td>
<td>Multi-Threading, The physical processor package is capable of supporting more than one logical processor.</td>
</tr>
<tr>
<td>29</td>
<td>TM</td>
<td>Thermal Monitor, The processor implements the thermal monitor automatic thermal control circuitry (TCC).</td>
</tr>
<tr>
<td>30</td>
<td>Reserved</td>
<td>Reserved</td>
</tr>
<tr>
<td>31</td>
<td>PBE</td>
<td>Pending Break Enable, The processor supports the use of the FERR#/PBE# pin when the processor is in the stop-clock state (STPCLK# is asserted) to signal the processor that an interrupt is pending and that the processor should return to normal operation to handle the interrupt. Bit 10 (PBE enable) in the IA32_MISC_ENABLE MSR enables this capability.</td>
</tr>
</tbody>
</table>

INPUT EAX = 2: Cache and TLB Information Returned in EAX, EBX, ECX, EDX

When CPUID executes with EAX set to 2, the processor returns information about the processor’s internal caches and TLBs in the EAX, EBX, ECX, and EDX registers.

The encoding is as follows:

- The least-significant byte in register EAX (register AL) indicates the number of times the CPUID instruction must be executed with an input value of 2 to get a complete description of the processor’s caches and TLBs. The first member of the family of Pentium 4 processors will return a 1.
- The most significant bit (bit 31) of each register indicates whether the register contains valid information (set to 0) or is reserved (set to 1).
- If a register contains valid information, the information is contained in 1 byte descriptors. Table 2-24 shows the encoding of these descriptors. Note that the order of descriptors in the EAX, EBX, ECX, and EDX registers is not defined; that is, specific bytes are not designated to contain descriptors for specific cache or TLB types. The descriptors may appear in any order.

Table 2-24. Encoding of Cache and TLB Descriptors

<table>
<thead>
<tr>
<th>Descriptor Value</th>
<th>Cache or TLB Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>00H</td>
<td>Null descriptor</td>
</tr>
<tr>
<td>01H</td>
<td>Instruction TLB: 4 KByte pages, 4-way set associative, 32 entries</td>
</tr>
<tr>
<td>02H</td>
<td>Instruction TLB: 4 MByte pages, 4-way set associative, 2 entries</td>
</tr>
</tbody>
</table>
Table 2-24. Encoding of Cache and TLB Descriptors (Continued)

<table>
<thead>
<tr>
<th>Descriptor Value</th>
<th>Cache or TLB Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>03H</td>
<td>Data TLB: 4 KByte pages, 4-way set associative, 64 entries</td>
</tr>
<tr>
<td>04H</td>
<td>Data TLB: 4 MByte pages, 4-way set associative, 8 entries</td>
</tr>
<tr>
<td>05H</td>
<td>Data TLB1: 4 MByte pages, 4-way set associative, 32 entries</td>
</tr>
<tr>
<td>06H</td>
<td>1st-level instruction cache: 8 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>08H</td>
<td>1st-level instruction cache: 16 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>0AH</td>
<td>1st-level data cache: 8 KBytes, 2-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>0BH</td>
<td>Instruction TLB: 4 MByte pages, 4-way set associative, 4 entries</td>
</tr>
<tr>
<td>0CH</td>
<td>1st-level data cache: 16 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>22H</td>
<td>3rd-level cache: 512 KBytes, 4-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>23H</td>
<td>3rd-level cache: 1 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>25H</td>
<td>3rd-level cache: 2 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>29H</td>
<td>3rd-level cache: 4 MBytes, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>2CH</td>
<td>1st-level data cache: 32 KBytes, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>30H</td>
<td>1st-level instruction cache: 32 KBytes, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>40H</td>
<td>No 2nd-level cache or, if processor contains a valid 2nd-level cache, no 3rd-level cache</td>
</tr>
<tr>
<td>41H</td>
<td>2nd-level cache: 128 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>42H</td>
<td>2nd-level cache: 256 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>43H</td>
<td>2nd-level cache: 512 KBytes, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>44H</td>
<td>2nd-level cache: 1 MByte, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>45H</td>
<td>2nd-level cache: 2 MByte, 4-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>46H</td>
<td>3rd-level cache: 4 MByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>47H</td>
<td>3rd-level cache: 8 MByte, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>49H</td>
<td>3rd-level cache: 4MB, 16-way set associative, 64-byte line size (Intel Xeon processor MP, Family 0FH, Model 06H); 2nd-level cache: 4 MByte, 16-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>4AH</td>
<td>3rd-level cache: 6MByte, 12-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>4BH</td>
<td>3rd-level cache: 8MByte, 16-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>4CH</td>
<td>3rd-level cache: 12MByte, 12-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>4DH</td>
<td>3rd-level cache: 16MByte, 16-way set associative, 64 byte line size</td>
</tr>
</tbody>
</table>
### Table 2-24. Encoding of Cache and TLB Descriptors (Continued)

<table>
<thead>
<tr>
<th>Descriptor Value</th>
<th>Cache or TLB Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>4EH</td>
<td>2nd-level cache: 6MByte, 24-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>50H</td>
<td>Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 64 entries</td>
</tr>
<tr>
<td>51H</td>
<td>Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 128 entries</td>
</tr>
<tr>
<td>52H</td>
<td>Instruction TLB: 4 KByte and 2-MByte or 4-MByte pages, 256 entries</td>
</tr>
<tr>
<td>56H</td>
<td>Data TLB0: 4 MByte pages, 4-way set associative, 16 entries</td>
</tr>
<tr>
<td>57H</td>
<td>Data TLB0: 4 KByte pages, 4-way associative, 16 entries</td>
</tr>
<tr>
<td>58H</td>
<td>Data TLB: 4 KByte and 4 MByte pages, 64 entries</td>
</tr>
<tr>
<td>59H</td>
<td>Data TLB: 4 KByte and 4 MByte pages, 128 entries</td>
</tr>
<tr>
<td>5AH</td>
<td>Data TLB: 4 KByte and 4 MByte pages, 256 entries</td>
</tr>
<tr>
<td>60H</td>
<td>1st-level data cache: 16 KByte, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>66H</td>
<td>1st-level data cache: 8 KByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>67H</td>
<td>1st-level data cache: 16 KByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>68H</td>
<td>1st-level data cache: 32 KByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>70H</td>
<td>Trace cache: 12 K(\mu)op, 8-way set associative</td>
</tr>
<tr>
<td>71H</td>
<td>Trace cache: 16 K(\mu)op, 8-way set associative</td>
</tr>
<tr>
<td>72H</td>
<td>Trace cache: 32 K(\mu)op, 8-way set associative</td>
</tr>
<tr>
<td>78H</td>
<td>2nd-level cache: 1 MByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>79H</td>
<td>2nd-level cache: 128 KByte, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>7AH</td>
<td>2nd-level cache: 256 KByte, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>7BH</td>
<td>2nd-level cache: 512 KByte, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>7CH</td>
<td>2nd-level cache: 1 MByte, 8-way set associative, 64 byte line size, 2 lines per sector</td>
</tr>
<tr>
<td>7DH</td>
<td>2nd-level cache: 2 MByte, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>7FH</td>
<td>2nd-level cache: 512 KByte, 2-way set associative, 64-byte line size</td>
</tr>
<tr>
<td>82H</td>
<td>2nd-level cache: 256 KByte, 8-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>83H</td>
<td>2nd-level cache: 512 KByte, 8-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>84H</td>
<td>2nd-level cache: 1 MByte, 8-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>85H</td>
<td>2nd-level cache: 2 MByte, 8-way set associative, 32 byte line size</td>
</tr>
<tr>
<td>86H</td>
<td>2nd-level cache: 512 KByte, 4-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>87H</td>
<td>2nd-level cache: 1 MByte, 8-way set associative, 64 byte line size</td>
</tr>
<tr>
<td>B0H</td>
<td>Instruction TLB: 4 KByte pages, 4-way set associative, 128 entries</td>
</tr>
</tbody>
</table>
APPLICATION PROGRAMMING MODEL

Example 2-1. Example of Cache and TLB Interpretation

The first member of the family of Pentium 4 processors returns the following information about caches and TLBs when the CPUID executes with an input value of 2:

EAX  66 5B 50 01H
EBX  0H
ECX  0H
EDX  00 7A 70 00H

Which means:

- The least-significant byte (byte 0) of register EAX is set to 01H. This indicates that CPUID needs to be executed once with an input value of 2 to retrieve complete information about caches and TLBs.
- The most-significant bit of all four registers (EAX, EBX, ECX, and EDX) is set to 0, indicating that each register contains valid 1-byte descriptors.
- Bytes 1, 2, and 3 of register EAX indicate that the processor has:
  - 50H - a 64-entry instruction TLB, for mapping 4-KByte and 2-MByte or 4-MByte pages.
  - 5BH - a 64-entry data TLB, for mapping 4-KByte and 4-MByte pages.
  - 66H - an 8-KByte 1st level data cache, 4-way set associative, with a 64-Byte cache line size.
- The descriptors in registers EBX and ECX are valid, but contain NULL descriptors.
- Bytes 0, 1, 2, and 3 of register EDX indicate that the processor has:
  - 00H - NULL descriptor.
  - 70H - Trace cache: 12 K-μop, 8-way set associative.
  - 7AH - a 256-KByte 2nd level cache, 8-way set associative, with a sectored, 64-byte cache line size.
  - 00H - NULL descriptor.

<table>
<thead>
<tr>
<th>Descriptor Value</th>
<th>Cache or TLB Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>B1H</td>
<td>Instruction TLB: 2M pages, 4-way, 8 entries or 4M pages, 4-way, 4 entries</td>
</tr>
<tr>
<td>B3H</td>
<td>Data TLB: 4 KByte pages, 4-way set associative, 128 entries</td>
</tr>
<tr>
<td>B4H</td>
<td>Data TLB1: 4 KByte pages, 4-way associative, 256 entries</td>
</tr>
<tr>
<td>F0H</td>
<td>64-Byte prefetching</td>
</tr>
<tr>
<td>F1H</td>
<td>128-Byte prefetching</td>
</tr>
</tbody>
</table>

Table 2-24. Encoding of Cache and TLB Descriptors (Continued)
APPLICATION PROGRAMMING MODEL

INPUT EAX = 4: Returns Deterministic Cache Parameters for Each Level

When CPUID executes with EAX set to 4 and ECX contains an index value, the processor returns encoded data that describe a set of deterministic cache parameters (for the cache level associated with the input in ECX). Valid index values start from 0.

Software can enumerate the deterministic cache parameters for each level of the cache hierarchy starting with an index value of 0, until the parameters report the value associated with the cache type field is 0. The architecturally defined fields reported by deterministic cache parameters are documented in Table 2-19.

The CPUID leaf 4 also reports data that can be used to derive the topology of processor cores in a physical package. This information is constant for all valid index values. Software can query the raw data reported by executing CPUID with EAX=4 and ECX=0 and use it as part of the topology enumeration algorithm described in Chapter 7, "Multiple-Processor Management," in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

INPUT EAX = 5: Returns MONITOR and MWAIT Features

When CPUID executes with EAX set to 5, the processor returns information about features available to MONITOR/MWAIT instructions. The MONITOR instruction is used for address-range monitoring in conjunction with MWAIT instruction. The MWAIT instruction optionally provides additional extensions for advanced power management. See Table 2-19.

INPUT EAX = 6: Returns Thermal and Power Management Features

When CPUID executes with EAX set to 6, the processor returns information about thermal and power management features. See Table 2-19.

INPUT EAX = 9: Returns Direct Cache Access Information

When CPUID executes with EAX set to 9, the processor returns information about Direct Cache Access capabilities. See Table 2-19.

INPUT EAX = 10: Returns Architectural Performance Monitoring Features

When CPUID executes with EAX set to 10, the processor returns information about support for architectural performance monitoring capabilities. Architectural performance monitoring is supported if the version ID (see Table 2-19) is greater than Pn 0. See Table 2-19.

For each version of architectural performance monitoring capability, software must enumerate this leaf to discover the programming facilities and the architectural performance events available in the processor. The details are described in Chapter 18, “Debugging and Performance Monitoring,” in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
INPUT EAX = 11: Returns Extended Topology Information

When CPUID executes with EAX set to 11, the processor returns information about extended topology enumeration data. Software must detect the presence of CPUID leaf 0BH by verifying (a) the highest leaf index supported by CPUID is >= 0BH, and (b) CPUID.0BH:EBX[15:0] reports a non-zero value.

INPUT EAX = 13: Returns Processor Extended States Enumeration Information

When CPUID executes with EAX set to 13 and ECX = 0, the processor returns information about the bit-vector representation of all processor state extensions that are supported in the processor and storage size requirements of the XSAVE/XRSTOR area. See Table 2-19.

When CPUID executes with EAX set to 13 and ECX = n (n > 1 and less than the number of non-zero bits in CPUID.(EAX=0DH, ECX= 0H).EAX and CPUID.(EAX=0DH, ECX= 0H).EDX), the processor returns information about the size and offset of each processor extended state save area within the XSAVE/XRSTOR area. See Table 2-19.

METHODS FOR RETURNING BRANDING INFORMATION

Use the following techniques to access branding information:

1. Processor brand string method; this method also returns the processor’s maximum operating frequency
2. Processor brand index; this method uses a software supplied brand string table.

These two methods are discussed in the following sections. For methods that are available in early processors, see Section: “Identification of Earlier IA-32 Processors” in Chapter 14 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1.

The Processor Brand String Method

Figure 2-5 describes the algorithm used for detection of the brand string. Processor brand identification software should execute this algorithm on all Intel 64 and IA-32 processors.

This method (introduced with Pentium 4 processors) returns an ASCII brand identification string and the maximum operating frequency of the processor to the EAX, EBX, ECX, and EDX registers.
How Brand Strings Work

To use the brand string method, execute CPUID with EAX input of 8000002H through 80000004H. For each input value, CPUID returns 16 ASCII characters using EAX, EBX, ECX, and EDX. The returned string will be NULL-terminated.

Table 2-25 shows the brand string that is returned by the first processor in the Pentium 4 processor family.

**Table 2-25. Processor Brand String Returned with Pentium 4 Processor**

<table>
<thead>
<tr>
<th>EAX Input Value</th>
<th>Return Values</th>
<th>ASCII Equivalent</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Figure 2-5. Determination of Support for the Processor Brand String
### Table 2-25. Processor Brand String Returned with Pentium 4 Processor (Continued)

<table>
<thead>
<tr>
<th>Processor Address</th>
<th>EAX</th>
<th>EBX</th>
<th>ECX</th>
<th>EDX</th>
<th>String</th>
</tr>
</thead>
<tbody>
<tr>
<td>80000002H</td>
<td>20202020H</td>
<td>20202020H</td>
<td>20202020H</td>
<td>6E492020H</td>
<td>“” “”</td>
</tr>
<tr>
<td>80000003H</td>
<td>286C6574H</td>
<td>50202952H</td>
<td>69746E65H</td>
<td>52286D75H</td>
<td>“(let”</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>“P )R”</td>
</tr>
<tr>
<td>80000004H</td>
<td>20342029H</td>
<td>20555043H</td>
<td>30303531H</td>
<td>007A484D</td>
<td>“4 )”</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>“UPC”</td>
</tr>
</tbody>
</table>

#### Extracting the Maximum Processor Frequency from Brand Strings

Figure 2-6 provides an algorithm which software can use to extract the maximum processor operating frequency from the processor brand string.

**NOTE**

When a frequency is given in a brand string, it is the maximum qualified frequency of the processor, not the frequency at which the processor is currently running.
The Processor Brand Index Method

The brand index method (introduced with Pentium® III Xeon® processors) provides an entry point into a brand identification table that is maintained in memory by system software and is accessible from system- and user-level code. In this table, each brand index is associated with an ASCII brand identification string that identifies the official Intel family and model number of a processor.

When CPUID executes with EAX set to 1, the processor returns a brand index to the low byte in EBX. Software can then use this index to locate the brand identification string for the processor in the brand identification table. The first entry (brand index 0) in this table is reserved, allowing for backward compatibility with processors that...
do not support the brand identification feature. Starting with processor signature family ID = 0FH, model = 03H, brand index method is no longer supported. Use brand string method instead.

Table 2-26 shows brand indices that have identification strings associated with them.

Table 2-26. Mapping of Brand Indices; and Intel 64 and IA-32 Processor Brand Strings

<table>
<thead>
<tr>
<th>Brand Index</th>
<th>Brand String</th>
</tr>
</thead>
<tbody>
<tr>
<td>00H</td>
<td>This processor does not support the brand identification feature</td>
</tr>
<tr>
<td>01H</td>
<td>Intel(R) Celeron(R) processor¹</td>
</tr>
<tr>
<td>02H</td>
<td>Intel(R) Pentium(R) III processor¹</td>
</tr>
<tr>
<td>03H</td>
<td>Intel(R) Pentium(R) III Xeon(R) processor; If processor signature = 000006B1h, then Intel(R) Celeron(R) processor</td>
</tr>
<tr>
<td>04H</td>
<td>Intel(R) Pentium(R) III processor</td>
</tr>
<tr>
<td>06H</td>
<td>Mobile Intel(R) Pentium(R) III processor-M</td>
</tr>
<tr>
<td>07H</td>
<td>Mobile Intel(R) Celeron(R) processor¹</td>
</tr>
<tr>
<td>08H</td>
<td>Intel(R) Pentium(R) 4 processor</td>
</tr>
<tr>
<td>09H</td>
<td>Intel(R) Pentium(R) 4 processor</td>
</tr>
<tr>
<td>0AH</td>
<td>Intel(R) Celeron(R) processor¹</td>
</tr>
<tr>
<td>0BH</td>
<td>Intel(R) Xeon(R) processor; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor MP</td>
</tr>
<tr>
<td>0CH</td>
<td>Intel(R) Xeon(R) processor MP</td>
</tr>
<tr>
<td>0EH</td>
<td>Mobile Intel(R) Pentium(R) 4 processor-M; If processor signature = 00000F13h, then Intel(R) Xeon(R) processor</td>
</tr>
<tr>
<td>0FH</td>
<td>Mobile Intel(R) Celeron(R) processor¹</td>
</tr>
<tr>
<td>11H</td>
<td>Mobile Genuine Intel(R) processor</td>
</tr>
<tr>
<td>12H</td>
<td>Intel(R) Celeron(R) M processor</td>
</tr>
<tr>
<td>13H</td>
<td>Mobile Intel(R) Celeron(R) processor¹</td>
</tr>
<tr>
<td>14H</td>
<td>Intel(R) Celeron(R) processor</td>
</tr>
<tr>
<td>15H</td>
<td>Mobile Genuine Intel(R) processor</td>
</tr>
<tr>
<td>16H</td>
<td>Intel(R) Pentium(R) M processor</td>
</tr>
<tr>
<td>17H</td>
<td>Mobile Intel(R) Celeron(R) processor¹</td>
</tr>
<tr>
<td>18H - 0FFH</td>
<td>RESERVED</td>
</tr>
</tbody>
</table>

NOTES:
1. Indicates versions of these processors that were introduced after the Pentium III
APPLICATION PROGRAMMING MODEL

**IA-32 Architecture Compatibility**

CPUID is not supported in early models of the Intel486 processor or in any IA-32 processor earlier than the Intel486 processor.

**Operation**

IA32_BIOS_SIGN_ID MSR ← Update with installed microcode revision number;

CASE (EAX) OF

  EAX = 0:
  EAX ← Highest basic function input value understood by CPUID;
  EBX ← Vendor identification string;
  EDX ← Vendor identification string;
  ECX ← Vendor identification string;
  BREAK;

  EAX = 1H:
  EAX[3:0] ← Stepping ID;
  EAX[7:4] ← Model;
  EAX[11:8] ← Family;
  EAX[13:12] ← Processor type;
  EAX[15:14] ← Reserved;
  EAX[19:16] ← Extended Model;
  EAX[27:20] ← Extended Family;
  EAX[31:28] ← Reserved;
  EBX[7:0] ← Brand Index; (* Reserved if the value is zero. *)
  EBX[15:8] ← CLFLUSH Line Size;
  EBX[16:23] ← Reserved; (* Number of threads enabled = 2 if MT enable fuse set. *)
  EBX[24:31] ← Initial APIC ID;
  ECX ← Feature flags; (* See Figure 2-3. *)
  EDX ← Feature flags; (* See Figure 2-4. *)
  BREAK;

  EAX = 2H:
  EAX ← Cache and TLB information;
  EBX ← Cache and TLB information;
  ECX ← Cache and TLB information;
  EDX ← Cache and TLB information;
  BREAK;

  EAX = 3H:
  EAX ← Reserved;
  EBX ← Reserved;
  ECX ← ProcessorSerialNumber[31:0];
  (* Pentium III processors only, otherwise reserved. *)
  EDX ← ProcessorSerialNumber[63:32];
  (* Pentium III processors only, otherwise reserved. *)
APPLICATION PROGRAMMING MODEL

BREAK
EAX = 4H:
   EAX ← Deterministic Cache Parameters Leaf; (* See Table 2-19. *)
   EBX ← Deterministic Cache Parameters Leaf;
   ECX ← Deterministic Cache Parameters Leaf;
   EDX ← Deterministic Cache Parameters Leaf;
BREAK;
EAX = 5H:
   EAX ← MONITOR/MWAIT Leaf; (* See Table 2-19. *)
   EBX ← MONITOR/MWAIT Leaf;
   ECX ← MONITOR/MWAIT Leaf;
   EDX ← MONITOR/MWAIT Leaf;
BREAK;
EAX = 6H:
   EAX ← Thermal and Power Management Leaf; (* See Table 2-19. *)
   EBX ← Thermal and Power Management Leaf;
   ECX ← Thermal and Power Management Leaf;
   EDX ← Thermal and Power Management Leaf;
BREAK;
EAX = 7H or 8H:
   EAX ← Reserved = 0;
   EBX ← Reserved = 0;
   ECX ← Reserved = 0;
   EDX ← Reserved = 0;
BREAK;
EAX = 9H:
   EAX ← Direct Cache Access Information Leaf; (* See Table 2-19. *)
   EBX ← Direct Cache Access Information Leaf;
   ECX ← Direct Cache Access Information Leaf;
   EDX ← Direct Cache Access Information Leaf;
BREAK;
EAX = AH:
   EAX ← Architectural Performance Monitoring Leaf; (* See Table 2-19. *)
   EBX ← Architectural Performance Monitoring Leaf;
   ECX ← Architectural Performance Monitoring Leaf;
   EDX ← Architectural Performance Monitoring Leaf;
BREAK
EAX = BH:
   EAX ← Extended Topology Enumeration Leaf; (* See Table 2-19. *)
   EBX ← Extended Topology Enumeration Leaf;
   ECX ← Extended Topology Enumeration Leaf;
   EDX ← Extended Topology Enumeration Leaf;
BREAK;
APPLICATION PROGRAMMING MODEL

EAX = CH:
    EAX ← Reserved = 0;
    EBX ← Reserved = 0;
    ECX ← Reserved = 0;
    EDX ← Reserved = 0;
BREAK;
EAX = DH:
    EAX ← Processor Extended State Enumeration Leaf; (* See Table 2-19. *)
    EBX ← Processor Extended State Enumeration Leaf;
    ECX ← Processor Extended State Enumeration Leaf;
    EDX ← Processor Extended State Enumeration Leaf;
BREAK;
BREAK;
EAX = 80000000H:
    EAX ← Highest extended function input value understood by CPUID;
    EBX ← Reserved;
    ECX ← Reserved;
    EDX ← Reserved;
BREAK;
EAX = 80000001H:
    EAX ← Reserved;
    EBX ← Reserved;
    ECX ← Extended Feature Bits (* See Table 2-19.*);
    EDX ← Extended Feature Bits (* See Table 2-19.*);
BREAK;
EAX = 80000002H:
    EAX ← Processor Brand String;
    EBX ← Processor Brand String, continued;
    ECX ← Processor Brand String, continued;
    EDX ← Processor Brand String, continued;
BREAK;
EAX = 80000003H:
    EAX ← Processor Brand String, continued;
    EBX ← Processor Brand String, continued;
    ECX ← Processor Brand String, continued;
    EDX ← Processor Brand String, continued;
BREAK;
EAX = 80000004H:
    EAX ← Processor Brand String, continued;
    EBX ← Processor Brand String, continued;
    ECX ← Processor Brand String, continued;
    EDX ← Processor Brand String, continued;
BREAK;
EAX = 80000005H:
    EAX ← Reserved = 0;
    EBX ← Reserved = 0;
    ECX ← Reserved = 0;
    EDX ← Reserved = 0;
BREAK;

EAX = 80000006H:
    EAX ← Reserved = 0;
    EBX ← Reserved = 0;
    ECX ← Cache information;
    EDX ← Reserved = 0;
BREAK;

EAX = 80000007H:
    EAX ← Reserved = 0;
    EBX ← Reserved = 0;
    ECX ← Reserved = 0;
    EDX ← Reserved = 0;
BREAK;

EAX = 80000008H:
    EAX ← Reserved = 0;
    EBX ← Reserved = 0;
    ECX ← Reserved = 0;
    EDX ← Reserved = 0;
BREAK;

DEFAULT: (* EAX = Value outside of recognized range for CPUID. *)
    (* If the highest basic information leaf data depend on ECX input value, ECX is honored.*)
    EAX ← Reserved; (* Information returned for highest basic information leaf.*)
    EBX ← Reserved; (* Information returned for highest basic information leaf.*)
    ECX ← Reserved; (* Information returned for highest basic information leaf.*)
    EDX ← Reserved; (* Information returned for highest basic information leaf.*)
BREAK;

ESAC;

Flags Affected
None.

Exceptions (All Operating Modes)
#UD If the LOCK prefix is used.
In earlier IA-32 processors that do not support the CPUID instruction, execution of the instruction results in an invalid opcode (#UD) exception being generated.
This chapter describes the operating system programming considerations for AVX. The AES extension and PCLMULQDQ instruction follow the same system software requirements for XMM state support and SIMD floating-point exception support as SSE2, SSE3, SSSE3, SSE4 (see Chapter 12 of *IA-32 Intel Architecture Software Developer's Manual, Volumes 3A*).

The AVX and FMA extensions operate on 256-bit YMM registers, and require operating system to supports processor extended state management using XSAVE/XRSTOR instructions. VAESDEC/VAESDECLAST/VAESEN/VAESEN/CLAST/VAESIMC/VAESKEYGENASSIST/VPCLMULQDQ follow the same system programming requirements as AVX and FMA instructions operating on YMM states.

The basic requirements for an operating system using XSAVE/XRSTOR to manage processor extended states for current and future Intel Architecture processors can be found in Chapter 12 of *IA-32 Intel Architecture Software Developer's Manual, Volumes 3A*. This chapter covers additional requirements for OS to support YMM state.

### 3.1 YMM State, VEX Prefix and Supported Operating Modes

AVX and FMA instructions comprises of 256-bit and 128-bit instructions that operates on YMM states via VEX prefix encoding. SIMD instructions operating on XMM states (i.e. not accessing the upper 128 bits of YMM) generally do not use VEX prefix.

For processors that support YMM states, the YMM state exists in all operating modes. However, the available interfaces to access YMM states may vary in different modes. The processor’s support for instruction extensions that employ VEX prefix encoding is independent of the processor’s support for YMM state.

Instructions requiring VEX prefix encoding generally are supported in 64-bit, 32-bit modes, and 16-bit protected mode. They are not supported in Real mode, Virtual-8086 mode or entering into SMM mode.

Note that bits 255:128 of YMM register state are maintained across transitions into and out of these modes. Because, XSAVE/XRSTOR instruction can operate in all operating modes, it is possible that the processor's YMM register state can be modified by software in any operating mode by executing XRSTOR. The YMM registers can be updated by XRSTOR using the state information stored in the XSAVE/XRSTOR area residing in memory.
3.2 YMM STATE MANAGEMENT

Operating systems must use the XSAVE/XRSTOR instructions for YMM state management. The XSAVE/XRSTOR instructions also provide flexible and efficient interface to manage XMM/MXCSR states and x87 FPU states in conjunction with new processor extended states.

An OS must enable its YMM state management to support AVX and FMA extensions. Otherwise, an attempt to execute an instruction in AVX or FMA extensions (including an enhanced 128-bit SIMD instructions using VEX encoding) will cause a #UD exception.

3.2.1 Detection of YMM State Support

Detection of hardware support for new processor extended state is provided by the main leaf of CPUID leaf function 0DH with index ECX = 0. Specifically, the return value in EDX:EAX of CPUID.(EAX=0DH, ECX=0) provides a 64-bit wide bit vector of hardware support of processor state components, beginning with bit 0 of EAX corresponding to x87 FPU state, CPUID.(EAX=0DH, ECX=0):EAX[1] corresponding to SSE state (XMM registers and MXCSR), CPUID.(EAX=0DH, ECX=0):EAX[2] corresponding to YMM states.

3.2.2 Enabling of YMM State

An OS can enable YMM state support with the following steps:

- Verify the processor supports XSAVE/XRSTOR/XSETBV/XGETBV instructions and the XFEATURE_ENABLED_MASK register by checking CPUID.1_ECX.XSAVE[bit 26]=1.
- Verify the processor supports YMM state (i.e. bit 2 of XFEATURE_ENABLED_MASK is valid) by checking CPUID.(EAX=0DH, ECX=0):EAX.YMM[2]. The OS should also verify CPUID.(EAX=0DH, ECX=0):EAX.SSE[bit 1]=1, because the lower 128-bits of an YMM register are aliased to an XMM register.

The OS must determine the buffer size requirement for the XSAVE area that will be used by XSAVE/XRSTOR (see CPUID instruction in Section 2.9).
- Set CR4.OSXSAVE[bit 18]=1 to enable the use of XSETBV/XGETBV instructions to write/read the XFEATURE_ENABLED_MASK register.
- Supply an appropriate mask via EDX:EAX to execute XSETBV to enable the processor state components that the OS wishes to manage using XSAVE/XRSTOR instruction. To enable x87 FPU, SSE and YMM state management using XSAVE/XRSTOR, the enable mask is EDX=0H, EAX=7H (The individual bits of XFEATURE_ENABLED_MASK is listed in Table 3-27).
To enable YMM state, the OS must use EDX:EAX[2:1] = 11B when executing XSETBV. An attempt to execute XSETBV with EDX:EAX[2:1] = 10B causes a #GP(0) exception.

### Table 3-27. XFEATURE_ENABLED_MASK and Processor State Components

<table>
<thead>
<tr>
<th>Bit</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 - x87</td>
<td>If set, the processor supports x87 FPU state management via XSAVE/XRSTOR. This bit must be 1 if CPUID.01H:ECX.XSAVE[26] = 1.</td>
</tr>
<tr>
<td>1 - SSE</td>
<td>If set, the processor supports SSE state (XMM and MXCSR) management via XSAVE/XRSTOR. This bit must be set to '1' to enable AVX.</td>
</tr>
<tr>
<td>2 - YMM</td>
<td>If set, the processor supports YMM state (upper 128 bits of YMM registers) management via XSAVE. This bit must be set to '1' to enable AVX and FMA.</td>
</tr>
</tbody>
</table>

### 3.2.3 Enabling of SIMD Floating-Exception Support

AVX and FMA instruction may generate SIMD floating-point exceptions. An OS must enable SIMD floating-point exception support by setting CR4.OSXMMEXCPT[bit 10]=1.

The effect of CR4 setting that affects AVX and FMA enabling is listed in Table 3-28

### Table 3-28. CR4 bits for AVX New Instructions technology support

<table>
<thead>
<tr>
<th>Bit</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>CR4.OSXSAVE[bit 18]</td>
<td>If set, the OS supports use of XSETBV/XGETBV instruction to access. the XFEATURE_ENABLED_MASK register, XSAVE/XRSTOR to manage processor extended state. Must be set to '1' to enable AVX and FMA.</td>
</tr>
<tr>
<td>CR4.OSXMMEXCPT[bit 10]</td>
<td>Must be set to 1 to enable SIMD floating-point exceptions. This applies to AVX, FMA operating on YMM states, and legacy 128-bit SIMD floating-point instructions operating on XMM states.</td>
</tr>
<tr>
<td>CR4.OSFXSR[bit 9]</td>
<td>Ignored by AVX and FMA instructions operating on YMM states. Must be set to 1 to enable SIMD instructions operating on XMM state.</td>
</tr>
</tbody>
</table>
3.2.4 The Layout of XSAVE Area

The OS must determine the buffer size requirement by querying CPUID with EAX=0DH, ECX=0. If the OS wishes to enable all processor extended state components in the XFEATURE_ENABLED_MASK, it can allocate the buffer size according to CPUID.(EAX=0DH, ECX=0):ECX.

After the memory buff for XSAVE is allocated, the entire buffer must be cleared to zero prior to use by XSAVE.

For processors that support SSE and YMM states, the XSAVE area layout is listed in Table 3-29. The register fields of the first 512 byte of the XSAVE area are identical to those of the FXSAVE/FXRSTOR area.

<table>
<thead>
<tr>
<th>Save Areas</th>
<th>Offset (Byte)</th>
<th>Size (Bytes)</th>
</tr>
</thead>
<tbody>
<tr>
<td>FPU/SSE SaveArea</td>
<td>0</td>
<td>512</td>
</tr>
<tr>
<td>Header</td>
<td>512</td>
<td>64</td>
</tr>
<tr>
<td>Ext_Save_Area_2</td>
<td>CPUID.(EAX=0DH, ECX=2):EBX</td>
<td>CPUID.(EAX=0DH, ECX=2):EAX</td>
</tr>
</tbody>
</table>

The format of the header is as follows (see Table 3-30):

<table>
<thead>
<tr>
<th>15:8</th>
<th>7:0</th>
<th>Byte Offset from Header</th>
<th>Byte Offset from XSAVE Area</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reserved (Must be zero)</td>
<td>XSTATE_BV</td>
<td>0</td>
<td>512</td>
</tr>
<tr>
<td>Reserved</td>
<td>Reserved (Must be zero)</td>
<td>16</td>
<td>528</td>
</tr>
<tr>
<td>Reserved</td>
<td>Reserved</td>
<td>32</td>
<td>544</td>
</tr>
<tr>
<td>Reserved</td>
<td>Reserved</td>
<td>48</td>
<td>560</td>
</tr>
</tbody>
</table>

The layout of the Ext_Save_Area[YMM] contains 16 of the upper 128-bits of the YMM registers, as shown in Table 3-31.

Note in general that the layout of the XSAVE/XRSTOR save area is fixed and may contain non-contiguous individual save area (Ext_Save_Area_X). The XSAVE/XRSTOR area is not compacted if some processor extended state features are not saved or are not supported by the processor and/or by system software.
3.2.5 XSAVE/XRSTOR Interaction with YMM State and MXCSR

The processor's action as a result of executing XRSTOR, on the MXCSR, XMM and YMM registers, are listed in Table 3-32 (Both bit 1 and bit 2 of the XFEATURE_ENABLED_MASK register are presumed to be 1). The XMM registers may be initialized by the processor (See XRSTOR operation in Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2B). When the MXCSR register is updated from memory, reserved bit checking is enforced. The saving/restoring of MXCSR is bound to both the SSE state and YMM state.

Table 3-32. XRSTOR Action on MXCSR, XMM Registers, YMM Registers

<table>
<thead>
<tr>
<th>EDX:EAX</th>
<th>XSAT_EBV</th>
<th>MXCSR</th>
<th>YMM_H Registers</th>
<th>XMM Registers</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bit 2</td>
<td>Bit 1</td>
<td>Bit 2</td>
<td>Bit 1</td>
<td></td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>X</td>
<td>X</td>
<td>None</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>X</td>
<td>0</td>
<td>Load/Check</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>X</td>
<td>1</td>
<td>Load/Check</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>X</td>
<td>Load/Check</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>X</td>
<td>Load/Check</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>Load/Check</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>Load/Check</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>Load/Check</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>Load/Check</td>
</tr>
</tbody>
</table>
SYSTEM PROGRAMMING MODEL

The processor supplied init values for each processor state component used by XRSTOR is listed in Table 3-33.

Table 3-33. Processor Supplied Init Values XRSTOR May Use

<table>
<thead>
<tr>
<th>Processor State Component</th>
<th>Processor Supplied Register Values</th>
</tr>
</thead>
<tbody>
<tr>
<td>x87 FPU State</td>
<td>FCW ← 037FH; FTW ← 0FFFFH; FSW ← 0H; FPU CS ← 0H; FPU DS ← 0H; FPU IP ← 0H; FPU DP ← 0; ST0-ST7 ← 0;</td>
</tr>
<tr>
<td>SSE State&lt;sup&gt;1&lt;/sup&gt;</td>
<td>If 64-bit Mode: XMM0-XMM15 ← 0H; Else XMM0-XMM7 ← 0H</td>
</tr>
<tr>
<td>YMM State&lt;sup&gt;1&lt;/sup&gt;</td>
<td>If 64-bit Mode: YMM0_H-YMM15_H ← 0H; Else YMM0_H-YMM7_H ← 0H</td>
</tr>
</tbody>
</table>

NOTES:
1. MXCSR state is not updated by processor supplied values. MXCSR state can only be updated by XRSTOR from state information stored in XSAVE/XRSTOR area.

The action of XSAVE is listed in Table 3-34.

Table 3-34. XSAVE Action on MXCSR, XMM, YMM Register

<table>
<thead>
<tr>
<th>EDX:EAX</th>
<th>XFEATURE_ENABLE_M</th>
<th>MXCSR</th>
<th>YMM_H Registers</th>
<th>XMM Registers</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bit 2</td>
<td>Bit 1</td>
<td>Bit 2</td>
<td>Bit 1</td>
<td>None</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>X</td>
<td>X</td>
<td>Store</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>X</td>
<td>1</td>
<td>None</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>X</td>
<td>0</td>
<td>Store</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>0</td>
<td>X</td>
<td>None</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>Store</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>None</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>Store</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>Store</td>
</tr>
</tbody>
</table>

3.3 RESET BEHAVIOR

At processor reset
- YMM0-16 bits[255:0] are set to zero.
• XFEATURE_ENABLED_MASK[2:1] is set to zero, XFEATURE_ENABLED_MASK[0] is set to 1.
• CR4.OSXSAVE[bit 18] (and its mirror CPUID.1.ECX.OSXSAVE[bit 27]) is set to 0.

3.4 EMULATION

Setting the CR0.EMbit to 1 provides a technique to emulate Legacy SSE floating-point instruction sets in software. This technique is not supported with AVX instructions, nor FMA instructions.

If an operating system wishes to emulate AVX instructions, set XFEATURE_ENABLED_MASK[2:1] to zero. This will cause AVX instructions to #UD. Emulation of FMA by operating system can be done similarly as with emulating AVX instructions.

3.5 WRITING AVX FLOATING-POINT EXCEPTION HANDLERS

AVX and FMA floating-point exceptions are handled in an entirely analogous way to Legacy SSE floating-point exceptions. To handle unmasked SIMD floating-point exceptions, the operating system or executive must provide an exception handler. The section titled "SSE and SSE2 SIMD Floating-Point Exceptions" in Chapter 11, "Programming with Streaming SIMD Extensions 2 (SSE)," of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 1, describes the SIMD floating-point exception classes and gives suggestions for writing an exception handler to handle them.

To indicate that the operating system provides a handler for SIMD floating-point exceptions (#XM), the CR4.OSXMMEXCEPT flag (bit 10) must be set.
This page was intentionally left blank.
AVX and FMA instructions are encoded using a more efficient format than previous
instruction extensions in the Intel 64 and IA-32 architecture. The improved encoding
format make use a new prefix referred to as "VEX". The VEX prefix may be two or
three bytes long, depending on the instruction semantics. Despite the length of the
VEX prefix, the instruction encoding format using VEX addresses two important
issues: (a) there exists inefficiency in instruction encoding due to SIMD prefixes and
some fields of the REX prefix, (b) Both SIMD prefixes and REX prefix increase in
instruction byte-length. This chapter describes the instruction encoding format using
VEX.

### 4.1 INSTRUCTION FORMATS

Legacy instruction set extensions in IA-32 architecture employs one or more "single-
purpose" byte as an "escape opcode", or required SIMD prefix (66H, F2H, F3H) to
expand the processing capability of the instruction set. Intel 64 architecture uses the
REX prefix to expand the encoding of register access in instruction operands. Both
SIMD prefixes and REX prefix carry the side effect that they can cause the length of
an instruction to increase significantly. Legacy Intel 64 and IA-32 instruction set are
limited to supporting instruction syntax of only two operands that can be encoded to
access registers (and only one can access a memory address).

Instruction encoding using VEX prefix provides several advantages:
- Instruction syntax support for three operands and up-to four operands when
  necessary. For example, the third source register used by VBLENDVPD is encoded
  using bits 7:4 of the immediate byte.
- Encoding support for vector length of 128 bits (using XMM registers) and 256 bits
  (using YMM registers)
- Encoding support for instruction syntax of non-destructive source operands.
- Elimination of escape opcode byte (0FH), SIMD prefix byte (66H, F2H, F3H) via a
  compact bit field representation within the VEX prefix.
- Elimination of the need to use REX prefix to encode the extended half of general-
purpose register sets (R8-R15) for direct register access, memory addressing, or
  accessing XMM8-XMM15 (including YMM8-YMM15).
- Flexible and more compact bit fields are provided in the VEX prefix to retain the
  full functionality provided by REX prefix. REX.W, REX.X, REX.B functionalities are
  provided in the three-byte VEX prefix only because only a subset of SIMD instruc-
tions need them.
- Extensibility for future instruction extensions without significant instruction
  length increase.
INSTRUCTION FORMAT

Figure 4-7 shows the Intel 64 instruction encoding format with VEX prefix support. Legacy instruction without a VEX prefix is fully supported and unchanged. The use of VEX prefix in an Intel 64 instruction is optional, but a VEX prefix is required for Intel 64 instructions that operate on YMM registers or support three and four operand syntax. VEX prefix is not a constant-valued, ”single-purpose” byte like 0FH, 66H, F2H, F3H in legacy SSE instructions. VEX prefix provides substantially richer capability than the REX prefix.

<table>
<thead>
<tr>
<th># Bytes</th>
<th>2,3</th>
<th>1</th>
<th>1</th>
<th>0,1</th>
<th>0,1,2,4</th>
<th>0,1</th>
</tr>
</thead>
<tbody>
<tr>
<td>[Prefixes]</td>
<td>[VEX]</td>
<td>OPCODE</td>
<td>ModR/M</td>
<td>[SIB]</td>
<td>[DISP]</td>
<td>[IMM]</td>
</tr>
</tbody>
</table>

Figure 4-7. Instruction Encoding Format with VEX Prefix

4.1.1 VEX and the LOCK prefix
Any VEX-encoded instruction with a LOCK prefix preceding VEX will #UD.

4.1.2 VEX and the 66H, F2H, and F3H prefixes
Any VEX-encoded instruction with a 66H, F2H, or F3H prefix preceding VEX will #UD.

4.1.3 VEX and the REX prefix
Any VEX-encoded instruction with a REX prefix proceeding VEX will #UD.

4.1.4 The VEX Prefix
The VEX prefix is encoded in either the two-byte form (the first byte must be C5H) or in the three-byte form (the first byte must be C4H). The two-byte VEX is used mainly for 128-bit, scalar, and the most common 256-bit AVX instructions; while the three-byte VEX provides a compact replacement of REX and 3-byte opcode instructions (including AVX and FMA instructions). Beyond the first byte of the VEX prefix, it consists of a number of bit fields providing specific capability, they are shown in Figure 4-8.

The bit fields of the VEX prefix can be summarized by its functional purposes:
- Non-destructive source register encoding (applicable to three and four operand syntax): This is the first source operand in the instruction syntax. It is represented by the notation, VEX.vvvv. This field is encoded using 1's
INSTRUCTION FORMAT

complement form (inverted form), i.e. XMM0/YMM0/R0 is encoded as 1111B, XMM15/YMM15/R15 is encoded as 0000B.

• Vector length encoding: This 1-bit field represented by the notation VEX.L. L= 0 means vector length is 128 bits wide, L=1 means 256 bit vector. The value of this field is written as VEX.128 or VEX.256 in this document to distinguish encoded values of other VEX bit fields.

• REX prefix functionality: Full REX prefix functionality is provided in the three-byte form of VEX prefix. However the VEX bit fields providing REX functionality are encoded using 1’s complement form, i.e. XMM0/YMM0/R0 is encoded as 1111B, XMM15/YMM15/R15 is encoded as 0000B.
  — Two-byte form of the VEX prefix only provides the equivalent functionality of REX.R, using 1’s complement encoding. This is represented as VEX.R.
  — Three-byte form of the VEX prefix provides REX.R, REX.X, REX.B functionality using 1’s complement encoding and three dedicated bit fields represented as VEX.R, VEX.X, VEX.B.
  — Three-byte form of the VEX prefix provides the functionality of REX.W only to specific instructions that need to override default 32-bit operand size for a general purpose register to 64-bit size in 64-bit mode. For those applicable instructions, VEX.W field provides the same functionality as REX.W. VEX.W field can provide completely different functionality for other instructions. Consequently, the use of REX prefix with VEX encoded instructions is not allowed. However, the intent of the REX prefix for expanding register set is reserved for future instruction set extensions using VEX prefix encoding format.

• Compaction of SIMD prefix: Legacy SSE instructions effectively use SIMD prefixes (66H, F2H, F3H) as an opcode extension field. VEX prefix encoding allows the functional capability of such legacy SSE instructions (operating on XMM registers, bits 255:128 of corresponding YMM unmodified) to be encoded using the VEX.pp field without the presence of any SIMD prefix. The VEX-encoded 128-bit instruction will zero-out bits 255:128 of the destination register. VEX-encoded instruction may have 128 bit vector length or 256 bits length.

• Compaction of two-byte and three-byte opcode: More recently introduced legacy SSE instructions employ two and three-byte opcode. The one or two leading bytes are: OHF, and 0FH 3AH/0FH 38H. The one-byte escape (0FH) and two-byte escape (0FH 3AH, 0FH 38H) can also be interpreted as an opcode extension field. The VEX.mmmmm field provides compaction to allow many legacy instruction to be encoded without the constant byte sequence, 0FH, 0FH 3AH, 0FH 38H. These VEX-encoded instruction may have 128 bit vector length or 256 bits length.

The VEX prefix is required to be the last prefix and immediately precedes the opcode bytes. It must follow any other prefixes. If VEX prefix is present a REX prefix is not supported.

The 3-byte VEX leaves room for future expansion with 3 reserved bits. REX and the 66h/F2h/F3h prefixes are reclaimed for future use.
VEX prefix has a two-byte form and a three byte form. If an instruction syntax can be encoded using the two-byte form, it can also be encoded using the three byte form of VEX. The latter increases the length of the instruction by one byte. This may be helpful in some situations for code alignment.

The VEX prefix supports 256-bit versions of floating-point SSE, SSE2, SSE3, and SSE4 instructions. Some additional support for 128-bit vector integer instructions is provided in Table A-1 of Appendix A. Note, certain new instruction functionality can only be encoded with the VEX prefix (See Appendix A, Table A-2).

The VEX prefix will #UD on any instruction containing MMX register sources or destinations.
INSTRUCTION FORMAT

Figure 4-8. VEX bitfields
INSTRUCTION FORMAT

The following subsections describe the various fields in two or three-byte VEX prefix:

4.1.4.1  VEX Byte 0, bits[7:0]

VEX Byte 0, bits [7:0] must contain the value 11000101b (C5h) or 11000100b (C4h). The 3-byte VEX uses the C4h first byte, while the 2-byte VEX uses the C5h first byte.

4.1.4.2  VEX Byte 1, bit [7] - ‘R’

VEX Byte 1, bit [7] contains a bit analogous to a bit inverted REX.R. In protected and compatibility modes the bit must be set to ‘1’ otherwise the instruction is LES or LDS. This bit is present in both 2- and 3-byte VEX prefixes. The usage of WRXB bits for legacy instructions is explained in detail section 2.2.1.2 of Intel 64 and IA-32 Architectures Software developer’s manual, Volume 2A. This bit is stored in bit inverted format.

4.1.4.3  3-byte VEX byte 1, bit[6] - ‘X’

Bit[6] of the 3-byte VEX byte 1 encodes a bit analogous to a bit inverted REX.X. It is an extension of the SIB Index field in 64-bit modes. In 32-bit modes, this bit must be set to ‘1’ otherwise the instruction is LES or LDS. This bit is available only in the 3-byte VEX prefix. This bit is stored in bit inverted format.

4.1.4.4  3-byte VEX byte 1, bit[5] - ‘B’

Bit[5] of the 3-byte VEX byte 1 encodes a bit analogous to a bit inverted REX.B. In 64-bit modes, it is an extension of the ModR/M r/m field, or the SIB base field. In 32-bit modes, this bit is ignored. This bit is available only in the 3-byte VEX prefix. This bit is stored in bit inverted format.

4.1.4.5  3-byte VEX byte 2, bit[7] - ‘W’

Bit[7] of the 3-byte VEX byte 2 is represented by the notation VEX.W. It can provide following functions, depending on the specific opcode.

- For AVX instructions that have equivalent legacy SSE instructions, if REX.W has a meaning in legacy SSE instruction, VEX.W has same meaning in the corresponding AVX equivalent form. In 32-bit modes, VEX.W must be set to “0” otherwise the AVX form will #UD.
- For AVX instructions that have equivalent legacy SSE instructions, if REX.W is
INSTRUCTION FORMAT

don’t care in legacy SSE instruction, VEX.W is ignored in the corresponding AVX equivalent form irrespective of mode.

- For new AVX instructions where VEX.W has no defined function, it is reserved as zero and setting to other than zero will cause instruction to #UD.

4.1.4.6 2-byte VEX Byte 1, bits [6:3] and 3-byte VEX Byte 2, bits [6:3] - ‘vvvv’
the Source or dest Register Specifier

In 32-bit mode the VEX first byte C4 and C5 alias onto the LES and LDS instructions. To maintain compatibility with existing programs the VEX 2nd byte, bits [7:6] must be 11b. To achieve this, the VEX payload bits are selected to place only inverted, 64-bit valid fields (extended register selectors) in these upper bits.

The 2-byte VEX Byte 1, bits [6:3] and the 3-byte VEX, Byte 2, bits [6:3] encode a field (shorthand VEX.vvvv) that for instructions with 2 or more source registers and an XMM or YMM or memory destination encodes the first source register specifier stored in inverted (1’s complement) form.

VEX.vvvv is not used by the instructions with one source (except certain shifts, see below) or on instructions with no XMM or YMM or memory destination. If an instruction does not use VEX.vvvv then it should be set to 1111b otherwise instruction will #UD.

In 64-bit mode all 4 bits may be used. See Table 4-35 for the encoding of the XMM or YMM registers. In 32-bit and 16-bit modes bit 6 must be 1 (if bit 6 is not 1, the 2-byte VEX version will generate LDS instruction and the 3-byte VEX version will ignore this bit).
### Table 4-35. VEX.vvvv to register name mapping

<table>
<thead>
<tr>
<th>VEX.vvvv</th>
<th>Dest Register</th>
<th>Valid in Legacy/Compatibility 32-bit modes?</th>
</tr>
</thead>
<tbody>
<tr>
<td>1111B</td>
<td>XMM0/YMM0</td>
<td>Valid</td>
</tr>
<tr>
<td>1110B</td>
<td>XMM1/YMM1</td>
<td>Valid</td>
</tr>
<tr>
<td>1101B</td>
<td>XMM2/YMM2</td>
<td>Valid</td>
</tr>
<tr>
<td>1100B</td>
<td>XMM3/YMM3</td>
<td>Valid</td>
</tr>
<tr>
<td>1011B</td>
<td>XMM4/YMM4</td>
<td>Valid</td>
</tr>
<tr>
<td>1010B</td>
<td>XMM5/YMM5</td>
<td>Valid</td>
</tr>
<tr>
<td>1001B</td>
<td>XMM6/YMM6</td>
<td>Valid</td>
</tr>
<tr>
<td>1000B</td>
<td>XMM7/YMM7</td>
<td>Valid</td>
</tr>
<tr>
<td>0111B</td>
<td>XMM8/YMM8</td>
<td>Invalid</td>
</tr>
<tr>
<td>0110B</td>
<td>XMM9/YMM9</td>
<td>Invalid</td>
</tr>
<tr>
<td>0101B</td>
<td>XMM10/YMM10</td>
<td>Invalid</td>
</tr>
<tr>
<td>0100B</td>
<td>XMM11/YMM11</td>
<td>Invalid</td>
</tr>
<tr>
<td>0011B</td>
<td>XMM12/YMM12</td>
<td>Invalid</td>
</tr>
<tr>
<td>0010B</td>
<td>XMM13/YMM13</td>
<td>Invalid</td>
</tr>
<tr>
<td>0001B</td>
<td>XMM14/YMM14</td>
<td>Invalid</td>
</tr>
<tr>
<td>0000B</td>
<td>XMM15/YMM15</td>
<td>Invalid</td>
</tr>
</tbody>
</table>

The VEX.vvvv field is encoded in bit inverted format for accessing a register operand.

### 4.1.5 Instruction Operand Encoding and VEX.vvvv, ModR/M

VEX-encoded instructions support three-operand and four-operand instruction syntax. Some VEX-encoded instructions have syntax with less than three operands, e.g. VEX-encoded pack shift instructions support one source operand and one destination operand.

The roles of VEX.vvvv, reg field of ModR/M byte (ModR/M.reg), r/m field of ModR/M byte (ModR/M.r/m) with respect to encoding destination and source operands vary with different type of instruction syntax.

The role of VEX.vvvv can be summarized to three situations:

- VEX.vvvv encodes the first source register operand, specified in inverted (1’s complement) form and is valid for instructions with 2 or more source operands (see Table 4-37).
- VEX.vvvv encodes the destination register operand, specified in 1’s complement form for certain vector shifts. The instructions where VEX.vvvv is used as a destination are listed in Table 4-36. The notation in the “Opcode” column in Table 4-36 is described in detail in section 5.1.1.
INSTRUCTION FORMAT

- VEX.vvvv does not encode any operand, the field is reserved and should contain 1111b.

Table 4-36. Instructions with a VEX.vvvv destination

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction mnemonic</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.ND.128.66.0F 73 /7 ib</td>
<td>VPSLLDQ xmm1, xmm2, imm8</td>
</tr>
<tr>
<td>VEX.ND.128.66.0F 73 /3 ib</td>
<td>VPSRLDQ xmm1, xmm2, imm8</td>
</tr>
<tr>
<td>VEX.ND.128.66.0F 71 /2 ib</td>
<td>VPSRLW xmm1, xmm2, imm8</td>
</tr>
<tr>
<td>VEX.ND.128.66.0F 72 /2 ib</td>
<td>VPSRLD xmm1, xmm2, imm8</td>
</tr>
<tr>
<td>VEX.ND.128.66.0F 71 /2 ib</td>
<td>VPSRLQ xmm1, xmm2, imm8</td>
</tr>
<tr>
<td>VEX.ND.128.66.0F 71 /4 ib</td>
<td>VPSRAW xmm1, xmm2, imm8</td>
</tr>
<tr>
<td>VEX.ND.128.66.0F 72 /4 ib</td>
<td>VPSRAD xmm1, xmm2, imm8</td>
</tr>
<tr>
<td>VEX.ND.128.66.0F 71 /6 ib</td>
<td>VPSLLW xmm1, xmm2, imm8</td>
</tr>
<tr>
<td>VEX.ND.128.66.0F 72 /6 ib</td>
<td>VPSLLD xmm1, xmm2, imm8</td>
</tr>
<tr>
<td>VEX.ND.128.66.0F 73 /6 ib</td>
<td>VPSLLQ xmm1, xmm2, imm8</td>
</tr>
</tbody>
</table>

The role of ModR/M.r/m field can be summarized to two situations:
- ModR/M.r/m encodes the instruction operand that references a memory address.
- For some instructions that do not support memory addressing semantics, ModR/M.r/m encodes either the destination register operand or a source register operand.

The role of ModR/M.reg field can be summarized to two situations:
- ModR/M.reg encodes either the destination register operand or a source register operand.
- For some instructions, ModR/M.reg is treated as an opcode extension and not used to encode any instruction operand.

For instruction syntax that support four operands, VEX.vvvv, ModR/M.r/m, ModR/M.reg encodes three of the four operands. The role of bits 7:4 of the immediate byte serves two situations:
- Imm8[7:4] encodes the third source register operand.

Table 4-37 lists each type of instruction syntax and the instruction operand encoding rule for VEX.vvvv, ModR/M.r/m, ModR/M.reg, and Imm8[7:4]. The “Instruction type” column lists the relationship of the destination operand, the number and types of source operands. The encoding of each operand type to VEX.vvvv, ModR/M.r/m, ModR/M.reg, and Imm8[7:4] is shown in the right-hand column.
### Instruction Format

#### Table 4-37. Interpreting VEX.vvvv, reg_field, and rm_field.

<table>
<thead>
<tr>
<th>Instruction Type</th>
<th>Behavior</th>
<th>How arguments feed the operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>xmm/ymm := op(reg, reg/mem)</td>
<td>All Fields Used</td>
<td>ModR/M.reg := op (VEX.vvvv, ModR/M.r/m)</td>
</tr>
<tr>
<td>xmm/ymm := op(reg, reg)</td>
<td>All Fields Used</td>
<td>ModR/M.r/m := op (VEX.vvvv, ModR/M.reg)</td>
</tr>
<tr>
<td>xmm/ymm := op (reg, reg/mem, reg)</td>
<td>All Fields Used</td>
<td>ModR/M.reg := op (VEX.vvvv, ModR/M.r/m, imm8[7:4])</td>
</tr>
<tr>
<td>reg0 := op132 ((reg0, reg2/mem), reg1)</td>
<td>All Fields Used</td>
<td>ModR/M.reg := op132 ((ModR/M.reg, ModR/M.r/m, VEX.vvvv))</td>
</tr>
<tr>
<td>reg0 := op213 ((reg1, reg0), reg2/mem)</td>
<td>All Fields Used</td>
<td>ModR/M.reg := op213 ((VEX.vvvv, ModR/M.reg), ModR/M.r/m)</td>
</tr>
<tr>
<td>reg0 := op231 ((reg1, reg2/mem), reg0)</td>
<td>All Fields Used</td>
<td>ModR/M.reg := op132 ((VEX.vvvv, ModR/M.r/m), ModR/M.reg)</td>
</tr>
<tr>
<td>xmm/ymm := op(reg/mem)</td>
<td>VEX.vvvv must be 1111b, otherwise instruction will #UD</td>
<td>ModR/M.reg := op (ModR/M.r/m)</td>
</tr>
<tr>
<td>xmm/ymm := op(xmm/ymm)</td>
<td>reg_field used for opcode extension</td>
<td>VEX.vvvv := op(ModR/M.r/m)</td>
</tr>
<tr>
<td>r32/r64 := op(reg/mem)</td>
<td>VEX.vvvv must be 1111b, otherwise instruction will #UD</td>
<td>ModR/M.reg := op (ModR/M.r/m)</td>
</tr>
<tr>
<td>implicit(eflags/r32) := op (reg, reg/mem)</td>
<td>VEX.vvvv must be 1111b, otherwise instruction will #UD</td>
<td>implicit(eflags/r32) := op(ModR/M.reg, ModR/M.r/m)</td>
</tr>
<tr>
<td>xmm/ymm/mem := op(reg)</td>
<td>VEX.vvvv must be 1111b, otherwise instruction will #UD</td>
<td>ModR/M.r/m := op(ModR/M.reg)</td>
</tr>
<tr>
<td>r32/r64/mem := op(reg)</td>
<td>VEX.vvvv must be 1111b, otherwise instruction will #UD</td>
<td>ModR/M.r/m := op (ModR/M.reg)</td>
</tr>
<tr>
<td>mem := op(reg)</td>
<td>VEX.vvvv must be 1111b, otherwise instruction will #UD</td>
<td>ModR/M.r/m := op (ModR/M.reg)</td>
</tr>
<tr>
<td>mem := op(reg, reg)</td>
<td>All Fields used</td>
<td>ModR/M.r/m := op (VEX.vvvv, ModR/M.reg)</td>
</tr>
<tr>
<td>xmm/ymm := op(reg, mem)</td>
<td>All Fields Used</td>
<td>ModR/M.reg := op (VEX.vvvv, ModR/M.r/m)</td>
</tr>
</tbody>
</table>

**Note 1:** VBLENDVPD/VBLENDVPS/VPBLENDVDB.

**Note 2:** The instruction VPEXTRW r32, xmm1, imm (VEX.128.66.0F C5 /r ib) encodes the destination operand in ModR/M.reg.
Note 3: VMASKMOVS/PD store form: VEX.vvvv holds the mask register, reg_field the src register, and rm_field the memory operand.

Note 4: VMASKMOVS/PD load form: VEX.vvvv holds the mask register, rm_field the memory operand, and reg_field the destination register.

4.1.5.1  3-byte VEX byte 1, bits[4:0] - “m-mmmm”

Bits[4:0] of the 3-byte VEX byte 1 encode an implied leading opcode byte (0F, 0F 38, or 0F 3A). Several bits are reserved for future use and will #UD unless 0.

<table>
<thead>
<tr>
<th>VEX.m-mmmm</th>
<th>Implied Leading Opcode Bytes</th>
</tr>
</thead>
<tbody>
<tr>
<td>00000B</td>
<td>Reserved</td>
</tr>
<tr>
<td>00001B</td>
<td>0F</td>
</tr>
<tr>
<td>00010B</td>
<td>0F 38</td>
</tr>
<tr>
<td>00011B</td>
<td>0F 3A</td>
</tr>
<tr>
<td>00100-11111B</td>
<td>Reserved</td>
</tr>
<tr>
<td>(2-byte VEX)</td>
<td>0F</td>
</tr>
</tbody>
</table>

VEX.m-mmmm is only available on the 3-byte VEX. The 2-byte VEX implies a leading 0Fh opcode byte.

4.1.5.2  2-byte VEX byte 1, bit[2], and 3-byte VEX byte 2, bit [2]- “L”

The vector length field, VEX.L, is encoded in bit[2] of either the second byte of 2-byte VEX, or the third byte of 3-byte VEX. If “VEX.L = 1”, it indicates 256-bit vector operation. “VEX.L = 0” indicates scalar and 128-bit vector operations.

The instruction VZEROUPPER is a special case that is encoded with VEX.L = 0, although its operation zero’s bits 255:128 of all YMM registers accessible in the current operating mode.

See the following table.
INSTRUCTION FORMAT

Table 4-39. VEX.L interpretation

<table>
<thead>
<tr>
<th>VEX.L</th>
<th>Vector Length</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>128-bit (or 32/64-bit scalar)</td>
</tr>
<tr>
<td>1</td>
<td>256-bit</td>
</tr>
</tbody>
</table>

4.1.5.3 2-byte VEX byte 1, bits[1:0], and 3-byte VEX byte 2, bits [1:0]- “pp”
Up to one implied prefix is encoded by bits[1:0] of either the 2-byte VEX byte 1 or the 3-byte VEX byte 2. The prefix behaves as if it was encoded prior to VEX, but after all other encoded prefixes.
See the following table.

Table 4-40. VEX.pp interpretation

<table>
<thead>
<tr>
<th>pp</th>
<th>Implies this prefix after other prefixes but before VEX</th>
</tr>
</thead>
<tbody>
<tr>
<td>00B</td>
<td>None</td>
</tr>
<tr>
<td>01B</td>
<td>66</td>
</tr>
<tr>
<td>10B</td>
<td>F3</td>
</tr>
<tr>
<td>11B</td>
<td>F2</td>
</tr>
</tbody>
</table>

4.1.6 The Opcode Byte
One (and only one) opcode byte follows the 2 or 3 byte VEX. Legal opcodes are specified in Appendix B, in color. Any instruction that uses illegal opcode will #UD.

4.1.7 The MODRM, SIB, and Displacement Bytes
The encodings are unchanged but the interpretation of reg_field or rm_field differs (see above).

4.1.8 The Third Source Operand (Immediate Byte)
VEX-encoded instructions can support instruction with a four operand syntax. VBLENDVPD, VBLENDVPS, and PBLENDVB use imm8[7:4] to encode one of the source registers.
4.1.9 AVX Instructions and the Upper 128-bits of YMM registers
If an instruction with a destination XMM register is encoded with a VEX prefix, the processor zeroes the upper 128 bits of the equivalent YMM register. Legacy SSE instructions without VEX preserve the upper 128-bits.

4.1.10 AVX Instruction Length
The AVX and FMA instructions described in this document (including VEX and ignoring other prefixes) do not exceed 11 bytes in length, but may increase in the future. The maximum length of an Intel 64 and IA-32 instruction remains 15 bytes.
This page was intentionally left blank.
Instructions that are described in this document follow the general documentation convention established in *Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A* and 2B. Additional notations and conventions adopted in this document are listed in Section 5.1. Section 5.2 covers supplemental information that applies to a specific subset of instructions.

5.1 INTERPRETING INSTRUCTION REFERENCE PAGES

This section describes the format of information contained in the instruction reference pages in this chapter. It explains notational conventions and abbreviations used in these sections that are outside of those conventions described in Section 3.1 of *the Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A*.

5.1.1 Instruction Format

The following is an example of the format used for each instruction description in this chapter. The table below provides an example summary table:
INSTRUCTION SET REFERENCE

VBROADCASTF128- Broadcast 128 Bits of Floating-Point Values (THIS IS AN EXAMPLE)

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.256.66.0F38 1A /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Broadcast 128 bits of floating-point data in mem to low and high 128-bits in ymm1</td>
</tr>
<tr>
<td>VBROADCASTF128 ymm1, m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

5.1.2 Opcode Column in the Instruction Summary Table

For notation and conventions applicable to instructions that do not use VEX prefix, consult Section 3.1 of the *Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A*.

In the Instruction Summary Table, the Opcode column presents each instruction encoded using the VEX prefix in following form (including the modR/M byte if applicable, the immediate byte if applicable):

VEX.[NDS].[128,256].[66,F2,F3].0F/0F3A/0F38.[W0,W1] opcode [/r] [/ib,/is4]

- **VEX:** indicates the presence of the VEX prefix is required. The VEX prefix can be encoded using the three-byte form (the first byte is C4H), or using the two-byte form (the first byte is C5H). The two-byte form of VEX only applies to those instructions that do not require the following fields to be encoded: VEX.mmmmm, VEX.W, VEX.X, VEX.B. Refer to Section 4.1.4 for more detail on the VEX prefix.

The encoding of various sub-fields of the VEX prefix is described using the following notations:

- **NDS, NDD, DDS:** specifies that VEX.vvvv field is valid for the encoding of a register operand:
  - VEX.NDS: VEX.vvvv encodes the first source register in an instruction syntax where the content of source registers will be preserved.
  - VEX.NDD: VEX.vvvv encodes the destination register that cannot be encoded by ModR/M:reg field.
  - VEX.DDS: VEX.vvvv encodes the second source register in a three-operand instruction syntax where the content of first source register will be overwritten by the result.
  - If none of NDS, NDD, and DDS is present, VEX.vvvv must be 1111b (i.e. VEX.vvvv does not encode an operand). The VEX.vvvv field can be encoded using either the 2-byte or 3-byte form of the VEX prefix.
— **128,256**: VEX.L field can be 0 (denoted by VEX.128) or 1 (denoted by VEX.256). The VEX.L field can be encoded using either the 2-byte or 3-byte form of the VEX prefix. The presence of the notation VEX.256 or VEX.128 in the opcode column should be interpreted as follows:

- If VEX.256 is present in the opcode column: The semantics of the instruction must be encoded with VEX.L = 1. An attempt to encode this instruction with VEX.L= 0 can result in one of two situations: (a) if VEX.128 version is defined, the processor will behave according to the defined VEX.128 behavior; (b) an #UD occurs if there is no VEX.128 version defined.

- If VEX.128 is present in the opcode column but there is no VEX.256 version defined for the same opcode byte: Three situations apply: (a) For VEX-encoded, 128-bit SIMD integer instructions, software must encode the instruction with VEX.L = 0. The processor will treat the opcode byte encoded with VEX.L= 1 by causing an #UD exception; (b) For VEX-encoded, 128-bit packed floating-point instructions, software must encode the instruction with VEX.L = 0. The processor will treat the opcode byte encoded with VEX.L= 1 by causing an #UD exception (e.g., VMOVLP); (c) For VEX-encoded, scalar, SIMD floating-point instructions, software should encode the instruction with VEX.L = 0 to ensure software compatibility with future processor generations. Scalar SIMD floating-point instruction can be distinguished from the mnemonic of the instruction. Generally, the last two letters of the instruction mnemonic would be either "SS", "SD", or "SI" for SIMD floating-point conversion instructions, exceptVBROADCASTSx are unique cases.

— **66,F2,F3**: The presence or absence of these values maps to the VEX.pp field encodings. If absent, this corresponds to VEX.pp=00B. If present, the corresponding VEX.pp value affects the “opcode” byte in the same way as if a SIMD prefix (66H, F2H or F3H) does to the ensuing opcode byte. Thus a non-zero encoding of VEX.pp may be considered as an implied 66H/F2H/F3H prefix. The VEX.pp field may be encoded using either the 2-byte or 3-byte form of the VEX prefix.

— **0F,0F3A,0F38**: The presence maps to a valid encoding of the VEX.mmmmm field. Only three encoded values of VEX.mmmmm are defined as valid, corresponding to the escape byte sequence of 0FH, 0F3AH and 0F38H. The effect of a valid VEX.mmmmm encoding on the ensuing opcode byte is same as if the corresponding escape byte sequence on the ensuing opcode byte for non-VEX encoded instructions. Thus a valid encoding of VEX.mmmmm may be consider as an implies escape byte sequence of either 0FH, 0F3AH or 0F38H. The VEX.mmmmm field must be encoded using the 3-byte form of VEX prefix.

— **0F,0F3A,0F38 and 2-byte/3-byte VEX**: The presence of 0F3A and 0F38 in the opcode column implies that opcode can only be encoded by the three-byte form of VEX. The presence of 0F in the opcode column does not preclude the opcode to be encoded by the two-byte of VEX if the semantics of the
opcode does not require any subfield of VEX not present in the two-byte form of the VEX prefix.

— **W0**: VEX.W=0.
— **W1**: VEX.W=1.

The presence of W0/W1 in the opcode column applies to two situations: (a) it is treated as an extended opcode bit, (b) the instruction semantics support an operand size promotion to 64-bit of a general-purpose register operand or a 32-bit memory operand. The presence of W1 in the opcode column implies the opcode must be encoded using the 3-byte form of the VEX prefix. The presence of W0 in the opcode column does not preclude the opcode to be encoded using the C5H form of the VEX prefix, if the semantics of the opcode does not require other VEX subfields not present in the two-byte form of the VEX prefix. If neither W0 or W1 is present, the instruction may be encoded using either the two-byte form (if the opcode semantic does not require VEX subfields not present in the two-byte form of VEX) or the three-byte form of VEX. Encoding an instruction using the two-byte form of VEX is equivalent to W0. Please see Section 4.1.4 on the subfield definitions within VEX.

- **opcode**: Instruction opcode.
- **/is4**: An 8-bit immediate byte is present containing a source register specifier in imm[7:4] and instruction-specific payload in imm[3:0].
- **imz2**: Part of the is4 immediate byte providing control functions that apply to two-source permute instructions
- In general, the encoding of VEX.R, VEX.X, VEX.B field are not shown explicitly in the opcode column. The encoding scheme of VEX.R, VEX.X, VEX.B fields must follow the rules defined in Section 4.1.4.

### 5.1.3 Instruction Column in the Instruction Summary Table

<additions to the eponymous PRM section>

- **ymmm** — a YMM register. The 256-bit YMM registers are: YMM0 through YMM7; YMM8 through YMM15 are available in 64-bit mode.
- **m256** — A 32-byte operand in memory. This nomenclature is used only with AVX and FMA instructions.
- **ymmm/m256** - a YMM register or 256-bit memory operand.
- **<YMM0>:** indicates use of the YMM0 register as an implicit argument.
- **SRC1** - Denotes the first source operand in the instruction syntax of an instruction encoded with the VEX prefix and having two or more source operands.
- **SRC2** - Denotes the second source operand in the instruction syntax of an instruction encoded with the VEX prefix and having two or more source operands.
- **SRC3** - Denotes the third source operand in the instruction syntax of an instruction encoded with the VEX prefix and having three source operands.
• **SRC** - The source in a AVX single-source instruction or the source in a Legacy SSE instruction.
• **DST** - the destination in a AVX instruction. In Legacy SSE instructions can be either the destination, first source, or both. This field is encoded by reg_field.

### 5.1.4 64/32 bit Mode Support column in the Instruction Summary Table

The "64/32 bit Mode Support" column in the Instruction Summary table indicates whether an opcode sequence is supported in (a) 64-bit mode or (b) the Compatibility mode and other IA-32 modes that apply in conjunction with the CPUID feature flag associated specific instruction extensions.

The 64-bit mode support is to the left of the ‘slash’ and has the following notation:
- **V** — Supported.
- **I** — Not supported.
- **N.E.** — Indicates an instruction syntax is not encodable in 64-bit mode (it may represent part of a sequence of valid instructions in other modes).
- **N.P.** — Indicates the REX prefix does not affect the legacy instruction in 64-bit-mode.
- **N.I.** — Indicates the opcode is treated as a new instruction in 64-bit mode.
- **N.S.** — Indicates an instruction syntax that requires an address override prefix in 64-bit mode and is not supported. Using an address override prefix in 64-bit mode may result in model-specific execution behavior.

The compatibility/Legacy mode support is to the right of the ‘slash’ and has the following notation:
- **V** — Supported.
- **I** — Not supported.
- **N.E.** — Indicates an Intel 64 instruction mnemonics/syntax that is not encodable; the opcode sequence is not applicable as an individual instruction in compatibility mode or IA-32 mode. The opcode may represent a valid sequence of legacy IA-32 instructions.

### 5.1.5 CPUID Support column in the Instruction Summary Table

The fourth column holds abbreviated CPUID feature flags (e.g. appropriate bit in CPUID.1.ECX, CPUID.1.EDX for SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2/AVX support) that indicate processor support for the instruction. If the corresponding flag is '0', the instruction will #UD.
5.2 AES TRANSFORMATIONS AND DATA STRUCTURE

5.2.1 Little-Endian Architecture and Big-Endian Specification (FIPS 197)

FIPS 197 document defines the Advanced Encryption Standard (AES) and includes a set of test vectors for testing all of the steps in the algorithm, and can be used for testing and debugging.

The following observation is important for using the AES instructions offered in Intel 64 Architecture: FIPS 197 text convention is to write hex strings with the low-memory byte on the left and the high-memory byte on the right. Intel’s convention is the reverse. It is similar to the difference between Big Endian and Little Endian notations.

In other words, a 128 bits vector in the FIPS document, when read from left to right, is encoded as \([7:0, 15:8, 23:16, 31:24, \ldots 127:120]\). Note that inside the byte, the encoding is \([7:0]\), so the first bit from the left is the most significant bit. In practice, the test vectors are written in hexadecimal notation, where pairs of hexadecimal digits define the different bytes. To translate the FIPS 197 notation to an Intel 64 architecture compatible ("Little Endian") format, each test vector needs to be byte-reflected to \([127:120, \ldots 31:24, 23:16, 15:8, 7:0]\).

Example A:
FIPS Test vector: \(0x000102030405060708090a0b0c0d0e0f\)
Intel AES Hardware: \(0xf0e0d0c0b0a09080706050403020100\)

It should be pointed out that the only thing at issue is a textual convention, and programmers do not need to perform byte-reversal in their code, when using the AES instructions.

5.2.1.1 AES Data Structure in Intel 64 Architecture

The AES instructions that are defined in this document operate on one or on two 128 bits source operands: State and Round Key. From the architectural point of view, the state is input in an xmm register and the Round key is input either in an xmm register or a 128-bit memory location.

In AES algorithm, the state (128 bits) can be viewed as 4 32-bit doublewords ("Word"s in AES terminology): \(X3, X2, X1, X0\).

The state may also be viewed as a set of 16 bytes. The 16 bytes can also be viewed as a 4x4 matrix of bytes where \(S(i, j)\) with \(i, j = 0, 1, 2, 3\) compose the 32-bit "word"s as follows:

\[
\begin{align*}
X0 &= S(3, 0) S(2, 0) S(1, 0) S(0, 0) \\
X1 &= S(3, 1) S(2, 1) S(1, 1) S(0, 1)
\end{align*}
\]
X2 = S (3, 2) S (2, 2) S (1, 2) S (0, 2)
X3 = S (3, 3) S (2, 3) S (1, 3) S (0, 3)

The following tables, Table 5-1 through Table 5-4, illustrate various representations of a 128-bit state.

**Table 5-1. Byte and 32-bit Word Representation of a 128-bit State**

<table>
<thead>
<tr>
<th>Byte #</th>
<th>15</th>
<th>14</th>
<th>13</th>
<th>12</th>
<th>11</th>
<th>10</th>
<th>9</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>5</th>
<th>4</th>
<th>3</th>
<th>2</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bit Position</td>
<td>127</td>
<td>119</td>
<td>111</td>
<td>103</td>
<td>103</td>
<td>95</td>
<td>87</td>
<td>79</td>
<td>71</td>
<td>63</td>
<td>55</td>
<td>47</td>
<td>39</td>
<td>31</td>
<td>23</td>
<td>15</td>
</tr>
<tr>
<td></td>
<td>120</td>
<td>112</td>
<td>103</td>
<td>96</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>State Word</th>
<th>X3</th>
<th>X2</th>
<th>X1</th>
<th>X0</th>
</tr>
</thead>
<tbody>
<tr>
<td>State Byte</td>
<td>P</td>
<td>O</td>
<td>N</td>
<td>M</td>
</tr>
</tbody>
</table>

**Table 5-2. Matrix Representation of a 128-bit State**

<table>
<thead>
<tr>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
</tr>
</thead>
<tbody>
<tr>
<td>E</td>
<td>F</td>
<td>G</td>
<td>H</td>
</tr>
<tr>
<td>I</td>
<td>J</td>
<td>K</td>
<td>L</td>
</tr>
<tr>
<td>M</td>
<td>N</td>
<td>O</td>
<td>P</td>
</tr>
</tbody>
</table>

S(0, 0) S(0, 1) S(0, 2) S(0, 3)
S(1, 0) S(1, 1) S(1, 2) S(1, 3)
S(2, 0) S(2, 1) S(2, 2) S(2, 3)
S(3, 0) S(3, 1) S(3, 2) S(3, 3)

Example:

FIPS vector: d4 bf 5d 30 e0 b4 52 ae b8 41 11 f1 1e 27 98 e5

This vector has the "least significant" byte d4 and the significant byte e5 (written in Big Endian format in the FIPS document). When it is translated to IA notations, the encoding is:

**Table 5-3. Little Endian Representation of a 128-bit State**

<table>
<thead>
<tr>
<th>Byte #</th>
<th>15</th>
<th>14</th>
<th>13</th>
<th>12</th>
<th>11</th>
<th>10</th>
<th>9</th>
<th>8</th>
<th>7</th>
<th>6</th>
<th>5</th>
<th>4</th>
<th>3</th>
<th>2</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>State Byte</td>
<td>P</td>
<td>O</td>
<td>N</td>
<td>M</td>
<td>L</td>
<td>K</td>
<td>J</td>
<td>I</td>
<td>H</td>
<td>G</td>
<td>F</td>
<td>E</td>
<td>D</td>
<td>C</td>
<td>B</td>
<td>A</td>
</tr>
<tr>
<td>State Value</td>
<td>e5</td>
<td>98</td>
<td>27</td>
<td>1e</td>
<td>fl</td>
<td>11</td>
<td>41</td>
<td>b8</td>
<td>ae</td>
<td>52</td>
<td>b4</td>
<td>e0</td>
<td>30</td>
<td>5d</td>
<td>bf</td>
<td>d4</td>
</tr>
</tbody>
</table>

**Table 5-4. Little Endian Representation of a 4x4 Byte Matrix**

<table>
<thead>
<tr>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
</thead>
<tbody>
<tr>
<td>E</td>
<td>F</td>
<td>G</td>
</tr>
<tr>
<td>I</td>
<td>J</td>
<td>K</td>
</tr>
<tr>
<td>M</td>
<td>N</td>
<td>O</td>
</tr>
</tbody>
</table>

d4 e0 b8 1e
bf b4 41 27
5d 52 11 98
5.2.2 AES Transformations and Functions

The following functions and transformations are used in the algorithmic descriptions of AES instruction extensions AESDEC, AESDECLAST, AESENC, AESENCLAST, AESIMC, AESKEYGENASSIST.

Note that these transformations are expressed here in a Little Endian format (and not as in the FIPS 197 document).

- **MixColumns():** A byte-oriented 4x4 matrix transformation on the matrix representation of a 128-bit AES state. A FIPS-197 defined 4x4 matrix is multiplied to each 4x1 column vector of the AES state. The columns are considered polynomials with coefficients in the Finite Field that is used in the definition of FIPS 197, the operations ("multiplication" and "addition") are in that Finite Field, and the polynomials are reduced modulo \( x^4+1 \).

The MixColumns() transformation defines the relationship between each byte of the result state, represented as \( S'(i, j) \) of a 4x4 matrix (see Section 5.2.1), as a function of input state bytes, \( S(i, j) \), as follows

\[
\begin{align*}
S'(0, j) &= FF\_MUL(02H, S(0, j)) \ XOR \ FF\_MUL(03H, S(1, j)) \ XOR \ S(2, j) \ XOR \ S(3, j) \\
S'(1, j) &= S(0, j) \ XOR \ FF\_MUL(02H, S(1, j)) \ XOR \ FF\_MUL(03H, S(2, j)) \ XOR \ S(3, j) \\
S'(2, j) &= S(0, j) \ XOR \ S(1, j) \ XOR \ FF\_MUL(02H, S(2, j)) \ XOR \ FF\_MUL(03H, S(3, j)) \\
S'(3, j) &= FF\_MUL(03H, S(0, j)) \ XOR \ S(1, j) \ XOR \ S(2, j) \ XOR \ FF\_MUL(02H, S(3, j))
\end{align*}
\]

where \( j = 0, 1, 2, 3 \). \( FF\_MUL(\text{Byte1}, \text{Byte2}) \) denotes the result of multiplying two elements (represented by Byte1 and byte2) in the Finite Field representation that defines AES. The result of produced by \( FF\_MUL(\text{Byte1}, \text{Byte2}) \) is an element in the Finite Field (represented as a byte). A Finite Field is a field with a finite number of elements, and when this number can be represented as a power of 2 (2n), its elements can be represented as the set of 2n binary strings of length n. AES uses a finite field with n=8 (having 256 elements). With this representation, "addition" of two elements in that field is a bit-wise XOR of their binary-string representation, producing another element in the field. Multiplication of two elements in that field is defined using an irreducible polynomial (for AES, this polynomial is \( m(x) = x^8 + x^4 + x^3 + x + 1 \)). In this Finite Field representation, the bit value of bit position k of a byte represents the coefficient of a polynomial of order k, e.g., 1010_1101B (ADH) is represented by the polynomial \( x^7 + x^5 + x^3 + x^2 + 1 \). The byte value result of multiplication of two elements is obtained by carry-less multiplication of the two corresponding polynomials, followed by reduction modulo the polynomial, where the remainder
is calculated using operations defined in the field. For example, \( FF\_MUL(57H, 83H) = C1H \), because the carry-less polynomial multiplication of the polynomials represented by 57H and 83H produces \( x^{13} + x^{11} + x^9 + x^8 + x^6 + x^5 + x^4 + x^2 + 1 \), and the remainder modulo \( m(x) \) is \( x^7 + x^6 + 1 \).

- **RotWord()**: performs a byte-wise cyclic permutation (rotate right in little-endian byte order) on a 32-bit AES word.

The output word \( X'[j] \) of RotWord(\( X[j] \)) where \( X[j] \) represent the four bytes of column \( j \), \( S(i, j) \), in descending order \( X[j] = (S(3, j), S(2, j), S(1, j), S(0, j)) \); \( X'[j] = (S'(3, j), S'(2, j), S'(1, j), S'(0, j)) \leftarrow (S(0, j), S(3, j), S(2, j), S(1, j)) \)

- **ShiftRows()**: A byte-oriented matrix transformation that processes the matrix representation of a 16-byte AES state by cyclically shifting the last three rows of the state by different offset to the left, see Figure 5-5.

**Table 5-5. The ShiftRows Transformation**

<table>
<thead>
<tr>
<th>Matrix Representation of Input State</th>
<th>Output of ShiftRows</th>
</tr>
</thead>
<tbody>
<tr>
<td>A E I M</td>
<td>A E I M</td>
</tr>
<tr>
<td>B F J N</td>
<td>F J N B</td>
</tr>
<tr>
<td>C G K O</td>
<td>K O C G</td>
</tr>
<tr>
<td>D H L P</td>
<td>P D H L</td>
</tr>
</tbody>
</table>

- **SubBytes()**: A byte-oriented transformation that processes the 128-bit AES state by applying a non-linear substitution table (S-BOX) on each byte of the state.

The SubBytes() function defines the relationship between each byte of the result state \( S'(i, j) \) as a function of input state byte \( S(i, j) \), by

\[
S'(i, j) \leftarrow S\text{-Box} \ (S(i, j)[7:4], S(i, j)[3:0])
\]

where S-BOX( \( S[7:4], S[3:0] \)) represents a look-up operation on a 16x16 table to return a byte value, see Table 5-6.
SubWord(): produces an output AES word (four bytes) from the four bytes of an input word using a non-linear substitution table (S-BOX).

\[
X'[j] = ( S'(3, j), S'(2, j), S'(1, j), S'(0, j) )
\]

InvMixColumns(): The inverse transformation of MixColumns().

The InvMixColumns() transformation defines the relationship between each byte of the result state S'(i, j) as a function of input state bytes, S(i, j), by

\[
S'(0, j) \leftarrow \text{FF_MUL}(0eH, S(0, j)) \text{ XOR FF_MUL}(0bH, S(1, j)) \text{ XOR FF_MUL}(0dH, S(2, j)) \text{ XOR FF_MUL}(09H, S(3, j))
\]

\[
S'(1, j) \leftarrow \text{FF_MUL}(09H, S(0, j)) \text{ XOR FF_MUL}(0eH, S(1, j)) \text{ XOR FF_MUL}(0bH, S(2, j)) \text{ XOR FF_MUL}(0dH, S(3, j))
\]

\[
S'(2, j) \leftarrow \text{FF_MUL}(0dH, S(0, j)) \text{ XOR FF_MUL}(09H, S(1, j)) \text{ XOR FF_MUL}(0bH, S(2, j)) \text{ XOR FF_MUL}(09H, S(3, j))
\]

\[
S'(3, j) \leftarrow \text{FF_MUL}(0bH, S(0, j)) \text{ XOR FF_MUL}(0dH, S(1, j)) \text{ XOR FF_MUL}(09H, S(2, j)) \text{ XOR FF_MUL}(0bH, S(3, j))
\]

### Table 5-6. Look-up Table Associated with S-Box Transformation

<table>
<thead>
<tr>
<th>S[7:4]</th>
<th>S[3:0]</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>63</td>
</tr>
<tr>
<td>1</td>
<td>ca</td>
</tr>
<tr>
<td>2</td>
<td>b7</td>
</tr>
<tr>
<td>3</td>
<td>04</td>
</tr>
<tr>
<td>4</td>
<td>09</td>
</tr>
<tr>
<td>5</td>
<td>53</td>
</tr>
<tr>
<td>6</td>
<td>d0</td>
</tr>
<tr>
<td>7</td>
<td>51</td>
</tr>
<tr>
<td>8</td>
<td>cd</td>
</tr>
<tr>
<td>9</td>
<td>60</td>
</tr>
<tr>
<td>a</td>
<td>e0</td>
</tr>
<tr>
<td>b</td>
<td>e7</td>
</tr>
<tr>
<td>c</td>
<td>ba</td>
</tr>
<tr>
<td>d</td>
<td>70</td>
</tr>
<tr>
<td>e</td>
<td>e1</td>
</tr>
<tr>
<td>f</td>
<td>8c</td>
</tr>
</tbody>
</table>
\[ S'(3, j) \leftarrow FF\_MUL(0bH, S(0, j)) \text{ XOR } FF\_MUL(0dH, S(1, j)) \text{ XOR } FF\_MUL (09H, S(2, j)) \text{ XOR } FF\_MUL (0eH, S(3, j)) \], \text{ where } j = 0, 1, 2, 3.

- **InvShiftRows()**: The inverse transformation of InvShiftRows(). The InvShiftRows() transforms the matrix representation of a 16-byte AES state by cyclically shifting the last three rows of the state by different offset to the right, see Table 5-7.

**Table 5-7. The InvShiftRows Transformation**

<table>
<thead>
<tr>
<th>Matrix Representation of Input State</th>
<th>Output of ShiftRows</th>
</tr>
</thead>
<tbody>
<tr>
<td>A E I M</td>
<td>A E I M</td>
</tr>
<tr>
<td>B F J N</td>
<td>N B F J</td>
</tr>
<tr>
<td>C G K O</td>
<td>K O C G</td>
</tr>
<tr>
<td>D H L P</td>
<td>H L P D</td>
</tr>
</tbody>
</table>

- **InvSubBytes()**: The inverse transformation of SubBytes(). The InvSubBytes() transformation defines the relationship between each byte of the result state \( S'(i, j) \) as a function of input state byte \( S(i, j) \), by

\[ S'(i, j) \leftarrow InvS\text{-}Box \left( S(i, j)[7:4], S(i, j)[3:0] \right) \]

where InvS\text{-}BOX( S[7:4], S[3:0]) represents a look-up operation on a 16x16 table to return a byte value, see Table 5-8.
5.3 SUMMARY OF TERMS

- **“Legacy SSE”:** Refers to SSE, SSE2, SSE3, SSSE3, SSE4, and any future instruction sets referencing XMM registers and encoded without a VEX prefix.

- **XGETBV, XSETBV, XSAVE, XRSTOR** are defined in *IA-32 Intel Architecture Software Developer’s Manual, Volumes 3A and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B.*

- **VEX:** refers to a two-byte or three-byte prefix. AVX and FMA instructions are encoded using a VEX prefix.

- **VEX.vvvv**. The VEX bitfield specifying a source or destination register (in 1’s complement form).

- **rm_field**: shorthand for the ModR/M r/m field and any REX.B

- **reg_field**: shorthand for the ModR/M reg field and any REX.R

---

### Table 5-8. Look-up Table Associated with InvS-Box Transformation

<table>
<thead>
<tr>
<th>S[3:0]</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>a</th>
<th>b</th>
<th>c</th>
<th>d</th>
<th>e</th>
<th>f</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>52</td>
<td>09</td>
<td>6a</td>
<td>d5</td>
<td>30</td>
<td>36</td>
<td>a5</td>
<td>38</td>
<td>bf</td>
<td>40</td>
<td>a3</td>
<td>9e</td>
<td>81</td>
<td>f3</td>
<td>d7</td>
<td>fb</td>
</tr>
<tr>
<td>1</td>
<td>7e</td>
<td>c3</td>
<td>39</td>
<td>82</td>
<td>9b</td>
<td>2f</td>
<td>ff</td>
<td>87</td>
<td>34</td>
<td>8e</td>
<td>43</td>
<td>44</td>
<td>c4</td>
<td>de</td>
<td>e9</td>
<td>cb</td>
</tr>
<tr>
<td>2</td>
<td>54</td>
<td>7b</td>
<td>94</td>
<td>32</td>
<td>a6</td>
<td>c2</td>
<td>23</td>
<td>3d</td>
<td>ee</td>
<td>4c</td>
<td>95</td>
<td>0b</td>
<td>42</td>
<td>fa</td>
<td>c3</td>
<td>4e</td>
</tr>
<tr>
<td>3</td>
<td>08</td>
<td>2e</td>
<td>a1</td>
<td>66</td>
<td>28</td>
<td>d9</td>
<td>24</td>
<td>b2</td>
<td>76</td>
<td>5b</td>
<td>a2</td>
<td>49</td>
<td>6d</td>
<td>8b</td>
<td>d1</td>
<td>25</td>
</tr>
<tr>
<td>4</td>
<td>72</td>
<td>f8</td>
<td>f6</td>
<td>64</td>
<td>86</td>
<td>68</td>
<td>98</td>
<td>16</td>
<td>d4</td>
<td>a4</td>
<td>5c</td>
<td>cc</td>
<td>5d</td>
<td>6b</td>
<td>92</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>6c</td>
<td>70</td>
<td>48</td>
<td>50</td>
<td>fd</td>
<td>ed</td>
<td>b9</td>
<td>da</td>
<td>5e</td>
<td>15</td>
<td>46</td>
<td>57</td>
<td>a7</td>
<td>8d</td>
<td>9d</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>90</td>
<td>d8</td>
<td>ab</td>
<td>00</td>
<td>8c</td>
<td>bce</td>
<td>d3</td>
<td>0a</td>
<td>f7</td>
<td>e4</td>
<td>58</td>
<td>05</td>
<td>b8</td>
<td>b3</td>
<td>45</td>
<td>06</td>
</tr>
<tr>
<td>7</td>
<td>d0</td>
<td>2c</td>
<td>1e</td>
<td>8f</td>
<td>ca</td>
<td>3f</td>
<td>0f</td>
<td>02</td>
<td>c1</td>
<td>af</td>
<td>bd</td>
<td>03</td>
<td>13</td>
<td>8a</td>
<td>6b</td>
<td></td>
</tr>
<tr>
<td>8</td>
<td>3a</td>
<td>91</td>
<td>11</td>
<td>41</td>
<td>4f</td>
<td>67</td>
<td>dc</td>
<td>ea</td>
<td>97</td>
<td>f2</td>
<td>cf</td>
<td>ce</td>
<td>f0</td>
<td>b4</td>
<td>e6</td>
<td>73</td>
</tr>
<tr>
<td>9</td>
<td>96</td>
<td>ac</td>
<td>74</td>
<td>22</td>
<td>e7</td>
<td>ad</td>
<td>35</td>
<td>85</td>
<td>e2</td>
<td>f9</td>
<td>37</td>
<td>e8</td>
<td>1c</td>
<td>75</td>
<td>df</td>
<td>6e</td>
</tr>
<tr>
<td>a</td>
<td>47</td>
<td>fl</td>
<td>1a</td>
<td>71</td>
<td>1d</td>
<td>29</td>
<td>c5</td>
<td>89</td>
<td>6f</td>
<td>b7</td>
<td>62</td>
<td>0e</td>
<td>aa</td>
<td>18</td>
<td>be</td>
<td>1b</td>
</tr>
<tr>
<td>b</td>
<td>fe</td>
<td>56</td>
<td>3e</td>
<td>4b</td>
<td>c6</td>
<td>d2</td>
<td>79</td>
<td>20</td>
<td>9a</td>
<td>db</td>
<td>e0</td>
<td>fe</td>
<td>78</td>
<td>ed</td>
<td>5a</td>
<td>f4</td>
</tr>
<tr>
<td>c</td>
<td>1f</td>
<td>dd</td>
<td>a8</td>
<td>33</td>
<td>88</td>
<td>07</td>
<td>c7</td>
<td>31</td>
<td>b1</td>
<td>12</td>
<td>10</td>
<td>59</td>
<td>27</td>
<td>80</td>
<td>ec</td>
<td>5f</td>
</tr>
<tr>
<td>d</td>
<td>60</td>
<td>51</td>
<td>7f</td>
<td>a9</td>
<td>19</td>
<td>b5</td>
<td>4a</td>
<td>0d</td>
<td>2d</td>
<td>e5</td>
<td>7a</td>
<td>9f</td>
<td>93</td>
<td>e9</td>
<td>9c</td>
<td>ef</td>
</tr>
<tr>
<td>e</td>
<td>a0</td>
<td>e0</td>
<td>3b</td>
<td>4d</td>
<td>ae</td>
<td>2a</td>
<td>f5</td>
<td>b0</td>
<td>c8</td>
<td>eb</td>
<td>bb</td>
<td>3c</td>
<td>83</td>
<td>53</td>
<td>99</td>
<td>61</td>
</tr>
<tr>
<td>f</td>
<td>17</td>
<td>2b</td>
<td>04</td>
<td>7e</td>
<td>ba</td>
<td>77</td>
<td>d6</td>
<td>26</td>
<td>e1</td>
<td>69</td>
<td>14</td>
<td>63</td>
<td>55</td>
<td>21</td>
<td>0c</td>
<td>7d</td>
</tr>
</tbody>
</table>
5.4 INSTRUCTION SET REFERENCE

<only instructions modified by AVX are included>
INSTRUCTION SET REFERENCE

ADDPD - Add Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 58 /r ADDPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Add packed double-precision floating-point values from xmm2/mem to xmm1 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 58 /r VADDPD xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Add packed double-precision floating-point values from xmm3/mem to xmm2 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 58 /r VADDPD ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Add packed double-precision floating-point values from ymm3/mem to ymm2 and stores result in ymm1</td>
</tr>
</tbody>
</table>

Description
Performs an SIMD add of the two or four packed double-precision floating-point values from the first Source operand to the Second Source operand, and stores the packed double-precision floating-point results in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation
VADDPD (VEX.256 encoded version)
DEST[63:0] ← SRC1[63:0] + SRC2[63:0]
DEST[127:64] ← SRC1[127:64] + SRC2[127:64]
INSTRUCTION SET REFERENCE

VADDPD (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0] + SRC2[63:0]
DEST[127:64] ← SRC1[127:64] + SRC2[127:64]
DEST[255:128] ← 0

ADDPD (128-bit Legacy SSE version)
DEST[63:0] ← DEST[63:0] + SRC[63:0]
DEST[127:64] ← DEST[127:64] + SRC[127:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VADDPD __m256d_mm256_add_pd (__m256d a, __m256d b);
ADDPD __m128d_mm_add_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
ADDPS- Add Packed Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 58 /r ADDPS xmm1, xmm2/m128</td>
<td>V/V SSE</td>
<td>Add packed single-precision floating-point values from xmm2/mem to xmm1 and stores result in xmm1</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.0F 58 /r VADDPS xmm1,xmm2, xmm3/m128</td>
<td>V/V AVX</td>
<td>Add packed single-precision floating-point values from xmm3/mem to xmm2 and stores result in xmm1</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.0F 58 /r VADDPS ymm1, ymm2, ymm3/m256</td>
<td>V/V AVX</td>
<td>Add packed single-precision floating-point values from ymm3/mem to ymm2 and stores result in ymm1</td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Performs an SIMD add of the four or eight packed single-precision floating-point values from the first Source operand to the Second Source operand, and stores the packed single-precision floating-point results in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

**Operation**

**VADDPS (VEX.256 encoded version)**

DEST[31:0] ← SRC1[31:0] + SRC2[31:0]
DEST[95:64] ← SRC1[95:64] + SRC2[95:64]

**VADDPS (VEX.128 encoded version)**
DEST[31:0] ← SRC1[31:0] + SRC2[31:0]
DEST[95:64] ← SRC1[95:64] + SRC2[95:64]
DEST[255:128] ← 0

**ADDPS (128-bit Legacy SSE version)**
DEST[31:0] ← SRC1[31:0] + SRC2[31:0]
DEST[95:64] ← SRC1[95:64] + SRC2[95:64]
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

```
VADDPS _m256 _mm256_add_ps (_m256 a, _m256 b);
ADDPS _m128 _mm_add_ps (_m128 a, _m128 b);
```

**SIMD Floating-Point Exceptions**
Overflow, Underflow, Invalid, Precision, Denormal

**Other Exceptions**
See Exceptions Type 2
### ADDSD- Add Scalar Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 58 /r ADDSD xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE2</td>
<td>Add the low double-precision floating-point value from xmm2/mem to xmm1 and store the result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 58 /r VADDSD xmm1,xmm2, xmm3/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Add the low double-precision floating-point value from xmm3/mem to xmm2 and store the result in xmm1</td>
</tr>
</tbody>
</table>

#### Description

Adds the low double-precision floating-point values from the second source operand and the first source operand and stores the double-precision floating-point result in the destination operand.

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VADDSD is encoded with VEX.L=0. Encoding VADDSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

#### Operation

**VADDSD (VEX.128 encoded version)**

DEST[63:0] ← SRC1[63:0] + SRC2[63:0]

DEST[127:64] ← SRC1[127:64]

DEST[255:128] ← 0

**ADDSD (128-bit Legacy SSE version)**

DEST[63:0] ← DEST[63:0] + SRC[63:0]

DEST[255:64] (Unmodified)
Intel C/C++ Compiler Intrinsic Equivalent
ADDSD __m128d _mm_add_sd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 3
INSTRUCTION SET REFERENCE

ADDSS- Add Scalar Single Precision Floating-Point Values

### Opcode/ Instruction

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 58 /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Add the low single-precision floating-point value from xmm2/mem to xmm1 and store the result in xmm1</td>
</tr>
<tr>
<td>ADDSS xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 58 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Add the low single-precision floating-point value from xmm2/mem to xmm1 and store the result in xmm1</td>
</tr>
<tr>
<td>VADDSS xmm1,xmm2, xmm3/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Description

Adds the low single-precision floating-point values from the second source operand and the first source operand, and stores the double-precision floating-point result in the destination operand.

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VADDSS is encoded with VEX.L=0. Encoding VADDSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

### Operation

**VADDSS DEST, SRC1, SRC2 (VEX.128 encoded version)**

- DEST[31:0] ← SRC1[31:0] + SRC2[31:0]
- DEST[255:128] ← 0

**ADDSS DEST, SRC (128-bit Legacy SSE version)**

- DEST[31:0] ← DEST[31:0] + SRC[31:0]
- DEST[255:32] (Unmodified)
INSTRUCTION SET REFERENCE

Intel C/C++ Compiler Intrinsic Equivalent
ADDSS __m128_mm_add_ss (__m128 a, __m128 b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 3
INSTRUCTION SET REFERENCE

ADDSUBPD- Packed Double FP Add/Subtract

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F D0 /r ADDSUBPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE3</td>
<td>Add/subtract double-precision floating-point values from xmm2/m128 to xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F D0 /r VADDSUBPD xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Add/subtract packed double-precision floating-point values from xmm3/mem to xmm2 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F D0 /r VADDSUBPD ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Add / subtract packed double-precision floating-point values from ymm3/mem to ymm2 and stores result in ymm1</td>
</tr>
</tbody>
</table>

Description

Adds odd-numbered double-precision floating-point values of the first source operand (second operand) with the corresponding double-precision floating-point values from the second source operand (third operand); stores the result in the odd-numbered values of the destination operand (first operand). Subtracts the even-numbered double-precision floating-point values from the second source operand from the corresponding double-precision floating values in the first source operand; stores the result into the even-numbered values of the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VADDSUBPD (VEX.256 encoded version)
INSTRUCTION SET REFERENCE

DEST[63:0] ← SRC1[63:0] - SRC2[63:0]
DEST[127:64] ← SRC1[127:64] + SRC2[127:64]

**VADDSUBPD (VEX.128 encoded version)**
DEST[63:0] ← SRC1[63:0] - SRC2[63:0]
DEST[127:64] ← SRC1[127:64] + SRC2[127:64]
DEST[255:128] ← 0

**ADDSUBPD (128-bit Legacy SSE version)**
DEST[63:0] ← DEST[63:0] - SRC[63:0]
DEST[127:64] ← DEST[127:64] + SRC[127:64]
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

VADDSUBPD __m256d _mm256_addsub_pd (__m256d a, __m256d b);
ADDSUBPD __m128d _mm_addsub_pd (__m128d a, __m128d b);

**SIMD Floating-Point Exceptions**
Overflow, Underflow, Invalid, Precision, Denormal

**Other Exceptions**
See Exceptions Type 2
INSTRUCTION SET REFERENCE

ADDSUBPS- Packed Single FP Add/Subtract

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F D0 /r ADDSUBPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE3</td>
<td>Add/subtract single-precision floating-point values from xmm2/m128 to xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F D0 /r VADDSUBPS xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Add/subtract single-precision floating-point values from xmm3/mem to xmm2 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.F2.0F D0 /r VADDSUBPS ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Add / subtract single-precision floating-point values from ymm3/mem to ymm2 and stores result in ymm1</td>
</tr>
</tbody>
</table>

Description
Adds odd-numbered single-precision floating-point values of the first source operand (second operand) with the corresponding single-precision floating-point values from the second source operand (third operand); stores the result in the odd-numbered values of the destination operand (first operand). Subtracts the even-numbered single-precision floating-point values from the second source operand from the corresponding single-precision floating values in the first source operand; stores the result into the even-numbered values of the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation
VADDSUBPS (VEX.256 encoded version)
DEST[31:0] ← SRC1[31:0] - SRC2[31:0]
DEST[95:64] ← SRC1[95:64] - SRC2[95:64]

**VADDSUBPS (VEX.128 encoded version)**
DEST[31:0] ← SRC1[31:0] - SRC2[31:0]
DEST[95:64] ← SRC1[95:64] - SRC2[95:64]
DEST[255:128] ← 0

**ADDSUBPS (128-bit Legacy SSE version)**
DEST[31:0] ← DEST[31:0] - SRC[31:0]
DEST[95:64] ← DEST[95:64] - SRC[95:64]
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**
VADDSUBPS _m256 _mm256_addsub_ps (_m256 a, _m256 b);
ADDSUBPS _m128 _mm_addsub_ps (_m128 a, _m128 b);

**SIMD Floating-Point Exceptions**
Overflow, Underflow, Invalid, Precision, Denormal

**Other Exceptions**
See Exceptions Type 2
AESENC/AESENCLAST - Perform One Round of an AES Encryption Flow

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 DC /r AESENC xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AES</td>
<td>Perform one round of an AES encryption flow, operating on a 128-bit data (state) from xmm1 with a 128-bit round key from xmm2/m128.</td>
</tr>
<tr>
<td>66 0F 38 DD /r AESENCLAST xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AES</td>
<td>Perform the last round of an AES encryption flow, operating on a 128-bit data (state) from xmm1 with a 128-bit round key from xmm2/m128.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 DC /r VAESENC xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>Both AES and AVX flags</td>
<td>Perform one round of an AES encryption flow, operating on a 128-bit data (state) from xmm2 with a 128-bit round key from the xmm3/m128; store the result in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 DD /r VASENCCLAST xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>Both AES and AVX flags</td>
<td>Perform the last round of an AES encryption flow, operating on a 128-bit data (state) from xmm2 with a 128-bit round key from the xmm3/m128; store the result in xmm1.</td>
</tr>
</tbody>
</table>

Description

These instructions perform a single round of an AES encryption flow using a round key from the second source operand, operating on 128-bit data (state) from the first source operand, and store the result in the destination operand.

Use the AESENC instruction for all but the last encryption rounds. For the last encryption round, use the AESENCLAST instruction.

VEX.128 encoded version: The first source operand and the destination operand are XMM registers. The second source operand can be an XMM register or a 128-bit memory location. Bits (255:128) of the destination YMM register are zeroed.

128-bit Legacy SSE version: The first source operand and the destination operand are the same and must be an XMM register. The second source operand can be an...
XMM register or a 128-bit memory location. Bits (255:128) of the corresponding YMM destination register remain unchanged.

**Operation**

**VAESENC**

\[
\text{STATE} \leftarrow \text{SRC1}; \\
\text{RoundKey} \leftarrow \text{SRC2}; \\
\text{STATE} \leftarrow \text{ShiftRows}(\text{STATE}); \\
\text{STATE} \leftarrow \text{SubBytes}(\text{STATE}); \\
\text{STATE} \leftarrow \text{MixColumns}(\text{STATE}); \\
\text{DEST}[127:0] \leftarrow \text{STATE XOR RoundKey}; \\
\text{DEST}[255:128] \leftarrow 0
\]

**AESENC**

\[
\text{STATE} \leftarrow \text{SRC1}; \\
\text{RoundKey} \leftarrow \text{SRC2}; \\
\text{STATE} \leftarrow \text{ShiftRows}(\text{STATE}); \\
\text{STATE} \leftarrow \text{SubBytes}(\text{STATE}); \\
\text{STATE} \leftarrow \text{MixColumns}(\text{STATE}); \\
\text{DEST}[127:0] \leftarrow \text{STATE XOR RoundKey}; \\
\text{DEST}[255:128] \text{ (Unmodified)}
\]

**VAESENCLAST**

\[
\text{STATE} \leftarrow \text{SRC1}; \\
\text{RoundKey} \leftarrow \text{SRC2}; \\
\text{STATE} \leftarrow \text{ShiftRows}(\text{STATE}); \\
\text{STATE} \leftarrow \text{SubBytes}(\text{STATE}); \\
\text{DEST}[127:0] \leftarrow \text{STATE XOR RoundKey}; \\
\text{DEST}[255:128] \leftarrow 0
\]

**AESENCLAST**

\[
\text{STATE} \leftarrow \text{SRC1}; \\
\text{RoundKey} \leftarrow \text{SRC2}; \\
\text{STATE} \leftarrow \text{ShiftRows}(\text{STATE}); \\
\text{STATE} \leftarrow \text{SubBytes}(\text{STATE}); \\
\text{DEST}[127:0] \leftarrow \text{STATE XOR RoundKey}; \\
\text{DEST}[255:128] \text{ (Unmodified)}
\]

**Intel C/C++ Compiler Intrinsic Equivalent**

- (V)AESENCl __m128i _mm_aesenc (__m128i, __m128i)
- (V)AESENCLAST __m128i _mm_aesenclast (__m128i, __m128i)
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
none

Other Exceptions
See Exceptions Type 4
AESDEC/AESDECLAST - Perform One Round of an AES Decryption Flow

Description

These instructions perform a single round of the AES decryption flow using the Equivalent Inverse Cipher, with the round key from the second source operand, operating on a 128-bit data (state) from the first source operand, and store the result in the destination operand.

Use the AESDEC instruction for all but the last decryption round. For the last decryption round, use the AESDECLAST instruction.
VEX.128 encoded version: The first source operand and the destination operand are XMM registers. The second source operand can be an XMM register or a 128-bit memory location. Bits (255:128) of the destination YMM register are zeroed.

128-bit Legacy SSE version: The first source operand and the destination operand are the same and must be an XMM register. The second source operand can be an XMM register or a 128-bit memory location. Bits (255:128) of the corresponding YMM destination register remain unchanged.

**Operation**

**VAESDEC**

STATE ← SRC1;
RoundKey ← SRC2;
STATE ← InvShiftRows( STATE );
STATE ← InvSubBytes( STATE );
STATE ← InvMixColumns( STATE );
DEST[127:0] ← STATE XOR RoundKey;
DEST[255:128] ← 0

**AESDEC**

STATE ← SRC1;
RoundKey ← SRC2;
STATE ← InvShiftRows( STATE );
STATE ← InvSubBytes( STATE );
STATE ← InvMixColumns( STATE );
DEST[127:0] ← STATE XOR RoundKey;
DEST[255:128] (Unmodified)

**VAESDECLAST**

STATE ← SRC1;
RoundKey ← SRC2;
STATE ← InvShiftRows( STATE );
STATE ← InvSubBytes( STATE );
DEST[127:0] ← STATE XOR RoundKey;
DEST[255:128] ← 0

**AESDECLAST**

STATE ← SRC1;
RoundKey ← SRC2;
STATE ← InvShiftRows( STATE );
STATE ← InvSubBytes( STATE );
DEST[127:0] ← STATE XOR RoundKey;
DEST[255:128] (Unmodified)
Intel C/C++ Compiler Intrinsic Equivalent

(V)AESDEC __m128i _mm_aesdec (__m128i, __m128i)
(V)AESDECLAST __m128i _mm_aesdeclast (__m128i, __m128i)

SIMD Floating-Point Exceptions
none

Other Exceptions
See Exceptions Type 4
AESIMC- Perform the AES InvMixColumn Transformation

**Description**

Perform the InvMixColumns transformation on the source operand and store the result in the destination operand. The destination operand is an XMM register. The source operand can be an XMM register or a 128-bit memory location.

Note the AESIMC instruction should be applied to the expanded AES round keys (except for the first and last round key) in order to prepare them for decryption using the "Equivalent Inverse Cipher" (defined in FIPS 197).

**VEX.128 encoded version:** Bits (255:128) of the destination YMM register are zeroed.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

**Operation**

**VAESIMC**

DEST[127:0] ← InvMixColumns( SRC );
DEST[255:128] ← 0;

**AESIMC**

DEST[127:0] ← InvMixColumns( SRC );
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

(V)AESIMC __m128i __mm_aesimc (__m128i)

**SIMD Floating-Point Exceptions**

None
Other Exceptions
See Exceptions Type 4
AESKEYGENASSIST - AES Round Key Generation Assist

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 0F /r ib</td>
<td>V/V</td>
<td>AES</td>
<td>Assist in AES round key generation using an 8 bits Round Constant (RCON) specified in the immediate byte, operating on 128 bits of data specified in xmm2/m128 and stores the result in xmm1.</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 0F /r ib</td>
<td>V/V</td>
<td>Both AES and AVX flags</td>
<td>Assist in AES round key generation using 8 bits Round Constant (RCON) specified in the immediate byte, operating on 128 bits of data specified in xmm2/m128 and stores the result in xmm1.</td>
</tr>
</tbody>
</table>

Description

Assist in expanding the AES cipher key, by computing steps towards generating a round key for encryption, using 128-bit data specified in the source operand and an 8-bit round constant specified as an immediate, store the result in the destination operand.

The destination operand is an XMM register. The source operand can be an XMM register or a 128-bit memory location.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

Operation

**VAESKEYGENASSIST**

X3[31:0] ← SRC [127: 96];
X2[31:0] ← SRC [95: 64];
X1[31:0] ← SRC [63: 32];
X0[31:0] ← SRC [31: 0];
RCON[31:0] ← ZeroExtend(Imm8[7:0]);
DEST[31:0] ← SubWord(X1);
DEST[63:32] ← RotWord(SubWord(X1)) XOR RCON;
DEST[95:64] ← SubWord(X3);
DEST[127:96] ← RotWord(SubWord(X3)) XOR RCON;
AESKEYGENASSIST
X3[31:0] ← SRC [127: 96];
X2[31:0] ← SRC [95: 64];
X1[31:0] ← SRC [63: 32];
X0[31:0] ← SRC [31: 0];
RCON[31:0] ← ZeroExtend(Imm8[7:0]);
DEST[31:0] ← SubWord(X1);
DEST[63:32] ← RotWord( SubWord(X1) ) XOR RCON;
DEST[95:64] ← SubWord(X3);
DEST[127:96] ← RotWord( SubWord(X3) ) XOR RCON;
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
(V)AESKEYGENASSIST __m128i _mm_aesimc (__m128i, const int)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4
ANDPD- Bitwise Logical AND of Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 54 /r V/V V/V SSE2</td>
<td>Return the bitwise logical AND of packed double-precision floating-point values in xmm1 and xmm2/mem</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 54 /r V/V AVX</td>
<td>Return the bitwise logical AND of packed double-precision floating-point values in xmm1 and xmm2/mem</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 54 /r V/V AVX</td>
<td>Return the bitwise logical AND of packed double-precision floating-point values in xmm1 and xmm2/mem</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a bitwise logical AND of the two or four packed double-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VANDPD (VEX.256 encoded version)
DEST[63:0] ← SRC1[63:0] BITWISE AND SRC2[63:0]

VANDPD (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0] BITWISE AND SRC2[63:0]
DEST[255:128] ← 0

ANDPD (128-bit Legacy SSE version)
DEST[63:0] ← DEST[63:0] BITWISE AND SRC[63:0]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VANDPD __m256d _mm256_and_pd (__m256d a, __m256d b);
ANDPD __m128d _mm_and_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4
INSTRUCTION SET REFERENCE

ANDPS- Bitwise Logical AND of Packed Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 54 /r ANDPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Return the bitwise logical AND of packed single-precision floating-point values in xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 54 /r VANDPS xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical AND of packed single-precision floating-point values in xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.0F 54 /r VANDPS ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical AND of packed single-precision floating-point values in ymm2 and ymm3/mem</td>
</tr>
</tbody>
</table>

Description
Performs a bitwise logical AND of the four or eight packed single-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation
VANDPS (VEX.256 encoded version)
DEST[31:0] ← SRC1[31:0] BITWISE AND SRC2[31:0]
DEST[95:64] ← SRC1[95:64] BITWISE AND SRC2[95:64]
INSTRUCTION SET REFERENCE


**VANDPS (VEX.128 encoded version)**
DEST[31:0] ← SRC1[31:0] BITWISE AND SRC2[31:0]
DEST[95:64] ← SRC1[95:64] BITWISE AND SRC2[95:64]
DEST[255:128] ← 0

**ANDPS (128-bit Legacy SSE version)**
DEST[31:0] ← DEST[31:0] BITWISE AND SRC[31:0]
DEST[95:64] ← DEST[95:64] BITWISE AND SRC[95:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VANDPS __m256 _mm256_and_ps (__m256 a, __m256 b);
ANDPS __m128 _mm_and_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4
ANDNPD- Bitwise Logical AND NOT of Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 55 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Return the bitwise logical AND NOT of packed double-precision floating-point values in xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VANDNPD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 55 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical AND NOT of packed double-precision floating-point values in xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VANDNPD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 55 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical AND NOT of packed double-precision floating-point values in ymm2 and ymm3/mem</td>
</tr>
<tr>
<td>VANDNPD ymm1, ymm2, ymm3/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a bitwise logical AND NOT of the two or four packed double-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VANDNPD (VEX.256 encoded version)
DEST[63:0] ← (NOT(SRC1[63:0])) BITWISE AND SRC2[63:0]
DEST[127:64] ← (NOT(SRC1[127:64])) BITWISE AND SRC2[127:64]
INSTRUCTION SET REFERENCE

DEST[255:192] ← (NOT(SRC1[255:192])) BITWISE AND SRC2[255:192]

**VANDNPD (VEX.128 encoded version)**
DEST[63:0] ← (NOT(SRC1[63:0])) BITWISE AND SRC2[63:0]
DEST[127:64] ← (NOT(SRC1[127:64])) BITWISE AND SRC2[127:64]
DEST[255:128] ← 0

**ANDNPD (128-bit Legacy SSE version)**
DEST[63:0] ← (NOT(DEST[63:0])) BITWISE AND SRC[63:0]
DEST[127:64] ← (NOT(DEST[127:64])) BITWISE AND SRC[127:64]
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

VANDNPD _m256d _mm256_andnot_pd (_m256d a, _m256d b);
ANDNPD _m128d _mm_andnot_pd (_m128d a, _m128d b);

**SIMD Floating-Point Exceptions**
None

**Other Exceptions**
See Exceptions Type 4
INSTRUCTION SET REFERENCE

ANDNPS- Bitwise Logical AND NOT of Packed Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 55 /r ANDNPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Return the bitwise logical AND NOT of packed single-precision floating-point values in xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 55 /r VANDNPS xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical AND NOT of packed single-precision floating-point values in xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.0F 55 /r VANDNPS ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical AND NOT of packed single-precision floating-point values in ymm2 and ymm3/mem</td>
</tr>
</tbody>
</table>

Description

Performs a bitwise logical AND NOT of the four or eight packed single-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: The first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VANDNPS (VEX.256 encoded version)

DEST[31:0] ← (NOT(SRC1[31:0])) BITWISE AND SRC2[31:0]
DEST[95:64] ← (NOT(SRC1[95:64])) BITWISE AND SRC2[95:64]
INSTRUCTION SET REFERENCE

DEST[127:96] ← (NOT(SRC1[127:96])) BITWISE AND SRC2[127:96]
DEST[159:128] ← (NOT(SRC1[159:128])) BITWISE AND SRC2[159:128]

VANDNPS (VEX.128 encoded version)
DEST[31:0] ← (NOT(SRC1[31:0])) BITWISE AND SRC2[31:0]
DEST[95:64] ← (NOT(SRC1[95:64])) BITWISE AND SRC2[95:64]
DEST[127:96] ← (NOT(SRC1[127:96])) BITWISE AND SRC2[127:96]
DEST[255:128] ← 0

ANDNPS (128-bit Legacy SSE version)
DEST[31:0] ← (NOT(DEST[31:0])) BITWISE AND SRC[31:0]
DEST[95:64] ← (NOT(DEST[95:64])) BITWISE AND SRC[95:64]
DEST[127:96] ← (NOT(DEST[127:96])) BITWISE AND SRC[127:96]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VANDNPS __m256 _mm256_andnot_ps (__m256 a, __m256 b);
ANDNPS __m128 _mm_andnot_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4
BLENDPD- Blend Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 0D /r ib</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Select packed double-precision floating-point Values from xmm1 and xmm2/m128 from mask in imm8</td>
</tr>
<tr>
<td>BLENDPD xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 0D /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Select packed double-precision floating-point Values from xmm2 and xmm3/m128 from mask in imm8</td>
</tr>
<tr>
<td>VBLENDPD xmm1, xmm2, xmm3/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F3A 0D /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Select packed double-precision floating-point Values from ymm2 and ymm3/m256 from mask in imm8 and store the values in ymm1</td>
</tr>
<tr>
<td>VBLENDPD ymm1, ymm2, ymm3/m256, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Double-precision floating-point values from the second source operand (third operand) are conditionally merged with values from the first source operand (second operand) and written to the destination operand (first operand). The immediate bits [3:0] determine whether the corresponding double-precision floating-point value in the destination is copied from the second source or first source. If a bit in the mask, corresponding to a word, is "1", then the double-precision floating-point value in the second source operand is copied, else the value in the first source operand is copied.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand is an XMM register. The second source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.
INSTRUCTION SET REFERENCE

Operation

**VBLENDPD (VEX.256 encoded version)**

IF (IMM8[0] = 0) THEN DEST[63:0] ← SRC1[63:0]
ELSE DEST[63:0] ← SRC2[63:0] FI

IF (IMM8[1] = 0) THEN DEST[127:64] ← SRC1[127:64]
ELSE DEST[127:64] ← SRC2[127:64] FI


**VBLENDPD (VEX.128 encoded version)**

IF (IMM8[0] = 0) THEN DEST[63:0] ← SRC1[63:0]
ELSE DEST[63:0] ← SRC2[63:0] FI

IF (IMM8[1] = 0) THEN DEST[127:64] ← SRC1[127:64]
ELSE DEST[127:64] ← SRC2[127:64] FI

DEST[255:128] ← 0

**BLENDPD (128-bit Legacy SSE version)**

IF (IMM8[0] = 0) THEN DEST[63:0] ← DEST[63:0]
ELSE DEST[63:0] ← SRC[63:0] FI

IF (IMM8[1] = 0) THEN DEST[127:64] ← DEST[127:64]
ELSE DEST[127:64] ← SRC[127:64] FI

DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VBLENDPD __m256d _mm256_blend_pd (__m256d a, __m256d b, const int mask);

BLENDPD __m128d _mm_blend_pd (__m128d a, __m128d b, const int mask);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4
BLENDPS- Blend Packed Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 0C /r ib</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Select packed single-precision floating-point values from xmm1 and xmm2/m128 from mask in imm8</td>
</tr>
<tr>
<td>BLENDPS xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Select packed single-precision floating-point values from xmm2 and xmm3/m128 from mask in imm8</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 0C /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Select packed single-precision floating-point values from ymm2 and ymm3/m256 from mask in imm8</td>
</tr>
<tr>
<td>VBLENDPS xmm1, xmm2, xmm3/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Select packed single-precision floating-point values from ymm2 and ymm3/m256 from mask in imm8</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F3A 0C /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Select packed single-precision floating-point values from ymm2 and ymm3/m256 from mask in imm8</td>
</tr>
<tr>
<td>VBLENDPS ymm1, ymm2, ymm3/m256, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Single-precision floating-point values from the second source operand (third operand) are conditionally merged with values from the first source operand (second operand) and written to the destination operand (first operand). The immediate bits [7:0] determine whether the corresponding single precision floating-point value in the destination is copied from the second source or first source. If a bit in the mask, corresponding to a word, is “1”, then the single-precision floating-point value in the second source operand is copied, else the value in the first source operand is copied.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: The first source operand an XMM register. The second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.
**Operation**

**VBLENDPS (VEX.256 encoded version)**

IF (IMM8[0] = 0) THEN DEST[31:0] ← SRC1[31:0]
ELSE DEST [31:0] ← SRC2[31:0] FI


IF (IMM8[2] = 0) THEN DEST[95:64] ← SRC1[95:64]
ELSE DEST [95:64] ← SRC2[95:64] FI


**VBLENDPS (VEX.128 encoded version)**

IF (IMM8[0] = 0) THEN DEST[31:0] ← SRC1[31:0]
ELSE DEST [31:0] ← SRC2[31:0] FI


IF (IMM8[2] = 0) THEN DEST[95:64] ← SRC1[95:64]
ELSE DEST [95:64] ← SRC2[95:64] FI


**BLENDPS (128-bit Legacy SSE version)**

IF (IMM8[0] = 0) THEN DEST[31:0] ← DEST[31:0]
ELSE DEST [31:0] ← SRC[31:0] FI


IF (IMM8[2] = 0) THEN DEST[95:64] ← DEST[95:64]
ELSE DEST [95:64] ← SRC[95:64] FI


DEST[255:128] (Unmodified)
INSTRUCTION SET REFERENCE

Intel C/C++ Compiler Intrinsic Equivalent

VBLENDPS __m256 _mm256_blend_ps (__m256 a, __m256 b, const int mask);
BLENDPS __m128 _mm_blend_ps (__m128 a, __m128 b, const int mask);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4
BLENDVPD- Blend Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 15 /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Conditionally copy double-precision floating-point values from xmm2 or xmm3/m128 to xmm1, based on mask bits in the implicit mask operand, XMM0.</td>
</tr>
<tr>
<td>BLENDVPD xmm1, xmm2/m128, &lt;XMM0&gt;</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 4B /r /is4</td>
<td>V/V</td>
<td>AVX</td>
<td>Conditionally copy double-precision floating-point values from xmm2 or xmm3/m128 to xmm1, based on mask bits in the mask operand, xmm4</td>
</tr>
<tr>
<td>VBLENDVPD xmm1, xmm2, xmm3/m128, xmm4</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F3A 4B /r /is4</td>
<td>V/V</td>
<td>AVX</td>
<td>Conditionally copy double-precision floating-point values from ymm2 or ymm3/m256 to ymm1, based on mask bits in the mask operand, ymm4</td>
</tr>
<tr>
<td>VBLENDVPD ymm1, ymm2, ymm3/m256, ymm4</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Conditionally copy each quadword data element of double-precision floating-point value from the second source operand and the first source operand depending on mask bits defined in the mask register operand. The mask bits are the most significant bit in each quadword element of the mask register.

Each quadword element of the destination operand is copied from:

- the corresponding quadword element in the second source operand, If a mask bit is "1"; or
- the corresponding quadword element in the first source operand, If a mask bit is "0"

The register assignment of the implicit mask operand for BLENDVPD is defined to be the architectural register XMM0

128-bit Legacy SSE version: The first source operand and the destination operand is the same. Bits (255:128) of the corresponding YMM destination register remain unchanged. The mask register operand is implicitly defined to be the architectural register XMM0. An attempt to execute BLENDVPD with a VEX prefix will cause #UD.

VEX.128 encoded version: The first source operand and the destination operand are XMM registers. The second source operand is an XMM register or 128-bit memory...
location. The mask operand is the third source register, and encoded in bits[7:4] of the immediate byte(imm8). The bits[3:0] of imm8 are ignored. In 32-bit mode, imm8[7] is ignored. The upper bits (255:128) of the corresponding YMM register (destination register) are zeroed. VEX.W must be 0, otherwise, the instruction will #UD.

VEX.256 encoded version: The first source operand and destination operand are YMM registers. The second source operand can be a YMM register or a 256-bit memory location. The mask operand is the third source register, and encoded in bits[7:4] of the immediate byte(imm8). The bits[3:0] of imm8 are ignored. In 32-bit mode, imm8[7] is ignored. VEX.W must be 0, otherwise, the instruction will #UD.

VBLENDVPD permits the mask to be any XMM or YMM register. In contrast, BLENDVPD treats XMM0 implicitly as the mask and do not support non-destructive destination operation.

Operation

VBLENDVPD (VEX.256 encoded version)

\[
\text{MASK} \leftarrow \text{SRC3}
\]
\[
\text{IF (MASK}[63] = 0) \text{ THEN DEST}[63:0] \leftarrow \text{SRC1}[63:0]
\]
\[
\text{ELSE DEST }[63:0] \leftarrow \text{SRC2}[63:0] \text{ FI}
\]
\[
\text{IF (MASK}[127] = 0) \text{ THEN DEST}[127:64] \leftarrow \text{SRC1}[127:64]
\]
\[
\text{ELSE DEST }[127:64] \leftarrow \text{SRC2}[127:64] \text{ FI}
\]
\[
\text{IF (MASK}[191] = 0) \text{ THEN DEST}[191:128] \leftarrow \text{SRC1}[191:128]
\]
\[
\text{ELSE DEST }[191:128] \leftarrow \text{SRC2}[191:128] \text{ FI}
\]
\[
\text{IF (MASK}[255] = 0) \text{ THEN DEST}[255:192] \leftarrow \text{SRC1}[255:192]
\]
\[
\text{ELSE DEST }[255:192] \leftarrow \text{SRC2}[255:192] \text{ FI}
\]

VBLENDVPD (VEX.128 encoded version)

\[
\text{MASK} \leftarrow \text{SRC3}
\]
\[
\text{IF (MASK}[63] = 0) \text{ THEN DEST}[63:0] \leftarrow \text{SRC1}[63:0]
\]
\[
\text{ELSE DEST }[63:0] \leftarrow \text{SRC2}[63:0] \text{ FI}
\]
\[
\text{IF (MASK}[127] = 0) \text{ THEN DEST}[127:64] \leftarrow \text{SRC1}[127:64]
\]
\[
\text{ELSE DEST }[127:64] \leftarrow \text{SRC2}[127:64] \text{ FI}
\]
\[
\text{DEST}[255:128] \leftarrow 0
\]

BLENDVPD (128-bit Legacy SSE version)

\[
\text{MASK} \leftarrow \text{XMM0}
\]
\[
\text{IF (MASK}[63] = 0) \text{ THEN DEST}[63:0] \leftarrow \text{DEST}[63:0]
\]
\[
\text{ELSE DEST }[63:0] \leftarrow \text{SRC}[63:0] \text{ FI}
\]
\[
\text{IF (MASK}[127] = 0) \text{ THEN DEST}[127:64] \leftarrow \text{DEST}[127:64]
\]
\[
\text{ELSE DEST }[127:64] \leftarrow \text{SRC}[127:64] \text{ FI}
\]
\[
\text{DEST}[255:128] \text{ (Unmodified)}
\]
INSTRUCTION SET REFERENCE

Intel C/C++ Compiler Intrinsic Equivalent

VBLENDVPD __m256_mm256_blendv_pd (__m256d a, __m256d b, __m256d mask);
VBLENDVPD __m128_mm_blendv_pd (__m128d a, __m128d b, __m128d mask);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.W = 1.
**BLENDVPS- Blend Packed Single Precision Floating-Point Values**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 14 /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Conditionally copy single-precision floating-point values from xmm2 or xmm3/m128 to xmm1, based on mask bits in the implicit mask operand, XMM0.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 4A /r /is4</td>
<td>V/V</td>
<td>AVX</td>
<td>Conditionally copy single-precision floating-point values from xmm2 or xmm3/m128 to xmm1, based on mask bits in the specified mask operand, xmm4</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F3A 4A /r /is4</td>
<td>V/V</td>
<td>AVX</td>
<td>Conditionally copy single-precision floating-point values from ymm2 or ymm3/m256 to ymm1, based on mask bits in the specified mask register, ymm4</td>
</tr>
</tbody>
</table>

**Description**

Conditionally copy each dword data element of single-precision floating-point value from the second source operand and the first source operand depending on mask bits defined in the mask register operand. The mask bits are the most significant bit in each dword element of the mask register.

Each quadword element of the destination operand is copied from:

- the corresponding dword element in the second source operand, if a mask bit is "1"; or
- the corresponding dword element in the first source operand, if a mask bit is "0"

The register assignment of the implicit mask operand for BLENDVPS is defined to be the architectural register XMM0.

128-bit Legacy SSE version: The first source operand and the destination operand is the same. Bits (255:128) of the corresponding YMM destination register remain unchanged. The mask register operand is implicitly defined to be the architectural register XMM0. An attempt to execute BLENDVPS with a VEX prefix will cause #UD.

VEX.128 encoded version: The first source operand and the destination operand are XMM registers. The second source operand is an XMM register or 128-bit memory location. The mask operand is the third source register, and encoded in bits[7:4] of
the immediate byte(imm8). The bits[3:0] of imm8 are ignored. In 32-bit mode, imm8[7] is ignored. The upper bits (255:128) of the corresponding YMM register (destination register) are zeroed. VEX.W must be 0, otherwise, the instruction will #UD.

VEX.256 encoded version: The first source operand and destination operand are YMM registers. The second source operand can be a YMM register or a 256-bit memory location. The mask operand is the third source register, and encoded in bits[7:4] of the immediate byte(imm8). The bits[3:0] of imm8 are ignored. In 32-bit mode, imm8[7] is ignored. VEX.W must be 0, otherwise, the instruction will #UD.

VBLENDVPS permits the mask to be any XMM or YMM register. In contrast, BLENDVPS treats XMM0 implicitly as the mask and do not support non-destructive destination operation.

**Operation**

**VBLENDVPS (VEX.256 encoded version)**

\[
\text{MASK} \leftarrow \text{SRC3}
\]

\[
\text{IF (MASK}[31] = 0) \text{ THEN DEST}[31:0] \leftarrow \text{SRC1}[31:0] \\
\quad \text{ELSE DEST}[31:0] \leftarrow \text{SRC2}[31:0] \text{ FI}
\]

\[
\text{IF (MASK}[63] = 0) \text{ THEN DEST}[63:32] \leftarrow \text{SRC1}[63:32] \\
\quad \text{ELSE DEST}[63:32] \leftarrow \text{SRC2}[63:32] \text{ FI}
\]

\[
\text{IF (MASK}[95] = 0) \text{ THEN DEST}[95:64] \leftarrow \text{SRC1}[95:64] \\
\quad \text{ELSE DEST}[95:64] \leftarrow \text{SRC2}[95:64] \text{ FI}
\]

\[
\text{IF (MASK}[127] = 0) \text{ THEN DEST}[127:96] \leftarrow \text{SRC1}[127:96] \\
\quad \text{ELSE DEST}[127:96] \leftarrow \text{SRC2}[127:96] \text{ FI}
\]

\[
\text{IF (MASK}[159] = 0) \text{ THEN DEST}[159:128] \leftarrow \text{SRC1}[159:128] \\
\quad \text{ELSE DEST}[159:128] \leftarrow \text{SRC2}[159:128] \text{ FI}
\]

\[
\text{IF (MASK}[191] = 0) \text{ THEN DEST}[191:160] \leftarrow \text{SRC1}[191:160] \\
\quad \text{ELSE DEST}[191:160] \leftarrow \text{SRC2}[191:160] \text{ FI}
\]

\[
\text{IF (MASK}[223] = 0) \text{ THEN DEST}[223:192] \leftarrow \text{SRC1}[223:192] \\
\quad \text{ELSE DEST}[223:192] \leftarrow \text{SRC2}[223:192] \text{ FI}
\]

\[
\text{IF (MASK}[255] = 0) \text{ THEN DEST}[255:224] \leftarrow \text{SRC1}[255:224] \\
\quad \text{ELSE DEST}[255:224] \leftarrow \text{SRC2}[255:224] \text{ FI}
\]

**VBLENDVPS (VEX.128 encoded version)**

\[
\text{MASK} \leftarrow \text{SRC3}
\]

\[
\text{IF (MASK}[31] = 0) \text{ THEN DEST}[31:0] \leftarrow \text{SRC1}[31:0] \\
\quad \text{ELSE DEST}[31:0] \leftarrow \text{SRC2}[31:0] \text{ FI}
\]

\[
\text{IF (MASK}[63] = 0) \text{ THEN DEST}[63:32] \leftarrow \text{SRC1}[63:32] \\
\quad \text{ELSE DEST}[63:32] \leftarrow \text{SRC2}[63:32] \text{ FI}
\]

\[
\text{IF (MASK}[95] = 0) \text{ THEN DEST}[95:64] \leftarrow \text{SRC1}[95:64] \\
\quad \text{ELSE DEST}[95:64] \leftarrow \text{SRC2}[95:64] \text{ FI}
\]
INSTRUCTION SET REFERENCE

IF (MASK[127] = 0) THEN DEST[127:96] ← SRC1[127:96]
DEST[255:128] ← 0

BLENDVPS (128-bit Legacy SSE version)
MASK ← XMM0
IF (MASK[31] = 0) THEN DEST[31:0] ← DEST[31:0]
    ELSE DEST [31:0] ← SRC[31:0] FI
IF (MASK[63] = 0) THEN DEST[63:32] ← DEST[63:32]
IF (MASK[95] = 0) THEN DEST[95:64] ← DEST[95:64]
    ELSE DEST [95:64] ← SRC[95:64] FI
IF (MASK[127] = 0) THEN DEST[127:96] ← DEST[127:96]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VBLENDVPS _m256 _mm256_blendv_ps (_m256 a, _m256 b, _m256 mask);
VBLENDVPS _m128 _mm_blendv_ps (_m128 a, _m128 b, _m128 mask);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.W = 1.
VBROADCAST- Load with Broadcast

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.128.66.0F38 18 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Broadcast single-precision floating-point element in mem to four locations in xmm1</td>
</tr>
<tr>
<td>VBROADCASTSS xmm1, m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.66.0F38 18 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Broadcast single-precision floating-point element in mem to eight locations in ymm1</td>
</tr>
<tr>
<td>VBROADCASTSS ymm1, m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.66.0F38 19 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Broadcast double-precision floating-point element in mem to four locations in ymm1</td>
</tr>
<tr>
<td>VBROADCASTSD ymm1, m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.66.0F38 1A /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Broadcast 128 bits of floating-point data in mem to low and high 128-bits in ymm1</td>
</tr>
<tr>
<td>VBROADCASTF128 ymm1, m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Load floating point values from the source operand (second operand) and broadcast to all elements of the destination operand (first operand).

The destination operand is a YMM register. The source operand is either a 32-bit, 64-bit, or 128-bit memory location. Register source encodings are reserved and will #UD.

VBROADCASTSD and VBROADCASTF128 are only supported as 256-bit wide versions. VBROADCASTSS is supported in both 128-bit and 256-bit wide versions.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

If VBROADCASTSD or VBROADCASTF128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception.
Figure 5-1. VBROADCASTSS Operation (VEX.256 encoded version)

Figure 5-2. VBROADCASTSS Operation (128-bit version)
Operation

**VBROADCASTSS (128 bit version)**

\[
\text{temp} \leftarrow \text{SRC}[31:0] \\
\text{DEST}[31:0] \leftarrow \text{temp} \\
\text{DEST}[63:32] \leftarrow \text{temp} \\
\text{DEST}[95:64] \leftarrow \text{temp} \\
\text{DEST}[127:96] \leftarrow \text{temp} \\
\text{DEST}[255:128] \leftarrow 0
\]
INSTRUCTION SET REFERENCE

VBROADCASTSS (VEX.256 encoded version)

temp ← SRC[31:0]
DEST[31:0] ← temp
DEST[63:32] ← temp
DEST[95:64] ← temp
DEST[127:96] ← temp
DEST[159:128] ← temp
DEST[191:160] ← temp
DEST[223:192] ← temp
DEST[255:224] ← temp

VBROADCASTSD (VEX.256 encoded version)

temp ← SRC[63:0]
DEST[63:0] ← temp
DEST[127:64] ← temp
DEST[191:128] ← temp
DEST[255:192] ← temp

VBROADCASTF128

temp ← SRC[127:0]
DEST[127:0] ← temp
DEST[255:128] ← temp

Intel C/C++ Compiler Intrinsic Equivalent

VBROADCASTSS __m128_mm_broadcast_ss(float *a);
VBROADCASTSS __m256_mm256_broadcast_ss(float *a);
VBROADCASTSD __m256d_mm256_bROADCAST_sd(double *a);
VBROADCASTF128 __m256_mm256_broadcast_ps(__m128 * a);
VBROADCASTF128 __m256d_mm256_broadcast_pd(__m128d * a);

SIMD Floating-Point Exceptions

None

Other Exceptions
See Exceptions Type 6, additionally

#UD If VEX.L = 0 for VBROADCASTSD.
 If VEX.L = 0 for VBROADCASTF128

5-58 Ref. # 319433-005
## CMPPD- Compare Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F  C2 /r ib</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed double-precision floating-point values in xmm2/m128 and xmm1 using bits 2:0 of imm8 as a comparison predicate</td>
</tr>
<tr>
<td>CMPPD xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>VEX.NDS.128.66.0F C2 /r ib</th>
<th>V/V</th>
<th>AVX</th>
<th>Compare packed double-precision floating-point values in xmm3/m128 and xmm2 using bits 4:0 of imm8 as a comparison predicate</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPPD xmm1, xmm2, xmm3/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>VEX.NDS.256.66.0F C2 /r ib</th>
<th>V/V</th>
<th>AVX</th>
<th>Compare packed double-precision floating-point values in ymm3/m256 and ymm2 using bits 4:0 of imm8 as a comparison predicate</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPPD ymm1, ymm2, ymm3/m256, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Description

Performs a SIMD compare of the packed double-precision floating-point values in the second source operand and the first source operand and returns the results of the comparison to the destination operand. The comparison predicate operand (immediate byte) specifies the type of comparison performed on each pair of packed values in the two source operands. The result of each comparison is a quadword mask of all 1s (comparison true) or all 0s (comparison false).

**VEX.256 encoded version:** The first source operand (second operand) is a YMM register. The second source operand (third operand) can be a YMM register or a 256-bit memory location. The destination operand (first operand) is a YMM register. Four comparisons are performed with results written to the destination operand.

**128-bit Legacy SSE version:** The first source and destination operand (first operand) is an XMM register. The second source operand (second operand) can be an XMM register or 128-bit memory location. The destination operand (first operand) is an XMM register. Four comparisons are performed with results written to bits 127:0 of the destination operand.

**VEX.128 encoded version:** The first source operand (second operand) is an XMM register. The second source operand (third operand) can be an XMM register or a 128-bit memory location. Bits (255:128) of the destination YMM register are zeroed.
Two comparisons are performed with results written to bits 127:0 of the destination operand.

The comparison predicate operand is an 8-bit immediate:

- For instructions encoded using the VEX prefix, bits 4:0 define the type of comparison to be performed (see Table 5-9). Bits 5 through 7 of the immediate are reserved.
- For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see the first 8 rows of Table 5-9). Bits 3 through 7 of the immediate are reserved.

### Table 5-9. Comparison Predicate for CMPPD and CMPPS Instructions

<table>
<thead>
<tr>
<th>Predicate</th>
<th>Immediate Value</th>
<th>Description</th>
<th>Result: A is 1st Operand, B is 2nd Operand</th>
<th>Unordered&lt;sup&gt;1&lt;/sup&gt;</th>
<th>Signals #IA on QNAN</th>
</tr>
</thead>
<tbody>
<tr>
<td>EQ_OQ (EQ)</td>
<td>0H</td>
<td>Equal (ordered, non-signaling)</td>
<td>False, False, True, False</td>
<td>False</td>
<td>No</td>
</tr>
<tr>
<td>LT_OS (LT)</td>
<td>1H</td>
<td>Less-than (ordered, signaling)</td>
<td>False, True, False, False</td>
<td>False</td>
<td>Yes</td>
</tr>
<tr>
<td>LE_OS (LE)</td>
<td>2H</td>
<td>Less-than-or-equal (ordered, signaling)</td>
<td>False, True, True, False</td>
<td>False</td>
<td>Yes</td>
</tr>
<tr>
<td>UNORD_Q (UNORD)</td>
<td>3H</td>
<td>Unordered (non-signaling)</td>
<td>False, False, False, True</td>
<td>True</td>
<td>No</td>
</tr>
<tr>
<td>NEQ_UQ (NEQ)</td>
<td>4H</td>
<td>Not-equal (unordered, non-signaling)</td>
<td>True, True, False, True</td>
<td>True</td>
<td>No</td>
</tr>
<tr>
<td>NLT_US (NLT)</td>
<td>5H</td>
<td>Not-less-than (unordered, signaling)</td>
<td>True, False, True, True</td>
<td>True</td>
<td>Yes</td>
</tr>
<tr>
<td>NLE_US (NLE)</td>
<td>6H</td>
<td>Not-less-than-or-equal (unordered, signaling)</td>
<td>True, False, False, True</td>
<td>True</td>
<td>Yes</td>
</tr>
<tr>
<td>ORD_Q (ORD)</td>
<td>7H</td>
<td>Ordered (non-signaling)</td>
<td>True, True, True, False</td>
<td>False</td>
<td>No</td>
</tr>
<tr>
<td>EQ_UQ</td>
<td>8H</td>
<td>Equal (unordered, non-signaling)</td>
<td>False, False, True, True</td>
<td>True</td>
<td>No</td>
</tr>
<tr>
<td>NGE_US (NGE)</td>
<td>9H</td>
<td>Not-greater-than-or-equal (unordered, signaling)</td>
<td>False, True, False, True</td>
<td>True</td>
<td>Yes</td>
</tr>
</tbody>
</table>
**Table 5-9. Comparison Predicate for CMPPD and CMPPS Instructions (Continued)**

<table>
<thead>
<tr>
<th>Predicate</th>
<th>imm8 Value</th>
<th>Description</th>
<th>Result: A Is 1st Operand, B Is 2nd Operand</th>
<th>Signals #IA on QNAN</th>
</tr>
</thead>
<tbody>
<tr>
<td>NGT_US (NGT)</td>
<td>AH</td>
<td>Not-greater-than (unordered, signaling)</td>
<td>False True True True Yes</td>
<td></td>
</tr>
<tr>
<td>FALSE_OQ(FALSE)</td>
<td>BH</td>
<td>False (ordered, non-signaling)</td>
<td>False False False False No</td>
<td></td>
</tr>
<tr>
<td>NEQ_OQ</td>
<td>CH</td>
<td>Not-equal (ordered, non-signaling)</td>
<td>True True False False No</td>
<td></td>
</tr>
<tr>
<td>GE_OS (GE)</td>
<td>DH</td>
<td>Greater-than-or-equal (ordered, signaling)</td>
<td>True False True False Yes</td>
<td></td>
</tr>
<tr>
<td>GT_OS (GT)</td>
<td>EH</td>
<td>Greater-than (ordered, signaling)</td>
<td>True False False False Yes</td>
<td></td>
</tr>
<tr>
<td>TRUE_UQ(TRUE)</td>
<td>FH</td>
<td>True (unordered, non-signaling)</td>
<td>True True True True No</td>
<td></td>
</tr>
<tr>
<td>EQ_OS</td>
<td>10H</td>
<td>Equal (ordered, signaling)</td>
<td>False False True False Yes</td>
<td></td>
</tr>
<tr>
<td>LT_OQ</td>
<td>11H</td>
<td>Less-than (ordered, nonsignaling)</td>
<td>False True False False No</td>
<td></td>
</tr>
<tr>
<td>LE_OQ</td>
<td>12H</td>
<td>Less-than-or-equal (ordered, nonsignaling)</td>
<td>False True True False No</td>
<td></td>
</tr>
<tr>
<td>UNORD_S</td>
<td>13H</td>
<td>Unordered (signaling)</td>
<td>False False False True Yes</td>
<td></td>
</tr>
<tr>
<td>NEQ_US</td>
<td>14H</td>
<td>Not-equal (unordered, signaling)</td>
<td>True True False False Yes</td>
<td></td>
</tr>
<tr>
<td>NLT_UQ</td>
<td>15H</td>
<td>Not-less-than (unordered, non-signaling)</td>
<td>True False True True No</td>
<td></td>
</tr>
<tr>
<td>NLE_UQ</td>
<td>16H</td>
<td>Not-less-than-or-equal (unordered, nonsignaling)</td>
<td>True False False True No</td>
<td></td>
</tr>
<tr>
<td>ORD_S</td>
<td>17H</td>
<td>Ordered (signaling)</td>
<td>True True True False Yes</td>
<td></td>
</tr>
<tr>
<td>EQ_US</td>
<td>18H</td>
<td>Equal (unordered, signaling)</td>
<td>False False True True Yes</td>
<td></td>
</tr>
</tbody>
</table>
The unordered relationship is true when at least one of the two source operands being compared is a NaN; the ordered relationship is true when neither source operand is a NaN.

A subsequent computational instruction that uses the mask result in the destination operand as an input operand will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask of all 1s corresponds to a QNaN.

Note that processors with “CPUID.1H:ECX.AVX =0” do not implement the “greater-than”, “greater-than-or-equal”, “not-greater than”, and “not-greater-than-or-equal relations” predicates. These comparisons can be made either by using the inverse relationship (that is, use the “not-less-than-or-equal” to make a “greater-than” comparison) or by using software emulation. When using software emulation, the program must swap the operands (copying registers when necessary to protect the data that will now be in the destination), and then perform the compare using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7 (Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A) under the heading Emulation.

Table 5-9. Comparison Predicate for CMPPD and CMPPS Instructions (Continued)

<table>
<thead>
<tr>
<th>Predicate</th>
<th>imm8 Value</th>
<th>Description</th>
<th>Result: A Is 1st Operand, B Is 2nd Operand</th>
<th>Signals QIA on QNAN</th>
</tr>
</thead>
<tbody>
<tr>
<td>NGE_UQ</td>
<td>19H</td>
<td>Not-greater-than-or-equal (unordered, nonsignaling)</td>
<td>A &gt; B A &lt; B A = B Unordered</td>
<td></td>
</tr>
<tr>
<td>NGT_UQ</td>
<td>1AH</td>
<td>Not-greater-than (unordered, nonsignaling)</td>
<td>False True True True No</td>
<td></td>
</tr>
<tr>
<td>FALSE_OS</td>
<td>1BH</td>
<td>False (ordered, signaling)</td>
<td>False False False False Yes</td>
<td></td>
</tr>
<tr>
<td>NEQ_OS</td>
<td>1CH</td>
<td>Not-equal (ordered, signaling)</td>
<td>True True False False Yes</td>
<td></td>
</tr>
<tr>
<td>GE_OQ</td>
<td>1DH</td>
<td>Greater-than-or-equal (ordered, nonsignaling)</td>
<td>True False True False No</td>
<td></td>
</tr>
<tr>
<td>GT_OQ</td>
<td>1EH</td>
<td>Greater-than (ordered, nonsignaling)</td>
<td>True False False False No</td>
<td></td>
</tr>
<tr>
<td>TRUE_US</td>
<td>1FH</td>
<td>True (unordered, signaling)</td>
<td>True True True True Yes</td>
<td></td>
</tr>
</tbody>
</table>

NOTES:
1. If either operand A or B is a NAN.

If either operand A or B is a NaN.

The unordered relationship is true when at least one of the two source operands being compared is a NaN; the ordered relationship is true when neither source operand is a NaN.

A subsequent computational instruction that uses the mask result in the destination operand as an input operand will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask of all 1s corresponds to a QNaN.

Note that processors with “CPUID.1H:ECX.AVX =0” do not implement the “greater-than”, “greater-than-or-equal”, “not-greater than”, and “not-greater-than-or-equal relations” predicates. These comparisons can be made either by using the inverse relationship (that is, use the “not-less-than-or-equal” to make a “greater-than” comparison) or by using software emulation. When using software emulation, the program must swap the operands (copying registers when necessary to protect the data that will now be in the destination), and then perform the compare using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7 (Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A) under the heading Emulation.
Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand CMPPD instruction, for processors with “CPUID.1H:ECX.AVX =0”. See Table 5-10. Compiler should treat reserved Imm8 values as illegal syntax.

Table 5-10. Pseudo-Op and CMPPD Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPPD Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMPEQPD xmm1, xmm2</td>
<td>CMPPD xmm1, xmm2, 0</td>
</tr>
<tr>
<td>CMPLTPD xmm1, xmm2</td>
<td>CMPPD xmm1, xmm2, 1</td>
</tr>
<tr>
<td>CMPLEPD xmm1, xmm2</td>
<td>CMPPD xmm1, xmm2, 2</td>
</tr>
<tr>
<td>CMPUNORDPD xmm1, xmm2</td>
<td>CMPPD xmm1, xmm2, 3</td>
</tr>
<tr>
<td>CMPNEQPD xmm1, xmm2</td>
<td>CMPPD xmm1, xmm2, 4</td>
</tr>
<tr>
<td>CMPNLTDPD xmm1, xmm2</td>
<td>CMPPD xmm1, xmm2, 5</td>
</tr>
<tr>
<td>CMPNLTPD xmm1, xmm2</td>
<td>CMPPD xmm1, xmm2, 6</td>
</tr>
<tr>
<td>CMPORDPD xmm1, xmm2</td>
<td>CMPPD xmm1, xmm2, 7</td>
</tr>
</tbody>
</table>

The greater-than relations that the processor does not implement require more than one instruction to emulate in software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to the correct destination register and that the source operand is left intact.) Processors with “CPUID.1H:ECX.AVX =1” implement the full complement of 32 predicates shown in Table 5-9, software emulation is no longer needed. Compilers and assemblers may implement the following three-operand pseudo-ops in addition to the four-operand VCMPDPPD instruction. See Table 5-11, where the notations of reg1 reg2, and reg3 represent either XMM registers or YMM registers. Compiler should treat reserved Imm8 values as illegal syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic interface.

Table 5-11. Pseudo-Op and VCMPDPPD Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPPD Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPEQPD reg1, reg2, reg3</td>
<td>VCMPDPPD reg1, reg2, reg3, 0</td>
</tr>
<tr>
<td>VCMPLTPD reg1, reg2, reg3</td>
<td>VCMPDPPD reg1, reg2, reg3, 1</td>
</tr>
<tr>
<td>VCMPLEPD reg1, reg2, reg3</td>
<td>VCMPDPPD reg1, reg2, reg3, 2</td>
</tr>
<tr>
<td>VCMPPORDPD reg1, reg2, reg3</td>
<td>VCMPDPPD reg1, reg2, reg3, 3</td>
</tr>
<tr>
<td>VCMPPNEQPD reg1, reg2, reg3</td>
<td>VCMPDPPD reg1, reg2, reg3, 4</td>
</tr>
<tr>
<td>VCMPPNLTDPD reg1, reg2, reg3</td>
<td>VCMPDPPD reg1, reg2, reg3, 5</td>
</tr>
<tr>
<td>VCMPPNLEPD reg1, reg2, reg3</td>
<td>VCMPDPPD reg1, reg2, reg3, 6</td>
</tr>
<tr>
<td>VCMPPORDPD reg1, reg2, reg3</td>
<td>VCMPDPPD reg1, reg2, reg3, 7</td>
</tr>
</tbody>
</table>
INSTRUCTION SET REFERENCE

Table 5-11. Pseudo-Op and VCMPPD Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPPD Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPEQ_UQP reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 8</td>
</tr>
<tr>
<td>VCMPNGEPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 9</td>
</tr>
<tr>
<td>VCMPNGTPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 0AH</td>
</tr>
<tr>
<td>VCMFALSEPDP reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 0BH</td>
</tr>
<tr>
<td>VCPNQ_OQP reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 0CH</td>
</tr>
<tr>
<td>VCPGEPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 0DH</td>
</tr>
<tr>
<td>VCPGTPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 0EH</td>
</tr>
<tr>
<td>VCMPTUEPDP reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 0FH</td>
</tr>
<tr>
<td>VCMPEQ_OSPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 10H</td>
</tr>
<tr>
<td>VCMPLT_OQP reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 11H</td>
</tr>
<tr>
<td>VCMPLPE_OQP reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 12H</td>
</tr>
<tr>
<td>VCMPNORD_SPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 13H</td>
</tr>
<tr>
<td>VCPFPNUSPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 14H</td>
</tr>
<tr>
<td>VCMFNLT_UQP reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 15H</td>
</tr>
<tr>
<td>VCMPNLUE_UQP reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 16H</td>
</tr>
<tr>
<td>VCMPPORD_SPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 17H</td>
</tr>
<tr>
<td>VCMPEQ_USPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 18H</td>
</tr>
<tr>
<td>VCMPNGEPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 19H</td>
</tr>
<tr>
<td>VCPNGTYQPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 1AH</td>
</tr>
<tr>
<td>VCMFNQUSPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 1BH</td>
</tr>
<tr>
<td>VCMTPUESPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 1CH</td>
</tr>
<tr>
<td>VCMPEQ_USPD reg1, reg2, reg3</td>
<td>VCMPPD reg1, reg2, reg3, 1FH</td>
</tr>
</tbody>
</table>

Operation

CASE (COMPARISON PREDICATE) OF
0: OP3 ← EQ_OQ; OP5 ← EQ_OQ;
1: OP3 ← LT_O; OP5 ← LT_O;
2: OP3 ← LE_O; OP5 ← LE_O;
3: OP3 ← UNORD_Q; OP5 ← UNORD_Q;
4: OP3 ← NEQ_UQ; OP5 ← NEQ_UQ;
5: OP3 ← NLT_US; OP5 ← NLT_US;
6: OP3 ← NLE_US; OP5 ← NLE_US;
7: OP3 ← ORD_Q; OP5 ← ORD_Q;
8: OP5 ← EQ_UQ;
9: OP5 ← NGE_US;
10: OP5 ← NGT_US;
11: OP5 ← FALSE_OQ;
12: OP5 ← NEQ_OQ;
13: OP5 ← GE_OS;
14: OP5 ← GT_OS;
15: OP5 ← TRUE_UQ;
16: OP5 ← EQ_OS;
17: OP5 ← LT_OQ;
18: OP5 ← LE_OQ;
19: OP5 ← UNORD_S;
20: OP5 ← NGE_US;
21: OP5 ← NLT_UQ;
22: OP5 ← NLE_UQ;
23: OP5 ← ORD_S;
24: OP5 ← EQ_US;
25: OP5 ← NGE_UQ;
26: OP5 ← NGT_UQ;
27: OP5 ← FALSE_US;
28: OP5 ← NEQ_OS;
29: OP5 ← GE_OQ;
30: OP5 ← GT_OQ;
31: OP5 ← TRUE_US;
DEFAULT: Reserved;

ESAC;

VCMPPD (VEX.256 encoded version)
CMP0 ← SRC[63:0] OP5 SRC2[63:0];
CMP1 ← SRC[127:64] OP5 SRC2[127:64];
CMP3 ← SRC[255:192] OP5 SRC2[255:192];
IF CMP0 = TRUE
    THEN DEST[63:0] ← FFFFFFFFFFFFFFHH;
    ELSE DEST[63:0] ← 0000000000000000H; FI;
IF CMP1 = TRUE
    THEN DEST[127:64] ← FFFFFFFFFFFFFFFFHH;
    ELSE DEST[127:64] ← 0000000000000000H; FI;
IF CMP2 = TRUE
INSTRUCTION SET REFERENCE

THEN DEST[191:128] ← FFFFFFFFFFFFFFFFFH;
ELSE DEST[191:128] ← 0000000000000000H; FI;
IF CMP3 = TRUE
    THEN DEST[255:192] ← FFFFFFFFFFFFFFFFFH;
    ELSE DEST[255:192] ← 0000000000000000H; FI;

VCMPDD (VEX.128 encoded version)
CMP0 ← SRC1[63:0] OP5 SRC2[63:0];
CMP1 ← SRC1[127:64] OP5 SRC2[127:64];
IF CMP0 = TRUE
    THEN DEST[63:0] ← FFFFFFFFFFFFFFFFFH;
    ELSE DEST[63:0] ← 0000000000000000H; FI;
IF CMP1 = TRUE
    THEN DEST[127:64] ← FFFFFFFFFFFFFFFFFH;
    ELSE DEST[127:64] ← 0000000000000000H; FI;
DEST[255:128] ← 0

CMPPD (128-bit Legacy SSE version)
CMP0 ← SRC1[63:0] OP3 SRC2[63:0];
CMP1 ← SRC1[127:64] OP3 SRC2[127:64];
IF CMP0 = TRUE
    THEN DEST[63:0] ← FFFFFFFFFFFFFFFFFH;
    ELSE DEST[63:0] ← 0000000000000000H; FI;
IF CMP1 = TRUE
    THEN DEST[127:64] ← FFFFFFFFFFFFFFFFFH;
    ELSE DEST[127:64] ← 0000000000000000H; FI;
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VCMPPD __m256 _mm256_cmp_pd(__m256 a, __m256 b, const int imm)
VCMPPD __m128 _mm_cmp_pd(__m128 a, __m128 b, const int imm)

SIMD Floating-Point Exceptions
Invalid if SNaN operand and invalid if QNaN and predicate as listed in Table 5-9.
Denormal

Other Exceptions
See Exceptions Type 2
CMPPS- Compare Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F C2 /r ib</td>
<td>V/V</td>
<td>SSE</td>
<td>Compare packed single-precision floating-point values in xmm2/m128 and xmm1 using bits 2:0 of imm8 as a comparison predicate</td>
</tr>
<tr>
<td>CMPPS xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.0F C2 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed single-precision floating-point values in xmm3/m128 and xmm2 using bits 4:0 of imm8 as a comparison predicate</td>
</tr>
<tr>
<td>VCMPPS xmm1, xmm2, xmm3/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.0F C2 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed single-precision floating-point values in ymm3/m256 and ymm2 using bits 2:0 of imm8 as a comparison predicate</td>
</tr>
<tr>
<td>VCMPPS ymm1, ymm2, ymm3/m256, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a SIMD compare of the packed single-precision floating-point values in the second source operand and the first source operand and returns the results of the comparison to the destination operand. The comparison predicate operand (immediate byte) specifies the type of comparison performed on each of the pairs of packed values. The result of each comparison is a quadword mask of all 1s (comparison true) or all 0s (comparison false).

VEX.256 encoded version: The first source operand (second operand) is a YMM register. The second source operand (third operand) can be a YMM register or a 256-bit memory location. The destination operand (first operand) is a YMM register. Eight comparisons are performed with results written to the destination operand.

128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The second source operand (second operand) can be an XMM register or 128-bit memory location. Bits (255:128) of the corresponding YMM destination register remain unchanged. Four comparisons are performed with results written to bits 127:0 of the destination operand.

VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source operand (third operand) can be an XMM register or a 128-bit memory location. Bits (255:128) of the destination YMM register are zeroed.
INSTRUCTION SET REFERENCE

Four comparisons are performed with results written to bits 127:0 of the destination operand.

The comparison predicate operand is an 8-bit immediate:

- For instructions encoded using the VEX prefix, bits 4:0 define the type of comparison to be performed (see Figure 5-9). Bits 5 through 7 of the immediate are reserved.
- For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see the first 8 rows of Table 5-9). Bits 3 through 7 of the immediate are reserved.

The unordered relationship is true when at least one of the two source operands being compared is a NaN; the ordered relationship is true when neither source operand is a NaN.

A subsequent computational instruction that uses the mask result in the destination operand as an input operand will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask of all 1s corresponds to a QNaN.

Note that processors with “CPUID.1H:ECX.AVX =0” do not implement the “greater-than”, “greater-than-or-equal”, “not-greater than”, and “not-greater-than-or-equal relations” predicates. These comparisons can be made either by using the inverse relationship (that is, use the “not-less-than-or-equal” to make a “greater-than” comparison) or by using software emulation. When using software emulation, the program must swap the operands (copying registers when necessary to protect the data that will now be in the destination), and then perform the compare using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7 (Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A) under the heading Emulation.

Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand CMPPS instruction, for processors with "CPUID.1H:ECX.AVX =0". See Table 5-12. Compiler should treat reserved Imm8 values as illegal syntax.

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPPS Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMPEQPS xmm1, xmm2</td>
<td>CMPPS xmm1, xmm2, 0</td>
</tr>
<tr>
<td>CMPLTPS xmm1, xmm2</td>
<td>CMPPS xmm1, xmm2, 1</td>
</tr>
<tr>
<td>CMPLEPS xmm1, xmm2</td>
<td>CMPPS xmm1, xmm2, 2</td>
</tr>
<tr>
<td>CMPUNORDPS xmm1, xmm2</td>
<td>CMPPS xmm1, xmm2, 3</td>
</tr>
<tr>
<td>CMPNEQPS xmm1, xmm2</td>
<td>CMPPS xmm1, xmm2, 4</td>
</tr>
<tr>
<td>CMPNLTPS xmm1, xmm2</td>
<td>CMPPS xmm1, xmm2, 5</td>
</tr>
<tr>
<td>CMPNLEPS xmm1, xmm2</td>
<td>CMPPS xmm1, xmm2, 6</td>
</tr>
</tbody>
</table>

Table 5-12. Pseudo-Op and CMPPS Implementation
The greater-than relations that the processor does not implement require more than one instruction to emulate in software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to the correct destination register and that the source operand is left intact.)

Processors with "CPUID.1H:ECX.AVX =1" implement the full complement of 32 predicates shown in Table 5-13, software emulation is no longer needed. Compilers and assemblers may implement the following three-operand pseudo-ops in addition to the four-operand VCMPPS instruction. See Table 5-13, where the notation of reg1 and reg2 represent either XMM registers or YMM registers. Compiler should treat reserved Imm8 values as illegal syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic interface.

### Table 5-12. Pseudo-Op and CMPPS Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPPS Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMPORDPS xmm1, xmm2</td>
<td>CMPPS xmm1, xmm2, 7</td>
</tr>
</tbody>
</table>

### Table 5-13. Pseudo-Op and VCMPPS Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPPS Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPEQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 0</td>
</tr>
<tr>
<td>VCMPPLTPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 1</td>
</tr>
<tr>
<td>VCMPLEPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 2</td>
</tr>
<tr>
<td>VCMPPUNORDPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 3</td>
</tr>
<tr>
<td>VCMPPNEQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 4</td>
</tr>
<tr>
<td>VCMPNLTPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 5</td>
</tr>
<tr>
<td>VCMPNLEPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 6</td>
</tr>
<tr>
<td>VCMPPORDPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 7</td>
</tr>
<tr>
<td>VCMPEQ_UQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 8</td>
</tr>
<tr>
<td>VCMPNGEPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 9</td>
</tr>
<tr>
<td>VCMPNGTIPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 0AH</td>
</tr>
<tr>
<td>VCMMPNALSEPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 0BH</td>
</tr>
<tr>
<td>VCMPPNEQ_UQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 0CH</td>
</tr>
<tr>
<td>VCMMPGEPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 0DH</td>
</tr>
<tr>
<td>VCMMPGTIPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 0EH</td>
</tr>
<tr>
<td>VCMMPTRUEPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 0FH</td>
</tr>
</tbody>
</table>
### Table 5-13. Pseudo-Op and VCMPPS Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPPS Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPEQ_OSPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 10H</td>
</tr>
<tr>
<td>VCMPLT_OQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 11H</td>
</tr>
<tr>
<td>VCMPLE_OQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 12H</td>
</tr>
<tr>
<td>VCMPUNORD_SPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 13H</td>
</tr>
<tr>
<td>VCMPNEQ_USPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 14H</td>
</tr>
<tr>
<td>VCMPLT_UQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 15H</td>
</tr>
<tr>
<td>VCMPNLLE_UQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 16H</td>
</tr>
<tr>
<td>VCMPPORD_SPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 17H</td>
</tr>
<tr>
<td>VCMPEQ_USPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 18H</td>
</tr>
<tr>
<td>VCMPNGE_UQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 19H</td>
</tr>
<tr>
<td>VCMPNLT_UQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 1AH</td>
</tr>
<tr>
<td>VCMPNLE_UQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 1BH</td>
</tr>
<tr>
<td>VCMPGE_OQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 1CH</td>
</tr>
<tr>
<td>VCMPGT_OQPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 1FH</td>
</tr>
<tr>
<td>VCMPTRUE_USPS reg1, reg2, reg3</td>
<td>VCMPPS reg1, reg2, reg3, 1FH</td>
</tr>
</tbody>
</table>

### Operation

CASE (COMPARISON PREDICATE) OF

0: OP3 ← EQ_OQ; OP5 ← EQ_OQ;
1: OP3 ← LT_OS; OP5 ← LT_OS;
2: OP3 ← LE_OS; OP5 ← LE_OS;
3: OP3 ← UNORD_Q; OP5 ← UNORD_Q;
4: OP3 ← NEQ_UQ; OP5 ← NEQ_UQ;
5: OP3 ← NLT_US; OP5 ← NLT_US;
6: OP3 ← NLE_US; OP5 ← NLE_US;
7: OP3 ← ORD_Q; OP5 ← ORD_Q;
8: OP5 ← EQ_UQ;
9: OP5 ← NGE_US;
10: OP5 ← NGT_US;
11: OP5 ← FALSE_OQ;
12: OP5 ← NEQ_OQ;
13: OP5 ← GE_OS;
14: OP5 ← GT_OS;
15: OP5 ← TRUE_UQ;
16: OP5 ← EQ_OS;
17: OP5 ← LT_OQ;
18: OP5 ← LE_OQ;
19: OP5 ← UNORD_S;
20: OP5 ← NEQ_US;
21: OP5 ← NLT_UQ;
22: OP5 ← NLE_UQ;
23: OP5 ← ORD_S;
24: OP5 ← EQ_US;
25: OP5 ← NGE_UQ;
26: OP5 ← NGT_UQ;
27: OP5 ← FALSE_OS;
28: OP5 ← NEQ_OS;
29: OP5 ← GE_OQ;
30: OP5 ← GT_OQ;
31: OP5 ← TRUE_US;
DEFAULT: Reserved

ESAC;

VCMPPS (VEX.256 encoded version)
CMP0 ← SRC1[31:0] OP5 SRC2[31:0];
CMP1 ← SRC1[63:32] OP5 SRC2[63:32];
CMP2 ← SRC1[95:64] OP5 SRC2[95:64];
CMP3 ← SRC1[127:96] OP5 SRC2[127:96];
CMP4 ← SRC1[159:128] OP5 SRC2[159:128];
CMP5 ← SRC1[191:160] OP5 SRC2[191:160];
CMP6 ← SRC1[223:192] OP5 SRC2[223:192];
CMP7 ← SRC1[255:224] OP5 SRC2[255:224];

IF CMP0 = TRUE
THEN DEST[31:0] ← FFFFFFFFH;
ELSE DEST[31:0] ← 000000000H; FI;
IF CMP1 = TRUE
THEN DEST[63:32] ← FFFFFFFFH;
ELSE DEST[63:32] ← 000000000H; FI;
IF CMP2 = TRUE
THEN DEST[95:64] ← FFFFFFFFH;
ELSE DEST[95:64] ← 000000000H; FI;
IF CMP3 = TRUE
THEN DEST[127:96] ← FFFFFFFFH;
ELSE DEST[127:96] ← 000000000H; FI;
IF CMP4 = TRUE
THEN DEST[159:128] ← FFFFFFFFH;
ELSE DEST[159:128] ← 000000000H; FI;
IF CMP5 = TRUE
   THEN DEST[191:160] ← FFFFFFFFH;
ELSE DEST[191:160] ← 000000000H; FI;
IF CMP6 = TRUE
   THEN DEST[223:192] ← FFFFFFFFH;
ELSE DEST[223:192] ← 000000000H; FI;
IF CMP7 = TRUE
   THEN DEST[255:224] ← FFFFFFFFH;
ELSE DEST[255:224] ← 000000000H; FI;

VCMPPS (VEX.128 encoded version)
CMP0 ← SRC1[31:0] OP5 SRC2[31:0];
CMP1 ← SRC1[63:32] OP5 SRC2[63:32];
CMP2 ← SRC1[95:64] OP5 SRC2[95:64];
CMP3 ← SRC1[127:96] OP5 SRC2[127:96];
IF CMP0 = TRUE
   THEN DEST[31:0] ← FFFFFFFFH;
Else DEST[31:0] ← 000000000H; FI;
IF CMP1 = TRUE
   THEN DEST[63:32] ← FFFFFFFFH;
Else DEST[63:32] ← 000000000H; FI;
IF CMP2 = TRUE
   THEN DEST[95:64] ← FFFFFFFFH;
Else DEST[95:64] ← 000000000H; FI;
IF CMP3 = TRUE
   THEN DEST[127:96] ← FFFFFFFFH;
Else DEST[127:96] ← 000000000H; FI;
DEST[255:128] ← 0

CMPPS (128-bit Legacy SSE version)
CMP0 ← SRC1[31:0] OP3 SRC2[31:0];
CMP1 ← SRC1[63:32] OP3 SRC2[63:32];
CMP2 ← SRC1[95:64] OP3 SRC2[95:64];
CMP3 ← SRC1[127:96] OP3 SRC2[127:96];
IF CMP0 = TRUE
   THEN DEST[31:0] ← FFFFFFFFH;
Else DEST[31:0] ← 000000000H; FI;
IF CMP1 = TRUE
   THEN DEST[63:32] ← FFFFFFFFH;
Else DEST[63:32] ← 000000000H; FI;
IF CMP2 = TRUE
   THEN DEST[95:64] ← FFFFFFFFH;
ELSE DEST[95:64] ← 000000000H; FI;
IF CMP3 = TRUE
  THEN DEST[127:96] ← FFFFFFFFH;
  ELSE DEST[127:96] ← 000000000H; FI;
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VCMPPS __m256_mm256_cmp_ps(__m256 a, __m256 b, const int imm)
VCMPPS __m128_mm_cmp_ps(__m128 a, __m128 b, const int imm)

SIMD Floating-Point Exceptions
Invalid if SNaN operand and invalid if QNaN and predicate as listed in Table 5-9.
Denormal

Other Exceptions
See Exceptions Type 2
INSTRUCTION SET REFERENCE

CMPSD- Compare Scalar Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F C2 /r ib</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>CMPSD xmm1, xmm2/m64, imm8</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare low double precision floating-point value in xmm2/m64 and xmm1 using bits 2:0 of imm8 as comparison predicate</td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F C2 /r ib</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VCMPSD xmm1, xmm2, xmm3/m64, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare low double precision floating-point value in xmm3/m64 and xmm2 using bits 4:0 of imm8 as comparison predicate</td>
</tr>
</tbody>
</table>

Description

Compares the low double-precision floating-point values in the second source operand and the first source operand and returns the results in of the comparison to the destination operand. The comparison predicate operand (immediate operand) specifies the type of comparison performed. The comparison result is a quadword mask of all 1s (comparison true) or all 0s (comparison false).

128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The second source operand (second operand) can be an XMM register or 64-bit memory location. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source operand (third operand) can be an XMM register or a 64-bit memory location. The result is stored in the low quadword of the destination operand; the high quadword is filled with the contents of the high quadword of the first source operand. Bits (255:128) of the destination YMM register are zeroed.

The comparison predicate operand is an 8-bit immediate:

- For instructions encoded using the VEX prefix, bits 4:0 define the type of comparison to be performed (see Table 5-9). Bits 5 through 7 of the immediate are reserved.
- For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see the first 8 rows of Table 5-9). Bits 3 through 7 of the immediate are reserved.

The unordered relationship is true when at least one of the two source operands being compared is a NaN; the ordered relationship is true when neither source operand is a NaN.
A subsequent computational instruction that uses the mask result in the destination operand as an input operand will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask of all 1s corresponds to a QNaN.

Note that processors with "CPUID.1H:ECX.AVX =0" do not implement the "greater-than", "greater-than-or-equal", "not-greater than", and "not-greater-than-or-equal relations" predicates. These comparisons can be made either by using the inverse relationship (that is, use the "not-less-than-or-equal" to make a "greater-than" comparison) or by using software emulation. When using software emulation, the program must swap the operands (copying registers when necessary to protect the data that will now be in the destination), and then perform the compare using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7 (Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A) under the heading Emulation.

Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand CMPSD instruction, for processors with "CPUID.1H:ECX.AVX =0". See Table 5-14. Compiler should treat reserved Imm8 values as illegal syntax.

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPSD Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMPEQSD xmm1, xmm2</td>
<td>CMPSD xmm1, xmm2, 0</td>
</tr>
<tr>
<td>CMPLTSD xmm1, xmm2</td>
<td>CMPSD xmm1, xmm2, 1</td>
</tr>
<tr>
<td>CMPLESD xmm1, xmm2</td>
<td>CMPSD xmm1, xmm2, 2</td>
</tr>
<tr>
<td>CMPUORDSD xmm1, xmm2</td>
<td>CMPSD xmm1, xmm2, 3</td>
</tr>
<tr>
<td>CMPNQEQSD xmm1, xmm2</td>
<td>CMPSD xmm1, xmm2, 4</td>
</tr>
<tr>
<td>CMPNLTSD xmm1, xmm2</td>
<td>CMPSD xmm1, xmm2, 5</td>
</tr>
<tr>
<td>CMPNLESD xmm1, xmm2</td>
<td>CMPSD xmm1, xmm2, 6</td>
</tr>
<tr>
<td>CMPORDSD xmm1, xmm2</td>
<td>CMPSD xmm1, xmm2, 7</td>
</tr>
</tbody>
</table>

The greater-than relations that the processor does not implement require more than one instruction to emulate in software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to the correct destination register and that the source operand is left intact.)

Processors with "CPUID.1H:ECX.AVX =1" implement the full complement of 32 predicates shown in Table 5-15, software emulation is no longer needed. Compilers and assemblers may implement the following three-operand pseudo-ops in addition to the four-operand VCMPSD instruction. See Table 5-15, where the notations of reg1 reg2, and reg3 represent either XMM registers or YMM registers. Compiler should treat reserved Imm8 values as illegal syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic interface.
### Table 5-15. Pseudo-Op and VCMPSD Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPSD Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPEQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 0</td>
</tr>
<tr>
<td>VCMPGTSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 1</td>
</tr>
<tr>
<td>VCMPLTSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 2</td>
</tr>
<tr>
<td>VCMPPNORDSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 3</td>
</tr>
<tr>
<td>VCMPNEQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 4</td>
</tr>
<tr>
<td>VCMPPNLTSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 5</td>
</tr>
<tr>
<td>VCMPPNLESD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 6</td>
</tr>
<tr>
<td>VCMPPORDSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 7</td>
</tr>
<tr>
<td>VCMP_EQ_UQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 8</td>
</tr>
<tr>
<td>VCMP_GE_UQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 9</td>
</tr>
<tr>
<td>VCMP_GT_UQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 0AH</td>
</tr>
<tr>
<td>VCMP_FALSE_UQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 1BH</td>
</tr>
<tr>
<td>VCMP_EQ_OQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 10H</td>
</tr>
<tr>
<td>VCMP_GT_OQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 11H</td>
</tr>
<tr>
<td>VCMP_TRUE_OQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 12H</td>
</tr>
<tr>
<td>VCMP_EQ_SSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 13H</td>
</tr>
<tr>
<td>VCMP_GT_SSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 14H</td>
</tr>
<tr>
<td>VCMP_FALSE_SSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 15H</td>
</tr>
<tr>
<td>VCMP_GE_OQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 16H</td>
</tr>
<tr>
<td>VCMP_GT_OQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 17H</td>
</tr>
<tr>
<td>VCMP_FALSE_OQSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 18H</td>
</tr>
<tr>
<td>VCMP_EQ_USSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 19H</td>
</tr>
<tr>
<td>VCMP_GT_USSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 1AH</td>
</tr>
<tr>
<td>VCMP_TRUE_USSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 1BH</td>
</tr>
<tr>
<td>VCMP_EQ_USSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 1CH</td>
</tr>
<tr>
<td>VCMP_GT_USSD rd1, rd2, rd3</td>
<td>VCMPSD rd1, rd2, rd3, 1DH</td>
</tr>
</tbody>
</table>
Software should ensure VCMPSD is encoded with VEX.L=0. Encoding VCMPSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

CASE (COMPARISON PREDICATE) OF

0: OP3 ← EQ_OQ; OP5 ← EQ_OQ;
1: OP3 ← LT_OS; OP5 ← LT_OS;
2: OP3 ← LE_OS; OP5 ← LE_OS;
3: OP3 ← UNORD_Q; OP5 ← UNORD_Q;
4: OP3 ← NEQ_UQ; OP5 ← NEQ_UQ;
5: OP3 ← NLT_US; OP5 ← NLT_US;
6: OP3 ← NLE_US; OP5 ← NLE_US;
7: OP5 ← ORD_Q; OP5 ← ORD_Q;
8: OP5 ← EQ_UQ;
9: OP5 ← NGE_US;
10: OP5 ← NGT_US;
11: OP5 ← FALSE_OQ;
12: OP5 ← NEQ_OQ;
13: OP5 ← GE_OS;
14: OP5 ← GT_OS;
15: OP5 ← TRUE_UQ;
16: OP5 ← EQ_OS;
17: OP5 ← LT_OQ;
18: OP5 ← LE_OQ;
19: OP5 ← UNORD_S;
20: OP5 ← NEQ_US;
21: OP5 ← NLT_UQ;
22: OP5 ← NLE_UQ;
23: OP5 ← ORD_S;
24: OP5 ← EQ_US;
25: OP5 ← NGE_UQ;
26: OP5 ← NGT_UQ;
27: OP5 ← FALSE_OS;
28: OP5 ← NEQ_OS;
29: OP5 ← GE_OQ;
30: OP5 ← GT_OQ;
31: OP5 ← TRUE_US;

Table 5-15. Pseudo-Op and VCMPSD Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>VCMPSD Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPPGT_OQSD reg1, reg2, reg3</td>
<td>VCMPSD reg1, reg2, reg3, IEH</td>
</tr>
<tr>
<td>VCMPPTRUE_USSD reg1, reg2, reg3</td>
<td>VCMPSD reg1, reg2, reg3, IFH</td>
</tr>
</tbody>
</table>
INSTRUCTION SET REFERENCE

DEFAULT: Reserved
ESAC;

**CMPSD (128-bit Legacy SSE version)**

CMP0 ← DEST[63:0] OP3 SRC[63:0];
IF CMP0 = TRUE
THEN DEST[63:0] ← FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] ← 0000000000000000H; FI;
DEST[255:64] (Unmodified)

**VCMPSD (VEX.128 encoded version)**

CMP0 ← SRC1[63:0] OP5 SRC2[63:0];
IF CMP0 = TRUE
THEN DEST[63:0] ← FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] ← 0000000000000000H; FI;
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

**Intel C/C++ Compiler Intrinsic Equivalent**

VCMPSD _m128 _mm_cmp_sd(_m128 a, _m128 b, const int imm)

**SIMD Floating-Point Exceptions**

Invalid if SNaN operand, Invalid if QNaN and predicate as listed in Table 5-9
Denormal.

**Other Exceptions**

See Exceptions Type 3
CMPSS- Compare Scalar Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 OF C2 /r ib</td>
<td>V/V</td>
<td>SSE</td>
<td>Compare low single precision floating-point value in xmm2/m32 and xmm1 using bits 2:0 of imm8 as comparison predicate.</td>
</tr>
<tr>
<td>CMPSS xmm1, xmm2/m32, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F C2 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare low single precision floating-point value in xmm2/m32 and xmm1 using bits 4:0 of imm8 as comparison predicate.</td>
</tr>
<tr>
<td>VCMPS xmm1, xmm2, xmm3/m32, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Compares the low single-precision floating-point values in the second source operand and the first source operand and returns the results of the comparison to the destination operand. The comparison predicate operand (immediate operand) specifies the type of comparison performed. The comparison result is a doubleword mask of all 1s (comparison true) or all 0s (comparison false).

128-bit Legacy SSE version: The first source and destination operand (first operand) is an XMM register. The second source operand (second operand) can be an XMM register or 32-bit memory location. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: The first source operand (second operand) is an XMM register. The second source operand (third operand) can be an XMM register or a 32-bit memory location. The result is stored in the low 32 bits of the destination operand; bits 128:32 of the destination operand are copied from the first source operand. Bits (255:128) of the destination YMM register are zeroed.

The comparison predicate operand is an 8-bit immediate:

- For instructions encoded using the VEX prefix, bits 4:0 define the type of comparison to be performed (see Table 5-9). Bits 5 through 7 of the immediate are reserved.
- For instruction encodings that do not use VEX prefix, bits 2:0 define the type of comparison to be made (see the first 8 rows of Table 5-9). Bits 3 through 7 of the immediate are reserved.

The unordered relationship is true when at least one of the two source operands being compared is a NaN; the ordered relationship is true when neither source operand is a NaN.
INSTRUCTION SET REFERENCE

A subsequent computational instruction that uses the mask result in the destination operand as an input operand will not generate an exception, because a mask of all 0s corresponds to a floating-point value of +0.0 and a mask of all 1s corresponds to a QNaN.

Note that processors with "CPUID.1H:ECX.AVX =0" do not implement the "greater-than", "greater-than-or-equal", "not-greater-than", and "not-greater-than-or-equal relations" predicates. These comparisons can be made either by using the inverse relationship (that is, use the "not-less-than-or-equal" to make a "greater-than" comparison) or by using software emulation. When using software emulation, the program must swap the operands (copying registers when necessary to protect the data that will now be in the destination), and then perform the compare using a different predicate. The predicate to be used for these emulations is listed in the first 8 rows of Table 3-7 (Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A) under the heading Emulation.

Compilers and assemblers may implement the following two-operand pseudo-ops in addition to the three-operand CMPSS instruction, for processors with "CPUID.1H:ECX.AVX =0". See Table 5-16. Compiler should treat reserved Imm8 values as illegal syntax.

Table 5-16. Pseudo-Op and CMPSS Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPSS Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMPEQSS xmm1, xmm2</td>
<td>CMPSS xmm1, xmm2, 0</td>
</tr>
<tr>
<td>CMPLTSS xmm1, xmm2</td>
<td>CMPSS xmm1, xmm2, 1</td>
</tr>
<tr>
<td>CMPLESS xmm1, xmm2</td>
<td>CMPSS xmm1, xmm2, 2</td>
</tr>
<tr>
<td>CMPUNORDSS xmm1, xmm2</td>
<td>CMPSS xmm1, xmm2, 3</td>
</tr>
<tr>
<td>CMPNEQSS xmm1, xmm2</td>
<td>CMPSS xmm1, xmm2, 4</td>
</tr>
<tr>
<td>CMPNLTSS xmm1, xmm2</td>
<td>CMPSS xmm1, xmm2, 5</td>
</tr>
<tr>
<td>CMPNLESS xmm1, xmm2</td>
<td>CMPSS xmm1, xmm2, 6</td>
</tr>
<tr>
<td>CMPORDSS xmm1, xmm2</td>
<td>CMPSS xmm1, xmm2, 7</td>
</tr>
</tbody>
</table>

The greater-than relations that the processor does not implement require more than one instruction to emulate in software and therefore should not be implemented as pseudo-ops. (For these, the programmer should reverse the operands of the corresponding less than relations and use move instructions to ensure that the mask is moved to the correct destination register and that the source operand is left intact.)

Processors with "CPUID.1H:ECX.AVX =1" implement the full complement of 32 predicates shown in Table 5-15, software emulation is no longer needed. Compilers and assemblers may implement the following three-operand pseudo-ops in addition to the four-operand VCMPSS instruction. See Table 5-17, where the notations of reg1 reg2, and reg3 represent either XMM registers or YMM registers. Compiler should treat reserved Imm8 values as illegal syntax. Alternately, intrinsics can map the pseudo-ops to pre-defined constants to support a simpler intrinsic interface.
### Table 5-17. Pseudo-Op and V CMPSS Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPSS Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPEQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 0</td>
</tr>
<tr>
<td>VCMPLTSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 1</td>
</tr>
<tr>
<td>VCMPLESS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 2</td>
</tr>
<tr>
<td>V CMPUNORDSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 3</td>
</tr>
<tr>
<td>V CMPNEQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 4</td>
</tr>
<tr>
<td>V CMPNL TSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 5</td>
</tr>
<tr>
<td>V CMPNLESS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 6</td>
</tr>
<tr>
<td>V CMPORDSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 7</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 8</td>
</tr>
<tr>
<td>V CMPNGESS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 9</td>
</tr>
<tr>
<td>V CMPNGTSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 0AH</td>
</tr>
<tr>
<td>V CMPFAL SESS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 0BH</td>
</tr>
<tr>
<td>V CMP_EQ_OQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 0CH</td>
</tr>
<tr>
<td>V CMPGESS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 0DH</td>
</tr>
<tr>
<td>V CMPGTSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 0EH</td>
</tr>
<tr>
<td>V CMP_TRUESS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 0FH</td>
</tr>
<tr>
<td>V CMP_EQ_OSSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 10H</td>
</tr>
<tr>
<td>V CMP_EQ_OQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 11H</td>
</tr>
<tr>
<td>V CMP_EQ_OQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 12H</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 13H</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 14H</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 15H</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 16H</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 17H</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 18H</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 19H</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 1AH</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 1BH</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 1CH</td>
</tr>
<tr>
<td>V CMP_EQ_UQSS reg1, reg2, reg3</td>
<td>V CMPSS reg1, reg2, reg3, 1DH</td>
</tr>
</tbody>
</table>
Software should ensure VCMPSS is encoded with VEX.L=0. Encoding VCMPSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

### Table 5-17. Pseudo-Op and VCMPSS Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>CMPSS Implementation</th>
</tr>
</thead>
<tbody>
<tr>
<td>VCMPGT_QOSS reg1, reg2, reg3</td>
<td>VCMPSS reg1, reg2, reg3, 1EH</td>
</tr>
<tr>
<td>VCMPTRUE_USSS reg1, reg2, reg3</td>
<td>VCMPSS reg1, reg2, reg3, 1FH</td>
</tr>
</tbody>
</table>

**Operation**

CASE (COMPARISON PREDICATE) OF

- 0: OP3 ← EQ_QQ; OP5 ← EQ_QQ;
- 1: OP3 ← LT_OS; OP5 ← LT_OS;
- 2: OP3 ← LE_OS; OP5 ← LE_OS;
- 3: OP3 ← UNORD_Q; OP5 ← UNORD_Q;
- 4: OP3 ← NEQ_UQ; OP5 ← NEQ_UQ;
- 5: OP3 ← NLT_US; OP5 ← NLT_US;
- 6: OP3 ← NLE_US; OP5 ← NLE_US;
- 7: OP3 ← ORD_Q; OP5 ← ORD_Q;
- 8: OP5 ← EQ_UQ;
- 9: OP5 ← NGE_US;
- 10: OP5 ← NGT_US;
- 11: OP5 ← FALSE_QQ;
- 12: OP5 ← NEQ_UQ;
- 13: OP5 ← GE_OS;
- 14: OP5 ← GT_OS;
- 15: OP5 ← TRUE_UQ;
- 16: OP5 ← EQ_OS;
- 17: OP5 ← LT_OQ;
- 18: OP5 ← LE_OQ;
- 19: OP5 ← UNORD_S;
- 20: OP5 ← NEQ_US;
- 21: OP5 ← NLT_UQ;
- 22: OP5 ← NLE_UQ;
- 23: OP5 ← ORD_S;
- 24: OP5 ← EQ_US;
- 25: OP5 ← NGE_UQ;
- 26: OP5 ← NGT_UQ;
- 27: OP5 ← FALSE_OS;
- 28: OP5 ← NEQ_OS;
- 29: OP5 ← GE_OQ;
- 30: OP5 ← GT_OQ;
- 31: OP5 ← TRUE_US;
DEFAULT: Reserved

ESAC;

**CMPSS (128-bit Legacy SSE version)**

CMP0 ← DEST[31:0] OP3 SRC[31:0];
IF CMP0 = TRUE
THEN DEST[31:0] ← FFFFFFFFH;
ELSE DEST[31:0] ← 00000000H; FI;
DEST[255:32] (Unmodified)

**VCMPSS (VEX.128 encoded version)**

CMP0 ← SRC1[31:0] OP5 SRC2[31:0];
IF CMP0 = TRUE
THEN DEST[31:0] ← FFFFFFFFH;
ELSE DEST[31:0] ← 00000000H; FI;
DEST[255:128] ← 0

**Intel C/C++ Compiler Intrinsic Equivalent**

VCMPSS __m128 _mm_cmp_ss(__m128 a, __m128 b, const int imm)

**SIMD Floating-Point Exceptions**

Invalid if SNaN operand, Invalid if QNaN and predicate as listed in Table 5-9,
Denormal.

**Other Exceptions**

See Exceptions Type 3
INSTRUCTION SET REFERENCE

COMISD- Compare Scalar Ordered Double-Precision Floating-Point Values and Set EFLAGS

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 2F /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare low double precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.</td>
</tr>
<tr>
<td>COMISD xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F 2F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare low double precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.</td>
</tr>
<tr>
<td>VCOMISD xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Compares the double-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN).

Operand 1 is an XMM register; operand 2 can be an XMM register or a 64 bit memory location. The COMISD instruction differs from the UCOMISD instruction in that it signals a SIMD floating-point invalid operation exception (#I) when a source operand is either a QNaN or SNaN. The UCOMISD instruction signals an invalid numeric exception only if a source operand is an SNaN.

The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Software should ensure VCOMISD is encoded with VEX.L=0. Encoding VCOMISD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

COMISD (all versions)

RESULT ← OrderedCompare(DEST[63:0] <> SRC[63:0])

(* Set EFLAGS *) CASE (RESULT) OF
UNORDERED: ZF,PF,CF ← 111;
GREATER_THAN: ZF,PF,CF ← 000;
LESS_THAN: ZF,PF,CF ← 001;
EQUAL: ZF,PF,CF ← 100;
ESAC;
OF, AF, SF ← 0; }

**Intel C/C++ Compiler Intrinsic Equivalent**

```c
int _mm_comieq_sd (__m128d a, __m128d b)
int _mm_comilt_sd (__m128d a, __m128d b)
int _mm_comile_sd (__m128d a, __m128d b)
int _mm_comigt_sd (__m128d a, __m128d b)
int _mm_comige_sd (__m128d a, __m128d b)
int _mm_comineq_sd (__m128d a, __m128d b)
int _mm_comineq_sd (__m128d a, __m128d b)
```

**SIMD Floating-Point Exceptions**

Invalid (if SNaN or QNaN operands), Denormal.

**Other Exceptions**

See Exceptions Type 3; additionally

```c
#UD If VEX.vvvv != 1111B.
```
COMISS- Compare Scalar Ordered Single-Precision Floating-Point Values and Set EFLAGS

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 2F /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Compare low single precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.</td>
</tr>
<tr>
<td>COMISS xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.0F 2F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare low single precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.</td>
</tr>
<tr>
<td>VCOMISS xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Comparisons the single-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN).

Operand 1 is an XMM register; operand 2 can be an XMM register or a 32 bit memory location.

The COMISS instruction differs from the UCOMISS instruction in that it signals an SIMD floating-point invalid operation exception (#I) when a source operand is either a QNaN or SNaN. The UCOMISS instruction signals an invalid numeric exception only if a source operand is an SNaN.

The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Software should ensure VCOMISS is encoded with VEX.L=0. Encoding VCOMISS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

COMISS (all versions)

RESULT ← OrderedCompare(DEST[31:0] <> SRC[31:0])
INSTRUCTION SET REFERENCE

)(* Set EFLAGS *) CASE (RESULT) OF

    UNORDERED: ZF,PF,CF ← 111;
    GREATER_THAN: ZF,PF,CF ← 000;
    LESS_THAN: ZF,PF,CF ← 001;
    EQUAL: ZF,PF,CF ← 100;
ESAC;
OF, AF, SF ← 0; }

Intel C/C++ Compiler Intrinsic Equivalent

int _mm_comieq_ss (__m128 a, __m128 b)
int _mm_comilt_ss (__m128 a, __m128 b)
int _mm_comile_ss (__m128 a, __m128 b)
int _mm_comigt_ss (__m128 a, __m128 b)
int _mm_comige_ss (__m128 a, __m128 b)
int _mm_comineq_ss (__m128 a, __m128 b)

SIMD Floating-Point Exceptions
Invalid (if SNaN or QNaN operands), Denormal.

Other Exceptions
See Exceptions Type 3; additionally
#UD If VEX.vvvv != 111B.
CVTDQ2PD- Convert Packed Doubleword Integers to Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F E6 /r CVTDQ2PD xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert two packed signed doubleword integers from xmm2/mem to two packed double-precision floating-point values in xmm1</td>
</tr>
<tr>
<td>VEX.128.F3.0F E6 /r VCVTDQ2PD xmm1, xmm2/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert two packed signed doubleword integers from xmm2/mem to two packed double-precision floating-point values in xmm1</td>
</tr>
<tr>
<td>VEX.256.F3.0F E6 /r VCVTDQ2PD ymm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert four packed signed doubleword integers from xmm2/mem to four packed double-precision floating-point values in ymm1</td>
</tr>
</tbody>
</table>

Description

Converts two or four packed signed doubleword integers in the source operand (second operand) to two or four packed double-precision floating-point values in the destination operand (first operand).

VEX.256 encoded version: The source operand is an XMM register or 128-bit memory location. The destination operation is a YMM register.

VEX.128 encoded version: The source operand is an XMM register or 64-bit memory location. The destination operation is a YMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The source operand is an XMM register or 64-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
Operation

**VCVTDQ2PD (VEX.256 encoded version)**

- DEST[63:0] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0])
- DEST[127:64] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[63:32])
- DEST[191:128] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[95:64])
- DEST[255:192] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[127:96])

**VCVTDQ2PD (VEX.128 encoded version)**

- DEST[63:0] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0])
- DEST[127:64] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[63:32])
- DEST[255:128] ← 0

**CVTDQ2PD (128-bit Legacy SSE version)**

- DEST[63:0] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0])
- DEST[127:64] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[63:32])
- DEST[255:128] (unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

- `VCVTDQ2PD __m256d _mm256_cvtepi32_pd (__m128i src)`
- `CVTDQ2PD __m128d _mm_cvtepi32_pd (__m128i src)`

**Other Exceptions**

- See Exceptions Type 5; additionally
  - #UD If VEX.vvvv ! 1111B.
CVTDQ2PS- Convert Packed Doubleword Integers to Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 5B /r CVTDQ2PS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert four packed signed doubleword integers from xmm2/mem to four packed single-precision floating-point values in xmm1</td>
</tr>
<tr>
<td>VEX.128.0F 5B /r VCVTDQ2PS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert four packed signed doubleword integers from xmm2/mem to four packed single-precision floating-point values in xmm1</td>
</tr>
<tr>
<td>VEX.256.0F 5B /r VCVTDQ2PS ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert eight packed signed doubleword integers from ymm2/mem to eight packed single-precision floating-point values in ymm1</td>
</tr>
</tbody>
</table>

Description

Converts four or eight packed signed doubleword integers in the source operand to four or eight packed single-precision floating-point values in the destination operand.

VEX.256 encoded version: The source operand is a YMM register or 256-bit memory location. The destination operation is a YMM register.

VEX.128 encoded version: The source operand is an XMM register or 128-bit memory location. The destination operation is a YMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The source operand is an XMM register or 128-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation

`VCVTDQ2PS (VEX.256 encoded version)`

DEST[31:0] ← Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0])

DEST[63:32] ← Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:32])
INSTRUCTION SET REFERENCE

DEST[95:64] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[95:64])
DEST[127:96] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[127:96])
DEST[159:128] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[159:128])
DEST[191:160] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[191:160])
DEST[223:192] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[223:192])
DEST[255:224] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[255:224])

VCVTDQ2PS (VEX.128 encoded version)
DEST[31:0] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0])
DEST[63:32] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:32])
DEST[95:64] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[95:64])
DEST[127:96] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[127:96])
DEST[255:128] (unmodified)

CVTDQ2PS (128-bit Legacy SSE version)
DEST[31:0] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0])
DEST[63:32] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:32])
DEST[95:64] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[95:64])
DEST[127:96] \(\leftrightarrow\) Convert_Integer_To_Single_Precision_Floating_Point(SRC[127:96])
DEST[255:128] (unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VCVTDQ2PS __m256 _mm256_cvtepi32_ps (__m256i src)
CVTDQ2PS __m128 _mm_cvtepi32_ps (__m128i src)

SIMD Floating-Point Exceptions

Precision

Other Exceptions

See Exceptions Type 2; additionally

#UD If VEX.vvv != 1111B.
INSTRUCTION SET REFERENCE

CVTPD2DQ- Convert Packed Double-Precision Floating-point values to Packed Doubleword Integers

<table>
<thead>
<tr>
<th>Opcode Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F E6 /r CVTPD2DQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert two packed double-precision floating-point values in xmm2/mem to two signed doubleword integers in xmm1</td>
</tr>
<tr>
<td>VEX.128.F2.0F E6 /r VCVTPD2DQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert two packed double-precision floating-point values in xmm2/mem to two signed doubleword integers in xmm1</td>
</tr>
<tr>
<td>VEX.256.F2.0F E6 /r VCVTPD2DQ xmm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert four packed double-precision floating-point values in ymm2/mem to four signed doubleword integers in xmm1</td>
</tr>
</tbody>
</table>

Description

Converts two or four packed double-precision floating-point values in the source operand (second operand) to two or four packed signed doubleword integers in the destination operand (first operand).

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.

VEX.256 encoded version: The source operand is a YMM register or 256-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

VEX.128 encoded version: The source operand is an XMM register or 128-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The source operand is an XMM register or 128-bit memory location. The destination operation is an XMM register. Bits[127:64] of the destination XMM register are zeroed. However, the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.
INSTRUCTION SET REFERENCE

Figure 5-6. VCVTPD2DQ (VEX.256 encoded version)

Operation

**VCVTPD2DQ (VEX.256 encoded version)**

- DEST[31:0] $\leftarrow$ Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0])
- DEST[63:32] $\leftarrow$ Convert_Double_Precision_Floating_Point_To_Integer(SRC[127:64])
- DEST[95:64] $\leftarrow$ Convert_Double_Precision_Floating_Point_To_Integer(SRC[191:128])
- DEST[127:96] $\leftarrow$ Convert_Double_Precision_Floating_Point_To_Integer(SRC[255:192])
- DEST[255:128] $\leftarrow$ 0

**VCVTPD2DQ (VEX.128 encoded version)**

- DEST[31:0] $\leftarrow$ Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0])
- DEST[63:32] $\leftarrow$ Convert_Double_Precision_Floating_Point_To_Integer(SRC[127:64])
- DEST[255:64] $\leftarrow$ 0

**CVTPD2DQ (128-bit Legacy SSE version)**

- DEST[31:0] $\leftarrow$ Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0])
- DEST[63:32] $\leftarrow$ Convert_Double_Precision_Floating_Point_To_Integer(SRC[127:64])
- DEST[127:64] $\leftarrow$ 0
- DEST[255:128] (unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

- CVTPD2DQ(_m128i_mm256_cvtpd_epi32(_m256d src)
- CVTPD2DQ(_m128i_mm_cvtpd_epi32(_m128d src)
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
Invalid, Precision

Other Exceptions
See Exceptions Type 2; additionally
#UD If VEX.vvvv != 1111B.
CVTPD2PS- Convert Packed Double-Precision Floating-point values to Packed Single-Precision Floating-Point Values

Description
Converts two or four packed double-precision floating-point values in the source operand (second operand) to two or four packed single-precision floating-point values in the destination operand (first operand).

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register.

VEX.256 encoded version: The source operand is a YMM register or 256-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

VEX.128 encoded version: The source operand is an XMM register or 128-bit memory location. The destination operation is a YMM register. The upper bits (255:64) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The source operand is an XMM register or 128-bit memory location. The destination operation is an XMM register. Bits[127:64] of the destination XMM register are zeroed. However, the upper bits (255:128) of the corresponding YMM register destination are unmodified.

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 5A /r CVTPD2PS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert two packed double-precision floating-point values in xmm2/mem to two single-precision floating-point values in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F 5A /r VCVTPD2PS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert two packed double-precision floating-point values in xmm2/mem to two single-precision floating-point values in xmm1</td>
</tr>
<tr>
<td>VEX.256.66.0F 5A /r VCVTPD2PS xmm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert four packed double-precision floating-point values in ymm2/mem to four single-precision floating-point values in xmm1</td>
</tr>
</tbody>
</table>
Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

**Operation**

**VCVTPD2PS (VEX.256 encoded version)**

- DEST[31:0] $\leftarrow$ Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63:0])
- DEST[63:32] $\leftarrow$ Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[127:64])
- DEST[95:64] $\leftarrow$ Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[191:128])
- DEST[127:96] $\leftarrow$ Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[255:192])
- DEST[255:128] $\leftarrow$ 0

**VCVTPD2PS (VEX.128 encoded version)**

- DEST[31:0] $\leftarrow$ Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63:0])
- DEST[63:32] $\leftarrow$ Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[127:64])
- DEST[255:64] $\leftarrow$ 0

**CVTPD2PS (128-bit Legacy SSE version)**

- DEST[31:0] $\leftarrow$ Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63:0])
- DEST[63:32] $\leftarrow$ Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[127:64])
- DEST[127:64] $\leftarrow$ 0
- DEST[255:128] (unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

- CVTPD2PS __m256 _mm256_cvtpd_ps (__m256d a)
- CVTPD2PS __m128 _mm_cvtpd_ps (__m128d a)
SIMD Floating-Point Exceptions
Invalid, Precision, Underflow, Overflow, Denormal

Other Exceptions
See Exceptions Type 2; additionally
#UD If VEX.vvvv != 1111B.
**CVTPS2DQ- Convert Packed Single Precision Floating-Point Values to Packed Singed Doubleword Integer Values**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 5B /r CVTPS2DQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert four packed single precision floating-point values from xmm2/mem to four packed signed doubleword values in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F 5B /r VCVTPS2DQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert four packed single precision floating-point values from xmm2/mem to four packed signed doubleword values in xmm1</td>
</tr>
<tr>
<td>VEX.256.66.0F 5B /r VCVTPS2DQ ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert eight packed single precision floating-point values from ymm2/mem to eight packed signed doubleword values in ymm1</td>
</tr>
</tbody>
</table>

**Description**

Converts four or eight packed single-precision floating-point values in the source operand to four or eight signed doubleword integers in the destination operand.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.

**VEX.256 encoded version:** The source operand is a YMM register or 256-bit memory location. The destination operation is a YMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

**VEX.128 encoded version:** The source operand is an XMM register or 128-bit memory location. The destination operation is a YMM register. The upper bits (255:128) of the corresponding YMM register destination are unmodified.

128-bit Legacy SSE version: The source operand is an XMM register or 128-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
INSTRUCTION SET REFERENCE

Operation

VCVTPS2DQ (VEX.256 encoded version)
DEST[31:0] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0])
DEST[63:32] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[63:32])
DEST[95:64] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[95:64])
DEST[127:96] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[127:96])
DEST[159:128] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[159:128])
DEST[191:160] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[191:160])
DEST[223:192] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[223:192])
DEST[255:224] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[255:224])

VCVTPS2DQ (VEX.128 encoded version)
DEST[31:0] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0])
DEST[63:32] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[63:32])
DEST[95:64] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[95:64])
DEST[127:96] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[127:96])
DEST[255:128] ← 0

CVTPS2DQ (128-bit Legacy SSE version)
DEST[31:0] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0])
DEST[63:32] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[63:32])
DEST[95:64] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[95:64])
DEST[127:96] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[127:96])
DEST[255:128] (unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VCVTPS2DQ _m256i _mm256_cvtps_epi32 (__m256 a)
CVTPS2DQ _m128i _mm_cvtps_epi32 (__m128 a)

SIMD Floating-Point Exceptions

Invalid, Precision

Other Exceptions

See Exceptions Type 2; additionally

#UD If VEX.vvvv != 1111B.
**INSTRUCTION SET REFERENCE**

**CVTPS2PD- Convert Packed Single Precision Floating-point values to Packed Double Precision Floating-Point Values**

<table>
<thead>
<tr>
<th>OPCODE/INSTRUCTION</th>
<th>64/32 bit MODE SUPPORT</th>
<th>CPUID FEATURE FLAG</th>
<th>DESCRIPTION</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 5A /r CVTPS2PD xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert two packed single-precision floating-point values in xmm2/mem to two packed double-precision floating-point values in xmm1</td>
</tr>
<tr>
<td>VEX.128.0F 5A /r VCVTPS2PD xmm1, xmm2/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert two packed single-precision floating-point values in xmm2/mem to two packed double-precision floating-point values in xmm1</td>
</tr>
<tr>
<td>VEX.256.0F 5A /r VCVTPS2PD ymm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert four packed single-precision floating-point values in xmm2/mem to four packed double-precision floating-point values in ymm1</td>
</tr>
</tbody>
</table>

**Description**

Converts two or four packed single-precision floating-point values in the source operand (second operand) to two or four packed double-precision floating-point values in the destination operand (first operand).

**VEX.256 encoded version:** The source operand is an XMM register or 128-bit memory location. The destination operation is a YMM register.

**VEX.128 encoded version:** The source operand is an XMM register or 64-bit memory location. The destination operation is a YMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

**128-bit Legacy SSE version:** The source operand is an XMM register or 64-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are unmodified.

**Note:** In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
INSTRUCTION SET REFERENCE

Figure 5-8. CVTPS2PD (VEX.256 encoded version)

**Operation**

**VCVTPS2PD (VEX.256 encoded version)**

\[
\begin{align*}
\text{DEST}[63:0] & \leftarrow \text{Convert Single Precision To Double Precision Floating Point(SRC}[31:0]) \\
\text{DEST}[127:64] & \leftarrow \text{Convert Single Precision To Double Precision Floating Point(SRC}[63:32]) \\
\text{DEST}[191:128] & \leftarrow \text{Convert Single Precision To Double Precision Floating Point(SRC}[95:64]) \\
\text{DEST}[255:192] & \leftarrow \text{Convert Single Precision To Double Precision Floating Point(SRC}[127:96])
\end{align*}
\]

**VCVTPS2PD (VEX.128 encoded version)**

\[
\begin{align*}
\text{DEST}[63:0] & \leftarrow \text{Convert Single Precision To Double Precision Floating Point(SRC}[31:0]) \\
\text{DEST}[127:64] & \leftarrow \text{Convert Single Precision To Double Precision Floating Point(SRC}[63:32]) \\
\text{DEST}[255:128] & \leftarrow 0
\end{align*}
\]

**CVTPS2PD (128-bit Legacy SSE version)**

\[
\begin{align*}
\text{DEST}[63:0] & \leftarrow \text{Convert Single Precision To Double Precision Floating Point(SRC}[31:0]) \\
\text{DEST}[127:64] & \leftarrow \text{Convert Single Precision To Double Precision Floating Point(SRC}[63:32]) \\
\text{DEST}[255:128] & \leftarrow \text{(unmodified)}
\end{align*}
\]

**Intel C/C++ Compiler Intrinsic Equivalent**

\[
\begin{align*}
\text{VCVTPS2PD} & \_\_m256d\_\_mm256\_cvtps\_pd(\_\_m128\ a) \\
\text{CVTPS2PD} & \_\_m128d\_\_mm\_cvtps\_pd(\_\_m128\ a)
\end{align*}
\]
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
Invalid, Denormal

Other Exceptions
See Exceptions Type 3; additionally
#UD If VEX.vvvv != 111B.
CVTSD2SI- Convert Scalar Double-Precision Floating-Point Value to Doubleword Integer

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 2D /r CVTSD2SI r32, xmm1/m64</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert one double precision floating-point value from xmm1/m64 to one signed doubleword integer r32.</td>
</tr>
<tr>
<td>F2 REX.W 0F 2D /r CVTSD2SI r64, xmm1/m64</td>
<td>V/N.E.</td>
<td>SSE2</td>
<td>Convert one double precision floating-point value from xmm/m64 to one signed quadword integer sign-extended into r64.</td>
</tr>
<tr>
<td>VEX.128.F2.0F.W0 2D /r VCVTSD2SI r32, xmm1/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert one double precision floating-point value from xmm1/m64 to one signed doubleword integer r32.</td>
</tr>
<tr>
<td>VEX.128.F2.0F.W1 2D /r VCVTSD2SI r64, xmm1/m64</td>
<td>V/N.E.</td>
<td>AVX</td>
<td>Convert one double precision floating-point value from xmm1/m64 to one signed quadword integer sign-extended into r64.</td>
</tr>
</tbody>
</table>

Description

Converts a double-precision floating-point value in the source operand (second operand) to a signed doubleword integer in the destination operand (first operand). The source operand can be an XMM register or a 64-bit memory location. The destination operand is a general-purpose register. When the source operand is an XMM register, the double-precision floating-point value is contained in the low quadword of the register.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.

Legacy SSE instructions: Use of the REX.W prefix promotes the instruction to 64-bit operation. See the summary chart at the beginning of this section for encoding data and limits.
Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Software should ensure VCVTSD2SI is encoded with VEX.L=0. Encoding VCVTSD2SI with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

(V)CVTSD2SI
IF 64-Bit Mode and OperandSize = 64
THEN
    DEST[63:0] ← Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
ELSE
    DEST[31:0] ← Convert_Double_Precision_Floating_Point_To_Integer(SRC[63:0]);
FI;

Intel C/C++ Compiler Intrinsic Equivalent

int _mm_cvtsd_si32(__m128d a)

SIMD Floating-Point Exceptions

Invalid, Precision

Other Exceptions

See Exceptions Type 3; additionally
#UD If VEX.vvvv ≠ 1111B.
CVTSD2SS- Convert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value

Description
Converts a double-precision floating-point value in the second source operand to a single-precision floating-point value in the destination operand.

When the second source operand is an XMM register, the double-precision floating-point value is contained in the low quadword of the register. The result is stored in the low doubleword of the destination operand, and the upper 3 doublewords are copied from the upper 3 doublewords of the first source operand. When the conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register.

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VCVTSD2SS is encoded with VEX.L=0. Encoding VCVTSD2SS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation
VCVTSD2SS (VEX.128 encoded version)
DEST[31:0] ← Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC2[63:0]);
INSTRUCTION SET REFERENCE

DEST[255:128] ← 0

**CVTSD2SS (128-bit Legacy SSE version)**
DEST[31:0] ← Convert_Double_Precision_To_Single_Precision_Floating_Point(SRC[63:0]),
(* DEST[255:32] Unmodified *)

Intel C/C++ Compiler Intrinsic Equivalent

CVTSD2SS __m128_mm_cvtsd_ss(__m128 a, __m128d b)

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 3
CVTSI2SD- Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>OPCODE/INSTRUCTION</th>
<th>64/32 BIT MODE SUPPORT</th>
<th>CPUID FEATURE FLAG</th>
<th>DESCRIPTION</th>
</tr>
</thead>
<tbody>
<tr>
<td>CVTSI2SD xmm1, r32/m32</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert one signed doubleword integer from r32/m32 to one double-precision floating-point value in xmm1.</td>
</tr>
<tr>
<td>CVTSI2SD xmm1, r/m64</td>
<td>V/N.E.</td>
<td>SSE2</td>
<td>Convert one signed quadword integer from r/m64 to one double-precision floating-point value in xmm1.</td>
</tr>
<tr>
<td>VCVTSI2SD xmm1, xmm2, r/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert one signed doubleword integer from r/m32 to one double-precision floating-point value in xmm1.</td>
</tr>
<tr>
<td>VCVTSI2SD xmm1, xmm2, r/m64</td>
<td>V/N.E.</td>
<td>AVX</td>
<td>Convert one signed quadword integer from r/m64 to one double-precision floating-point value in xmm1.</td>
</tr>
</tbody>
</table>

**Description**

Converts a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the second source operand to a double-precision floating-point value in the destination operand. The result is stored in the low quadword of the destination operand, and the high quadword left unchanged. When conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register.

Legacy SSE instructions: Use of the REX.W prefix promotes the instruction to 64-bit operands. See the summary chart at the beginning of this section for encoding data and limits.

The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.
INSTRUCTION SET REFERENCE

VEX.128 encoded version: Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VCVTSI2SD is encoded with VEX.L=0. Encoding VCVTSI2SD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

VCVTSI2SD
IF 64-Bit Mode AndOperandSize = 64
THEN
    DEST[63:0] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC2[63:0]);
ELSE
    DEST[63:0] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC2[31:0]);
FI;
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

CVTSI2SD
IF 64-Bit Mode AndOperandSize = 64
THEN
    DEST[63:0] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[63:0]);
ELSE
    DEST[63:0] ← Convert_Integer_To_Double_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[255:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

CVTSI2SD __m128d_mm_cvtsi32_sd(__m128d a, int b)

SIMD Floating-Point Exceptions

Precision

Other Exceptions
See Exceptions Type 3
INSTRUCTION SET REFERENCE

CVTSI2SS- Convert Doubleword Integer to Scalar Single-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 2A /r CVTSI2SS xmm1, r/m32</td>
<td>V/V</td>
<td>SSE</td>
<td>Convert one signed doubleword integer from r/m32 to one single-precision floating-point value in xmm1.</td>
</tr>
<tr>
<td>F3 REX.W 0F 2A /r CVTSI2SS xmm1, r/m64</td>
<td>V/N.E.</td>
<td>SSE</td>
<td>Convert one signed quadword integer from r/m64 to one single-precision floating-point value in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F.W0 2A /r VCVTSI2SS xmm1, xmm2, r/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert one signed doubleword integer from r/m32 to one single-precision floating-point value in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F.W1 2A /r VCVTSI2SS xmm1, xmm2, r/m64</td>
<td>V/N.E.</td>
<td>AVX</td>
<td>Convert one signed quadword integer from r/m64 to one single-precision floating-point value in xmm1.</td>
</tr>
</tbody>
</table>

Description

Converts a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the source operand (second operand) to a single-precision floating-point value in the destination operand (first operand). The source operand can be a general-purpose register or a memory location. The destination operand is an XMM register. The result is stored in the low doubleword of the destination operand, and the upper three doublewords are left unchanged. When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register.

Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to 64-bit operands. See the summary chart at the beginning of this section for encoding data and limits.

The second source operand can be a general-purpose register or a 32/64-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.
INSTRUCTION SET REFERENCE

VEX.128 encoded version: Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VCVTSI2SS is encoded with VEX.L=0. Encoding VCVTSI2SS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

VCVTSI2SS (VEX.128 encoded version)
IF 64-Bit Mode And OperandSize = 64
THEN
    DEST[31:0] ← Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:0]);
ELSE
    DEST[31:0] ← Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[255:128] ← 0

CVTSI2SS (128-bit Legacy SSE version)
IF 64-Bit Mode And OperandSize = 64
THEN
    DEST[31:0] ← Convert_Integer_To_Single_Precision_Floating_Point(SRC[63:0]);
ELSE
    DEST[31:0] ← Convert_Integer_To_Single_Precision_Floating_Point(SRC[31:0]);
FI;
DEST[255:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

CVTSI2SS __m128_mm_cvtsi32_ss(__m128 a, int b)

SIMD Floating-Point Exceptions

Precision

Other Exceptions

See Exceptions Type 3
CVTSS2SD- Convert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 5A /r CVTSS2SD xmm1, xmm2/m32</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert one single-precision floating-point value in xmm2/m32 to one double-precision floating-point value in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 5A /r VCVTSS2SD xmm1, xmm2, xmm3/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert one single-precision floating-point value in xmm3/m32 to one double-precision floating-point value and merge with high bits of xmm2.</td>
</tr>
</tbody>
</table>

**Description**

Converts a single-precision floating-point value in the second source operand to a double-precision floating-point value in the destination operand. When the second source operand is an XMM register, the single-precision floating-point value is contained in the low doubleword of the register. The result is stored in the low quadword of the destination operand, and the high quadword is copied from the first source operand.

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VCVTSS2SD is encoded with VEX.L=0. Encoding VCVTSS2SD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

**Operation**

**VCVTSS2SD (VEX.128 encoded version)**

DEST[63:0] ← Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC2[31:0])

DEST[127:64] ← SRC1[127:64]

DEST[255:128] ← 0
INSTRUCTION SET REFERENCE

CVTSS2SD (128-bit Legacy SSE version)
DEST[63:0] ← Convert_Single_Precision_To_Double_Precision_Floating_Point(SRC[31:0]);
DEST[255:64] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
CVTSS2SD __m128d_mm_cvtss_sd(__m128d a, __m128 b)

SIMD Floating-Point Exceptions
Invalid, Denormal

Other Exceptions
See Exceptions Type 3
CVTSS2SI - Convert Scalar Single-Precision Floating-Point Value to Doubleword Integer

Converts a single-precision floating-point value in the source operand (second operand) to a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the destination operand (first operand). The source operand can be an XMM register or a memory location. The destination operand is a general-purpose register. When the source operand is an XMM register, the single-precision floating point value is contained in the low doubleword of the register.

When a conversion is inexact, the value returned is rounded according to the rounding control bits in the MXCSR register. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.

Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to 64-bit operands. See the summary chart at the beginning of this section for encoding data and limits.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 2D/r CVTSS2SI r32, xmm1/m32</td>
<td>V/V</td>
<td>SSE</td>
<td>Convert one single-precision floating-point value from xmm1/m32 to one signed doubleword integer in r32.</td>
</tr>
<tr>
<td>F3 REX.W 0F 2D/r CVTSS2SI r64, xmm1/m32</td>
<td>V/N.E.</td>
<td>SSE</td>
<td>Convert one single-precision floating-point value from xmm1/m32 to one signed quadword integer in r64.</td>
</tr>
<tr>
<td>VEX.128.F3.0F.W0 2D/r VCVTSS2SI r32, xmm1/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert one single-precision floating-point value from xmm1/m32 to one signed doubleword integer in r32.</td>
</tr>
<tr>
<td>VEX.128.F3.0F.W1 2D/r VCVTSS2SI r64, xmm1/m32</td>
<td>V/N.E.</td>
<td>AVX</td>
<td>Convert one single-precision floating-point value from xmm1/m32 to one signed quadword integer in r64.</td>
</tr>
</tbody>
</table>
INSTRUCTION SET REFERENCE

Software should ensure VCVTSS2SI is encoded with VEX.L=0. Encoding VCVTSS2SI with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation
CVTSS2SI
IF 64-bit Mode and OperandSize = 64
THEN
  DEST[63:0] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0]);
ELSE
  DEST[31:0] ← Convert_Single_Precision_Floating_Point_To_Integer(SRC[31:0]);
FI;

Intel C/C++ Compiler Intrinsic Equivalent

int _mm_cvtss_si32(__m128 a)

SIMD Floating-Point Exceptions
Invalid, Precision

Other Exceptions
See Exceptions Type 3; additionally
#UD If VEX.vvvv != 1111B.
CVTTPD2DQ- Convert with Truncation Packed Double-Precision Floating-point values to Packed Doubleword Integers

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F E6 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert two packed double-precision floating-point values in xmm2/mem to two signed doubleword integers in xmm1 using truncation</td>
</tr>
<tr>
<td>CVTTPD2DQ xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F E6 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert two packed double-precision floating-point values in xmm2/mem to two signed doubleword integers in xmm1 using truncation</td>
</tr>
<tr>
<td>VCVTTPD2DQ xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.66.0F E6 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert four packed double-precision floating-point values in ymm2/mem to four signed doubleword integers in xmm1 using truncation</td>
</tr>
<tr>
<td>VCVTTPD2DQ xmm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Converts two or four packed double-precision floating-point values in the source operand (second operand) to two or four packed signed doubleword integers in the destination operand (first operand).

When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.

VEX.256 encoded version: The source operand is a YMM register or 256-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

VEX.128 encoded version: The source operand is an XMM register or 128-bit memory location. The destination operation is a YMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The source operand is an XMM register or 128-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are unmodified.
INSTRUCTION SET REFERENCE

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Figure 5-9. VCVTTPD2DQ (VEX.256 encoded version)

Operation

VCVTTDPD2Q (VEX.256 encoded version)
DEST[31:0] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0])
DEST[63:32] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[127:64])
DEST[95:64] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[191:128])
DEST[127:96] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[255:192])
DEST[255:128] ← 0

CVTTPD2Q (VEX.128 encoded version)
DEST[31:0] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0])
DEST[63:32] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[127:64])
DEST[255:64] ← 0

CVTTPD2Q (128-bit Legacy SSE version)
DEST[31:0] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0])
DEST[63:32] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[127:64])
DEST[127:64] ← 0
DEST[255:128] (unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VCVTTDPD2Q __m128i __m256_cvtpd_epi32 (__m256d src)
CVTTDQ2PD __m128i_mm_cvtpd_epi32 (__m128d src)

SIMD Floating-Point Exceptions
Invalid, Precision

Other Exceptions
See Exceptions Type 2; additionally
#UD If VEX.vvvv != 1111B.
CVTTPS2DQ- Convert with Truncation Packed Single Precision Floating-Point Values to Packed Singed Doubleword Integer Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 5B /r CVTTPS2DQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert four packed single precision floating-point values from xmm2/mem to four packed signed doubleword values in xmm1 using truncation</td>
</tr>
<tr>
<td>VEX.128.F3.0F 5B /r VCVTTPS2DQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert four packed single precision floating-point values from xmm2/mem to four packed signed doubleword values in xmm1 using truncation</td>
</tr>
<tr>
<td>VEX.256.F3.0F 5B /r VCVTTPS2DQ ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert eight packed single precision floating-point values from ymm2/mem to eight packed signed doubleword values in ymm1 using truncation</td>
</tr>
</tbody>
</table>

**Description**

Converts four or eight packed single-precision floating-point values in the source operand to four or eight signed doubleword integers in the destination operand.

When a conversion is inexact, a truncated (round toward zero) value is returned. If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.

VEX.256 encoded version: The source operand is a YMM register or 256-bit memory location. The destination operation is a YMM register.

VEX.128 encoded version: The source operand is an XMM register or 128-bit memory location. The destination operation is a YMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The source operand is an XMM register or 128-bit memory location. The destination operation is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation

**VCVTPPS2DQ (VEX.256 encoded version)**
DEST[31:0] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0])
DEST[63:32] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[63:32])
DEST[95:64] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[95:64])
DEST[127:96] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[127:96])
DEST[159:128] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[159:128])
DEST[191:160] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[191:160])
DEST[223:192] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[223:192])
DEST[255:224] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[255:224])

**VCVTPPS2DQ (VEX.128 encoded version)**
DEST[31:0] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0])
DEST[63:32] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[63:32])
DEST[95:64] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[95:64])
DEST[127:96] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[127:96])
DEST[255:128] ← 0

**CVTTPS2DQ (128-bit Legacy SSE version)**
DEST[31:0] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0])
DEST[63:32] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[63:32])
DEST[95:64] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[95:64])
DEST[127:96] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[127:96])
DEST[255:128] (unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VCVTPS2DQ __m256i _mm256_cvtps_epi32 (__m256 a)
CVTTPS2DQ __m128i _mm_cvtps_epi32 (__m128 a)

SIMD Floating-Point Exceptions
Invalid, Precision

Other Exceptions
See Exceptions Type 2; additionally
#UD If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

CVTTSD2SI- Convert with Truncation Scalar Double-Precision Floating-Point Value to Signed Doubleword Integer

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 2C /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Convert one double-precision floating-point value from xmm1/m64 to one signed doubleword integer in r32 using truncation.</td>
</tr>
<tr>
<td>CVTTS2DI r32, xmm1/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F2 REX.W 0F 2C /r</td>
<td>V/N.E.</td>
<td>SSE2</td>
<td>Convert one double precision floating-point value from xmm1/m64 to one signed quadword integer in r64 using truncation.</td>
</tr>
<tr>
<td>CVTTS2SI r64, xmm1/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F2.0F.W0 2C /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert one double-precision floating-point value from xmm1/m64 to one signed doubleword integer in r32 using truncation.</td>
</tr>
<tr>
<td>VCVTTS2SI r32, xmm1/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F2.0F.W1 2C /r</td>
<td>V/N.E.</td>
<td>AVX</td>
<td>Convert one double precision floating-point value from xmm1/m64 to one signed quadword integer in r64 using truncation.</td>
</tr>
<tr>
<td>VCVTTS2SI r64, xmm1/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Converts a double-precision floating-point value in the source operand (second operand) to a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the destination operand (first operand). The source operand can be an XMM register or a 64-bit memory location. The destination operand is a general purpose register. When the source operand is an XMM register, the double-precision floating-point value is contained in the low quadword of the register.

When a conversion is inexact, a truncated (round toward zero) result is returned. If a converted result is larger than the maximum signed doubleword integer, the floating point invalid exception is raised. If this exception is masked, the indefinite integer value (80000000H) is returned.
Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to 64-bit operation. See the summary chart at the beginning of this section for encoding data and limits.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Software should ensure VCVTTS2SI is encoded with VEX.L=0. Encoding VCVTTS2SI with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

CVTTS2SI

IF 64-Bit Mode and OperandSize = 64
THEN
  DEST[63:0] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0]);
ELSE
  DEST[31:0] ← Convert_Double_Precision_Floating_Point_To_Integer_Truncate(SRC[63:0]);
FI;

Intel C/C++ Compiler Intrinsic Equivalent

int_mm_cvttsd_si32(__m128d a)

SIMD Floating-Point Exceptions

Invalid, Precision

Other Exceptions

See Exceptions Type 3; additionally

#UD If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

CVTTSS2SI- Convert with Truncation Scalar Single-Precision Floating-Point Value to Doubleword Integer

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 2C /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Convert one single-precision floating-point value from xmm1/m32 to one signed doubleword integer in r32 using truncation.</td>
</tr>
<tr>
<td>CVTTSS2SI r32, xmm1/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F3 REX.W 0F 2C /r</td>
<td>V/N.E.</td>
<td>SSE</td>
<td>Convert one single-precision floating-point value from xmm1/m32 to one signed quadword integer in r64 using truncation.</td>
</tr>
<tr>
<td>CVTTSS2SI r64, xmm1/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F3.0F.W0 2C /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Convert one single-precision floating-point value from xmm1/m32 to one signed doubleword integer in r32 using truncation.</td>
</tr>
<tr>
<td>VCVTTSS2SI r32, xmm1/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F3.0F.W1 2C /r</td>
<td>V/N.E.</td>
<td>AVX</td>
<td>Convert one single-precision floating-point value from xmm1/m32 to one signed quadword integer in r64 using truncation.</td>
</tr>
<tr>
<td>VCVTTSS2SI r64, xmm1/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Converts a single-precision floating-point value in the source operand (second operand) to a signed doubleword integer (or signed quadword integer if operand size is 64 bits) in the destination operand (first operand). The source operand can be an XMM register or a 32-bit memory location. The destination operand is a general purpose register. When the source operand is an XMM register, the single-precision floating-point value is contained in the low doubleword of the register.

When a conversion is inexact, a truncated (round toward zero) result is returned. If a converted result is larger than the maximum signed doubleword integer, the floating point invalid exception is raised. If this exception is masked, the indefinite integer value (80000000H) is returned.
Legacy SSE instructions: In 64-bit mode, Use of the REX.W prefix promotes the instruction to 64-bit operation. See the summary chart at the beginning of this section for encoding data and limits.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Software should ensure VCVTSS2SI is encoded with VEX.L=0. Encoding VCVTSS2SI with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

CVTTSS2SI
IF 64-Bit Mode and OperandSize = 64
THEN
  DEST[63:0] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0]);
ELSE
  DEST[31:0] ← Convert_Single_Precision_Floating_Point_To_Integer_Truncate(SRC[31:0]);
FI;

Intel C/C++ Compiler Intrinsic Equivalent
int _mm_cvttss_si32(__m128 a)

SIMD Floating-Point Exceptions
Invalid, Precision

Other Exceptions
See Exceptions Type 3; additionally
#UD If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

DIVPD- Divide Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 5E /r DIVPD xmm1, xmm3/m128</td>
<td>V/V SSE2</td>
<td>Divide packed double-precision floating-point values in xmm1 by packed double-precision floating-point values in xmm2/mem</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 5E /r VDIVPD xmm1, xmm2, xmm3/m128</td>
<td>V/V AVX</td>
<td>Divide packed double-precision floating-point values in xmm2 by packed double-precision floating-point values in xmm3/mem</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 5E /r VDIVPD ymm1, ymm2, ymm3/m256</td>
<td>V/V AVX</td>
<td>Divide packed double-precision floating-point values in ymm2 by packed double-precision floating-point values in ymm3/mem</td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Performs an SIMD divide of the two or four packed double-precision floating-point values in the first source operand by the two or four packed double-precision floating-point values in the second source operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

**Operation**

VDIVPD (VEX.256 encoded version)

DEST[63:0] ← SRC1[63:0] / SRC2[63:0]  
VDIVPD (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0] / SRC2[63:0]
DEST[255:128] ← 0

DIVPD (128-bit Legacy SSE version)
DEST[63:0] ← SRC1[63:0] / SRC2[63:0]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VDIVPD __m256d _mm256_div_pd (__m256d a, __m256d b);
DIVPD __m128d _mm_div_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Divide-by-Zero, Precision, Denormal

Other Exceptions
See Exceptions Type 2
DIVPS- Divide Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 5E /r DIVPS xmm1, xmm2/m128</td>
<td>V/V SSE</td>
<td>Divide packed single-precision floating-point values in xmm1 by packed double-precision floating-point values in xmm2/m128</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.0F 5E /r VDIVPS xmm1, xmm2, xmm3/m128</td>
<td>V/V AVX</td>
<td>Divide packed single-precision floating-point values in xmm2 by packed double-precision floating-point values in xmm3/m128</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.0F 5E /r VDIVPS ymm1, ymm2, ymm3/m256</td>
<td>V/V AVX</td>
<td>Divide packed single-precision floating-point values in ymm2 by packed double-precision floating-point values in ymm3/mem</td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs an SIMD divide of the four or eight packed single-precision floating-point values in the first source operand by the four or eight packed single-precision floating-point values in the second source operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VDIVPS (VEX.256 encoded version)

DEST[31:0] ← SRC1[31:0] / SRC2[31:0]
DEST[95:64] ← SRC1[95:64] / SRC2[95:64]
INSTRUCTION SET REFERENCE


VDIVPS (VEX.128 encoded version)
DEST[31:0] ← SRC1[31:0] / SRC2[31:0]
DEST[95:64] ← SRC1[95:64] / SRC2[95:64]
DEST[255:128] ← 0

DIVPS (128-bit Legacy SSE version)
DEST[31:0] ← SRC1[31:0] / SRC2[31:0]
DEST[95:64] ← SRC1[95:64] / SRC2[95:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VDIVPS __m256 _mm256_div_ps (__m256 a, __m256 b);
DIVPS __m128 _mm_div_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Divide-by-Zero, Precision, Denormal

Other Exceptions
See Exceptions Type 2
DIVSD- Divide Scalar Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Support</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 5E /r DIVSD xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE2</td>
<td>Divide low double-precision floating point values in xmm1 by low double precision floating-point value in xmm2/mem64.</td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 5E /r VDIVSD xmm1, xmm2, xmm3/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Divide low double-precision floating point values in xmm2 by low double precision floating-point value in xmm3/mem64.</td>
</tr>
</tbody>
</table>

Description

Divides the low double-precision floating-point value in the first source operand by the low double-precision floating-point value in the second source operand, and stores the double-precision floating-point result in the destination operand. The second source operand can be an XMM register or a 64-bit memory location. The first source and destination hyperons are XMM registers. The high quadword of the destination operand is copied from the high quadword of the first source operand. See Chapter 11 in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for an overview of a scalar double-precision floating-point operation.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. Software should ensure VDIVSD is encoded with VEX.L=0. Encoding VDIVSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

VDIVSD (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0] / SRC2[63:0]
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

DIVSD (128-bit Legacy SSE version)
DEST[63:0] ← DEST[63:0] / SRC[63:0]
DEST[255:64] (Unmodified)

**Intel C/C++ 6 Compiler Intrinsic Equivalent**

DIVSD __m128d _mm_div_sd (__m128d a, __m128d b)

**SIMD Floating-Point Exceptions**
Overflow, Underflow, Invalid, Divide-by-Zero, Precision, Denormal

**Other Exceptions**
See Exceptions Type 3
DIVSS- Divide Scalar Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 5E /r DIVSS xmm1, xmm2/m32</td>
<td>V/V SSE</td>
<td>Divide low single-precision floating point value in xmm1 by low single precision floating-point value in xmm2/m32.</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 5E /r VDIVSS xmm1, xmm2, xmm3/m32</td>
<td>V/V AVX</td>
<td>Divide low single-precision floating point value in xmm2 by low single precision floating-point value in xmm3/m32.</td>
<td></td>
</tr>
</tbody>
</table>

Description

Divides the low single-precision floating-point value in the first source operand by the low single-precision floating-point value in the second source operand, and stores the single-precision floating-point result in the destination operand. The second source operand can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers. The three high-order doublewords of the destination are copied from the same dwords of the first source operand. See Chapter 10 in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for an overview of a scalar single-precision floating-point operation.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. Software should ensure VDIVSS is encoded with VEX.L=0. Encoding VDIVSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

**VDIVSS (VEX.128 encoded version)**

DEST[31:0] ← SRC1[31:0] / SRC2[31:0]
DEST[255:128] ← 0

**DIVSS (128-bit Legacy SSE version)**

DEST[31:0] ← DEST[31:0] / SRC[31:0]
DEST[255:32] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

DIVSS __m128 _mm_div_ss(__m128 a, __m128 b)

**SIMD Floating-Point Exceptions**

Overflow, Underflow, Invalid, Divide-by-Zero, Precision, Denormal

**Other Exceptions**

See Exceptions Type 3
DPPD- Dot Product of Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 41 /r ib</td>
<td>V/V</td>
<td>SSE_1</td>
<td>Selectively multiply packed DP floating-point values from xmm1 with packed DP floating-point values from xmm2, add and selectively store the packed DP floating-point values to xmm1</td>
</tr>
<tr>
<td>DPPD xmm1, xmm3/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 41 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Selectively multiply packed DP floating-point values from xmm2 with packed DP floating-point values from xmm3, add and selectively store the packed DP floating-point values to xmm1</td>
</tr>
<tr>
<td>VDPPD xmm1,xmm2, xmm3/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Conditionally multiplies the packed double precision floating-point values in the destination operand (first operand) with the packed double-precision floating-point values in the source (second operand) depending on a mask extracted from bits 4-5 of the immediate operand. Each of the two resulting double-precision values is summed and this sum is conditionally broadcast to each of 2 positions in the destination operand if the corresponding bit of the mask selected from bits 0-1 of the immediate operand is "1". If the corresponding low bit 0-1 of the mask is zero, the destination is set to zero. DPPD follows the NaN forwarding rules stated in the Software Developer’s Manual, vol. 1, table 4.7. These rules do not cover horizontal prioritization of NaNs. Horizontal propagation of NaNs to the destination and the positioning of those NaNs in the destination is implementation dependent. NaNs on the input sources or computationally generated NaNs will have at least one NaN propagated to the destination.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

If VDPPD is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

5-132
Operation

**DP\_primitive (SRC1, SRC2)**

IF (imm8[4] == 1) THEN Temp1[63:0] \(\leftarrow\) SRC1[63:0] \* SRC2[63:0];
ELSE Temp1[63:0] \(\leftarrow\) +0.0;
IF (imm8[5] == 1) THEN Temp1[127:64] \(\leftarrow\) SRC1[127:64] \* SRC2[127:64];
ELSE Temp1[127:64] \(\leftarrow\) +0.0;
Temp2[63:0] \(\leftarrow\) Temp1[63:0] + Temp1[127:64];
IF (imm8[0] == 1) THEN DEST[63:0] \(\leftarrow\) Temp2[63:0];
ELSE DEST[63:0] \(\leftarrow\) +0.0;
IF (imm8[1] == 1) THEN DEST[127:64] \(\leftarrow\) Temp2[63:0];
ELSE DEST[127:64] \(\leftarrow\) +0.0;

**VDPPD (VEX.128 encoded version)**

DEST[127:0] \(\leftarrow\) DP\_Primitive(SRC1[127:0], SRC2[127:0]);
DEST[255:128] \(\leftarrow\) 0

**DPPD (128-bit Legacy SSE version)**

DEST[127:0] \(\leftarrow\) DP\_Primitive(SRC1[127:0], SRC2[127:0]);
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

DPPD \_m128d \_mm\_dp\_pd (\_m128d a, \_m128d b, const int mask);

**SIMD Floating-Point Exceptions**

Overflow, Underflow, Invalid, Precision, Denormal

Exceptions are determined separately for each add and multiply operation.
Unmasked exceptions will leave the destination untouched

**Other Exceptions**

See Exceptions Type 2; additionally

#UD If VEX.L=1
**DPPS- Dot Product of Packed Single-Precision Floating-Point Values**

**Description**

Multiples the packed single precision floating point values in the first source operand (second operand) with the packed single-precision floats in the second source (third operand). Each of the four resulting single-precision values is conditionally summed depending on a mask extracted from the high 4 bits of the immediate operand. This sum is broadcast to each of 4 positions in the destination operand (first operand) if the corresponding bit of the mask selected from the low 4 bits of the immediate operand is "1". If the corresponding low bit 0-3 of the mask is zero, the destination is set to zero.

The process is replicated for the high elements of the destination YMM.

DPPS follows the NaN forwarding rules stated in the Software Developer's Manual, vol. 1, table 4.7. These rules do not cover horizontal prioritization of NaNs. Horizontal propagation of NaNs to the destination and the positioning of those NaNs in the destination is implementation dependent. NaNs on the input sources or computationally generated NaNs will have at least one NaN propagated to the destination.

---

**Opcode/Instruction**

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 40 /r ib DPPS xmm1, xmm3/m128, imm8</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Multiply packed SP floating point values from xmm1 with packed SP floating point values from xmm3/mem selectively add and store to xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 40 /r ib VDPPS xmm1,xmm2, xmm3/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply packed SP floating point values from xmm1 with packed SP floating point values from xmm2/mem selectively add and store to xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F3A 40 /r ib VDPPS ymm1, ymm2, ymm3/m256, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply packed single-precision floating-point values from ymm2 with packed SP floating point values from ymm3/mem, selectively add pairs of elements and store to ymm1</td>
</tr>
</tbody>
</table>
VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

**DP Primitive (SRC1, SRC2)**

IF (imm8[4] == 1) THEN Temp1[31:0] ← SRC1[31:0] * SRC2[31:0];
ELSE Temp1[31:0] ← +0.0;
ELSE Temp1[63:32] ← +0.0;
ELSE Temp1[95:64] ← +0.0;
ELSE Temp1[127:96] ← +0.0;

Temp2[31:0] ← Temp1[31:0] + Temp1[63:32];
Temp3[31:0] ← Temp1[95:64] + Temp1[127:96];
Temp4[31:0] ← Temp2[31:0] + Temp3[31:0];

IF (imm8[0] == 1) THEN DEST[31:0] ← Temp4[31:0];
ELSE DEST[31:0] ← +0.0;
ELSE DEST[63:32] ← +0.0;
IF (imm8[2] == 1) THEN DEST[95:64] ← Temp4[31:0];
ELSE DEST[95:64] ← +0.0;
ELSE DEST[127:96] ← +0.0;

**VDPPS (VEX.256 encoded version)**

DEST[127:0] ← DP_Primitive(SRC1[127:0], SRC2[127:0]);

**VDPPS (VEX.128 encoded version)**

DEST[127:0] ← DP_Primitive(SRC1[127:0], SRC2[127:0]);
DEST[255:128] ← 0
INSTRUCTION SET REFERENCE

**DPP (128-bit Legacy SSE version)**
DEST[127:0] ← DP_Primitive(SRC1[127:0], SRC2[127:0]);
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**
VDPPS __m256 _mm256_dp_ps ( __m256 a, __m256 b, const int mask);
(V)DPPS __m128 _mm_dp_ps ( __m128 a, __m128 b, const int mask);

**SIMD Floating-Point Exceptions**
Overflow, Underflow, Invalid, Precision, Denormal
Exceptions are determined separately for each add and multiply operation.
Unmasked exceptions will leave the destination untouched

**Other Exceptions**
See Exceptions Type 2
VEXTRACTF128- Extract packed floating-point values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEXTRACTF128 xmm1/m128, ymm2, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Extract 128 bits of packed floating-point values from ymm2 and store results in xmm1/mem</td>
</tr>
</tbody>
</table>

Description

Extracts 128-bits of packed floating-point values from the source operand (second operand) at an 128-bit offset from imm8[0] into the destination operand (first operand). The destination may be either an XMM register or an 128-bit memory location.

VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

The high 7 bits of the immediate are ignored.

If VEXTRACTF128 is encoded with VEX.L= 0, an attempt to execute the instruction encoded with VEX.L= 0 will cause an #UD exception.

Operation

**VEXTRACTF128 (memory destination form)**

CASE (imm8[0]) OF

0: DEST[127:0] ← SRC1[127:0]
1: DEST[127:0] ← SRC1[255:128]

ESAC.

**VEXTRACTF128 (register destination form)**

CASE (imm8[0]) OF

0: DEST[127:0] ← SRC1[127:0]
1: DEST[127:0] ← SRC1[255:128]

ESAC.

DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent

VEXTRACTF128 _m128 _mm256_extractf128_ps (_m256 a, int offset);

VEXTRACTF128 _m128d _mm256_extractf128_pd (_m256d a, int offset);
INSTRUCTION SET REFERENCE

VEXTRACTF128 _m128i_mm256_extractf128_si256(__m256i a, int offset);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 6; additionally
#UD IF VEX.L = 0
EXTRACTPS- Extract packed floating-point values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 17 /r ib</td>
<td>VV</td>
<td>SSE4_1</td>
<td>Extract one single-precision floating-point value from xmm1 at the offset specified by imm8 and store the result in reg or m32. Zero extend the results in 64-bit register if applicable.</td>
</tr>
<tr>
<td>VEXTRACTPS reg/m32, xmm1, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F3A 17 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Extract one single-precision floating-point value from xmm1 at the offset specified by imm8 and store the result in reg or m32. Zero extend the results in 64-bit register if applicable.</td>
</tr>
<tr>
<td>VEXTRACTPS r/m32, xmm1, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Extracts a single-precision floating-point value from the source operand (second operand) at the 32-bit offset specified from imm8. Immediate bits higher than the most significant offset for the vector length are ignored.

The extracted single-precision floating-point value is stored in the low 32-bits of the destination operand.

In 64-bit mode, destination register operand has default operand size of 64 bits. The upper 32-bits of the register are filled with zero. REX.W is ignored.

VEX.128 encoded version: When VEX.128.66.0F3A.W1 17 form is used in 64-bit mode with a general purpose register (GPR) as a destination operand, the packed single quantity is zero extended to 64 bits. VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

128-bit Legacy SSE version: When a REX.W prefix is used in 64-bit mode with a general purpose register (GPR) as a destination operand, the packed single quantity is zero extended to 64 bits.

The source register is an XMM register. Imm8[1:0] determine the starting DWORD offset from which to extract the 32-bit floating-point value.

If VEXTRACTPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.
INSTRUCTION SET REFERENCE

Operation

**VEXTRACTPS (VEX.128 encoded version)**
SRC_OFFSET ← IMM8[1:0]
IF (64-Bit Mode and DEST is register)
   DEST[31:0] ← (SRC[127:0] >> (SRC_OFFSET*32)) AND 0xFFFFFFFFh
   DEST[63:32] ← 0
ELSE
   DEST[31:0] ← (SRC[127:0] >> (SRC_OFFSET*32)) AND 0xFFFFFFFFh
FI

**VEXTRACTPS (128-bit Legacy SSE version)**
SRC_OFFSET ← IMM8[1:0]
IF (64-Bit Mode and DEST is register)
   DEST[31:0] ← (SRC[127:0] >> (SRC_OFFSET*32)) AND 0xFFFFFFFFh
   DEST[63:32] ← 0
ELSE
   DEST[31:0] ← (SRC[127:0] >> (SRC_OFFSET*32)) AND 0xFFFFFFFFh
FI

Intel C/C++ Compiler Intrinsic Equivalent

EXTRACTPS _mm_extractmem_ps (float *dest, __m128 a, const int nidx);

EXTRACTPS __m128 _mm_extract_ps (__m128 a, const int nidx);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 5; Additionally

#UD IF VEX.L = 1
HADDPD - Add Horizontal Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 7C /r</td>
<td>V/V</td>
<td>SSE3</td>
<td>Horizontal add packed double-precision floating-point values from xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 7C /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Horizontal add packed double-precision floating-point values from xmm1, xmm2, and xmm3/m128</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 7C /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Horizontal add packed double-precision floating-point values from ymm1, ymm2, and ymm3/m256</td>
</tr>
</tbody>
</table>

Description

Adds pairs of adjacent double-precision floating-point values in the first source operand and second source operand and stores results in the destination.

![Figure 5-10. VHADDPD operation](image)

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.
INSTRUCTION SET REFERENCE

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation
VHADDPD (VEX.256 encoded version)
DEST[63:0] ← SRC1[127:64] + SRC1[63:0]
DEST[127:64] ← SRC2[127:64] + SRC2[63:0]

VHADDPD (VEX.128 encoded version)
DEST[63:0] ← SRC1[127:64] + SRC1[63:0]
DEST[127:64] ← SRC2[127:64] + SRC2[63:0]
DEST[255:128] ← 0

HADDPD (128-bit Legacy SSE version)
DEST[63:0] ← SRC1[127:64] + SRC1[63:0]
DEST[127:64] ← SRC2[127:64] + SRC2[63:0]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VHADDPD __m256d _mm256_hadd_pd (__m256d a, __m256d b);
HADDPD __m128d _mm_hadd_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
HADDPS- Add Horizontal Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 7C /r</td>
<td>V/V</td>
<td>SSE3</td>
<td>Horizontal add packed single-precision floating-point values from xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>HADDPS xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 7C /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Horizontal add packed single-precision floating-point values from xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VHADDPS xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.F2.0F 7C /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Horizontal add packed single-precision floating-point values from ymm2 and ymm3/mem</td>
</tr>
<tr>
<td>VHADDPS ymm1, ymm2, ymm3/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Adds pairs of adjacent single-precision floating-point values in the first source operand and second source operand and stores results in the destination.

**Figure 5-11. VHADDPS operation**

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.
INSTRUCTION SET REFERENCE

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VHADDPS (VEX.256 encoded version)

\[
\begin{align*}
\text{DEST}[31:0] & \leftarrow \text{SRC1}[63:32] + \text{SRC1}[31:0] \\
\text{DEST}[63:32] & \leftarrow \text{SRC1}[127:96] + \text{SRC1}[95:64] \\
\text{DEST}[95:64] & \leftarrow \text{SRC2}[63:32] + \text{SRC2}[31:0] \\
\text{DEST}[127:96] & \leftarrow \text{SRC2}[127:96] + \text{SRC2}[95:64] \\
\text{DEST}[159:128] & \leftarrow \text{SRC1}[191:160] + \text{SRC1}[159:128] \\
\text{DEST}[191:160] & \leftarrow \text{SRC1}[255:224] + \text{SRC1}[223:192] \\
\text{DEST}[223:192] & \leftarrow \text{SRC2}[191:160] + \text{SRC2}[159:128] \\
\text{DEST}[255:224] & \leftarrow \text{SRC2}[255:224] + \text{SRC2}[223:192] \\
\end{align*}
\]

VHADDPS (VEX.128 encoded version)

\[
\begin{align*}
\text{DEST}[31:0] & \leftarrow \text{SRC1}[63:32] + \text{SRC1}[31:0] \\
\text{DEST}[63:32] & \leftarrow \text{SRC1}[127:96] + \text{SRC1}[95:64] \\
\text{DEST}[95:64] & \leftarrow \text{SRC2}[63:32] + \text{SRC2}[31:0] \\
\text{DEST}[127:96] & \leftarrow \text{SRC2}[127:96] + \text{SRC2}[95:64] \\
\text{DEST}[255:128] & \leftarrow 0
\end{align*}
\]

HADDPS (128-bit Legacy SSE version)

\[
\begin{align*}
\text{DEST}[31:0] & \leftarrow \text{SRC1}[63:32] + \text{SRC1}[31:0] \\
\text{DEST}[63:32] & \leftarrow \text{SRC1}[127:96] + \text{SRC1}[95:64] \\
\text{DEST}[95:64] & \leftarrow \text{SRC2}[63:32] + \text{SRC2}[31:0] \\
\text{DEST}[127:96] & \leftarrow \text{SRC2}[127:96] + \text{SRC2}[95:64] \\
\text{DEST}[255:128] & \text{(Unmodified)}
\end{align*}
\]

Intel C/C++ Compiler Intrinsic Equivalent

\[
\begin{align*}
\text{VHADDPS} & \quad _\text{m256} \ _\text{mm256} \ _\text{hadd} \ _\text{ps} \ (_\text{m256} \ a, \ _\text{m256} \ b); \\
\text{HADDPS} & \quad _\text{m128} \ _\text{mm} \ _\text{hadd} \ _\text{ps} \ (_\text{m128} \ a, \ _\text{m128} \ b);
\end{align*}
\]

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal
Other Exceptions
See Exceptions Type 2
INSTRUCTION SET REFERENCE

HSUBPD- Subtract Horizontal Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 7D /r HSUBPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE3</td>
<td>Horizontal subtract packed double-precision floating-point values from xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 7D /r VHSUBPD xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Horizontal subtract packed double-precision floating-point values from xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 7D /r VHSUBPD ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Horizontal subtract packed double-precision floating-point values from ymm2 and ymm3/mem</td>
</tr>
</tbody>
</table>

Description
Subtract pairs of adjacent double-precision floating-point values in the first source operand and second source operand and stores results in the destination.

![Figure 5-12. VHSUBPD operation](image)

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

**VHSUBPD (VEX.256 encoded version)**

- DEST[63:0] ← SRC1[63:0] - SRC1[127:64]
- DEST[127:64] ← SRC2[63:0] - SRC2[127:64]

**VHSUBPD (VEX.128 encoded version)**

- DEST[63:0] ← SRC1[63:0] - SRC1[127:64]
- DEST[127:64] ← SRC2[63:0] - SRC2[127:64]
- DEST[255:128] ← 0

**HSUBPD (128-bit Legacy SSE version)**

- DEST[63:0] ← SRC1[63:0] - SRC1[127:64]
- DEST[127:64] ← SRC2[63:0] - SRC2[127:64]
- DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

- VHSUBPD _m256d_mm256_hsub_pd (_m256d a, _m256d b);
- HSUBPD _m128d_mm_hsub_pd (_m128d a, _m128d b);

**SIMD Floating-Point Exceptions**

- Overflow, Underflow, Invalid, Precision, Denormal

**Other Exceptions**

- See Exceptions Type 2
INSTRUCTION SET REFERENCE

HSUBPS- Subtract Horizontal Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 7D /r</td>
<td>V/V SSE3</td>
<td>Horizontal subtract packed single-precision floating-point values from xmm1 and xmm2/mem</td>
<td></td>
</tr>
<tr>
<td>HSUBPS xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 7D /r</td>
<td>V/V AVX</td>
<td>Horizontal subtract packed single-precision floating-point values from xmm2 and xmm3/mem</td>
<td></td>
</tr>
<tr>
<td>VHSUBPS xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.F2.0F 7D /r</td>
<td>V/V AVX</td>
<td>Horizontal subtract packed single-precision floating-point values from ymm2 and ymm3/mem</td>
<td></td>
</tr>
<tr>
<td>VHSUBPS ymm1, ymm2, ymm3/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description
Subtract pairs of adjacent single-precision floating-point values in the first source operand and second source operand and stores results in the destination.

Figure 5-13. VHSUBPS operation

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.
VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VHSUBPS (VEX.256 encoded version)

<table>
<thead>
<tr>
<th>Destination</th>
<th>Source 1</th>
<th>Source 2</th>
<th>Result</th>
</tr>
</thead>
<tbody>
<tr>
<td>DEST[31:0]</td>
<td>SRC1[31:0]</td>
<td>SRC1[63:32]</td>
<td>0</td>
</tr>
<tr>
<td>DEST[95:64]</td>
<td>SRC2[31:0]</td>
<td>SRC2[63:32]</td>
<td>0</td>
</tr>
</tbody>
</table>

VHSUBPS (VEX.128 encoded version)

<table>
<thead>
<tr>
<th>Destination</th>
<th>Source 1</th>
<th>Source 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>DEST[31:0]</td>
<td>SRC1[31:0]</td>
<td>SRC1[63:32]</td>
</tr>
<tr>
<td>DEST[95:64]</td>
<td>SRC2[31:0]</td>
<td>SRC2[63:32]</td>
</tr>
<tr>
<td>DEST[255:128]</td>
<td>0</td>
<td></td>
</tr>
</tbody>
</table>

HSUBPS (128-bit Legacy SSE version)

<table>
<thead>
<tr>
<th>Destination</th>
<th>Source 1</th>
<th>Source 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>DEST[31:0]</td>
<td>SRC1[31:0]</td>
<td>SRC1[63:32]</td>
</tr>
<tr>
<td>DEST[95:64]</td>
<td>SRC2[31:0]</td>
<td>SRC2[63:32]</td>
</tr>
<tr>
<td>DEST[255:128]</td>
<td>(Unmodified)</td>
<td></td>
</tr>
</tbody>
</table>

Intel C/C++ Compiler Intrinsic Equivalent

VHSUBPS __m256 _mm256_hsub_ps (__m256 a, __m256 b);
HSUBPS __m128 _mm_hsub_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal
INSTRUCTION SET REFERENCE

Other Exceptions
See Exceptions Type 2
VINSERTF128- Insert packed floating-point values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.256.66.0F3A 18 /r ib VINSERTF128 ymm1, ymm2, xmm3/m128, imm8</td>
<td>V/V AVX</td>
<td>Insert 128-bits of packed floating-point values from xmm3/mem and the remaining values from ymm2 into ymm1</td>
<td></td>
</tr>
</tbody>
</table>

Description
Performs an insertion of 128-bits of packed floating-point values from the second source operand (third operand) into an the destination operand (first operand) at an 128-bit offset from imm8[0]. The remaining portions of the destination are written by the corresponding fields of the first source operand (second operand). The second source operand can be either an XMM register or a 128-bit memory location. The high 7 bits of the immediate are ignored.

Operation
VINSERTF128
TEMP[255:0]  SRC1[255:0]  CASE (imm8[0]) OF
  0: TEMP[127:0]  SRC2[127:0]
DEST  TEMP

Intel C/C++ Compiler Intrinsic Equivalent
VINSERTF128 __m256 _mm256_insertf128_ps (__m256 a, __m128 b, int offset);
VINSERTF128 __m256d _mm256_insertf128_pd (__m256d a, __m128d b, int offset);
VINSERTF128 __m256i _mm256_insertf128_si256 (__m256i a, __m128i b, int offset);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 6
## INSERTPS- Insert Scalar Single Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 21 /r ib</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Insert a single precision floating point value selected by imm8 from xmm2/m32 into xmm1 at the specified destination element specified by imm8 and zero out destination elements in xmm1 as indicated in imm8.</td>
</tr>
<tr>
<td>INSERTPS xmm1, xmm2/m32, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 21 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Insert a single precision floating point value selected by imm8 from xmm3/m32 and merge into xmm2 at the specified destination element specified by imm8 and zero out destination elements in xmm1 as indicated in imm8.</td>
</tr>
<tr>
<td>VINSERTPS xmm1, xmm2, xmm3/m32, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

**register source form**
Select a single precision floating-point element from second source as indicated by Count_S bits of the immediate operand and insert it into the first source at the location indicated by the Count_D bits of the immediate operand. Store in the destination and zero out destination elements based on the ZMask bits of the immediate operand.

**memory source form**
Load a floating-point element from a 32-bit memory location and insert it into the first source at the location indicated by the Count_D bits of the immediate operand. Store in the destination and zero out destination elements based on the ZMask bits of the immediate operand.

128-bit Legacy SSE version: The first source register is an XMM register. The second source operand is either an XMM register or a 32-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.
VEX.128 encoded version. The destination and first source register is an XMM register. The second source operand is either an XMM register or a 32-bit memory location. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

If VINSERTPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

**Operation**

**VINSERTPS (VEX.128 encoded version)**

IF (SRC == REG) THEN COUNT_S ∷ imm8[7:6]

ELSE COUNT_S ∷ 0

COUNT_D ∷ imm8[5:4]

ZMASK ∷ imm8[3:0]

CASE (COUNT_S) OF

0: TMP ∷ SRC2[31:0]

1: TMP ∷ SRC2[63:32]

2: TMP ∷ SRC2[95:64]

3: TMP ∷ SRC2[127:96]

ESAC;

CASE (COUNT_D) OF

0: TMP2[31:0] ∷ TMP


1: TMP2[63:32] ∷ TMP

TMP2[31:0] ∷ SRC1[31:0]

TMP2[127:64] ∷ SRC1[127:64]

2: TMP2[95:64] ∷ TMP

TMP2[63:0] ∷ SRC1[63:0]

TMP2[127:96] ∷ SRC1[127:96]

3: TMP2[127:96] ∷ TMP

TMP2[95:0] ∷ SRC1[95:0]

ESAC;

IF (ZMASK[0] == 1) THEN DEST[31:0] ∷ 00000000H

ELSE DEST[31:0] ∷ TMP2[31:0]


IF (ZMASK[2] == 1) THEN DEST[95:64] ∷ 00000000H

ELSE DEST[95:64] ∷ TMP2[95:64]


ELSE DEST[127:96] ∷ TMP2[127:96]

DEST[255:128] ∷ 0
INSTRUCTION SET REFERENCE

INSERTPS (128-bit Legacy SSE version)
IF (SRC == REG) THEN COUNT_S ← imm8[7:6]
ELSE COUNT_S ← 0
COUNT_D ← imm8[5:4]
ZMASK ← imm8[3:0]
CASE (COUNT_S) OF
  0: TMP ← SRC[31:0]
  1: TMP ← SRC[63:32]
  2: TMP ← SRC[95:64]
  3: TMP ← SRC[127:96]
ESAC;

CASE (COUNT_D) OF
  0: TMP2[31:0] ← TMP
  1: TMP2[63:32] ← TMP
      TMP2[31:0] ← DEST[31:0]
      TMP2[127:64] ← DEST[127:64]
  2: TMP2[95:64] ← TMP
      TMP2[63:0] ← DEST[63:0]
      TMP2[127:96] ← DEST[127:96]
  3: TMP2[127:96] ← TMP
      TMP2[95:0] ← DEST[95:0]
ESAC;

IF (ZMASK[0] == 1) THEN DEST[31:0] ← 00000000H
ELSE DEST[31:0] ← TMP2[31:0]
IF (ZMASK[2] == 1) THEN DEST[95:64] ← 00000000H
ELSE DEST[95:64] ← TMP2[95:64]
ELSE DEST[127:96] ← TMP2[127:96]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
INSETRTPS __m128 _mm_insert_ps(__m128 dst, __m128 src, const int idx);

SIMD Floating-Point Exceptions
None
INSTRUCTION SET REFERENCE

Other Exceptions
See Exceptions Type 5
INSTRUCTION SET REFERENCE

LDDQU- Move Unaligned Integer

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F F0 /r</td>
<td>V/V</td>
<td>SSE3</td>
<td>Load unaligned packed integer values from mem to xmm1</td>
</tr>
<tr>
<td>LDDQU xmm1, m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F2.0F F0 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Load unaligned packed integer values from mem to xmm1</td>
</tr>
<tr>
<td>VLDDQU xmm1, m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.F2.0F F0 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Load unaligned packed integer values from mem to ymm1</td>
</tr>
<tr>
<td>VLDDQU ymm1, m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description
The instruction is functionally similar to VMOVDQU YMM, m256 for loading from memory. That is: 32 bytes of data starting at an address specified by the source memory operand (second operand) are fetched from memory and placed in a destination register (first operand). The source operand need not be aligned on a 32-byte boundary. Up to 64 bytes may be loaded from memory; this is implementation dependent.

This instruction may improve performance relative to VMOVDQU if the source operand crosses a cache line boundary. In situations that require the data loaded by VLDDQU be modified and stored to the same location, use VMOVDQU or VMOVDQA instead of VLDDQU. To move double quadwords to or from memory locations that are known to be aligned on 32-byte boundaries, use the VMOVDQA instruction.

Implementation Notes
• If the source is aligned to a 32-byte boundary, based on the implementation, the 32 bytes may be loaded more than once. For that reason, the usage of VLDDQU should be avoided when using uncached or write-combining (WC) memory regions. For uncached or WC memory regions, keep using VMOVDQU.
• This instruction is a replacement for VMOVDQU (load) in situations where cache line splits significantly affect performance. It should not be used in situations where store-load forwarding is performance critical. If performance of store-load forwarding is critical to the application, use VMOVDQA store-load pairs when data is 256-bit aligned or VMOVDQU store-load pairs when data is 256-bit unaligned.
• If the memory address is not aligned on 32-byte boundary, some implementations may load up to 64 bytes and return 32 bytes in the destination. Some processor implementations may issue multiple loads to access the appropriate 32 bytes. Developers of multi-threaded or multi-processor software should be aware that on these processors the loads will be performed in a non-atomic way.
Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

**Operation**

**VLDDQU (VEX.256 encoded version)**

DEST[255:0] ← SRC[255:0]

**VLDDQU (VEX.128 encoded version)**

DEST[127:0] ← SRC[127:0]
DEST[255:128] ← 0

**LDDQU (128-bit Legacy SSE version)**

DEST[127:0] ← SRC[127:0]
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

LDDQU __m128i _mm_laddqu_si128 (__m128i * p);
LDDQU __m256i _mm256_laddqu_si256 (__m256i * p);

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type 10;

Note VEX-encoded instruction do not report #AC; treatment of #AC may vary if not-encoded with VEX prefix.
VLDMXCSR—Load MXCSR Register

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.128.0F AE /2 VLDMXCSR m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Load MXCSR register from m32.</td>
</tr>
</tbody>
</table>

**Description**

Loads the source operand into the MXCSR control/status register. The source operand is a 32-bit memory location.

The VLDMXCSR instruction is typically used in conjunction with the VSTMXCSR instruction for software that use instruction set extensions operating on the YMM state.

The default MXCSR value at reset is 1F80H.

If a VLDMXCSR instruction clears a SIMD floating-point exception mask bit and sets the corresponding exception flag bit, a SIMD floating-point exception will not be immediately generated. The exception will be generated only upon the execution of the next instruction that meets both conditions below:

- the instruction must operate on an XMM or YMM register operand,
- the instruction causes that particular SIMD floating-point exception to be reported.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode. If VLDMXCSR is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

**Operation**

MXCSR ← m32;

**C/C++ Compiler Intrinsic Equivalent**

_mm_setcsr(unsigned int i)

**SIMD Floating-Point Exceptions**

None.

**Other Exceptions**

See Exceptions Type 9; additionally

#GP For an attempt to set reserved bits in MXCSR
MASKMOVDQU- Store Selected Bytes of Double Quadword with NT Hint

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F F7 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Selectively write bytes from xmm1 to memory location using the byte mask in xmm2. The default memory location is specified by DS:DI/EDI/RDI</td>
</tr>
<tr>
<td>VMASKMOVDQU xmm1, xmm2</td>
<td>V/V</td>
<td>AVX</td>
<td>Selectively write bytes from xmm1 to memory location using the byte mask in xmm2. The default memory location is specified by DS:DI/EDI/RDI</td>
</tr>
</tbody>
</table>

Description

Stores selected bytes from the source operand (first operand) into an 128-bit memory location. The mask operand (second operand) selects which bytes from the source operand are written to memory. The source and mask operands are XMM registers. The location of the first byte of the memory location is specified by DI/EDI/RDI and DS registers. The memory location does not need to be aligned on a natural boundary. (The size of the store address depends on the address-size attribute.)

The most significant bit in each byte of the mask operand determines whether the corresponding byte in the source operand is written to the corresponding byte location in memory: 0 indicates no write and 1 indicates write.

The MASKMOVDQU instruction generates a non-temporal hint to the processor to minimize cache pollution. The non-temporal hint is implemented by using a write combining (WC) memory type protocol (see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10, of the IA-32 Intel® Architecture Software Developer’s Manual, Volume 1). Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MASKMOVDQU instructions if multiple processors might use different memory types to read/write the destination memory locations.

Behavior with a mask of all 0s is as follows:

- No data will be written to memory.
- Signaling of breakpoints (code or data) is not guaranteed; different processor implementations may signal or not signal these breakpoints.
INSTRUCTION SET REFERENCE

- Exceptions associated with addressing memory and page faults may still be signaled (implementation dependent).
- If the destination memory region is mapped as UC or WP, enforcement of associated semantics for these memory types is not guaranteed (that is, is reserved) and is implementation-specific.

Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s.

The MASKMOVDQU instruction can be used to improve performance of algorithms that need to merge data on a byte-by-byte basis. MASKMOVDQU should not cause a read for ownership; doing so generates unnecessary bandwidth since data is to be written directly using the bytemask without allocating old data prior to the store.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

If VMASKMOVDQU is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

Operation

**MASKMOVDQU**

IF (MASK[7] = 1)

 THEN DEST[DS:DI/EDI/RDI] ← SRC[7:0] ELSE (* Memory location unchanged *); FI;

IF (MASK[15] = 1)


(* Repeat operation for 3rd through 14th bytes in source operand *)

IF (MASK[127] = 1)


Intel C/C++ Compiler Intrinsic Equivalent

void _mm_maskmoveu_si128(__m128i d, __m128i n, char * p)

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L= 1.

 If VEX.vvvv != 1111B.
### VMASKMOV- Conditional SIMD Packed Loads and Stores

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.128.66.0F38 2C /r VMASKMOVPS xmm1, xmm2, m128</td>
<td>V/V AVX</td>
<td></td>
<td>Conditionally load packed single-precision values from m128 using mask in xmm2 and store in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F38 2C /r VMASKMOVPS ymm1, ymm2, m256</td>
<td>V/V AVX</td>
<td></td>
<td>Conditionally load packed single-precision values from m256 using mask in ymm2 and store in ymm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 2D /r VMASKMOVPD xmm1, xmm2, m128</td>
<td>V/V AVX</td>
<td></td>
<td>Conditionally load packed double-precision values from m128 using mask in xmm2 and store in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F38 2D /r VMASKMOVPD ymm1, ymm2, m256</td>
<td>V/V AVX</td>
<td></td>
<td>Conditionally load packed double-precision values from m256 using mask in ymm2 and store in ymm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 2E /r VMASKMOVPS m128, xmm1, xmm2</td>
<td>V/V AVX</td>
<td></td>
<td>Conditionally store packed single-precision values from xmm2 using mask in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F38 2E /r VMASKMOVPS m256, ymm1, ymm2</td>
<td>V/V AVX</td>
<td></td>
<td>Conditionally store packed single-precision values from ymm2 using mask in ymm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 2F /r VMASKMOVPD m128, xmm1, xmm2</td>
<td>V/V AVX</td>
<td></td>
<td>Conditionally store packed double-precision values from xmm2 using mask in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F38 2F /r VMASKMOVPD m256, ymm1, ymm2</td>
<td>V/V AVX</td>
<td></td>
<td>Conditionally store packed double-precision values from ymm2 using mask in ymm1</td>
</tr>
</tbody>
</table>

**Description**

Conditionally moves packed data elements from the second source operand into the corresponding data element of the destination operand, depending on the mask bits.
associated with each data element. The mask bits are specified in the first source operand.

The mask bit for each data element is the most significant bit of that element in the first source operand. If a mask is 1, the corresponding data element is copied from the second source operand to the destination operand. If the mask is 0, the corresponding data element is set to zero in the load form of these instructions, and unmodified in the store form.

The second source operand is a memory address for the load form of these instruction. The destination operand is a memory address for the store form of these instructions. The other operands are both XMM registers (for VEX.128 version) or YMM registers (for VEX.256 version).

Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero.

Unlike previous MASKMOV instructions (MASKMOVQ and MASKMOVDQU), a nontemporal hint is not applied to these instructions.

Instruction behavior on alignment check reporting with mask bits of less than all 1s are the same as with mask bits of all 1s.

VMASKMOV should not be used to access memory mapped I/O as the ordering of the individual loads or stores it does is implementation specific.

In cases where mask bits indicate data should not be loaded or stored paging A and D bits will be set in an implementation dependent way. However, A and D bits are always set for pages where data is actually loaded/stored.

Note: for load forms, the first source (the mask) is encoded in VEX.vvvv; the second source is encoded in rm_field, and the destination register is encoded in reg_field.

Note: for store forms, the first source (the mask) is encoded in VEX.vvvv; the second source register is encoded in reg_field, and the destination memory location is encoded in rm_field.

**Operation**

**VMASKMOVPS - 256-bit load**

DEST[31:0] ← IF (SRC1[31]) Load_32(mem) ELSE 0
DEST[63:32] ← IF (SRC1[63]) Load_32(mem + 4) ELSE 0
DEST[95:64] ← IF (SRC1[95]) Load_32(mem + 8) ELSE 0
DEST[127:96] ← IF (SRC1[127]) Load_32(mem + 12) ELSE 0
DEST[159:128] ← IF (SRC1[159]) Load_32(mem + 16) ELSE 0
DEST[191:160] ← IF (SRC1[191]) Load_32(mem + 20) ELSE 0
DEST[223:192] ← IF (SRC1[223]) Load_32(mem + 24) ELSE 0
DEST[255:224] ← IF (SRC1[255]) Load_32(mem + 28) ELSE 0

**VMASKMOVPS -128-bit load**

DEST[31:0] ← IF (SRC1[31]) Load_32(mem) ELSE 0
VMASKMOVPD - 256-bit load
DEST[63:0] ← IF (SRC1[63]) Load_32(mem + 4) ELSE 0
DEST[95:64] ← IF (SRC1[95]) Load_32(mem + 8) ELSE 0
DEST[127:96] ← IF (SRC1[127]) Load_32(mem + 12) ELSE 0
DEST[255:128] ← 0

VMASKMOVPD - 128-bit load
DEST[63:0] ← IF (SRC1[63]) Load_64(mem) ELSE 0
DEST[127:64] ← IF (SRC1[127]) Load_64(mem + 8) ELSE 0
DEST[191:128] ← IF (SRC1[191]) Load_64(mem + 16) ELSE 0
DEST[255:192] ← IF (SRC1[255]) Load_64(mem + 24) ELSE 0

VMASKMOVPS - 256-bit store
IF (SRC1[31]) DEST[31:0] ← SRC2[31:0]
IF (SRC1[63]) DEST[63:32] ← SRC2[63:32]
IF (SRC1[95]) DEST[95:64] ← SRC2[95:64]
IF (SRC1[127]) DEST[127:96] ← SRC2[127:96]
IF (SRC1[159]) DEST[159:128] ← SRC2[159:128]
IF (SRC1[223]) DEST[223:192] ← SRC2[223:192]
IF (SRC1[255]) DEST[255:224] ← SRC2[255:224]

VMASKMOVPS - 128-bit store
IF (SRC1[31]) DEST[31:0] ← SRC2[31:0]
IF (SRC1[63]) DEST[63:32] ← SRC2[63:32]
IF (SRC1[95]) DEST[95:64] ← SRC2[95:64]
IF (SRC1[127]) DEST[127:96] ← SRC2[127:96]

VMASKMOVPD - 256-bit store
IF (SRC1[63]) DEST[63:0] ← SRC2[63:0]
IF (SRC1[127]) DEST[127:64] ← SRC2[127:64]
IF (SRC1[255]) DEST[255:192] ← SRC2[255:192]

VMASKMOVPD - 128-bit store
IF (SRC1[63]) DEST[63:0] ← SRC2[63:0]
IF (SRC1[127]) DEST[127:64] ← SRC2[127:64]
INSTRUCTION SET REFERENCE

Intel C/C++ Compiler Intrinsic Equivalent

__m256 _mm256_maskload_ps(float const *a, __m256i mask)
void _mm256_maskstore_ps(float *a, __m256i mask, __m256 b)
__m256d _mm256_maskload_pd(double *a, __m256i mask);
void _mm256_maskstore_pd(double *a, __m256i mask, __m256d b);
__m128 _mm256_maskload_ps(float const *a, __m128i mask)
void _mm256_maskstore_ps(float *a, __m128i mask, __m128 b)
__m128d _mm256_maskload_pd(double *a, __m128i mask);
void _mm256_maskstore_pd(double *a, __m128i mask, __m128d b);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 6 (No AC# reported for any mask bit combinations)
MAXPD- Maximum of Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 5F /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Return the maximum double-precision floating-point values between xmm1 and xmm2/m128</td>
</tr>
<tr>
<td>MAXPD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 5F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the maximum double-precision floating-point values between xmm1,xmm2, xmm3/m128</td>
</tr>
<tr>
<td>VMAXPD xmm1,xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 5F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the maximum packed double-precision floating-point values between ymm1, ymm2, ymm3/m256</td>
</tr>
<tr>
<td>VMAXPD ymm1, ymm2, ymm3/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs an SIMD compare of the packed double-precision floating-point values in the first source operand and the second source operand and returns the maximum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. If a value in the second operand is an SNaN, that SNaN is forwarded unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second operand) be returned, the action of MAXPD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.
INSTRUCTION SET REFERENCE

Operation
MAX(SRC1, SRC2)
{
    IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ← SRC2;
    ELSE IF (SRC1 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF (SRC2 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF (SRC1 > SRC2) THEN DEST ← SRC1;
    ELSE DEST ← SRC2;
    FI;
}

VMAXPD (VEX.256 encoded version)
DEST[63:0] ← MAX(SRC1[63:0], SRC2[63:0])
DEST[127:64] ← MAX(SRC1[127:64], SRC2[127:64])
DEST[255:192] ← MAX(SRC1[255:192], SRC2[255:192])

VMAXPD (VEX.128 encoded version)
DEST[63:0] ← MAX(SRC1[63:0], SRC2[63:0])
DEST[127:64] ← MAX(SRC1[127:64], SRC2[127:64])
DEST[255:128] ← 0

MAXPD (128-bit Legacy SSE version)
DEST[63:0] ← MAX(DEST[63:0], SRC[63:0])
DEST[127:64] ← MAX(DEST[127:64], SRC[127:64])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VMAXPD __m256d _mm256_max_pd (__m256d a, __m256d b);
(V)MAXPD __m128d _mm_max_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
Invalid (including QNaN Source Operand), Denormal

Other Exceptions
See Exceptions Type 2
**MAXPS- Minimum of Packed Single Precision Floating-Point Values**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 5F /r V/V Maxps xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Return the maximum single-precision floating-point values between xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 5F /r V/V VMAXPS xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the maximum single-precision floating-point values between xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.0F 5F /r V/V VMAXPS ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the maximum single double-precision floating-point values between ymm2 and ymm3/mem</td>
</tr>
</tbody>
</table>

**Description**

Performs an SIMD compare of the packed single-precision floating-point values in the first source operand and the second source operand and returns the maximum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. If a value in the second operand is an SNaN, that SNaN is forwarded unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second operand) be returned, the action of MAXPS can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.
INSTRUCTION SET REFERENCE

Operation
MAX(SRC1, SRC2)
{
    IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ← SRC2;
    ELSE IF (SRC1 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF SRC2 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF (SRC1 > SRC2) THEN DEST ← SRC1;
    ELSE DEST ← SRC2;
    FI;
}

VMAXPS (VEX.256 encoded version)
DEST[31:0] ← MAX(SRC1[31:0], SRC2[31:0])
DEST[63:32] ← MAX(SRC1[63:32], SRC2[63:32])
DEST[95:64] ← MAX(SRC1[95:64], SRC2[95:64])
DEST[127:96] ← MAX(SRC1[127:96], SRC2[127:96])
DEST[159:128] ← MAX(SRC1[159:128], SRC2[159:128])
DEST[191:160] ← MAX(SRC1[191:160], SRC2[191:160])
DEST[255:224] ← MAX(SRC1[255:224], SRC2[255:224])

VMAXPS (VEX.128 encoded version)
DEST[31:0] ← MAX(SRC1[31:0], SRC2[31:0])
DEST[63:32] ← MAX(SRC1[63:32], SRC2[63:32])
DEST[95:64] ← MAX(SRC1[95:64], SRC2[95:64])
DEST[127:96] ← MAX(SRC1[127:96], SRC2[127:96])
DEST[255:128] ← 0

MAXPS (128-bit Legacy SSE version)
DEST[31:0] ← MAX(Dest[31:0], SRC[31:0])
DEST[63:32] ← MAX(Dest[63:32], SRC[63:32])
DEST[95:64] ← MAX(Dest[95:64], SRC[95:64])
DEST[127:96] ← MAX(Dest[127:96], SRC[127:96])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VMAXPS _m256 _mm256_max_ps (__m256 a, __m256 b);
MAXPS _m128 _mm_max_ps (__m128 a, __m128 b);
SIMD Floating-Point Exceptions
Invalid (including QNaN Source Operand), Denormal

Other Exceptions
See Exceptions Type 2
INSTRUCTION SET REFERENCE

MAXSD- Return Maximum Scalar Double-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 5F /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Return the maximum scalar double-precision floating-point value between xmm2/mem64 and xmm1.</td>
</tr>
<tr>
<td>MAXSD xmm1, xmm2/m64</td>
<td>V/V</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 5F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the maximum scalar double-precision floating-point value between xmm3/mem64 and xmm2.</td>
</tr>
<tr>
<td>VMAXSD xmm1, xmm2, xmm3/m64</td>
<td>V/V</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Compares the low double-precision floating-point values in the first source operand and second the source operand, and returns the maximum value to the low quad-word of the destination operand. The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers. When the second source operand is a memory operand, only 64 bits are accessed. The high quadword of the destination operand is copied from the same bits of first source operand.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN of either source operand be returned, the action of MAXSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.
Software should ensure VMAXSD is encoded with VEX.L=0. Encoding VMAXSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

**Operation**

\[
\text{MAX}(\text{SRC1}, \text{SRC2})
\]

\[
\{ \\
\text{IF } ((\text{SRC1} = 0.0) \text{ and } (\text{SRC2} = 0.0)) \text{ THEN DEST } \leftarrow \text{SRC2}; \\
\text{ELSE IF } (\text{SRC1} = \text{SNaN}) \text{ THEN DEST } \leftarrow \text{SRC2}; \text{ FI}; \\
\text{ELSE IF } \text{SRC2} = \text{SNaN} \text{ THEN DEST } \leftarrow \text{SRC2}; \text{ FI}; \\
\text{ELSE IF } (\text{SRC1} > \text{SRC2}) \text{ THEN DEST } \leftarrow \text{SRC1}; \\
\text{ELSE DEST } \leftarrow \text{SRC2}; \\
\text{FI}; \\
\}
\]

**VMAXSD (VEX.128 encoded version)**

\[
\text{DEST}[63:0] \leftarrow \text{MAX}(\text{SRC1}[63:0], \text{SRC2}[63:0]) \\
\text{DEST}[127:64] \leftarrow \text{SRC1}[127:64] \\
\text{DEST}[255:128] \leftarrow 0
\]

**MAXSD (128-bit Legacy SSE version)**

\[
\text{DEST}[63:0] \leftarrow \text{MAX}(\text{DEST}[63:0], \text{SRC}[63:0]) \\
\text{DEST}[255:64] \text{ (Unmodified)}
\]

**Intel C/C++ Compiler Intrinsic Equivalent**

\[
\text{MAXSD } \_\text{m128d } \text{_mm_max_sd}(\_\text{m128d } a, \_\text{m128d } b)
\]

**SIMD Floating-Point Exceptions**

Invalid (Including QNaN Source Operand), Denormal

**Other Exceptions**

See Exceptions Type 3
MAXSS- Return Maximum Scalar Single-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 5F /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Return the maximum scalar single-precision floating-point value between xmm2/mem32 and xmm1.</td>
</tr>
<tr>
<td>MAXSS xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VMAXSS xmm1, xmm2, xmm3/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Compares the low single-precision floating-point values in the first source operand and the second source operand, and returns the maximum value to the low double-word of the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN from either source operand be returned, the action of MAXSS can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VMAXSS is encoded with VEX.L=0. Encoding VMAXSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.
Operation

MAX(SRC1, SRC2)
{
   IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ← SRC2;
   ELSE IF (SRC1 = SNaN) THEN DEST ← SRC2; FI;
   ELSE IF SRC2 = SNaN) THEN DEST ← SRC2; FI;
   ELSE IF (SRC1 > SRC2) THEN DEST ← SRC1;
   ELSE DEST ← SRC2;
   FI;
}

VMAXSS (VEX.128 encoded version)
DEST[31:0] ← MAX(SRC1[31:0], SRC2[31:0])
DEST[255:128] ← 0

MAXSS (128-bit Legacy SSE version)
DEST[31:0] ← MAX(DEST[31:0], SRC[31:0])
DEST[255:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
__m128 _mm_max_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions
Invalid (Including QNaN Source Operand), Denormal

Other Exceptions
See Exceptions Type 3
INSTRUCTION SET REFERENCE

MINPD- Minimum of Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 5D /r MINPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Return the minimum double-precision floating-point values between xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 5D /r VMINPD xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the minimum double-precision floating-point values between xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 5D /r VMINPD ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the minimum packed double-precision floating-point values between ymm2 and ymm3/mem</td>
</tr>
</tbody>
</table>

Description
Performs an SIMD compare of the packed double-precision floating-point values in the first source operand and the second source operand and returns the minimum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. If a value in the second operand is an SNaN, that SNaN is forwarded unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second operand) be returned, the action of MINPD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zereod.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.
Operation

MIN(SRC1, SRC2)
{
    IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ← SRC2;
    ELSE IF (SRC1 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF (SRC2 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF (SRC1 < SRC2) THEN DEST ← SRC1;
    ELSE DEST ← SRC2;
    FI;
}

VMINPD (VEX.256 encoded version)
DEST[63:0] ← MIN(SRC1[63:0], SRC2[63:0])
DEST[127:64] ← MIN(SRC1[127:64], SRC2[127:64])
DEST[255:192] ← MIN(SRC1[255:192], SRC2[255:192])

VMINPD (VEX.128 encoded version)
DEST[63:0] ← MIN(SRC1[63:0], SRC2[63:0])
DEST[127:64] ← MIN(SRC1[127:64], SRC2[127:64])
DEST[255:128] ← 0

MINPD (128-bit Legacy SSE version)
DEST[63:0] ← MIN(SRC1[63:0], SRC2[63:0])
DEST[127:64] ← MIN(SRC1[127:64], SRC2[127:64])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VMINPD __m256d_mm256_min_pd (__m256d a, __m256d b);
MINPD __m128d_mm_min_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
Invalid (including QNaN Source Operand), Denormal

Other Exceptions
See Exceptions Type 2
MINPS- Minimum of Packed Single Precision Floating-Point Values

### Description
Perform an SIMD compare of the packed single-precision floating-point values in the first source operand and the second source operand and returns the minimum value for each pair of values to the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned. If a value in the second operand is an SNaN, that SNaN is forwarded unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second operand (source operand), either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second operand) be returned, the action of MINPS can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

**VEX.256 encoded version:** The first source operand is a YMM register. The second source operand and the destination register are XMM registers. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

**VEX.128 encoded version:** The first source operand is an XMM register or a 128-bit memory location. The destination register is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

**128-bit Legacy SSE version:** The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

### Opcode/ Instruction

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 5D /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Return the minimum single-precision floating-point values between xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>MINPS xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.0F 5D /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the minimum single-precision floating-point values between xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VMINPS xmm1,xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.0F 5D /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the minimum single double-precision floating-point values between ymm2 and ymm3/mem</td>
</tr>
<tr>
<td>VMINPS ymm1, ymm2, ymm3/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Operation
MIN(SRC1, SRC2)
{
    IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ← SRC2;
    ELSE IF (SRC1 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF (SRC2 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF (SRC1 < SRC2) THEN DEST ← SRC1;
    ELSE DEST ← SRC2;
    FI;
}

VMINPS (VEX.256 encoded version)
DEST[31:0] ← MIN(SRC1[31:0], SRC2[31:0])
DEST[63:32] ← MIN(SRC1[63:32], SRC2[63:32])
DEST[95:64] ← MIN(SRC1[95:64], SRC2[95:64])
DEST[127:96] ← MIN(SRC1[127:96], SRC2[127:96])
DEST[159:128] ← MIN(SRC1[159:128], SRC2[159:128])
DEST[191:160] ← MIN(SRC1[191:160], SRC2[191:160])
DEST[255:224] ← MIN(SRC1[255:224], SRC2[255:224])

VMINPS (VEX.128 encoded version)
DEST[31:0] ← MIN(SRC1[31:0], SRC2[31:0])
DEST[63:32] ← MIN(SRC1[63:32], SRC2[63:32])
DEST[95:64] ← MIN(SRC1[95:64], SRC2[95:64])
DEST[127:96] ← MIN(SRC1[127:96], SRC2[127:96])
DEST[255:128] ← 0

MINPS (128-bit Legacy SSE version)
DEST[31:0] ← MIN(SRC1[31:0], SRC2[31:0])
DEST[63:32] ← MIN(SRC1[63:32], SRC2[63:32])
DEST[95:64] ← MIN(SRC1[95:64], SRC2[95:64])
DEST[127:96] ← MIN(SRC1[127:96], SRC2[127:96])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VMINPS __m256 _mm256_min_ps (__m256 a, __m256 b);
MINPS __m128 _mm_min_ps (__m128 a, __m128 b);

Ref. # 319433-005 5-177
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
Invalid (including QNaN Source Operand), Denormal

Other Exceptions
See Exceptions Type 2
### MINSD- Return Minimum Scalar Double-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 5D /r MINSD xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE2</td>
<td>Return the minimum scalar double precision floating-point value between xmm2/mem64 and xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 5D /r VMINSN xmm1, xmm2, xmm3/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the minimum scalar double precision floating-point value between xmm3/mem64 and xmm2.</td>
</tr>
</tbody>
</table>

**Description**

Compares the low double-precision floating-point values in the first source operand and the second source operand, and returns the minimum value to the low quadword of the destination operand. When the source operand is a memory operand, only the 64 bits are accessed. The high quadword of the destination operand is copied from the same bits in the first source operand.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second source operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN source operand (from either the first or second source) be returned, the action of MINSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VMINSD is encoded with VEX.L=0. Encoding VMINSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.
INSTRUCTION SET REFERENCE

Operation
MIN(SRC1, SRC2)
{
    IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ← SRC2;
    ELSE IF (SRC1 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF SRC2 = SNaN) THEN DEST ← SRC2; FI;
    ELSE IF (SRC1 < SRC2) THEN DEST ← SRC1;
    ELSE DEST ← SRC2;
    FI;
}

MINSD (VEX.128 encoded version)
DEST[63:0] ← MIN(SRC1[63:0], SRC2[63:0])
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

MINSD (128-bit Legacy SSE version)
DEST[63:0] ← MIN(SRC1[63:0], SRC2[63:0])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
MINSD __m128d _mm_min_sd(__m128d a, __m128d b)

SIMD Floating-Point Exceptions
Invalid (including QNaN Source Operand), Denormal

Other Exceptions
See Exceptions Type 3
INSTRUCTION SET REFERENCE

MINSS- Return Minimum Scalar Single-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 5D /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Return the minimum scalar single precision floating-point value between xmm2/mem32 and xmm1.</td>
</tr>
<tr>
<td>MINSS xmm1,xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 5D /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the minimum scalar single precision floating-point value between xmm3/mem32 and xmm2.</td>
</tr>
<tr>
<td>VMINSS xmm1,xmm2, xmm3/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Compares the low single-precision floating-point values in the first source operand and the second source operand and returns the minimum value to the low double-word of the destination operand.

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second operand is an SNaN, that SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).

If only one value is a NaN (SNaN or QNaN) for this instruction, the second source operand, either a NaN or a valid floating-point value, is written to the result. If instead of this behavior, it is required that the NaN in either source operand be returned, the action of MINSD can be emulated using a sequence of instructions, such as, a comparison followed by AND, ANDN and OR.

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VMINSD is encoded with VEX.L=0. Encoding VMINSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

MIN(SRC1, SRC2)
INSTRUCTION SET REFERENCE

{ 
  IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST ← SRC2;
  ELSE IF (SRC1 = SNaN) THEN DEST ← SRC2; FI;
  ELSE IF SRC2 = SNaN) THEN DEST ← SRC2; FI;
  ELSE IF (SRC1 < SRC2) THEN DEST ← SRC1;
  ELSE DEST ← SRC2;
  FI;
}

VMINSS (VEX.128 encoded version)
DEST[31:0] ← MIN(SRC1[31:0], SRC2[31:0])
DEST[255:128] ← 0

MINSS (128-bit Legacy SSE version)
DEST[31:0] ← MIN(SRC1[31:0], SRC2[31:0])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
MINSS __m128 _mm_min_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions
Invalid (Including QNaN Source Operand), Denormal

Other Exceptions
See Exceptions Type 3
MOVAPD- Move Aligned Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 28 /r MOVAPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move aligned packed double-precision floating-point values from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>66 0F 29 /r MOVAPD xmm2/m128, xmm1</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move aligned packed double-precision floating-point values from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VEX.128.66.0F 28 /r VMOVAPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Move aligned packed double-precision floating-point values from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F 29 /r VMOVAPD xmm2/m128, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move aligned packed double-precision floating-point values from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VEX.256.66.0F 28 /r VMOVAPD ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Move aligned packed double-precision floating-point values from ymm2/mem to ymm1</td>
</tr>
<tr>
<td>VEX.256.66.0F 29 /r VMOVAPD ymm2/m256, ymm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move aligned packed double-precision floating-point values from ymm1 to ymm2/mem</td>
</tr>
</tbody>
</table>

Description

Moves 2 or 4 double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM or YMM register from an 128-bit or 256-bit memory location, to store the contents of an XMM or YMM register into a 128-bit or 256-bit memory location, or to move data between two XMM or two YMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit version) or 32-byte (VEX.256 encoded version) boundary or a general-protection exception (#GP) will be generated. To move double-precision floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

VEX.256 encoded version:
INSTRUCTION SET REFERENCE

Moves 256 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.

128-bit versions:
Moves 128 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating-point values to and from unaligned memory locations, use the VMOVUPD instruction.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register destination are zeroed.

Operation

VMOVAPD (VEX.256 encoded version)
DEST[255:0] ← SRC[255:0]

VMOVAPD (VEX.128 encoded version)
DEST[127:0] ← SRC[127:0]
DEST[255:128] ← 0

MOVAPD (128-bit load- and register-copy-form Legacy SSE version)
DEST[127:0] ← SRC[127:0]
DEST[255:128] (Unmodified)

(V)MOVAPD (128-bit store-form version)
DEST[127:0] ← SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent

VMOVAPD _m256d_mm256_load_pd (double const * p);
VMOVAPD _mm256_store_pd(double * p, __m256d a);
MOVAPD __m128d _mm_load_pd (double const * p);
MOVAPD __mm_store_pd(double * p, __m128d a);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type1; additionally
#UD If VEX.vvvv != 1111B.
MOVAPS - Move Aligned Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 28 /r MOVAPS xmm1, xmm2/m128</td>
<td>V/V SSE</td>
<td></td>
<td>Move aligned packed single-precision floating-point values from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>0F 29 /r MOVAPS xmm2/m128, xmm1</td>
<td>V/V SSE</td>
<td></td>
<td>Move aligned packed single-precision floating-point values from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VEX.128.0F 28 /r VMOVAPS xmm1, xmm2/m128</td>
<td>V/V AVX</td>
<td></td>
<td>Move aligned packed single-precision floating-point values from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>VEX.128.0F 29 /r VMOVAPS xmm2/m128, xmm1</td>
<td>V/V AVX</td>
<td></td>
<td>Move aligned packed single-precision floating-point values from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VEX.256.0F 28 /r VMOVAPS ymm1, ymm2/m256</td>
<td>V/V AVX</td>
<td></td>
<td>Move aligned packed single-precision floating-point values from ymm2/mem to ymm1</td>
</tr>
<tr>
<td>VEX.256.0F 29 /r VMOVAPS ymm2/m256, ymm1</td>
<td>V/V AVX</td>
<td></td>
<td>Move aligned packed single-precision floating-point values from ymm1 to ymm2/mem</td>
</tr>
</tbody>
</table>

Description

Moves 4 or 8 single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM or YMM register from an 128-bit or 256-bit memory location, to store the contents of an XMM or YMM register into a 128-bit or 256-bit memory location, or to move data between two XMM or two YMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit version) or 32-byte (VEX.256 encoded version) boundary or a general-protection exception (#GP) will be generated. To move single-precision floating-point values to and from unaligned memory locations, use the VMOVUPS instruction.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
VEX.256 encoded version:
Moves 256 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.

128-bit versions:
Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers. When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. To move single-precision floating-point values to and from unaligned memory locations, use the VMOVUPS instruction.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

Operation
VMOVAPS (VEX.256 encoded version)
DEST[255:0] ← SRC[255:0]

VMOVAPS (VEX.128 encoded version)
DEST[127:0] ← SRC[127:0]
DEST[255:128] ← 0

MOVAPS (128-bit load- and register-copy- form Legacy SSE version)
DEST[127:0] ← SRC[127:0]
DEST[255:128] (Unmodified)

(V)MOVAPS (128-bit store form)
DEST[127:0] ← SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent
VMOVAPS __m256 _mm256_load_ps (float const * p);
VMOVAPS__mm256_store_ps(float * p, __m256 a);
MOVAPS __m128 _mm_load_ps (float const * p);
MOVAPS _mm_store_ps(float * p, __m128 a);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type1; additionally
#UD If VEX.vvv != 1111B.
MOVD/MOVQ - Move Doubleword and Quadword

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 6E /r MOVD xmm1, r32/m32</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move doubleword from r/m32 to xmm1</td>
</tr>
<tr>
<td>66 REX.W 0F 6E /r MOVQ xmm1, r64/m64</td>
<td>V/N.E.</td>
<td>SSE2</td>
<td>Move quadword from r/m64 to xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F.W0 6E /r VMOVQ xmm1, r32/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Move doubleword from r/m32 to xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F.W1 6E /r VMOVQ xmm1, r64/m64</td>
<td>V/N.E.</td>
<td>AVX</td>
<td>Move quadword from r/m64 to xmm1</td>
</tr>
<tr>
<td>66 0F 7E /r MOVD r32/m32, xmm1</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move doubleword from xmm1 register to r/m32</td>
</tr>
<tr>
<td>66 REX.W 0F 7E /r MOVQ r64/m64, xmm1</td>
<td>V/N.E.</td>
<td>SSE2</td>
<td>Move quadword from xmm1 register to r/m64</td>
</tr>
<tr>
<td>VEX.128.66.0F.W0 7E /r VMOVD r32/m32, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move doubleword from xmm1 register to r/m32</td>
</tr>
<tr>
<td>VEX.128.66.0F.W1 7E /r VMOVQ r64/m64, xmm1</td>
<td>V/N.E.</td>
<td>AVX</td>
<td>Move quadword from xmm1 register to r/m64</td>
</tr>
</tbody>
</table>

Description

MOVD/Q with XMM destination:
Moves a dword integer from the source operand and stores it in the low 32-bits of the destination XMM register. The upper bits of the destination are zeroed. The source operand can be a 32-bit register or 32-bit memory location. A REX.W prefix promotes this to copy qword integers.
128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

MOVD/Q with r32/m32 or r64/m64 destination:
INSTRUCTION SET REFERENCE

Stores 32 (64) bits from the low bits of the source XMM register.
Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
If VMOVD or VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

Operation
MOVD (Legacy SSE version when destination is an XMM register)
DEST[31:0] ← SRC[31:0]
DEST[127:32] ← 0H
DEST[255:128] (Unmodified)

VMOVD (VEX-encoded version when destination is an XMM register)
DEST[31:0] ← SRC[31:0]
DEST[255:32] ← 0H

MOVQ (Legacy SSE version when destination is an XMM register)
DEST[63:0] ← SRC[63:0]
DEST[127:64] ← 0H
DEST[255:128] (Unmodified)

VMOVQ (VEX-encoded version when destination is an XMM register)
DEST[63:0] ← SRC[63:0]
DEST[255:64] ← 0H

MOVD / VMOVD (when destination is not an XMM register)
DEST[31:0] ← SRC[31:0]

MOVQ / VMOVQ (when destination is not an XMM register)
DEST[63:0] ← SRC[63:0]

Intel C/C++ Compiler Intrinsic Equivalent
MOVD __m128i _mm_cvtsi32_si128(int a)
MOVD int _mm_cvtsi128_si32(__m128i a)
MOVQ __m128i _mm_cvtsi64_si128(__int64 a)
MOVQ __int64 _mm_cvtsi128_si64(__m128i a)
SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5; additionally
#UD If VEX.L = 1.
If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

MOVQ- Move Quadword

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 7E /r</td>
<td>MOVQ xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move quadword from xmm2/m64 to xmm1</td>
</tr>
<tr>
<td>VEX.128.F3.0F 7E /r</td>
<td>VMOVQ xmm1, xmm2</td>
<td>V/V</td>
<td>AVX</td>
<td>Move quadword from xmm2 to xmm1</td>
</tr>
<tr>
<td>VEX.128.F3.0F 7E /r</td>
<td>VMOVQ xmm1, m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Load quadword from m64 to xmm1</td>
</tr>
<tr>
<td>66 0F D6 /r</td>
<td>MOVQ xmm1/m64, xmm2</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move quadword from xmm2 register to xmm1/m64</td>
</tr>
<tr>
<td>VEX.128.66.0F D6 /r</td>
<td>VMOVQ xmm1/m64, xmm2</td>
<td>V/V</td>
<td>AVX</td>
<td>Move quadword from xmm2 register to xmm1/m64</td>
</tr>
</tbody>
</table>

Description

Copies a quadword from the source operand (second operand) to the destination operand (first operand). The source and destination operands can be an XMM register or a 64-bit memory locations. This instruction can be used to move data between two XMM registers or between an XMM register and a 64-bit memory location. The instruction cannot be used to transfer data between memory locations.

When the source operand is an XMM register, the low quadword is moved; when the destination operand is an XMM register, the quadword is stored to the low quadword of the register, and the high quadword is cleared to all 0s.

Note: In VEX.128.66.0F D6 instruction version, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Note: In VEX.128.F3.0F 7E version, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

If VMOVQ is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

Operation

MOVQ (F3 0F 7E and 66 0F D6) with XMM register source and destination:
DEST[63:0] ← SRC[63:0]
DEST[127:64] ← 0
DEST[255:128] (Unmodified)

VMOVQ (VEX.NDS.128.F3.0F 7E) with XMM register source and destination:
DEST[63:0] ← SRC[63:0]
DEST[255:64] ← 0

VMOVQ (VEX.128.66.0F D6) with XMM register source and destination:
DEST[63:0] ← SRC[63:0]
DEST[255:64] ← 0

MOVQ (7E) with memory source:
DEST[63:0] ← SRC[63:0]
DEST[127:64] ← 0000000000000000H
DEST[255:128] (Unmodified)

VMOVQ (7E) with memory source:
DEST[63:0] ← SRC[63:0]
DEST[255:64] ← 0000000000000000H

MOVQ (D6) with memory dest:
DEST[63:0] ← SRC[63:0]

VMOVQ (D6) with memory dest:
DEST[63:0] ← SRC2[63:0]

Intel C/C++ Compiler Intrinsic Equivalent
MOVQ __m128i_mm_move_epi64(__m128i a);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5; additionally

#UD If VEX.L = 1.
If VEX.vvvv != 1111B.
MOVDDUP- Replicate Double FP Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 12 /r MOVDDUP xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE3</td>
<td>Move double-precision floating-point values from xmm2/mem and duplicate into xmm1</td>
</tr>
<tr>
<td>VEX.128.F2.0F 12 /r VMOVDDUP xmm1, xmm2/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Move double-precision floating-point values from xmm2/mem and duplicate into xmm1</td>
</tr>
<tr>
<td>VEX.256.F2.0F 12 /r VMOVDDUP ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Move even index double-precision floating-point values from ymm2/mem and duplicate each element into ymm1</td>
</tr>
</tbody>
</table>

Description

VEX.256 encoded version:
Duplicates even-indexed double-precision floating-point values from the source operand (second operand).

128-bit versions:
Duplicates a single double-precision floating-point value into the destination.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.
Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation

VMOVDDUP (VEX.256 encoded version)
DEST[63:0] ← SRC[63:0]
DEST[127:64] ← SRC[63:0]

VMOVDDUP (VEX.128 encoded version)
DEST[63:0] ← SRC[63:0]
DEST[127:64] ← SRC[63:0]
DEST[255:128] ← 0

MOVDDUP (128-bit Legacy SSE version)
DEST[63:0] ← SRC[63:0]
DEST[127:64] ← SRC[63:0]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

MOVDDUP _m256d _mm256_movedup_pd (_m256d a);
MOVDDUP _m128d _mm_movedup_pd (_m128d a);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5; additionally
#UD If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

MOVDQA- Move Aligned Packed Integer Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 6F /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move aligned packed integer values from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>MOVDQA xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 7F /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move aligned packed integer values from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>MOVDQA xmm2/m128, xmm1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F 6F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move aligned packed integer values from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>VMOVQDA xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F 7F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move aligned packed integer values from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VMOVQDA xmm2/m128, xmm1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.66.0F 6F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move aligned packed integer values from ymm2/mem to ymm1</td>
</tr>
<tr>
<td>VMOVQDA ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.66.0F 7F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move aligned packed integer values from ymm1 to ymm2/mem</td>
</tr>
<tr>
<td>VMOVQDA ymm2/m256, ymm1</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

VEX.256 encoded version:

Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 32-byte boundary or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the VMOVQDA instruction.

128-bit versions:
Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. To move integer data to and from unaligned memory locations, use the VMOVDQU instruction.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

**Operation**

**VMOVDQA (VEX.256 encoded version)**

\[
\begin{align*}
\text{DEST}[255:0] & \leftarrow \text{SRC}[255:0] \\
\end{align*}
\]

**VMOVDQA (VEX.128 encoded version)**

\[
\begin{align*}
\text{DEST}[127:0] & \leftarrow \text{SRC}[127:0] \\
\text{DEST}[255:128] & \leftarrow 0 \\
\end{align*}
\]

**MOVDQA (128-bit load- and register-form Legacy SSE version)**

\[
\begin{align*}
\text{DEST}[127:0] & \leftarrow \text{SRC}[127:0] \\
\text{DEST}[255:128] & \text{(Unmodified)} \\
\end{align*}
\]

**(V)MOVDQA (128-bit store forms)**

\[
\begin{align*}
\text{DEST}[127:0] & \leftarrow \text{SRC}[127:0] \\
\end{align*}
\]

**Intel C/C++ Compiler Intrinsic Equivalent**

VMOVQDA __m256i __m256_load_si256 (__m256i * p);

VMOVQDA __m256i__mm256_load_si256(__m256i *p, __m256i a);

MOVQDA __m128i __m_load_si128(__m128i *p);

MOVQDA __m128i__mm_store_si128(__m128i *p, __m128i a);

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type1; additionally

#UD If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

MOVDQU- Move Unaligned Packed Integer Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 6F /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move unaligned packed integer values from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>MOVDQU xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F3 0F 7F /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move unaligned packed integer values from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>MOVDQU xmm2/m128, xmm1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F3.0F 6F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed integer values from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>VMOVQDU xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F3.0F 7F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed integer values from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VMOVQDU xmm2/m128, xmm1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.F3.0F 6F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed integer values from ymm2/mem to ymm1</td>
</tr>
<tr>
<td>VMOVQDU ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.F3.0F 7F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed integer values from ymm1 to ymm2/mem</td>
</tr>
<tr>
<td>VMOVQDU ymm2/m256, ymm1</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

**VEX.256 encoded version:**

Moves 256 bits of packed integer values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.

**128-bit versions:**

Moves 128 bits of packed integer values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.
**128-bit Legacy SSE version**: Bits (255:128) of the corresponding YMM destination register remain unchanged.

When the source or destination operand is a memory operand, the operand may be unaligned to any alignment without causing a general-protection exception (#GP) to be generated.

**VEX.128 encoded version**: Bits (255:128) of the destination YMM register are zeroed.

**Operation**

**VMOVQ** (VEX.256 encoded version)

\[ \text{DEST}[255:0] \leftarrow \text{SRC}[255:0] \]

**VMOVQ** (VEX.128 encoded version)

\[ \text{DEST}[127:0] \leftarrow \text{SRC}[127:0] \]
\[ \text{DEST}[255:128] \leftarrow 0 \]

**MOVQ load and register copy (128-bit Legacy SSE version)**

\[ \text{DEST}[127:0] \leftarrow \text{SRC}[127:0] \]
\[ \text{DEST}[255:128] \text{ (Unmodified)} \]

**(V)MOVQ** 128-bit store-form versions

\[ \text{DEST}[127:0] \leftarrow \text{SRC}[127:0] \]

**Intel C/C++ Compiler Intrinsic Equivalent**

\[ \text{VMOVQ} \_m256i \_mm256\_loadu\_si256 (\_m256i \_p); \]
\[ \text{VMOVQ} \_mm256\_storeu\_si256(\_m256i \_p, \_m256i \_a); \]
\[ \text{MOVQ} \_m128i \_mm\_loadu\_si128 (\_m128i \_p); \]
\[ \text{MOVQ} \_mm\_storeu\_si128(\_m128i \_p, \_m128i \_a); \]

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type 10; additionally

\#UD \[ \text{If VEX\_vvvv} \neq 1111\_B. \]

Note VEX-encoded instruction do not report #AC; treatment of #AC may vary if not-encoded with VEX prefix.
MOVHLPS - Move Packed Single-Precision Floating-Point Values High to Low

<table>
<thead>
<tr>
<th>OPCODE/INSTRUCTION</th>
<th>64/32 BIT MODE SUPPORT</th>
<th>CPUID FEATURE FLAG</th>
<th>DESCRIPTION</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 12 /r V/V SSE MOVHLPS xmm1, xmm2</td>
<td>V/V</td>
<td>SSE</td>
<td>Move two packed single-precision floating-point values from high quadword of xmm2 to low quadword of xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 12 /r V/V AVX VMOVHLPS xmm1, xmm2, xmm3</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge two packed single-precision floating-point values from high quadword of xmm3 and low quadword of xmm2.</td>
</tr>
</tbody>
</table>

**Description**

This instruction cannot be used for memory to register moves.

**128-bit two-argument form:**

Moves two packed single-precision floating-point values from the high quadword of the second XMM argument (second operand) to the low quadword of the first XMM register (first argument). The high quadword of the destination operand is left unchanged. The upper 128 bits of the corresponding YMM destination register are unmodified.

**128-bit three-argument form**

Moves two packed single-precision floating-point values from the high quadword of the third XMM argument (third operand) to the low quadword of the destination (first operand). Copies the high quadword from the second XMM argument (second operand) to the high quadword of the destination (first operand). The upper 128 bits of the destination YMM register are zeroed.

If VMOVHLPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

**Operation**

MOVHLPS (128-bit two-argument form)

DEST[63:0] ← SRC[127:64]

DEST[255:64] (Unmodified)

VMOVHLPS (128-bit three-argument form)
INSTRUCTION SET REFERENCE

DEST[63:0] ← SRC2[127:64]
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
MOVHLPS __m128 _mm_movehl_ps(__m128 a, __m128 b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 7; additionally
#UD If VEX.L = 1
MOVHPD- Move High Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 16 /r MOVHPD xmm1, m64</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move double-precision floating-point values from m64 to high quadword of xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 16 /r VMOVHPD xmm2, xmm1, m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge double-precision floating-point value from m64 and the low quadword of xmm1.</td>
</tr>
<tr>
<td>66 0F 17/r MOVHPD m64, xmm1</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move double-precision floating-point values from high quadword of xmm1 to m64.</td>
</tr>
<tr>
<td>VEX128.66.0F 17/r VMOVHPD m64, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move double-precision floating-point values from high quadword of xmm1 to m64.</td>
</tr>
</tbody>
</table>

Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:
Moves a double-precision floating-point value from the source 64-bit memory operand and stores it in the high 64-bits of the destination XMM register. The lower 64-bits of the XMM register are preserved. The upper 128-bits of the corresponding YMM destination register are preserved.

VEX.128 encoded load:
 Loads a double-precision floating-point value from the source 64-bit memory operand (third operand) and stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from second XMM register (second operand) are stored in the lower 64-bits of the destination. The upper 128-bits of the destination YMM register are zeroed.

128-bit store:
Stores a double-precision floating-point value from the high 64-bits of the XMM register source (second operand) to the 64-bit memory location (first operand).

Note: VMOVHPD (store) (VEX.128.66.0F 17 /r) is legal and has the same behavior as the existing 66 0F 17 store. For VMOVHPD (store) (VEX.128.66.0F 17 /r) instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.
If VMOVHPD is encoded with VEX.L = 1, an attempt to execute the instruction encoded with VEX.L = 1 will cause an #UD exception.

**Operation**

**MOVHPD (128-bit Legacy SSE load)**
- DEST[63:0] (Unmodified)
- DEST[127:64] ← SRC[63:0]
- DEST[255:128] (Unmodified)

**VMOVHPD (VEX.128 encoded load)**
- DEST[63:0] ← SRC1[63:0]
- DEST[127:64] ← SRC2[63:0]
- DEST[255:128] ← 0

**VMOVHPD (store)**
- DEST[63:0] ← SRC[127:64]

**Intel C/C++ Compiler Intrinsic Equivalent**
- MOVHPD __m128d _mm_loadh_pd (__m128d a, double *p)
- MOVHPD void _mm_storeh_pd (double *p, __m128d a)

**SIMD Floating-Point Exceptions**
- None

**Other Exceptions**
- See Exceptions Type 5; additionally
  - #UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

MOVHPS- Move High Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 16 /r MOVHPS xmm1, m64</td>
<td>V/V</td>
<td>SSE</td>
<td>Move two packed single-precision floating-point values from m64 to high quadword of xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 16 /r VMOVHPS xmm2, xmm1, m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge two packed single-precision floating-point values from m64 and the low quadword of xmm1.</td>
</tr>
<tr>
<td>0F 17/r MOVHPS m64, xmm1</td>
<td>V/V</td>
<td>SSE</td>
<td>Move two packed single-precision floating-point values from high quadword of xmm1 to m64.</td>
</tr>
<tr>
<td>VEX.128.0F 17/r VMOVHPS m64, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move two packed single-precision floating-point values from high quadword of xmm1 to m64.</td>
</tr>
</tbody>
</table>

Description

This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:

Moves two packed single-precision floating-point values from the source 64-bit memory operand and stores them in the high 64-bits of the destination XMM register. The lower 64-bits of the XMM register are preserved. The upper 128-bits of the corresponding YMM destination register are preserved.

VEX.128 encoded load:

Loads two single-precision floating-point values from the source 64-bit memory operand (third operand) and stores it in the upper 64-bits of the destination XMM register (first operand). The low 64-bits from second XMM register (second operand) are stored in the lower 64-bits of the destination. The upper 128-bits of the destination YMM register are zeroed.

128-bit store:

Stores two packed single-precision floating-point values from the high 64-bits of the XMM register source (second operand) to the 64-bit memory location (first operand).

Note: VMOVHPS (store) (VEX.NDS.128.0F 17 /r) is legal and has the same behavior as the existing 0F 17 store. For VMOVHPS (store) (VEX.NDS.128.0F 17 /r) instruc-
tion version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.

If VMOVHPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

**Operation**

**MOVHPS (128-bit Legacy SSE load)**

DEST[63:0] (Unmodified)
DEST[127:64] ← SRC[63:0]
DEST[255:128] (Unmodified)

**VMOVHPS (VEX.128 encoded load)**

DEST[63:0] ← SRC1[63:0]
DEST[127:64] ← SRC2[63:0]
DEST[255:128] ← 0

**VMOVHPS (store)**

DEST[63:0] ← SRC[127:64]

**Intel C/C++ Compiler Intrinsic Equivalent**

MOVHPS _m128 _mm_loadh_pi ( _m128 a, _m64 *p)
MOVHPS void _mm_storeh_pi ( _m64 *p, _m128 a)

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type 5; additionally

#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

MOVLHPS - Move Packed Single-Precision Floating-Point Values Low to High

<table>
<thead>
<tr>
<th>OPCODE/ INSTRUCTION</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>DESCRIPTON</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 16 /r MOVLHPS xmm1, xmm2</td>
<td>V/V</td>
<td>SSE</td>
<td>Move two packed single-precision floating-point values from low quadword of xmm2 to high quadword of xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 16 /r VMOVLHPS xmm1, xmm2, xmm3</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge two packed single-precision floating-point values from low quadword of xmm3 and low quadword of xmm2.</td>
</tr>
</tbody>
</table>

Description

This instruction cannot be used for memory to register moves.

**128-bit two-argument form:**

Moves two packed single-precision floating-point values from the low quadword of the second XMM argument (second operand) to the high quadword of the first XMM register (first argument). The low quadword of the destination operand is left unchanged. The upper 128 bits of the corresponding YMM destination register are unmodified.

**128-bit three-argument form**

Moves two packed single-precision floating-point values from the low quadword of the third XMM argument (third operand) to the high quadword of the destination (first operand). Copies the low quadword from the second XMM argument (second operand) to the low quadword of the destination (first operand). The upper 128-bits of the destination YMM register are zeroed.

If VMOVLHPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

Operation

**MOVLHPS (128-bit two-argument form)**

DEST[63:0] (Unmodified)
DEST[127:64] ← SRC[63:0]
DEST[255:128] (Unmodified)
VMOVLP (128-bit three-argument form)
DEST[63:0] ← SRC1[63:0]
DEST[127:64] ← SRC2[63:0]
DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
MOVLHPS __m128 _mm_movelh_ps(__m128 a, __m128 b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 7; additionally
#UD IF VEX.L = 1.
INSTRUCTION SET REFERENCE

MOVLPD- Move Low Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 12 /r MOVLPD xmm1, m64</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move double-precision floating-point values from m64 to low quadword of xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 12 /r VMOVLPD xmm2, xmm1, m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge double-precision floating-point value from m64 and the high quadword of xmm1.</td>
</tr>
<tr>
<td>66 0F 13/r MOVLPD m64, xmm1</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move double-precision floating-point values from low quadword of xmm1 to m64.</td>
</tr>
<tr>
<td>VEX.128.66.0F 13/r VMOVLPD m64, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move double-precision floating-point values from low quadword of xmm1 to m64.</td>
</tr>
</tbody>
</table>

Description
This instruction cannot be used for register to register or memory to memory moves.

128-bit Legacy SSE load:
Moves a double-precision floating-point value from the source 64-bit memory operand and stores it in the low 64-bits of the destination XMM register. The upper 64-bits of the XMM register are preserved. The upper 128-bits of the corresponding YMM destination register are preserved.

VEX.128 encoded load:
Loads a double-precision floating-point value from the source 64-bit memory operand (third operand), merges it with the upper 64-bits of the first source XMM register (second operand), and stores it in the low 128-bits of the destination XMM register (first operand). The upper 128-bits of the destination YMM register are zeroed.

128-bit store:
Stores a double-precision floating-point value from the low 64-bits of the XMM register source (second operand) to the 64-bit memory location (first operand).
Note: VMOVLPD (store) (VEX.128.66.0F 13 /r) is legal and has the same behavior as the existing 66 0F 13 store. For VMOVLPD (store) (VEX.128.66.0F 13 /r) instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.
If VMOVLPD is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

**Operation**

**MOVLPD (128-bit Legacy SSE load)**
DEST[63:0] ← SRC[63:0]
DEST[255:64] (Unmodified)

**VMOVLPD (VEX.128 encoded load)**
DEST[63:0] ← SRC2[63:0]
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

**VMOVLPD (store)**
DEST[63:0] ← SRC[63:0]

**Intel C/C++ Compiler Intrinsic Equivalent**
MOVLPD __m128d _mm_loadl_pd (__m128d a, double *p)
MOVLPD void _mm_storel_pd (double *p, __m128d a)

**SIMD Floating-Point Exceptions**
None

**Other Exceptions**
See Exceptions Type 5; additionally
#UD If VEX.L = 1.
If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

MOVLPS- Move Low Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 12 /r V/V SSE MOVLPS xmm1, m64</td>
<td>V/V</td>
<td>SSE</td>
<td>Move two packed single-precision floating-point values from m64 to low quadword of xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 12 /r V/V AVX VMOVLPS xmm2, xmm1, m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge two packed single-precision floating-point values from m64 and the high quadword of xmm1.</td>
</tr>
<tr>
<td>0F 13/r V/V SSE MOVLPS m64, xmm1</td>
<td>V/V</td>
<td>SSE</td>
<td>Move two packed single-precision floating-point values from low quadword of xmm1 to m64.</td>
</tr>
<tr>
<td>VEX.128.0F 13/r V/V AVX VMOVLPS m64, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move two packed single-precision floating-point values from low quadword of xmm1 to m64.</td>
</tr>
</tbody>
</table>

Description

This instruction cannot be used for register to register or memory to memory moves.

**128-bit Legacy SSE load:**
Moves two packed single-precision floating-point values from the source 64-bit memory operand and stores them in the low 64-bits of the destination XMM register. The upper 64-bits of the XMM register are preserved. The upper 128-bits of the corresponding YMM destination register are preserved.

**VEX.128 encoded load:**
Loads two packed single-precision floating-point values from the source 64-bit memory operand (third operand), merges them with the upper 64-bits of the first source XMM register (second operand), and stores them in the low 128-bits of the destination XMM register (first operand). The upper 128-bits of the destination YMM register are zeroed.

**128-bit store:**
Loads two packed single-precision floating-point values from the low 64-bits of the XMM register source (second operand) to the 64-bit memory location (first operand).
Note: VMOVLPS (store) (VEX.128.0F 13 /r) is legal and has the same behavior as the existing OF 13 store. For VMOVLPS (store) (VEX.128.0F 13 /r) instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.

If VMOVLPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

Operation

**MOVLPS (128-bit Legacy SSE load)**

DEST[63:0] ← SRC[63:0]
DEST[255:64] (Unmodified)

**VMOVLPS (VEX.128 encoded load)**

DEST[63:0] ← SRC2[63:0]
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

**VMOVLPS (store)**

DEST[63:0] ← SRC[63:0]

**Intel C/C++ Compiler Intrinsic Equivalent**

MOVLS __m128 __m128_mm_loadl_pi ( __m128 a, __m64 *p)
MOVLS void __m64_mm_storel_pi ( __m64 *p, __m128 a)

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type 5; additionally

#UD If VEX.L = 1.
If VEX.vvvv != 1111B.
MOVMSKPD- Extract Double-Precision Floating-Point Sign mask

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 50 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Extract 2-bit sign mask from xmm2 and store in reg. The upper bits of r32 or r64 are zero’ed.</td>
</tr>
<tr>
<td>VEX.128.66.0F 50 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Extract 2-bit sign mask from xmm2 and store in reg. The upper bits of r32 or r64 are zero’ed.</td>
</tr>
<tr>
<td>VEX.256.66.0F 50 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Extract 4-bit sign mask from ymm2 and store in reg. The upper bits of r32 or r64 are zero’ed.</td>
</tr>
</tbody>
</table>

Description
Extracts the sign bits from the packed double-precision floating-point values in the source operand (second operand), formats them into a 2- or 4-bit mask, and stores the mask in the destination operand (first operand). The source operand is an XMM or YMM register, and the destination operand is a general-purpose register. The mask is stored in the 2 or 4 low-order bits of the destination operand. The upper bits of the destination operand beyond the mask are filled with zeros.

In 64-bit mode, the default operand size of the destination register is 64 bit.

VEX.256 encoded version: The source operand is a YMM register. The destination operand is a general purpose register.

128-bit versions: The source operand is a YMM register. The destination operand is a general purpose register.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VMOVMSKPD (VEX.256 encoded version)
DEST[0] ← SRC[63]
DEST[1] ← SRC[127]
DEST[2] ← SRC[191]
DEST[3] ← SRC[255]
IF DEST = r32
   THEN DEST[31:4] ← 0;
ELSE DEST[63:4] ← 0;
### (V)MOVMSKPD (128-bit versions)

DEST[0] ← SRC[63]
DEST[1] ← SRC[127]
IF DEST = r32
    THEN DEST[31:2] ← 0;
    ELSE DEST[63:2] ← 0;
FI

#### Intel C/C++ Compiler Intrinsic Equivalent

int _mm256_movemask_pd(__m256d a)
int _mm_movemask_pd(__m128d a)

#### SIMD Floating-Point Exceptions

None

#### Other Exceptions

See Exceptions Type 7; additionally

#UD    If VEX.vvvv != 1111B.
MOVMSKPS - Extract Single-Precision Floating-Point Sign mask

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 50 /r MOVMSKPS reg, xmm2</td>
<td>V/V</td>
<td>SSE</td>
<td>Extract 4-bit sign mask from xmm2 and store in reg. The upper bits of r32 or r64 are zero’ed.</td>
</tr>
<tr>
<td>VEX.128.0F 50 /r VMOVMSKPS reg, xmm2</td>
<td>V/V</td>
<td>AVX</td>
<td>Extract 4-bit sign mask from xmm2 and store in reg. The upper bits of r32 or r64 are zero’ed.</td>
</tr>
<tr>
<td>VEX.256.0F 50 /r VMOVMSKPS reg, ymm2</td>
<td>V/V</td>
<td>AVX</td>
<td>Extract 8-bit sign mask from ymm2 and store in reg. The upper bits of r32 or r64 are zero’ed.</td>
</tr>
</tbody>
</table>

Description
Extracts the sign bits from the packed single-precision floating-point values in the source operand (second operand), formats them into a 4- or 8-bit mask, and stores the mask in the destination operand (first operand). The source operand is an XMM or YMM register, and the destination operand is a general-purpose register. The mask is stored in the 4 or 8 low-order bits of the destination operand. The upper bits of the destination operand beyond the mask are filled with zeros.

In 64-bit mode, the default operand size of the destination register is 64 bit.
VEX.256 encoded version: The source operand is a YMM register. The destination operand is a general-purpose register.
128-bit versions: The source operand is a YMM register. The destination operand is a general purpose register.
Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation
VMOVMSKPS (VEX.256 encoded version)
DEST[0] ← SRC[31]
DEST[1] ← SRC[63]
DEST[2] ← SRC[95]
DEST[3] ← SRC[127]
DEST[4] ← SRC[159]
DEST[5] ← SRC[191]
DEST[6] ← SRC[223]
DEST[7] ← SRC[255]
IF DEST = r32
    THEN DEST[31:8] ← 0;
    ELSE DEST[63:8] ← 0;
FI

(V)MOVMSKPS (128-bit version)
DEST[0] ← SRC[31]
DEST[1] ← SRC[63]
DEST[2] ← SRC[95]
DEST[3] ← SRC[127]
IF DEST = r32
    THEN DEST[31:4] ← 0;
    ELSE DEST[63:4] ← 0;
FI

Intel C/C++ Compiler Intrinsic Equivalent
int _mm256_movemask_ps(__m256 a)
int _mm_movemask_ps(__m128 a)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 7; additionally
#UD                If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

MOVNTDQ- Store Packed Integers Using Non-Temporal Hint

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F E7 /r MOVNTDQ m128, xmm1</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move packed integer values in xmm1 to m128 using non-temporal hint</td>
</tr>
<tr>
<td>VEX.128.66.0F E7 /r VMOVNTDQ m128, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move packed integer values in xmm1 to m128 using non-temporal hint</td>
</tr>
<tr>
<td>VEX.256.66.0F E7 /r VMOVNTDQ m256, ymm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move packed integer values in ymm1 to m256 using non-temporal hint</td>
</tr>
</tbody>
</table>

Description

Moves the packed integers in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM register or YMM register, which is assumed to contain integer data (packed bytes, words, doublewords, or quadwords). The destination operand is a 128-bit or 256-bit memory location. The memory operand must be aligned on a 16-byte (128-bit version) or 32-byte (VEX.256 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with VMOVNTDQ instructions if multiple processors might use different memory types to read/write the destination memory locations.

Note: In VEX-128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0; otherwise instructions will #UD.

5-216

Ref. # 319433-005
INSTRUCTION SET REFERENCE

Operation
MOVNTDQ
DEST ← SRC

Intel C/C++ Compiler Intrinsic Equivalent
VMOVNTDQ void _mm256_stream_si256 (__m256i * p, __m256i a);
MOVNTDQ void _mm_stream_si128 (__m128i * p, __m128i a);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type1.SSE2; additionally
#UD If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

MOVNTDQA- Load Double Quadword Non-Temporal Aligned Hint

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 2A /r MOVNTDQA xmm1, m128</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Move double quadword from m128 to xmm1 using non-temporal hint if WC memory type.</td>
</tr>
<tr>
<td>VEX.128.66.0F38 2A /r VMOVNTDQA xmm1, m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Move double quadword from m128 to xmm using non-temporal hint if WC memory type.</td>
</tr>
</tbody>
</table>

Description

MOVNTDQA loads a double quadword from the source operand (second operand) to the destination operand (first operand) using a non-temporal hint if the memory source is WC (write combining) memory type. For WC memory type, the non-temporal hint may be implemented by loading a temporary internal buffer with the equivalent of an aligned cache line without filling this data to the cache. Any memory-type aliased lines in the cache will be snooped and flushed. Subsequent MOVNTDQA reads to unread portions of the WC cache line will receive data from the temporary internal buffer if data is available. The temporary internal buffer may be flushed by the processor at any time for any reason, for example:

• A load operation other than a MOVNTDQA which references memory already resident in a temporary internal buffer.
• A non-WC reference to memory already resident in a temporary internal buffer.
• Interleaving of reads and writes to a single temporary internal buffer.
• Repeated (V)MOVNTDQA loads of a particular 16-byte item in a streaming line.
• Certain micro-architectural conditions including resource shortages, detection of a mis-speculation condition, and various fault conditions

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when reading the data from memory. Using this protocol, the processor does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being read can override the non-temporal hint, if the memory address specified for the non-temporal read is not a WC memory region. Information on non-temporal reads and writes can be found in "Caching of Temporal vs. Non-Temporal Data" in Chapter 10 in the Intel® 64 and IA-32 Architecture Software Developer’s Manual, Volume 3A.
Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with a MFENCE instruction should be used in conjunction with MOVNTDQA instructions if multiple processors might use different memory types for the referenced memory locations or to synchronize reads of a processor with writes by other agents in the system. A processor’s implementation of the streaming load hint does not override the effective memory type, but the implementation of the hint is processor dependent. For example, a processor implementation may choose to ignore the hint and process the instruction as a normal MOVDQA for any memory type. Alternatively, another implementation may optimize cache reads generated by MOVNTDQA on WB memory type to reduce cache evictions. The 128-bit (V)MOVNTDQA addresses must be 16-byte aligned or the instruction will cause a #GP.

Note: In VEX-128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0; otherwise instructions will #UD.

**Operation**

MOVNTDQA (128bit- Legacy SSE form)

DEST ← SRC
DEST[255:128] (Unmodified)

VMOVNTDQA (VEX.128 encoded form)

DEST ← SRC
DEST[255:128] ← 0

**Intel C/C++ Compiler Intrinsic Equivalent**

MOVNTDQA __m128i _mm_stream_load_si128(__m128i *p);

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type1.SSE4.1; additionally

#UD If VEX.vvvv != 1111B.
If VEX.L = 1.
MOVNTPD- Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 2B /r MOVNTPD m128, xmm1</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move packed double-precision values in xmm1 to m128 using non-temporal hint</td>
</tr>
<tr>
<td>VEX.128.66.0F 2B /r VMOVNTPD m128, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move packed double-precision values in xmm1 to m128 using non-temporal hint</td>
</tr>
<tr>
<td>VEX.256.66.0F 2B /r VMOVNTPD m256, ymm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move packed double-precision values in ymm1 to m256 using non-temporal hint</td>
</tr>
</tbody>
</table>

**Description**

Moves the packed double-precision floating-point values in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM register or YMM register, which is assumed to contain packed double-precision, floating-pointing data. The destination operand is a 128-bit or 256-bit memory location. The memory operand must be aligned on a 16-byte (128-bit version) or 32-byte (VEX.256 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPD instructions if multiple processors might use different memory types to read/write the destination memory locations.

Note: In VEX-128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0; otherwise instructions will #UD.
Operation

**MOVNTPD**

DEST ← SRC

Intel C/C++ Compiler Intrinsic Equivalent

VMOVNTPD void_mm256_stream_pd (double * p, __m256d a);

MOVNTPD void_mm_stream_pd (double * p, __m128d a);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type1.SSE2; additionally

#UD If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

MOVNTPS- Store Packed Single-Precision Floating-Point Values Using Non-Temporal Hint

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 2B /r MOVNTPS m128, xmm1</td>
<td>V/V</td>
<td>SSE</td>
<td>Move packed single-precision values xmm1 to mem using non-temporal hint</td>
</tr>
<tr>
<td>VEX.128.0F 2B /r VMOVNTPS m128, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move packed single-precision values xmm1 to mem using non-temporal hint</td>
</tr>
<tr>
<td>VEX.256.0F 2B /r VMOVNTPS m256, ymm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move packed single-precision values ymm1 to mem using non-temporal hint</td>
</tr>
</tbody>
</table>

Description

Moves the packed single-precision floating-point values in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to prevent caching of the data during the write to memory. The source operand is an XMM register or YMM register, which is assumed to contain packed single-precision, floating-pointing. The destination operand is a 128-bit or 256-bit memory location. The memory operand must be aligned on a 16-byte (128-bit version) or 32-byte (VEX.256 encoded version) boundary otherwise a general-protection exception (#GP) will be generated.

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s Manual, Volume 1.

Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPS instructions if multiple processors might use different memory types to read/write the destination memory locations.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
Operation
MOVNTPS
DEST ← SRC

Intel C/C++ Compiler Intrinsic Equivalent
MOVNTPS void _mm_stream_ps (float * p, __m128 d a);
VMOVNTPS void _mm256_stream_ps (float * p, __m256 d a);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type1.SSE; additionally
#UD If VEX.vvvv != 1111B.
MOVSD - Move or Merge Scalar Double-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 10 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move or Merge scalar double-precision floating-point value from xmm2 to xmm1 register</td>
</tr>
<tr>
<td>MOVSD xmm1, xmm2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F2 0F 10 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Merge or Move scalar double-precision floating-point value from m64 to xmm1 register</td>
</tr>
<tr>
<td>MOVSD xmm1, m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 10 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge scalar double-precision floating-point value from xmm2 and xmm3 to xmm1 register</td>
</tr>
<tr>
<td>VMOVSD xmm1, xmm2, xmm3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F2.0F 10 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Load scalar double-precision floating-point value from m64 to xmm1 register</td>
</tr>
<tr>
<td>VMOVSD xmm1, m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F2 0F 11 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move scalar double-precision floating-point value from xmm1 register to xmm2/m64</td>
</tr>
<tr>
<td>MOVSD xmm2/m64, xmm1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 11 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge scalar double-precision floating-point value from xmm2 and xmm3 registers to xmm1</td>
</tr>
<tr>
<td>VMOVSD xmm1, xmm2, xmm3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F2.0F 11 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move scalar double-precision floating-point value from xmm1 register to m64</td>
</tr>
<tr>
<td>VMOVSD m64, xmm1</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Moves a scalar double-precision floating-point value from the source operand (second operand) to the destination operand (first operand). The source and destination operands can be XMM registers or 64-bit memory locations. This instruction can be used to move a double-precision floating-point value to and from the low quadword of an XMM register and a 64-bit memory location, or to move a double-precision floating-point value between the low quadwords of two XMM registers. The instruction cannot be used to transfer data between memory locations.

When the source and destination operands are XMM registers, the high quadword of the destination operand remains unchanged. When the source operand is a memory location and the destination operand is an XMM register, the high quadword of the destination register is unaffected.
location and destination operand is an XMM registers, the high quadword of the destination operand is cleared to all 0s.

Note: For the "VMOVSD m64, xmm1" (memory store form) instruction version, VEX.vvvv is reserved and must be 1111b, otherwise instruction will #UD.
Note: For the "VMOVSD xmm1, m64" (memory load form) instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.
Software should ensure VMOVSD is encoded with VEX.L=0. Encoding VMOVSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation
MOVSD (128-bit Legacy SSE version: MOVSD XMM1, XMM2)
DEST[63:0] ← SRC[63:0]
DEST[255:64] (Unmodified)

VMOVSD (VEX.NDS.128.F2.0F 11 /r: VMOVSD xmm1, xmm2, xmm3)
DEST[63:0] ← SRC2[63:0]
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

VMOVSD (VEX.NDS.128.F2.0F 10 /r: VMOVSD xmm1, xmm2, xmm3)
DEST[63:0] ← SRC2[63:0]
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

VMOVSD (VEX.NDS.128.F2.0F 10 /r: VMOVSD xmm1, m64)
DEST[63:0] ← SRC[63:0]
DEST[255:64] ← 0

MOVSD/VMOVSD (128-bit versions: MOVSD m64, xmm1 or VMOVSD m64, xmm1)
DEST[63:0] ← SRC[63:0]

MOVSD (128-bit Legacy SSE version: MOVSD xmm1, m64)
DEST[63:0] ← SRC[63:0]
DEST[127:64] ← 0
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
MOVSD __m128d _mm_load_sd (double *p)
MOVSD void _mm_store_sd (double *p, __m128d a)
MOVSD __m128d _mm_move_sd ( __m128d a, __m128d b)
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5; additionally
#UD If VEX.vvvv != 1111B.
## MOVSHDUP - Replicate Single FP Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 16 /r</td>
<td>V/V</td>
<td>SSE3</td>
<td>MOVSHDUP xmm1, xmm2/m128</td>
</tr>
<tr>
<td>VEX.128.F3.0F 16 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>VMOVSHDUP xmm1, xmm2/m128</td>
</tr>
<tr>
<td>VEX.256.F3.0F 16 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>VMOVSHDUP ymm1, ymm2/m256</td>
</tr>
</tbody>
</table>

### Description

Duplicates odd-indexed single-precision floating-point values from the source operand (second operand). See Figure 5-15. The source operand is an XMM or YMM register or 128 or 256-bit memory location and the destination operand is an XMM or YMM register.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
INSTRUCTION SET REFERENCE

Figure 5-15. MOVSHDUP Operation

Operation

VMOVSHDUP (VEX.256 encoded version)
DEST[31:0] ← SRC[63:32]
DEST[95:64] ← SRC[127:96]
DEST[127:96] ← SRC[127:96]
DEST[255:224] ← SRC[255:224]

VMOVSHDUP (VEX.128 encoded version)
DEST[31:0] ← SRC[63:32]
DEST[95:64] ← SRC[127:96]
DEST[127:96] ← SRC[127:96]
DEST[255:128] ← 0

MOVSHDUP (128-bit Legacy SSE version)
DEST[31:0] ← SRC[63:32]
DEST[95:64] ← SRC[127:96]
DEST[127:96] ← SRC[127:96]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VMOVSHDUP __m256 __m256_movehdup_ps (__m256 a);
VMOVSHDUP __m128 __m128_movehdup_ps (__m128 a);
SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 2
INSTRUCTION SET REFERENCE

MOVSLDUP- Replicate Single FP Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 12 /r MOVSLDUP xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE3</td>
<td>Move even index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1</td>
</tr>
<tr>
<td>VEX.128.F3.0F 12 /r VMOVSLDUP xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Move even index single-precision floating-point values from xmm2/mem and duplicate each element into xmm1</td>
</tr>
<tr>
<td>VEX.256.F3.0F 12 /r VMOVSLDUP ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Move even index single-precision floating-point values from ymm2/mem and duplicate each element into ymm1</td>
</tr>
</tbody>
</table>

Description

Duplicates even-indexed single-precision floating-point values from the source operand (second operand). See Figure 5-16. The source operand is an XMM or YMM register or 128 or 256-bit memory location and the destination operand is an XMM or YMM register.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.
VMOVSLDUP (VEX.256 encoded version)
DEST[31:0] ← SRC[31:0]
DEST[63:32] ← SRC[31:0]
DEST[95:64] ← SRC[95:64]
DEST[127:96] ← SRC[95:64]
DEST[159:128] ← SRC[159:128]

VMOVSLDUP (VEX.128 encoded version)
DEST[31:0] ← SRC[31:0]
DEST[63:32] ← SRC[31:0]
DEST[95:64] ← SRC[95:64]
DEST[127:96] ← SRC[95:64]
DEST[255:128] ← 0

MOVSLDUP (128-bit Legacy SSE version)
DEST[31:0] ← SRC[31:0]
DEST[63:32] ← SRC[31:0]
DEST[95:64] ← SRC[95:64]
DEST[127:96] ← SRC[95:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VMOVSLDUP _m256 _mm256_moveldup_ps (_m256 a);
VMOVSLDUP _m128 _mm_moveldup_ps (_m128 a);
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.vvvv != 1111B.
MOVSS- Move or Merge Scalar Single-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 10 /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Merge scalar single-precision floating-point value from xmm2 to xmm1 register</td>
</tr>
<tr>
<td>MOVSS xmm1, xmm2</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F3 0F 10 /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Load scalar single-precision floating-point value from m32 to xmm1 register</td>
</tr>
<tr>
<td>MOVSS xmm1, m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 10 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge scalar single-precision floating-point value from xmm2 and xmm3 to xmm1 register</td>
</tr>
<tr>
<td>VMOVSS xmm1, xmm2, xmm3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F3.0F 10 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Load scalar single-precision floating-point value from m32 to xmm1 register</td>
</tr>
<tr>
<td>VMOVSS xmm1, m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>F3 0F 11 /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Move scalar single-precision floating-point value from xmm1 register to xmm2/m32</td>
</tr>
<tr>
<td>MOVSS xmm2/m32, xmm1</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 11 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move scalar single-precision floating-point value from xmm2 and xmm3 to xmm1 register</td>
</tr>
<tr>
<td>VMOVSS xmm1, xmm2, xmm3</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F3.0F 11 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Move scalar single-precision floating-point value from xmm1 register to m32</td>
</tr>
<tr>
<td>VMOVSS m32, xmm1</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Moves a scalar single-precision floating-point value from the source operand (second operand) to the destination operand (first operand). The source and destination operands can be XMM registers or 32-bit memory locations. This instruction can be used to move a single-precision floating-point value to and from the low doubleword of an XMM register and a 32-bit memory location, or to move a single-precision floating-point value between the low doublewords of two XMM registers. The instruction cannot be used to transfer data between memory locations.

When the source and destination operands are XMM registers, the high doublewords of the destination operand remains unchanged. When the source operand is a
memory location and destination operand is an XMM registers, the high doublewords of the destination operand is cleared to all 0s.

Note: For the "VMOVSS m32, xmm1" (memory store form) instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.
Note: For the "VMOVSS xmm1, m32" (memory load form) instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.
Software should ensure VMOVSS is encoded with VEX.L=0. Encoding VMOVSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation
MOVSS (Legacy SSE version when the source and destination operands are both XMM registers)
DEST[31:0] ← SRC[31:0]
DEST[255:32] (Unmodified)

VMOVSS (VEX.NDS.128.F3.0F 11 /r where the destination is an XMM register)
DEST[31:0] ← SRC2[31:0]
DEST[255:128] ← 0

VMOVSS (VEX.NDS.128.F3.0F 10 /r where the source and destination are XMM registers)
DEST[31:0] ← SRC2[31:0]
DEST[255:128] ← 0

VMOVSS (VEX.NDS.128.F3.0F 10 /r when the source operand is memory and the destination is an XMM register)
DEST[31:0] ← SRC[31:0]
DEST[255:32] ← 0

MOVSS/VMOVSS (when the source operand is an XMM register and the destination is memory)
DEST[31:0] ← SRC[31:0]

MOVSS (Legacy SSE version when the source operand is memory and the destination is an XMM register)
DEST[31:0] ← SRC[31:0]
DEST[127:32] ← 0
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
MOVSS __m128 _mm_load_ss(float * p)
MOVSS void_mm_store_ss(float * p, __m128 a)
MOVSS __m128 _mm_move_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5; additionally
#UD If VEX.vvvv != 1111B.
MOVUPD- Move Unaligned Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 10 /r MOVUPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move unaligned packed double-precision floating-point from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>66 0F 11 /r MOVUPD xmm2/m128, xmm1</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move unaligned packed double-precision floating-point from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VEX.128.66.0F 10 /r VMOVUPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed double-precision floating-point from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F 11 /r VMOVUPD xmm2/m128, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed double-precision floating-point from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VEX.256.66.0F 10 /r VMOVUPD ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed double-precision floating-point from ymm2/mem to ymm1</td>
</tr>
<tr>
<td>VEX.256.66.0F 11 /r VMOVUPD ymm2/m256, ymm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed double-precision floating-point from ymm1 to ymm2/mem</td>
</tr>
</tbody>
</table>

Description

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

**VEX.256 encoded version:**
Moves 256 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.

**128-bit versions:**
INSTRUCTION SET REFERENCE

Moves 128 bits of packed double-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

**128-bit Legacy SSE version**: Bits (255:128) of the corresponding YMM destination register remain unchanged.

When the source or destination operand is a memory operand, the operand may be unaligned on a 16-byte boundary without causing a general-protection exception (#GP) to be generated

**VEX.128 encoded version**: Bits (255:128) of the destination YMM register are zeroed.

**Operation**

**VMOVUPD (VEX.256 encoded version)**

DEST[255:0] ← SRC[255:0]

**VMOVUPD (VEX.128 encoded version)**

DEST[127:0] ← SRC[127:0]

DEST[255:128] ← 0

**MOVUPD (128-bit load and register-copy form Legacy SSE version)**

DEST[127:0] ← SRC[127:0]

DEST[255:128] (Unmodified)

**(V)MOVUPD (128-bit store form)**

DEST[127:0] ← SRC[127:0]

**Intel C/C++ Compiler Intrinsic Equivalent**

VMOVUPD __m256d _mm256_loadu_pd (__m256d * p);

VMOVUPD __m256d _mm256_loadu_pd (__m256d * p, __m256d a);

MOVUPD __m128d _mm_loadu_pd (__m128d * p);

MOVUPD __m128d _mm_loadu_pd (__m128d * p, __m128d a);

**SIMD Floating-Point Exceptions**

None
Other Exceptions

See Exceptions Type 10, additionally

#UD       If VEX.vvvv ≠ 1111B.

Note VEX-encoded instruction do not report #AC; treatment of #AC may vary if not-encoded with VEX prefix
MOVUPS- Move Unaligned Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 10/r MOVUPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Move unaligned packed single-precision floating-point from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>0F 11/r MOVUPS xmm2/m128, xmm1</td>
<td>V/V</td>
<td>SSE</td>
<td>Move unaligned packed single-precision floating-point from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VEX.128.0F 10/r VMOVUPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed single-precision floating-point from xmm2/mem to xmm1</td>
</tr>
<tr>
<td>VEX.128.0F 11/r VMOVUPS xmm2/m128, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed single-precision floating-point from xmm1 to xmm2/mem</td>
</tr>
<tr>
<td>VEX.256.0F 10/r VMOVUPS ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed single-precision floating-point from ymm2/mem to ymm1</td>
</tr>
<tr>
<td>VEX.256.0F 11/r VMOVUPS ymm2/m256, ymm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move unaligned packed single-precision floating-point from ymm1 to ymm2/mem</td>
</tr>
</tbody>
</table>

Description

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

**VEX.256 encoded version:**

Moves 256 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load a YMM register from a 256-bit memory location, to store the contents of a YMM register into a 256-bit memory location, or to move data between two YMM registers.

**128-bit versions:**

Ref. # 319433-005 5-239
INSTRUCTION SET REFERENCE

Moves 128 bits of packed single-precision floating-point values from the source operand (second operand) to the destination operand (first operand). This instruction can be used to load an XMM register from a 128-bit memory location, to store the contents of an XMM register into a 128-bit memory location, or to move data between two XMM registers.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

When the source or destination operand is a memory operand, the operand may be unaligned on a 16-byte boundary without causing a general-protection exception (#GP) to be generated

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

Operation

VMOVUPS (VEX.256 encoded version)
DEST[255:0] ← SRC[255:0]

VMOVUPS (VEX.128 encoded load-form)
DEST[127:0] ← SRC[127:0]
DEST[255:128] ← 0

MOVUPS (128-bit load and register-copy form Legacy SSE version)
DEST[127:0] ← SRC[127:0]
DEST[255:128] (Unmodified)

(V)MOVUPS (128-bit store form)
DEST[127:0] ← SRC[127:0]

Intel C/C++ Compiler Intrinsic Equivalent

VMOVUPS__m256_mm256_loadu_ps(__m256 * p);
VMOVUPS_mm256_storeu_ps(_m256 *p, __m256 a);
MOVUPS__m128_mm_loadu_ps(__m128 * p);
MOVUPS_mm_storeu_ps(__m128 *p, __m128 a);

SIMD Floating-Point Exceptions

None
Other Exceptions

See Exceptions Type 10; additionally

#UD

If VEX.vvvv != 1111B.

Note VEX-encoded instruction do not report #AC; treatment of #AC may vary if not-encoded with VEX prefix
INSTRUCTION SET REFERENCE

**MPSADBW - Multiple Sum of Absolute Differences**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 42 /r ib MPSADBW xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Sums absolute 8-bit integer difference of adjacent groups of 4 byte integers in xmm1 and xmm2/m128 and writes the results in xmm1. Starting offsets within xmm1 and xmm2/m128 are determined by imm8.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 42 /r ib VMPSADBW xmm1, xmm2, xmm3/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Sums absolute 8-bit integer difference of adjacent groups of 4 byte integers in xmm2 and xmm3/m128 and writes the results in xmm1. Starting offsets within xmm2 and xmm3/m128 are determined by imm8.</td>
</tr>
</tbody>
</table>

**Description**

MPSADBW sums the absolute difference of 4 unsigned bytes selected by immediate bits 0-1 from the second source with sequential groups of 4 unsigned bytes in the first source operand. The source bytes from the first source operand start at an offset determined by bit 2 of the immediate. The operation is repeated 8 times, each time using the same second source input but selecting the group of 4 bytes starting at the next higher byte in the first source. Each 16-bit sum is written to dest.

The first source and destination operands are XMM registers. The second source operand is either an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: The first source and destination are the same. Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

If VMPSADBW is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

**Operation**

VMPSADBW (VEX.128 encoded version)

- SRC2_OFFSET ← imm8[1:0]*32
- SRC1_OFFSET ← imm8[2]*32
- SRC1_BYTE0 ← SRC1[SRC1_OFFSET+7:SRC1_OFFSET]
INSTRUCTION SET REFERENCE

SRC1 BYTE1 ← SRC1[SRC1_OFFSET+15:SRC1_OFFSET+8]
SRC1 BYTE2 ← SRC1[SRC1_OFFSET+23:SRC1_OFFSET+16]
SRC1 BYTE3 ← SRC1[SRC1_OFFSET+31:SRC1_OFFSET+24]
SRC1 BYTE4 ← SRC1[SRC1_OFFSET+39:SRC1_OFFSET+32]
SRC1 BYTE5 ← SRC1[SRC1_OFFSET+47:SRC1_OFFSET+40]
SRC1 BYTE6 ← SRC1[SRC1_OFFSET+55:SRC1_OFFSET+48]
SRC1 BYTE7 ← SRC1[SRC1_OFFSET+63:SRC1_OFFSET+56]
SRC1 BYTE8 ← SRC1[SRC1_OFFSET+71:SRC1_OFFSET+64]
SRC1 BYTE9 ← SRC1[SRC1_OFFSET+79:SRC1_OFFSET+72]
SRC1 BYTE10 ← SRC1[SRC1_OFFSET+87:SRC1_OFFSET+80]

SRC2 BYTE0 ← SRC2[SRC2_OFFSET+7:SRC2_OFFSET]
SRC2 BYTE1 ← SRC2[SRC2_OFFSET+15:SRC2_OFFSET+8]
SRC2 BYTE2 ← SRC2[SRC2_OFFSET+23:SRC2_OFFSET+16]
SRC2 BYTE3 ← SRC2[SRC2_OFFSET+31:SRC2_OFFSET+24]

TEMP0 ← ABS(SRC1 BYTE0 - SRC2 BYTE0)
TEMP1 ← ABS(SRC1 BYTE1 - SRC2 BYTE1)
TEMP2 ← ABS(SRC1 BYTE2 - SRC2 BYTE2)
TEMP3 ← ABS(SRC1 BYTE3 - SRC2 BYTE3)
DEST[15:0] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(SRC1 BYTE1 - SRC2 BYTE0)
TEMP1 ← ABS(SRC1 BYTE2 - SRC2 BYTE1)
TEMP2 ← ABS(SRC1 BYTE3 - SRC2 BYTE2)
TEMP3 ← ABS(SRC1 BYTE4 - SRC2 BYTE3)
DEST[31:16] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(SRC1 BYTE2 - SRC2 BYTE0)
TEMP1 ← ABS(SRC1 BYTE3 - SRC2 BYTE1)
TEMP2 ← ABS(SRC1 BYTE4 - SRC2 BYTE2)
TEMP3 ← ABS(SRC1 BYTE5 - SRC2 BYTE3)
DEST[47:32] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(SRC1 BYTE3 - SRC2 BYTE0)
TEMP1 ← ABS(SRC1 BYTE4 - SRC2 BYTE1)
TEMP2 ← ABS(SRC1 BYTE5 - SRC2 BYTE2)
TEMP3 ← ABS(SRC1 BYTE6 - SRC2 BYTE3)
DEST[63:48] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(SRC1 BYTE4 - SRC2 BYTE0)
TEMP1 ← ABS(SRC1 BYTE5 - SRC2 BYTE1)
TEMP2 ← ABS(SRC1 BYTE6 - SRC2_BYTE2)
TEMP3 ← ABS(SRC1 BYTE7 - SRC2_BYTE3)
DEST[79:64] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(SRC1 BYTE5 - SRC2 BYTE0)
TEMP1 ← ABS(SRC1_BYTE6 - SRC2_BYTE1)
INSTRUCTION SET REFERENCE

TEMP2 ← ABS(SRC1_BYTE7 - SRC2_BYTE2)
TEMP3 ← ABS(SRC1_BYTE8 - SRC2_BYTE3)
DEST[95:80] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(SRC1_BYTE6 - SRC2_BYTE0)
TEMP1 ← ABS(SRC1_BYTE7 - SRC2_BYTE1)
TEMP2 ← ABS(SRC1_BYTE8 - SRC2_BYTE2)
TEMP3 ← ABS(SRC1_BYTE9 - SRC2_BYTE3)
DEST[111:96] ← TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0 ← ABS(SRC1_BYTE7 - SRC2_BYTE0)
TEMP1 ← ABS(SRC1_BYTE8 - SRC2_BYTE1)
TEMP2 ← ABS(SRC1_BYTE9 - SRC2_BYTE2)
TEMP3 ← ABS(SRC1_BYTE10 - SRC2_BYTE3)
DEST[127:112] ← TEMP0 + TEMP1 + TEMP2 + TEMP3

DEST[255:128] ← 0

MPSADBW (128-bit Legacy SSE version)
SRC_OFFSET ← imm8[1:0]*32
DEST_OFFSET ← imm8[2]*32
DEST_BYTE0 ← DEST[DEST_OFFSET+7:DEST_OFFSET]
DEST_BYTE1 ← DEST[DEST_OFFSET+15:DEST_OFFSET+8]
DEST_BYTE2 ← DEST[DEST_OFFSET+23:DEST_OFFSET+16]
DEST_BYTE3 ← DEST[DEST_OFFSET+31:DEST_OFFSET+24]
DEST_BYTE4 ← DEST[DEST_OFFSET+39:DEST_OFFSET+32]
DEST_BYTE5 ← DEST[DEST_OFFSET+47:DEST_OFFSET+40]
DEST_BYTE6 ← DEST[DEST_OFFSET+55:DEST_OFFSET+48]
DEST_BYTE7 ← DEST[DEST_OFFSET+63:DEST_OFFSET+56]
DEST_BYTE8 ← DEST[DEST_OFFSET+71:DEST_OFFSET+64]
DEST_BYTE9 ← DEST[DEST_OFFSET+79:DEST_OFFSET+72]
DEST_BYTE10 ← DEST[DEST_OFFSET+87:DEST_OFFSET+80]

SRC_BYTE0 ← SRC[SRC_OFFSET+7:SRC_OFFSET]
SRC_BYTE1 ← SRC[SRC_OFFSET+15:SRC_OFFSET+8]
SRC_BYTE2 ← SRC[SRC_OFFSET+23:SRC_OFFSET+16]
SRC_BYTE3 ← SRC[SRC_OFFSET+31:SRC_OFFSET+24]

TEMP0 ← ABS(DEST_BYTE0 - SRC_BYTE0)
TEMP1 ← ABS(DEST_BYTE1 - SRC_BYTE1)
TEMP2 ← ABS(DEST_BYTE2 - SRC_BYTE2)
TEMP3 ← ABS(DEST_BYTE3 - SRC_BYTE3)
DEST[15:0] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(DEST_BYTE1 - SRC_BYTE0)
TEMP1 ← ABS(DEST_BYTE2 - SRC_BYTE1)
TEMP2 ← ABS(DEST_BYTE3 - SRC_BYTE2)
TEMP3 ← ABS(DEST_BYTE4 - SRC_BYTE3)
DEST[31:16] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(DEST_BYTE2 - SRC_BYTE0)
TEMP1 ← ABS(DEST_BYTE3 - SRC_BYTE1)
TEMP2 ← ABS(DEST_BYTE4 - SRC_BYTE2)
TEMP3 ← ABS(DEST_BYTE5 - SRC_BYTE3)
DEST[47:32] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(DEST_BYTE3 - SRC_BYTE0)
TEMP1 ← ABS(DEST_BYTE4 - SRC_BYTE1)
TEMP2 ← ABS(DEST_BYTE5 - SRC_BYTE2)
TEMP3 ← ABS(DEST_BYTE6 - SRC_BYTE3)
DEST[63:48] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(DEST_BYTE4 - SRC_BYTE0)
TEMP1 ← ABS(DEST_BYTE5 - SRC_BYTE1)
TEMP2 ← ABS(DEST_BYTE6 - SRC_BYTE2)
TEMP3 ← ABS(DEST_BYTE7 - SRC_BYTE3)
DEST[79:64] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(DEST_BYTE5 - SRC_BYTE0)
TEMP1 ← ABS(DEST_BYTE6 - SRC_BYTE1)
TEMP2 ← ABS(DEST_BYTE7 - SRC_BYTE2)
TEMP3 ← ABS(DEST_BYTE8 - SRC_BYTE3)
DEST[95:80] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
TEMP0 ← ABS(DEST_BYTE6 - SRC_BYTE0)
TEMP1 ← ABS(DEST_BYTE7 - SRC_BYTE1)
TEMP2 ← ABS(DEST_BYTE8 - SRC_BYTE2)
TEMP3 ← ABS(DEST_BYTE9 - SRC_BYTE3)
DEST[111:96] ← TEMP0 + TEMP1 + TEMP2 + TEMP3

TEMP0 ← ABS(DEST_BYTE7 - SRC_BYTE0)
TEMP1 ← ABS(DEST_BYTE8 - SRC_BYTE1)
TEMP2 ← ABS(DEST_BYTE9 - SRC_BYTE2)
TEMP3 ← ABS(DEST_BYTE10 - SRC_BYTE3)
DEST[127:112] ← TEMP0 + TEMP1 + TEMP2 + TEMP3
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

MPSADBW __m128i _mm_mpsadbw_epu8 (__m128i s1, __m128i s2, const int mask);

**SIMD Floating-Point Exceptions**

None
INSTRUCTION SET REFERENCE

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1
MULPD- Multiply Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 59 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Multiply packed double-precision floating-point values from xmm2/mem to xmm1 and stores result in xmm1</td>
</tr>
<tr>
<td>MULPD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 59 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply packed double-precision floating-point values from xmm3/mem to xmm2 and stores result in xmm1</td>
</tr>
<tr>
<td>VMULPD xmm1,xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 59 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply packed double-precision floating-point values from ymm3/mem to ymm2 and stores result in ymm1</td>
</tr>
<tr>
<td>VMULPD ymm1, ymm2, ymm3/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a SIMD Multiply of the two or four packed double-precision floating-point values from the first Source operand to the Second Source operand, and stores the packed double-precision floating-point results in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the destination YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VMULPD (VEX.256 encoded version)

DEST[63:0] ← SRC1[63:0] * SRC2[63:0]
DEST[127:64] ← SRC1[127:64] * SRC2[127:64]
INSTRUCTION SET REFERENCE

VMULPD (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0] * SRC2[63:0]
DEST[127:64] ← SRC1[127:64] * SRC2[127:64]
DEST[255:128] ← 0

MULPD (128-bit Legacy SSE version)
DEST[63:0] ← DEST[63:0] * SRC[63:0]
DEST[127:64] ← DEST[127:64] * SRC[127:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VMULPD __m256d _mm256_mul_pd (__m256d a, __m256d b);
MULPD __m128d _mm_mul_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
MULPS- Multiply Packed Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 59 /r MULPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Multiply packed single-precision floating-point values from xmm2/mem to xmm1 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 59 /r VMULPS xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply packed single-precision floating-point values from xmm3/mem to xmm2 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.0F 59 /r VMULPS ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply packed single-precision floating-point values from ymm3/mem to ymm2 and stores result in ymm1</td>
</tr>
</tbody>
</table>

Description
Performs an SIMD multiply of the four or eight packed single-precision floating-point values from the first Source operand to the Second Source operand, and stores the packed double-precision floating-point results in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: The first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the destination YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation
VMULPS (VEX.256 encoded version)
DEST[31:0] ← SRC1[31:0] * SRC2[31:0]
DEST[95:64] ← SRC1[95:64] * SRC2[95:64]
INSTRUCTION SET REFERENCE


VMULPS (VEX.128 encoded version)
DEST[31:0] ← SRC1[31:0] * SRC2[31:0]
DEST[95:64] ← SRC1[95:64] * SRC2[95:64]
DEST[255:128] ← 0

MULPS (128-bit Legacy SSE version)
DEST[31:0] ← SRC1[31:0] * SRC2[31:0]
DEST[95:64] ← SRC1[95:64] * SRC2[95:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VMULPS __m256 _mm256_mul_ps (__m256 a, __m256 b);
MULPS __m128 _mm_mul_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
MULSD- Multiply Scalar Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 59 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Multiply the low double-precision floating-point value in xmm2/mem64 by low double precision floating-point value in xmm1.</td>
</tr>
<tr>
<td>MULSD xmm1,xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 59/r</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply the low double-precision floating-point value in xmm3/mem64 by low double precision floating-point value in xmm2.</td>
</tr>
<tr>
<td>VMULSD xmm1,xmm2, xmm3/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Multiplies the low double-precision floating-point value in the second source operand by the low double-precision floating-point value in the first source operand, and stores the double-precision floating-point result in the destination operand. The second source operand can be an XMM register or a 64-bit memory location. The first source operand and the destination operands are XMM registers. The high quadword of the destination operand is copied from the high bits of the first source operand. See Figure 11-4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a scalar double-precision floating-point operation.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. Software should ensure VMULSD is encoded with VEX.L=0. Encoding VMULSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

VMULSD (VEX.128 encoded version)

DEST[63:0] ← SRC1[63:0] * SRC2[63:0]
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0
INSTRUCTION SET REFERENCE

**MULSD (128-bit Legacy SSE version)**

\[ \text{DEST}[63:0] \leftarrow \text{DEST}[63:0] \times \text{SRC}[63:0] \]

\[ \text{DEST}[255:64] \text{ (Unmodified)} \]

**Intel C/C++ Compiler Intrinsic Equivalent**

\[ \text{MULSD} \_\_m128d \_\_mm\_mul\_sd (\_\_m128d a, \_\_m128d b) \]

**SIMD Floating-Point Exceptions**

Overflow, Underflow, Invalid, Precision, Denormal

**Other Exceptions**

See Exceptions Type 3
MULSS- Multiply Scalar Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 59 /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Multiply the low single-precision floating-point value in xmm2/mem by the low single-precision floating-point value in xmm1.</td>
</tr>
<tr>
<td>MULSS xmm1,xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VMULSS xmm1,xmm2,xmm3/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Multiplies the low single-precision floating-point value from the second source operand by the low single-precision floating-point value in the first source operand, and stores the single-precision floating-point result in the destination operand. The second source operand can be an XMM register or a 32-bit memory location. The first source operand and the destination operands are XMM registers. The three high-order doublewords of the destination operand remain unchanged. See Figure 10-6 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a scalar single-precision floating-point operation.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. Software should ensure VMULSS is encoded with VEX.L=0. Encoding VMULSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

VMULSS (VEX.128 encoded version)

DEST[31:0] ← SRC1[31:0] * SRC2[31:0]
DEST[255:128] ← 0
INSTRUCTION SET REFERENCE

MULSS (128-bit Legacy SSE version)
DEST[31:0] ← DEST[31:0] * SRC[31:0]
DEST[255:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
MULSS __m128 _mm_mul_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions
Underflow, Overflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 3
ORPD- Bitwise Logical OR of Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 56/r ORPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Return the bitwise logical OR of packed double-precision floating-point values in xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 56 /r VORPD xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical OR of packed double-precision floating-point values in xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 56 /r VORPD ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical OR of packed double-precision floating-point values in ymm2 and ymm3/mem</td>
</tr>
</tbody>
</table>

Description

Performs a bitwise logical OR of the two or four packed double-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the destination YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

If VORPD is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

Operation

VORPD (VEX.256 encoded version)
DEST[63:0] ← SRC1[63:0] BITWISE OR SRC2[63:0]
INSTRUCTION SET REFERENCE

DEST[127:64] ← SRC1[127:64] BITWISE OR SRC2[127:64]

VORPD (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0] BITWISE OR SRC2[63:0]
DEST[127:64] ← SRC1[127:64] BITWISE OR SRC2[127:64]
DEST[255:128] ← 0

ORPD (128-bit Legacy SSE version)
DEST[63:0] ← DEST[63:0] BITWISE OR SRC[63:0]
DEST[127:64] ← DEST[127:64] BITWISE OR SRC[127:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VORPD __m256d _mm256_or_pd (__m256d a, __m256d b);
ORPD __m128d _mm_or_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
ORPS- Bitwise Logical OR of Packed Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 56 /r ORPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Return the bitwise logical OR of packed single-precision floating-point values in xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 56 /r VORPS xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical OR of packed single-precision floating-point values in xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.0F 56 /r VORPS ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical OR of packed single-precision floating-point values in ymm2 and ymm3/mem</td>
</tr>
</tbody>
</table>

Description
Performs a bitwise logical OR of the four or eight packed single-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand.

VEX.256 Encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the destination YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

If VORPS is encoded with VEX.L= 1, an attempt to execute the instruction encoded with VEX.L= 1 will cause an #UD exception.

Operation
VORPS (VEX.256 encoded version)
DEST[31:0] ← SRC1[31:0] BITWISE OR SRC2[31:0]
INSTRUCTION SET REFERENCE

DEST[95:64] ← SRC1[95:64] BITWISE OR SRC2[95:64]

VORPS (VEX.128 encoded version)
DEST[31:0] ← SRC1[31:0] BITWISE OR SRC2[31:0]
DEST[95:64] ← SRC1[95:64] BITWISE OR SRC2[95:64]
DEST[255:128] ← 0

ORPS (128-bit Legacy SSE version)
DEST[31:0] ← SRC1[31:0] BITWISE OR SRC2[31:0]
DEST[95:64] ← SRC1[95:64] BITWISE OR SRC2[95:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VORPS  _m256 _mm256_or_ps (_m256 a, _m256 b);
ORPS  _m128 _mm_or_ps (_m128 a, _m128 b);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4
INSTRUCTION SET REFERENCE

PABSB/PABSW/PABSD - Packed Absolute Value

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 1C /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Compute the absolute value of bytes in xmm2/m128 and store UNSIGNED result in xmm1.</td>
</tr>
<tr>
<td>PABSB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 1D /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Compute the absolute value of 16-bit integers in xmm2/m128 and store UNSIGNED result in xmm1.</td>
</tr>
<tr>
<td>PABSW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 1E /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Compute the absolute value of 32-bit integers in xmm2/m128 and store UNSIGNED result in xmm1.</td>
</tr>
<tr>
<td>PABSD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 1C /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compute the absolute value of bytes in xmm2/m128 and store UNSIGNED result in xmm1.</td>
</tr>
<tr>
<td>VPABSB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 1D /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compute the absolute value of 16-bit integers in xmm2/m128 and store UNSIGNED result in xmm1.</td>
</tr>
<tr>
<td>VPABSW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 1E /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compute the absolute value of 32-bit integers in xmm2/m128 and store UNSIGNED result in xmm1.</td>
</tr>
<tr>
<td>VPABSD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

PABSB/W/D computes the absolute value of each data element of the source operand and stores the UNSIGNED results in the destination operand. PABSB operates on signed bytes, PABSW operates on signed 16-bit words, and PABSD operates on signed 32-bit integers. The source is an XMM register or a 128-bit memory location. The destination operand is an XMM register.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.vvvv is reserved and must be 1111b, VEX.L must be 0; otherwise instructions will #UD.
INSTRUCTION SET REFERENCE

Operation
BYTE_ABS(SRC)
{
    DEST [7:0] ← ABS(SRC[7:0])
    ... repeat operation for 2nd through 15th bytes
    DEST [127..120] ← ABS(SRC[127:120])
}

WORD_ABS(SRC)
{
    DEST [15:0] ← ABS(SRC[15:0])
    ... repeat operation for 2nd through 7th 16-bit words
    DEST [127..112] ← ABS(SRC[127:112])
}

DWORD_ABS(SRC)
{
    DEST [31:0] ← ABS(SRC[31:0])
    DEST [63:32] ← ABS(SRC[63:32])
    DEST [95:64] ← ABS(SRC[95:64])
    DEST [127..96] ← ABS(SRC[127:96])
}

VPABSB (VEX.128 encoded version)
DEST[127:0] ← BYTE_ABS(SRC)
DEST[255:128] ← 0

PABSB (128-bit Legacy SSE version)
DEST[127:0] ← BYTE_ABS(SRC)
DEST[255:128] (Unmodified)

VPABSW (VEX.128 encoded version)
DEST[127:0] ← WORD_ABS(SRC)
DEST[255:128] ← 0

PABSW (128-bit Legacy SSE version)
DEST[127:0] ← WORD_ABS(SRC)
DEST[255:128] (Unmodified)

VPABSD (VEX.128 encoded version)
DEST[127:0] ← DWORD_ABS(SRC)
DEST[255:128] ← 0
PABSD (128-bit Legacy SSE version)
DEST[127:0] ← DWORD_ABS(SRC)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PABSB __m128i _mm_abs_epi8 (__m128i a)
PABSW __m128i _mm_abs_epi16 (__m128i a)
PABSD __m128i _mm_abs_epi32 (__m128i a)

SIMD Floating-Point Exceptions
none

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
If VEX.vvvv != 1111B.
## PACKSSWB/PACKSSDW - Pack with Signed Saturation

### Description

Converts packed signed word integers into packed signed byte integers (PACKSSWB) or converts packed signed doubleword integers into packed signed word integers (PACKSSDW), using saturation to handle overflow conditions. See Figure 5-17 for an example of the packing operation.

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 63 /r PACKSSWB xmm1,xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Converts 8 packed signed word integers from xmm1 and from xmm2/m128 into 16 packed signed byte integers in xmm1 using signed saturation.</td>
</tr>
<tr>
<td>66 0F 6B /r PACKSSDW xmm1,xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Converts 4 packed signed doubleword integers from xmm1 and from xmm2/m128 into 8 packed signed word integers in xmm1 using signed saturation.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 63 /r VPACKSSWB xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Converts 8 packed signed word integers from xmm2 and from xmm3/m128 into 16 packed signed byte integers in xmm1 using signed saturation.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 6B /r VPACKSSDW xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Converts 4 packed signed doubleword integers from xmm2 and from xmm3/m128 into 8 packed signed word integers in xmm1 using signed saturation.</td>
</tr>
</tbody>
</table>
The PACKSSWB instruction converts 8 signed word integers from the first source operand and 8 signed word integers from the second source operand into 16 signed byte integers and stores the result in the destination operand. If a signed word integer value is beyond the range of a signed byte integer (that is, greater than 7FH for a positive integer or greater than 80H for a negative integer), the saturated signed byte integer value of 7FH or 80H, respectively, is stored in the destination.

The PACKSSDW instruction packs 4 signed doublewords from the first source operand and 4 signed doublewords from the second source operand into 8 signed words in the destination operand (see Figure 5-17).

If a signed doubleword integer value is beyond the range of a signed word (that is, greater than 7FFFH for a positive integer or greater than 8000H for a negative integer), the saturated signed word integer value of 7FFFH or 8000H, respectively, is stored into the destination.

When operating on 128-bit operands, the first source and destination operands are XMM registers, and the second source operand can be either an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

SATURATING_PACK_WB(SRC1, SRC2)
DEST[7:0] ← SaturateSignedWordToSignedByte (SRC1[15:0])
DEST[15:8] ← SaturateSignedWordToSignedByte (SRC1[31:16])
DEST[39:32] ← SaturateSignedWordToSignedByte (SRC1[79:64])
DEST[47:40] ← SaturateSignedWordToSignedByte (SRC1[95:80])
DEST[55:48] ← SaturateSignedWordToSignedByte (SRC1[111:96])
DEST[63:56] ← SaturateSignedWordToSignedByte (SRC1[127:112])
DEST[71:64] ← SaturateSignedWordToSignedByte (SRC2[15:0])

Figure 5-17. PACKSSDW Instruction Operation using 64-bit Operands
INSTRUCTION SET REFERENCE

DEST[79:72] ← SaturateSignedWordToSignedByte (SRC2[31:16])
DEST[87:80] ← SaturateSignedWordToSignedByte (SRC2[47:32])
DEST[95:88] ← SaturateSignedWordToSignedByte (SRC2[63:48])
DEST[103:96] ← SaturateSignedWordToSignedByte (SRC2[79:64])
DEST[111:104] ← SaturateSignedWordToSignedByte (SRC2[95:80])
DEST[119:112] ← SaturateSignedWordToSignedByte (SRC2[111:96])
DEST[127:120] ← SaturateSignedWordToSignedByte (SRC2[127:112])

SATURATING_PACK_DW(SRC1, SRC2)
DEST[15:0] ← SaturateSignedDwordToSignedWord (SRC1[31:0])
DEST[31:16] ← SaturateSignedDwordToSignedWord (SRC1[63:32])
DEST[47:32] ← SaturateSignedDwordToSignedWord (SRC1[95:64])
DEST[63:48] ← SaturateSignedDwordToSignedWord (SRC1[127:96])
DEST[79:64] ← SaturateSignedDwordToSignedWord (SRC2[31:0])
DEST[95:80] ← SaturateSignedDwordToSignedWord (SRC2[63:32])
DEST[111:96] ← SaturateSignedDwordToSignedWord (SRC2[95:64])
DEST[127:112] ← SaturateSignedDwordToSignedWord (SRC2[127:96])

PACKSSDW
DEST[127:0] ← SATURATING_PACK_DW(DEST, SRC)
DEST[255:128] (Unmodified)

VPACKSSDW
DEST[127:0] ← SATURATING_PACK_DW(SRC1, SRC2)
DEST[255:128] ← 0

PACKSSWB
DEST[127:0] ← SATURATING_PACK_WB(DEST, SRC)
DEST[255:128] (Unmodified)

VPACKSSWB
DEST[127:0] ← SATURATING_PACK_WB(SRC1, SRC2)
DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
PACKSSWB __m128i _mm_packs_epi16(__m128i m1, __m128i m2)
PACKSSDW __m128i _mm_packs_epi32(__m128i m1, __m128i m2)

SIMD Floating-Point Exceptions
none
Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

PACKUSWb/PACKUSDW - Pack with Unsigned Saturation

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 67 /r V/V SSE2</td>
<td></td>
<td></td>
<td>Converts 8 signed word integers from xmm1 and 8 signed word integers from xmm2/m128 into 16 unsigned byte integers in xmm1 using unsigned saturation.</td>
</tr>
<tr>
<td>PACKUSWb xmm1,xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 2B /r V/V SSE4_1</td>
<td></td>
<td></td>
<td>Convert 4 packed signed doubleword integers from xmm1 and 4 packed signed doubleword integers from xmm2/m128 into 8 packed unsigned word integers in xmm1 using unsigned saturation.</td>
</tr>
<tr>
<td>PACKUSDW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F67 /r V/V AVX</td>
<td></td>
<td></td>
<td>Converts 8 signed word integers from xmm2 and 8 signed word integers from xmm3/m128 into 16 unsigned byte integers in xmm1 using unsigned saturation.</td>
</tr>
<tr>
<td>VPACKUSWb xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 2B /r V/V AVX</td>
<td></td>
<td></td>
<td>Convert 4 packed signed doubleword integers from xmm2 and 4 packed signed doubleword integers from xmm3/m128 into 8 packed unsigned word integers in xmm1 using unsigned saturation.</td>
</tr>
<tr>
<td>VPACKUSDW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

packuswb:
Converts 8 signed word integers from the second source operand and 8 signed word integers from the first source operand into 8 unsigned byte integers and stores the result in the destination operand. (See Figure 5-17 for an example of the packing operation.) If a signed word integer value is beyond the range of an unsigned byte integer (that is, greater than FFH or less than 00H), the saturated unsigned byte integer value of FFH or 00H, respectively, is stored in the destination.

The first source operand and destination operand must be an XMM register and the second source operand can be either an XMM register or a 128-bit memory location.
packusdw:
Converts packed signed doubleword integers into packed unsigned word integers using unsigned saturation to handle overflow conditions. If the signed doubleword value is beyond the range of an unsigned word (that is, greater than FFFFH or less than 0000H), the saturated unsigned word integer value of FFFFH or 0000H, respectively stored in the destination.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation
StaurateSignedWordToUnsignedByte(SRC)
{
   TMP ← SRC < 0 ? 0 : SRC
   return SRC > FFH ? FFH : TMP
}

SaturateSignedDWordToUnsignedWord(SRC)
{
   TMP ← SRC < 0 ? 0 : SRC
   return SRC > FFFFH ? FFFFH : TMP
}

UNSIGNED_SATURATING_PACK_DW(SRC1, SRC2)
DEST[15:0] ← SaturateSignedDWordToUnsignedWord(SRC1[31:0])
DEST[31:16] ← SaturateSignedDWordToUnsignedWord(SRC1[63:32])
DEST[47:32] ← SaturateSignedDWordToUnsignedWord(SRC1[95:64])
DEST[63:48] ← SaturateSignedDWordToUnsignedWord(SRC1[127:96])
DEST[79:64] ← SaturateSignedDWordToUnsignedWord(SRC2[31:0])
DEST[95:80] ← SaturateSignedDWordToUnsignedWord(SRC2[63:32])
DEST[111:96] ← SaturateSignedDWordToUnsignedWord(SRC2[95:64])
DEST[127:112] ← SaturateSignedDWordToUnsignedWord(SRC2[127:96])

UNSIGNED_SATURATING_PACK_WB(SRC1, SRC2)
DEST[7:0] ← SaturateSignedWordToUnsignedByte (SRC1[15:0])
DEST[15:8] ← SaturateSignedWordToUnsignedByte (SRC1[31:16])
DEST[23:16] ← SaturateSignedWordToUnsignedByte (SRC1[47:32])
DEST[31:24] ← SaturateSignedWordToUnsignedByte (SRC1[63:48])
DEST[39:32] ← SaturateSignedWordToUnsignedByte (SRC1[79:64])
DEST[47:40] ← SaturateSignedWordToUnsignedByte (SRC1[95:80])
DEST[55:48] ← SaturateSignedWordToUnsignedByte (SRC1[111:96])
DEST[63:56] ← SaturateSignedWordToUnsignedByte (SRC1[127:112])
INSTRUCTION SET REFERENCE

DEST[71:64] ← SaturateSignedWordToUnsignedByte (SRC2[15:0])
DEST[79:72] ← SaturateSignedWordToUnsignedByte (SRC2[31:16])
DEST[87:80] ← SaturateSignedWordToUnsignedByte (SRC2[47:32])
DEST[95:88] ← SaturateSignedWordToUnsignedByte (SRC2[63:48])
DEST[103:96] ← SaturateSignedWordToUnsignedByte (SRC2[79:64])
DEST[111:104] ← SaturateSignedWordToUnsignedByte (SRC2[95:80])
DEST[119:112] ← SaturateSignedWordToUnsignedByte (SRC2[111:96])
DEST[127:120] ← SaturateSignedWordToUnsignedByte (SRC2[127:112])

VPACKUSWB (VEX.128 encoded version)
DEST[127:0] ← UNSIGNED_SATURATING_PACK_WB(SRC1, SRC2)
DEST[255:128] ← 0

VPACKUSDW (VEX.128 encoded version)
DEST[127:0] ← UNSIGNED_SATURATING_PACK_DW(SRC1, SRC2)
DEST[255:128] ← 0

PACKUSWB (128-bit Legacy SSE version)
DEST[127:0] ← UNSIGNED_SATURATING_PACK_WB(Dest, SRC)
DEST[255:128] (Unmodified)

PACKUSDW (128-bit Legacy SSE version)
DEST[127:0] ← UNSIGNED_SATURATING_PACK_DW(Dest, SRC)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PACKUSDW __m128i _mm_packus_epi32(__m128i m1, __m128i m2);
PACKUSWB __m128i _mm_packus_epi16(__m128i m1, __m128i m2)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
# PADDB/PADDW/PADDD/PADDQ - Add Packed Integers

| Opcode/  
<table>
<thead>
<tr>
<th>Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F FC /r</td>
<td>V/V SSE2</td>
<td></td>
<td>Add packed byte integers from xmm2/m128 and xmm1.</td>
</tr>
<tr>
<td>PADDB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F FD /r</td>
<td>V/V SSE2</td>
<td></td>
<td>Add packed word integers from xmm2/m128 and xmm1.</td>
</tr>
<tr>
<td>PADDW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F FE /r</td>
<td>V/V SSE2</td>
<td></td>
<td>Add packed doubleword integers from xmm2/m128 and xmm1.</td>
</tr>
<tr>
<td>PADDD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F D4 /r</td>
<td>V/V SSE2</td>
<td></td>
<td>Add packed quadword integers from xmm2/m128 and xmm1.</td>
</tr>
<tr>
<td>PADDQ xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F FC /r</td>
<td>V/V AVX</td>
<td></td>
<td>Add packed byte integers from xmm3/m128 and xmm2.</td>
</tr>
<tr>
<td>VPADDB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F FD /r</td>
<td>V/V AVX</td>
<td></td>
<td>Add packed word integers from xmm3/m128 and xmm2.</td>
</tr>
<tr>
<td>VPADDW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F FE /r</td>
<td>V/V AVX</td>
<td></td>
<td>Add packed doubleword integers from xmm3/m128 and xmm2.</td>
</tr>
<tr>
<td>VPADDD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F D4 /r</td>
<td>V/V AVX</td>
<td></td>
<td>Add packed quadword integers from xmm3/m128 and xmm2.</td>
</tr>
<tr>
<td>VPADDQ xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Adds the packed byte, word, doubleword, or quadword integers in the first source operand to the second source operand and stores the result in the destination operand. The second source operand is an XMM register or an 128-bit memory location. The first source operand and destination operand are XMM registers. When a result is too large to be represented in the 8/16/32/64 integer (overflow), the result is wrapped around and the low bits are written to the destination element (that is, the carry is ignored).
Note that these instructions can operate on either unsigned or signed (two’s complement notation) integers; however, it does not set bits in the EFLAGS register to indicate overflow and/or a carry. To prevent undetected overflow conditions, software must control the ranges of the values operated on.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

VPADDB (VEX.128 encoded version)

\[
\begin{align*}
\text{DEST}[7:0] &\leftarrow \text{SRC}[7:0]+\text{SRC2}[7:0] \\
\text{DEST}[15:8] &\leftarrow \text{SRC}[15:8]+\text{SRC2}[15:8] \\
\text{DEST}[47:40] &\leftarrow \text{SRC}[47:40]+\text{SRC2}[47:40] \\
\text{DEST}[63:56] &\leftarrow \text{SRC}[63:56]+\text{SRC2}[63:56] \\
\text{DEST}[71:64] &\leftarrow \text{SRC}[71:64]+\text{SRC2}[71:64] \\
\text{DEST}[79:72] &\leftarrow \text{SRC}[79:72]+\text{SRC2}[79:72] \\
\text{DEST}[87:80] &\leftarrow \text{SRC}[87:80]+\text{SRC2}[87:80] \\
\text{DEST}[95:88] &\leftarrow \text{SRC}[95:88]+\text{SRC2}[95:88] \\
\text{DEST}[103:96] &\leftarrow \text{SRC}[103:96]+\text{SRC2}[103:96] \\
\text{DEST}[111:104] &\leftarrow \text{SRC}[111:104]+\text{SRC2}[111:104] \\
\text{DEST}[127:120] &\leftarrow \text{SRC}[127:120]+\text{SRC2}[127:120] \\
\text{DEST}[255:228] &\leftarrow 0
\end{align*}
\]

PADDB (128-bit Legacy SSE version)

\[
\begin{align*}
\text{DEST}[7:0] &\leftarrow \text{DEST}[7:0]+\text{SRC}[7:0] \\
\text{DEST}[15:8] &\leftarrow \text{DEST}[15:8]+\text{SRC}[15:8] \\
\text{DEST}[47:40] &\leftarrow \text{DEST}[47:40]+\text{SRC}[47:40] \\
\text{DEST}[63:56] &\leftarrow \text{DEST}[63:56]+\text{SRC}[63:56] \\
\text{DEST}[71:64] &\leftarrow \text{DEST}[71:64]+\text{SRC}[71:64] \\
\text{DEST}[79:72] &\leftarrow \text{DEST}[79:72]+\text{SRC}[79:72] \\
\text{DEST}[87:80] &\leftarrow \text{DEST}[87:80]+\text{SRC}[87:80] \\
\text{DEST}[95:88] &\leftarrow \text{DEST}[95:88]+\text{SRC}[95:88] \\
\text{DEST}[103:96] &\leftarrow \text{DEST}[103:96]+\text{SRC}[103:96]
\end{align*}
\]
INSTRUCTION SET REFERENCE

VPADDW (VEX.128 encoded version)
DEST[15:0] ← SRC[15:0]+SRC2[15:0]
DEST[79:64] ← SRC[79:64]+SRC2[79:64]
DEST[255:128] ← 0

PADDW (128-bit Legacy SSE version)
DEST[15:0] ← DEST[15:0]+SRC[15:0]
DEST[79:64] ← DEST[79:64]+SRC[79:64]
DEST[255:128] ← 0

VPADDD (VEX.128 encoded version)
DEST[31:0] ← SRC[31:0]+SRC2[31:0]
DEST[95:64] ← SRC[95:64]+SRC2[95:64]
DEST[255:128] ← 0

PADD (128-bit Legacy SSE version)
DEST[31:0] ← DEST[31:0]+SRC[31:0]
DEST[95:64] ← DEST[95:64]+SRC[95:64]
DEST[255:128] ← 0

VPADDDQ (VEX.128 encoded version)
DEST[63:0] ← SRC[63:0]+SRC2[63:0]
INSTRUCTION SET REFERENCE

DEST[127:64] ← SRC1[127:64]+SRC2[127:64]
DEST[255:128] ← 0

**PADDQ (128-bit Legacy SSE version)**
DEST[63:0] ← DEST[63:0]+SRC[63:0]
DEST[127:64] ← DEST[127:64]+SRC[127:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PADDB __m128i_mm_add_epi8 (__m128ia,__m128ib)
PADDW __m128i_mm_add_epi16 (__m128ia,__m128ib)
PADDD __m128i_mm_add_epi32 (__m128ia,__m128ib)
PADDQ __m128i_mm_add_epi64 (__m128ia,__m128ib)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

PADDSB/PADDSW- Add Packed Signed Integers with Signed Saturation

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F EC /r PADDSB xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Add packed signed byte integers from xmm2/m128 and xmm1 saturate the results.</td>
</tr>
<tr>
<td>66 0F ED /r PADDSW xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Add packed signed word integers from xmm2/m128 and xmm1 and saturate the results.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F EC /r VPADDSB xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Add packed signed byte integers from xmm3/m128 and xmm2 saturate the results.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F ED /r VPADDSW xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Add packed signed word integers from xmm3/m128 and xmm2 and saturate the results.</td>
</tr>
</tbody>
</table>

Description

Performs a SIMD add of the packed signed integers from the second source operand and the first source operand and stores the packed integer results in the destination operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation.

Overflow is handled with signed saturation, as described in the following paragraphs. The second source operand can be either an XMM register or a 128-bit memory location. The first source and destination operands are XMM registers.

The PADDSB instruction adds packed signed byte integers. When an individual byte result is beyond the range of a signed byte integer (that is, greater than 7FH or less than 80H), the saturated value of 7FH or 80H, respectively, is written to the destination operand.

The PADDSW instruction adds packed signed word integers. When an individual word result is beyond the range of a signed word integer (that is, greater than 7FFFH or less than 8000H), the saturated value of 7FFFH or 8000H, respectively, is written to the destination operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.
INSTRUCTION SET REFERENCE

Operation

**VPADDSB**

DEST[7:0] ↦ SaturateToSignedByte (SRC1[7:0] + SRC2[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)

DEST[127:120] ↦ SaturateToSignedByte (SRC1[111:120] + SRC2[127:120]);

DEST[255:128] ↦ 0

**PADDSB**

DEST[7:0] ↦ SaturateToSignedByte (DEST[7:0] + SRC[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)


DEST[255:128] (Unmodified)

**VPADDSW**

DEST[15:0] ↦ SaturateToSignedWord (SRC1[15:0] + SRC2[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)


DEST[255:128] ↦ 0

**PADDSW**

DEST[15:0] ↦ SaturateToSignedWord (DEST[15:0] + SRC[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)


DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PADDSB __m128i _mm_adds_epi8 ( __m128i a, __m128i b)

PADDSW __m128i _mm_adds_epi16 ( __m128i a, __m128i b)

SIMD Floating-Point Exceptions

none

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.
PADDUSB/PADDUSW- Add Packed Unsigned Integers with Unsigned Saturation

Description

Performs a SIMD add of the packed unsigned integers from the second source operand and the first source operand and stores the packed integer results in the destination operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation. Overflow is handled with unsigned saturation, as described in the following paragraphs.

The first source operand and the destination operands are XMM registers. The second source operand is either an XMM register or a 128-bit memory location.

The PADDUSB instruction adds packed unsigned byte integers. When an individual byte result is beyond the range of an unsigned byte integer (that is, greater than FFH), the saturated value of FFH is written to the destination operand. The PADDUSW instruction adds packed unsigned word integers. When an individual word result is beyond the range of an unsigned word integer (that is, greater than FFFFH), the saturated value of FFFFH is written to the destination operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F DC /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Add packed unsigned byte integers from xmm2/m128 and xmm1 saturate the results.</td>
</tr>
<tr>
<td>PADDUSB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F DD /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Add packed unsigned word integers from xmm2/m128 to xmm1 and saturate the results.</td>
</tr>
<tr>
<td>PADDUSW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.660F DC /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Add packed unsigned byte integers from xmm3/m128 and xmm2 saturate the results.</td>
</tr>
<tr>
<td>VPADDUSB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F DD /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Add packed unsigned word integers from xmm3/m128 to xmm2 and saturate the results.</td>
</tr>
<tr>
<td>VPADDUSW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
INSTRUCTION SET REFERENCE

Operation

VPADDBUS

DEST[7:0] ← SaturateToUnsignedByte (SRC1[7:0] + SRC2[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[127:120] ← SaturateToUnsignedByte (SRC1[111:120] + SRC2[127:120]);
DEST[255:128] ← 0

PADDUSB

DEST[7:0] ← SaturateToUnsignedByte (DEST[7:0] + SRC[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[127:120] ← SaturateToUnsignedByte (DEST[111:120] + SRC[127:120]);
DEST[255:128] (Unmodified)

VPADDUSW

DEST[15:0] ← SaturateToUnsignedWord (SRC1[15:0] + SRC2[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[255:128] ← 0

PADDUSW

DEST[15:0] ← SaturateToUnsignedWord (DEST[15:0] + SRC[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PADDUSB __m128i _mm_adds_epu8 (__m128i a, __m128i b)
PADDUSW __m128i _mm_adds_epu16 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions

none

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.
**PALIGNR - Byte Align**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 0F /r ib PALIGNR xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Concatenate destination and source operands, extract byte aligned result shifted to the right by constant value in imm8 and result is stored in xmm1</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 0F /r ib VPALIGNR xmm1, xmm2, xmm3/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Concatenate xmm2 and xmm3/m128, extract byte aligned result shifted to the right by constant value in imm8 and result is stored in xmm1</td>
<td></td>
</tr>
</tbody>
</table>

**Description**

PALIGNR concatenates the first source operand and the second source operand into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right aligned result into the destination. The first source and destination operand are XMM registers. The second source operand can be an XMM register or a 128-bit memory location. The immediate value is considered unsigned. Immediate shift counts larger than 32 for 128-bit operands produces a zero result.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**PALIGNR**
\[
\text{temp1}[255:0] \leftarrow \text{CONCATENATE(Dest,Src)}\gg\text{(imm8*8)} \\
\text{Dest}[127:0] \leftarrow \text{temp1}[127:0] \\
\text{Dest}[255:128] \text{ (Unmodified)}
\]

**VPALIGNR**
\[
\text{temp1}[255:0] \leftarrow \text{CONCATENATE(Src1,Src2)}\gg\text{(imm8*8)} \\
\text{Dest}[127:0] \leftarrow \text{temp1}[127:0] \\
\text{Dest}[255:128] \leftarrow 0
\]
INSTRUCTION SET REFERENCE

Intel C/C++ Compiler Intrinsic Equivalent
PALIGNR __m128i _mm_alignr_epi8 (__m128i a, __m128i b, int n)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
PAND - Logical AND

**Description**
Performs a bitwise logical AND operation on the second source operand and the first source operand and stores the result in the destination operand. The second source operand can be an XMM register or a 128-bit memory location. The first source and destination operands can be XMM registers. Each bit of the result is set to 1 if the corresponding bits of the first and second operands are 1; otherwise, it is set to 0.

**128-bit Legacy SSE version:** Bits (255:128) of the corresponding YMM destination register remain unchanged.

**VEX.128 encoded version:** Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**VPAND (VEX.128 encoded version)**

\[
\text{DEST} \leftarrow \text{SRC1 AND SRC2}
\]

\[
\text{DEST}[255:128] \leftarrow 0
\]

**PAND (128-bit Legacy SSE version)**

\[
\text{DEST} \leftarrow \text{DEST AND SRC}
\]

\[
\text{DEST}[255:128] \text{ (Unmodified)}
\]

**Intel C/C++ Compiler Intrinsic Equivalent**

\[
PAND __m128i _mm_and_si128 ( __m128i a, __m128i b)
\]

**SIMD Floating-Point Exceptions**
none

---

Ref. # 319433-005
INSTRUCTION SET REFERENCE

Other Exceptions
See Exceptions Type 4; additionally
#UD              If VEX.L = 1.
PANDN- Logical AND NOT

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F DF /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Bitwise AND NOT of xmm2/m128 and xmm1.</td>
</tr>
<tr>
<td>PANDN xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F DF /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Bitwise AND NOT of xmm2/m128 and xmm1.</td>
</tr>
<tr>
<td>VPANDN xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description
Perform a bitwise logical NOT operation on the first source operand and then performs a bitwise logical AND with the second source operand and stores the result in the destination operand. The second source operand is an XMM register or a 128-bit memory location. The first source and destination operands can be XMM registers. Each bit of the result is set to 1 if the corresponding bits of the first and second operands are 1; otherwise, it is set to 0.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation
VPANDN (VEX.128 encoded version)
DEST ← NOT(SRC1) AND SRC2
DEST[255:128] ← 0

PANDN(128-bit Legacy SSE version)
DEST ← NOT(DST) AND SRC
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PANDN __m128i _mm_andnot_si128 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions
none
INSTRUCTION SET REFERENCE

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
PAVGB/PAVGW - Average Packed Integers

**Description**

Performs a SIMD average of the packed unsigned integers from the second source operand and the first source operand and stores the results in the destination operand. For each corresponding pair of data elements in the first and second source operands, the elements are added together, a 1 is added to the temporary sum, and that result is shifted right one bit position. The destination and first source operands are XMM registers. The second source operand is an XMM register or a 128-bit memory location.

The PAVGB instruction operates on packed unsigned bytes and the PAVGW instruction operates on packed unsigned words.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**VPAVGB (VEX.128 encoded version)**

DEST[7:0] ← (SRC1[7:0] + SRC2[7:0] + 1) >> 1;

(* Repeat operation performed for bytes 2 through 15 *)

DEST[127:120] ← (SRC1[127:120] + SRC2[127:120] + 1) >> 1

DEST[255:128] ← 0

---

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F E0 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Average packed unsigned byte integers from xmm2/m128 and xmm1 with rounding.</td>
</tr>
<tr>
<td>PAVGB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F E3 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Average packed unsigned word integers from xmm2/m128 and xmm1 with rounding.</td>
</tr>
<tr>
<td>PAVGW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F E0 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Average packed unsigned byte integers from xmm3/m128 and xmm2 with rounding.</td>
</tr>
<tr>
<td>VPAVGB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F E3 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Average packed unsigned word integers from xmm3/m128 and xmm2 with rounding.</td>
</tr>
<tr>
<td>VPAVGW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
INSTRUCTION SET REFERENCE

PAVGB (128-bit Legacy SSE version)
DEST[7:0] ← (SRC[7:0] + DEST[7:0] + 1) >> 1;
(* Repeat operation performed for bytes 2 through 15 *)
DEST[127:120] ← (SRC[127:120] + DEST[127:120] + 1) >> 1
DEST[255:252] (Unmodified)

VPAVGW (VEX.128 encoded version)
DEST[15:0] ← (SRC1[15:0] + SRC2[15:0] + 1) >> 1;
(* Repeat operation performed for 16-bit words 2 through 7 *)
DEST[255:252] ← 0

PAVGW (128-bit Legacy SSE version)
DEST[15:0] ← (SRC[15:0] + DEST[15:0] + 1) >> 1;
(* Repeat operation performed for 16-bit words 2 through 7 *)
DEST[255:252] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PAVGB __m128i _mm_avg_epu8 ( __m128i a, __m128i b)
PAVGW __m128i _mm_avg_epu16 ( __m128i a, __m128i b)

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.
**PBLENDVB - Variable Blend Packed Bytes**

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 10 /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Select byte values from xmm1 and xmm2/m128, &lt;XMM0&gt; using mask bits in the implicit mask register, XMM0, and store the values into xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 4C /r /is4</td>
<td>V/V</td>
<td>AVX</td>
<td>Select byte values from xmm2 and xmm3/m128 using mask bits in the specified mask register, xmm4, and store the values into xmm1</td>
</tr>
</tbody>
</table>

**Description**

Conditionally copy byte elements from the second source operand and the first source operand depending on mask bits defined in the mask register operand. The mask bits are the most significant bit in each byte element of the mask register.

Each byte element of the destination operand is copied from:

- the corresponding byte element in the second source operand, If a mask bit is "1"; or
- the corresponding byte element in the first source operand, If a mask bit is "0"

The register assignment of the implicit third operand is defined to be the architectural register XMM0.

128-bit Legacy SSE version: The first source operand and the destination operand is the same. Bits (255:128) of the corresponding YMM destination register remain unchanged. The mask register operand is implicitly defined to be the architectural register XMM0. An attempt to execute PBLENDVB with a VEX prefix will cause #UD.

VEX.128 encoded version: The first source operand and the destination operand are XMM registers. The second source operand is an XMM register or 128-bit memory location. The mask operand is the third source register, and encoded in bits[7:4] of the immediate byte(imm8). The bits[3:0] of imm8 are ignored. In 32-bit mode, imm8[7] is ignored. The upper bits (255:128) of the corresponding YMM register (destination register) are zeroed. VEX.L must be 0, otherwise the instruction will #UD. VEX.W must be 0, otherwise, the instruction will #UD.

VPBLENDVB permits the mask to be any XMM or YMM register. In contrast, PBLENDVB treats XMM0 implicitly as the mask and do not support non-destructive destination operation. An attempt to execute PBLENDVB encoded with a VEX prefix will cause a #UD exception.
INSTRUCTION SET REFERENCE

Operation

VPBLENDB (VEX.128 encoded version)

MASK ← SRC3
IF (MASK[7] == 1) THEN DEST[7:0] ← SRC2[7:0];
ELSE DEST[7:0] ← SRC1[7:0];
ELSE DEST[15:8] ← SRC1[15:8];
ELSE DEST[23:16] ← SRC1[23:16];
ELSE DEST[31:24] ← SRC1[31:24];
IF (MASK[47] == 1) THEN DEST[47:40] ← SRC2[47:40]
ELSE DEST[47:40] ← SRC1[47:40];
ELSE DEST[55:48] ← SRC1[55:48];
IF (MASK[63] == 1) THEN DEST[63:56] ← SRC2[63:56]
ELSE DEST[63:56] ← SRC1[63:56];
IF (MASK[71] == 1) THEN DEST[71:64] ← SRC2[71:64]
ELSE DEST[71:64] ← SRC1[71:64];
IF (MASK[79] == 1) THEN DEST[79:72] ← SRC2[79:72]
ELSE DEST[79:72] ← SRC1[79:72];
IF (MASK[87] == 1) THEN DEST[87:80] ← SRC2[87:80]
ELSE DEST[87:80] ← SRC1[87:80];
IF (MASK[95] == 1) THEN DEST[95:88] ← SRC2[95:88]
ELSE DEST[95:88] ← SRC1[95:88];
IF (MASK[103] == 1) THEN DEST[103:96] ← SRC2[103:96]
ELSE DEST[103:96] ← SRC1[103:96];
IF (MASK[111] == 1) THEN DEST[111:104] ← SRC2[111:104]
ELSE DEST[111:104] ← SRC1[111:104];
ELSE DEST[119:112] ← SRC1[119:112];
IF (MASK[127] == 1) THEN DEST[127:120] ← SRC2[127:120]
ELSE DEST[127:120] ← SRC1[127:120])
DEST[255:128] ← 0

PBLENDV (128-bit Legacy SSE version)

MASK ← XMM0
IF (MASK[7] == 1) THEN DEST[7:0] ← SRC[7:0];
ELSE DEST[7:0] ← DEST[7:0];
ELSE DEST[15:8] ← DEST[15:8];
ELSE DEST[23:16] ← DEST[23:16];
ELSE DEST[31:24] ← DEST[31:24];
IF (MASK[47] == 1) THEN DEST[47:40] ← SRC[47:40]
ELSE DEST[47:40] ← DEST[47:40];
IF (MASK[63] == 1) THEN DEST[63:56] ← SRC[63:56]
ELSE DEST[63:56] ← DEST[63:56];
IF (MASK[71] == 1) THEN DEST[71:64] ← SRC[71:64]
ELSE DEST[71:64] ← DEST[71:64];
IF (MASK[79] == 1) THEN DEST[79:72] ← SRC[79:72]
ELSE DEST[79:72] ← DEST[79:72];
IF (MASK[87] == 1) THEN DEST[87:80] ← SRC[87:80]
ELSE DEST[87:80] ← DEST[87:80];
IF (MASK[95] == 1) THEN DEST[95:88] ← SRC[95:88]
ELSE DEST[95:88] ← DEST[95:88];
IF (MASK[103] == 1) THEN DEST[103:96] ← SRC[103:96]
ELSE DEST[103:96] ← DEST[103:96];
IF (MASK[111] == 1) THEN DEST[111:104] ← SRC[111:104]
ELSE DEST[111:104] ← DEST[111:104];
ELSE DEST[119:112] ← DEST[119:112];
IF (MASK[127] == 1) THEN DEST[127:120] ← SRC[127:120]
ELSE DEST[127:120] ← DEST[127:120]);

DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PBLENDVB __m128i _mm_blendv_epi8 (__m128i v1, __m128i v2, __m128i mask);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.
    If VEX.W = 1.
INSTRUCTION SET REFERENCE

PBLENDW - Blend Packed Words

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 0E /r ib PBLENDW xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Select words from xmm1 and xmm2/m128 from mask specified in imm8 and store the values into xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 0E /r ib VPBLENDW xmm1, xmm2, xmm3/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Select words from xmm2 and xmm3/m128 from mask specified in imm8 and store the values into xmm1</td>
</tr>
</tbody>
</table>

Description

Words from the source operand (second operand) are conditionally written to the destination operand (first operand) depending on bits in the immediate operand (third operand). The immediate bits (bits 7:0) form a mask that determines whether the corresponding word in the destination is copied from the source. If a bit in the mask, corresponding to a word, is "1", then the word is copied, else the word is unchanged.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

VPBLENDW (VEX.128 encoded version)

IF (imm8[0] == 1) THEN DEST[15:0] ← SRC2[15:0]
ELSE DEST[15:0] ← SRC1[15:0]
ELSE DEST[31:16] ← SRC1[31:16]
ELSE DEST[63:48] ← SRC1[63:48]
ELSE DEST[79:64] ← SRC1[79:64]
ELSE DEST[95:80] ← SRC1[95:80]
INSTRUCTION SET REFERENCE

ELSE DEST[111:96] ← SRC1[111:96]
ELSE DEST[127:112] ← SRC1[127:112]
DEST[255:128] ← 0

PBLENDW (128-bit Legacy SSE version)
IF (imm8[0] == 1) THEN DEST[15:0] ← SRC[15:0]
ELSE DEST[15:0] ← DEST[15:0]
ELSE DEST[31:16] ← DEST[31:16]
ELSE DEST[63:48] ← DEST[63:48]
ELSE DEST[79:64] ← DEST[79:64]
ELSE DEST[95:80] ← DEST[95:80]
ELSE DEST[111:96] ← DEST[111:96]
ELSE DEST[127:112] ← DEST[127:112]

DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PBLENDW __m128i _mm_blend_epi16 (__m128i v1, __m128i v2, const int mask)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

PCLMULQDQ - Carry-Less Multiplication Quadword

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 44 /r ib</td>
<td>V/V</td>
<td>CLMUL</td>
<td>Carry-less multiplication of one quadword of xmm1 by one quadword of xmm2/m128, stores the 128-bit result in xmm1. The immediate is used to determine which quadwords of xmm1 and xmm2/m128 should be used.</td>
</tr>
<tr>
<td>PCLMULQDQ xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 44 /r ib</td>
<td>V/V</td>
<td>Both CLMUL and AVX flags</td>
<td>Carry-less multiplication of one quadword of xmm2 by one quadword of xmm3/m128, stores the 128-bit result in xmm1. The immediate is used to determine which quadwords of xmm2 and xmm3/m128 should be used.</td>
</tr>
<tr>
<td>VPCLMULQDQ xmm1, xmm2, xmm3/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a carry-less multiplication of two quadwords, selected from the first source and second source operand according to the value of the immediate byte. Bits 4 and 0 are used to select which 64-bit half of each operand to use according to Table 5-18, other bits of the immediate byte are ignored.

Table 5-18. PCLMULQDQ Quadword Selection of Immediate Byte

<table>
<thead>
<tr>
<th>Imm[4]</th>
<th>Imm[0]</th>
<th>PCLMULQDQ Operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>0</td>
<td>CL_MUL( SRC2[63:0], SRC1[63:0] )</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
<td>CL_MUL( SRC2[63:0], SRC1[127:64] )</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
<td>CL_MUL( SRC2[127:64], SRC1[63:0] )</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
<td>CL_MUL( SRC2[127:64], SRC1[127:64] )</td>
</tr>
</tbody>
</table>

NOTES:

1. SRC2 denotes the second source operand, which can be a register or memory; SRC1 denotes the first source and destination operand.

The first source operand and the destination operand are the same and must be an XMM register. The second source operand can be an XMM register or a 128-bit...
memory location. Bits \((255:128)\) of the corresponding YMM destination register remain unchanged.

Compilers and assemblers may implement the following pseudo-op syntax to simply programming and emit the required encoding for \(\text{Imm}8\).

### Table 5-19. Pseudo-Op and PCLMULQDQ Implementation

<table>
<thead>
<tr>
<th>Pseudo-Op</th>
<th>Pseudo-Op</th>
<th>Imm8 Encoding</th>
</tr>
</thead>
<tbody>
<tr>
<td>PCLMULLQLQDQ (xmm1, xmm2)</td>
<td>VPCLMULLQLQDQ (xmm1, xmm2, xmm3)</td>
<td>0000_0000B</td>
</tr>
<tr>
<td>PCLMULHQLQDQ (xmm1, xmm2)</td>
<td>VPCLMULHQLQDQ (xmm1, xmm2, xmm3)</td>
<td>0000_0001B</td>
</tr>
<tr>
<td>PCLMULLQHDQ (xmm1, xmm2)</td>
<td>VPCLMULLQHDQ (xmm1, xmm2, xmm3)</td>
<td>0001_0000B</td>
</tr>
<tr>
<td>PCLMULHQHDQ (xmm1, xmm2)</td>
<td>VPCLMULHQHDQ (xmm1, xmm2, xmm3)</td>
<td>0001_0001B</td>
</tr>
</tbody>
</table>

**Operation**

**VPCLMULQDQ**

IF \((\text{Imm}8[0] = 0 )\)

THEN

\[ \text{TEMP1} \leftarrow \text{SRC1} [63:0]; \]

ELSE

\[ \text{TEMP1} \leftarrow \text{SRC1} [127:64]; \]

FI

IF \((\text{Imm}8[4] = 0 )\)

THEN

\[ \text{TEMP2} \leftarrow \text{SRC2} [63:0]; \]

ELSE

\[ \text{TEMP2} \leftarrow \text{SRC2} [127:64]; \]

FI

For \(i = 0\) to \(63\) {

\[ \text{TmpB} [i] \leftarrow (\text{TEMP1} [0] \text{ and } \text{TEMP2} [i]); \]

For \(j = 1\) to \(i\) {

\[ \text{TmpB} [i] \leftarrow \text{TmpB} [i] \text{ xor } (\text{TEMP1}[j] \text{ and } \text{TEMP2} [i - j]) \]

}  

\[ \text{DEST}[i] \leftarrow \text{TmpB}[i]; \]

}  

For \(i = 64\) to \(126\) {

\[ \text{TmpB} [i] \leftarrow 0; \]

For \(j = i - 63\) to \(63\) {

}  

}
INSTRUCTION SET REFERENCE

```c
TmpB[i] ← TmpB[i] xor (TEMP1[j] and TEMP2[i - j])

} DEST[i] ← TmpB[i];
```

```c
DEST[255:127] ← 0;
```

**PCLMULQDQ**

```c
IF (Imm8[0] = 0 )
    THEN
        TEMP1 ← SRC1[63:0];
    ELSE
        TEMP1 ← SRC1[127:64];
    FI
IF (Imm8[4] = 0 )
    THEN
        TEMP2 ← SRC2[63:0];
    ELSE
        TEMP2 ← SRC2[127:64];
    FI
```

```c
For i = 0 to 63 {
    TmpB[i] ← (TEMP1[0] and TEMP2[i]);
    For j = 1 to i {
        TmpB[i] ← TmpB[i] xor (TEMP1[j] and TEMP2[i - j])
    }
    DEST[i] ← TmpB[i];
}
```

```c
For i = 64 to 126 {
    TmpB[i] ← 0;
    For j = i - 63 to 63 {
        TmpB[i] ← TmpB[i] xor (TEMP1[j] and TEMP2[i - j])
    }
    DEST[i] ← TmpB[i];
}
```

```c
DEST[127] ← 0;
DEST[255:128] (Unmodified)
```

**Intel C/C++ Compiler Intrinsic Equivalent**

```c
(V)PCLMULQDQ __m128i _mm_clmulepi64_si128 (__m128i, __m128i, const int)
```

**SIMD Floating-Point Exceptions**

None
Other Exceptions

See Exceptions Type 4
INSTRUCTION SET REFERENCE

PCMPESTRI - Packed Compare Explicit Length Strings, Return Index

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 61 /r ib PCMPESTRI xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>SSE4_2</td>
<td>Perform a packed comparison of string data with explicit lengths, generating an index, and storing the result in ECX</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 61 /r ib VPCMPESTRI xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Perform a packed comparison of string data with explicit lengths, generating an index, and storing the result in ECX</td>
</tr>
</tbody>
</table>

Description

The instruction compares data from two strings based on the control encoded in the imm8 byte (as described in Section 3.1.2 of Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2A) generating an index stored to ECX. Each string is represented by two values. The first value is an XMM (or possibly m128 for the second operand) which contains the elements of the string (character data). The second value is stored in EAX (for xmm1) or EDX (for xmm2/m128) and represents the number of Bytes/Words which are valid for the respective xmm/m128 data. The length of each input is interpreted as being the absolute-value of the value in EAX (EDX). The absolute-value computation saturates to 16 (for bytes) and 8 (for words), based on the value of imm8[bit3] when the value in EAX (EDX) is greater than 16 (8) or less than -16 (-8).

At this point the comparisons and aggregation described in section Section 3.1.2 of Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2A are performed and the index of the first (or last, according to imm8[6]) set bit of IntRes2 is returned in ECX. If no bits are set in IntRes2, ECX is set to 16 (8).

Note that the Arithmetic Flags are written in a non-standard manner to supply the most relevant information.

CFlag – Reset if IntRes2 is equal to zero, set otherwise
ZFlag – Set if absolute-value of EDX is < 16 (8), reset otherwise
SFlag – Set if absolute-value of EAX is < 16 (8), reset otherwise
OFlag – IntRes2[0]
AFlag – Reset
PFlag – Reset
INSTRUCTION SET REFERENCE

Note: In VEX.128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.

Operation
See PCMPESTRI Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2B.

Intel C/C++ Compiler Intrinsic Equivalent
int _mm_cmpestri (__m128i a, int la, __m128i b, int lb, const int mode);

Intel C/C++ Compiler Intrinsic Equivalent for reading EFLAG Results
int _mm_cmpestria (__m128i a, int la, __m128i b, int lb, const int mode);
int _mm_cmpestric (__m128i a, int la, __m128i b, int lb, const int mode);
int _mm_cmpestrio (__m128i a, int la, __m128i b, int lb, const int mode);
int _mm_cmpestris (__m128i a, int la, __m128i b, int lb, const int mode);
int _mm_cmpestriz (__m128i a, int la, __m128i b, int lb, const int mode);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

**PCMPESTRM - Packed Compare Explicit Length Strings, Return Mask**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 60 /r ib PCMPESTRM xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>SSE4_2</td>
<td>Perform a packed comparison of string data with explicit lengths, generating a mask, and storing the result in XMM0</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 60 /r ib VPCMPESTRM xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Perform a packed comparison of string data with explicit lengths, generating a mask, and storing the result in XMM0</td>
</tr>
</tbody>
</table>

**Description**

The instruction compares data from two strings based on the control encoded in the imm8 byte (as described in Section 3.1.2 of *Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A*) generating a mask stored to XMM0. Each string is represented by two values. The first value is an XMM (or possibly m128 for the second operand) which contains the elements of the string (character data). The second value is stored in EAX (for xmm1) or EDX (for xmm2/m128) and represents the number of Bytes/Words which are valid for the respective xmm/m128 data. The length of each input is interpreted as being the absolute-value of the value in EAX (EDX). The absolute-value computation saturates to 16 (for bytes) and 8 (for words), based on the value of imm8[bit3] when the value in EAX (EDX) is greater than 16 (8) or less than -16 (-8).

At this point the comparisons and aggregation described in Section 3.1.2 of *Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A* are performed. As defined by imm8[6], IntRes2 is then either stored to the least significant bits of XMM0 (zero extended to 128 bits) or expanded into a byte/word-mask and then stored to XMM0.

Note that the Arithmetic Flags are written in a non-standard manner to supply the most relevant information:

- **CFlag** – Reset if IntRes2 is equal to zero, set otherwise
- **ZFlag** – Set if absolute-value of EDX is < 16 (8), reset otherwise
- **SFlag** – Set if absolute-value of EAX is < 16 (8), reset otherwise
- **OFlag** – IntRes2[0]
- **AFlag** – Reset
- **PFlag** – Reset

5-296
Note: In VEX.128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 1, otherwise the instruction will #UD.

Operation

See *PCMPESTRM Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*.

Intel C/C++ Compiler Intrinsic Equivalent

```c
__m128i _mm_cmpestrm (__m128i a, int la, __m128i b, int lb, const int mode);
```

Intel C/C++ Compiler Intrinsic Equivalent for reading EFLAG Results

```c
int _mm_cmpestrma (__m128i a, int la, __m128i b, int lb, const int mode);
int _mm_cmpestrmc (__m128i a, int la, __m128i b, int lb, const int mode);
int _mm_cmpestrmo (__m128i a, int la, __m128i b, int lb, const int mode);
int _mm_cmpestrms (__m128i a, int la, __m128i b, int lb, const int mode);
int _mm_cmpestrmz (__m128i a, int la, __m128i b, int lb, const int mode);
```

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

#UD  If VEX.L = 1.
    If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

PCMPISTRI - Packed Compare Implicit Length Strings, Return Index

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 63 /r ib PCMPISTRI xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>SSE4_2</td>
<td>Perform a packed comparison of string data with implicit lengths, generating an index, and storing the result in ECX.</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 63 /r ib VPCMPISTRI xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Perform a packed comparison of string data with implicit lengths, generating an index, and storing the result in ECX.</td>
</tr>
</tbody>
</table>

Description

The instruction compares data from two strings based on the control encoded in the imm8 byte (as described in Section 3.1.2 of Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A) generating an index stored to ECX. Each string is represented by a single value. The value is an XMM (or possibly m128 for the second operand) which contains the elements of the string (character data). Each input byte/word is augmented with a valid/invalid tag. A byte/word is considered valid only if it has a lower index than the least significant null byte/word. (The least significant null byte/word is also considered invalid.) At this point the comparisons and aggregation described in Section 3.1.2 of Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A are performed and the index of the first (or last, according to imm8[6]) set bit of IntRes2 is returned in ECX. If no bits are set in IntRes2, ECX is set to 16 (8).

Note that the Arithmetic Flags are written in a non-standard manner to supply the most relevant information.

CFlag – Reset if IntRes2 is equal to zero, set otherwise
ZFlag – Set if any byte/word of xmm2/mem128 is null, reset otherwise
SFlag – Set if any byte/word of xmm1 is null, reset otherwise
OFlag – IntRes2[0]
AFlag – Reset
PFlag – Reset

Note: In VEX.128 encoded version, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.
INSTRUCTION SET REFERENCE

Operation
See PCMPISTRI Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2B.

Intel C/C++ Compiler Intrinsic Equivalent

```c
int _mm_cmpistri (__m128i a, __m128i b, const int mode);
```

Intel C/C++ Compiler Intrinsic Equivalent for reading EFLAG Results

```c
int _mm_cmpistria (__m128i a, __m128i b, const int mode);
int _mm_cmpistric (__m128i a, __m128i b, const int mode);
int _mm_cmpistrio (__m128i a, __m128i b, const int mode);
int _mm_cmpistris (__m128i a, __m128i b, const int mode);
int _mm_cmpistriz (__m128i a, __m128i b, const int mode);
```

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

#UD  
If VEX.L = 1.
If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

PCMPISTRM - Packed Compare Implicit Length Strings, Return Mask

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 62 /r ib PCMPISTRM xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>SSE4_2</td>
<td>Perform a packed comparison of string data with implicit lengths, generating a mask, and storing the result in XMM0</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 62 /r ib VPCMPISTRM xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Perform a packed comparison of string data with implicit lengths, generating a mask, and storing the result in XMM0</td>
</tr>
</tbody>
</table>

Description

The instruction compares data from two strings based on the control encoded in the imm8 byte (as described in Section 3.1.2 of Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A) generating a mask stored to XMM0. Each string is represented by a single value. The value is an XMM (or possibly m128 for the second operand) which contains the elements of the string (character data). Each input byte/word is augmented with a valid/invalid tag. A byte/word is considered valid only if it has a lower index than the least significant null byte/word. (The least significant null byte/word is also considered invalid.)

At this point the comparisons and aggregation described in Section 3.1.2 of Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2A are performed. As defined by imm8[6], IntRes2 is then either stored to the least significant bits of XMM0 (zero extended to 128 bits) or expanded into a byte/word-mask and then stored to XMM0.

Note that the Arithmetic Flags are written in a non-standard manner to supply the most relevant information.

CFlag – Reset if IntRes2 is equal to zero, set otherwise
ZFlag – Set if any byte/word of xmm2/mem128 is null, reset otherwise
SFlag – Set if any byte/word of xmm1 is null, reset otherwise
OFlag – IntRes2[0]
AFlag – Reset
PFlag – Reset

Note: In VEX.128 encoded version, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.
Operation

See *PCMPESTRM Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*.

Intel C/C++ Compiler Intrinsic Equivalent

```c
__m128i _mm_cmpistrm (__m128i a, __m128i b, const int mode)
```

Intel C/C++ Compiler Intrinsic Equivalent for reading EFLAG Results

```c
int _mm_cmpistrma (__m128i a, __m128i b, const int mode);
int _mm_cmpistrmc (__m128i a, __m128i b, const int mode);
int _mm_cmpistrmo (__m128i a, __m128i b, const int mode);
int _mm_cmpistrms (__m128i a, __m128i b, const int mode);
int _mm_cmpistrmz (__m128i a, __m128i b, const int mode);
```

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

```c
#UD
If VEX.L = 1.
If VEX.vvvv != 1111B.
```
**INSTRUCTION SET REFERENCE**

**PCMPEQB/PCMPEQW/PCMPEQD/PCMPEQQ- Compare Packed Integers for Equality**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 74 /r PCMPEQB xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed bytes in xmm2/m128 and xmm1 for equality.</td>
</tr>
<tr>
<td>66 0F 75 /r PCMPEQW xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed words in xmm2/m128 and xmm1 for equality.</td>
</tr>
<tr>
<td>66 0F 76 /r PCMPEQD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed doublewords in xmm2/m128 and xmm1 for equality.</td>
</tr>
<tr>
<td>66 0F 38 29 /r PCMPEQQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Compare packed quadwords in xmm2/m128 and xmm1 for equality.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 74 /r VPCMPEQB xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed bytes in xmm3/m128 and xmm2 for equality.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 75 /r VPCMPEQW xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed words in xmm3/m128 and xmm2 for equality.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 76 /r VPCMPEQD xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed doublewords in xmm3/m128 and xmm2 for equality.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 29 /r VPCMPEQQ xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed quadwords in xmm3/m128 and xmm2 for equality.</td>
</tr>
</tbody>
</table>

**Description**

Performs a SIMD compare for equality of the packed bytes, words, doublewords, or quadwords in the first source operand and the second source operand. If a pair of data elements is equal, the corresponding data element in the destination operand is set to all 1s; otherwise, it is set to all 0s. The second source operand can be an XMM
register or a 128-bit memory location. The first source and destination operands are XMM registers.

The PCMPEQB instruction compares the corresponding bytes in the destination and source operands; the PCMPEQW instruction compares the corresponding words in the destination and source operands; the PCMPEQD instruction compares the corresponding doublewords in the destination and source operands, and the PCMPEQQ instruction compares the corresponding quadwords in the destination and source operands.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**COMPARE_BYTES_EQUAL (SRC1, SRC2)**

IF SRC1[7:0] = SRC2[7:0]
THEN DEST[7:0] ← FFH;
ELSE DEST[7:0] ← 0; FI;

(* Continue comparison of 2nd through 15th bytes in SRC1 and SRC2 *)

IF SRC1[127:120] = SRC2[127:120]
THEN DEST[127:120] ← FFH;
ELSE DEST[127:120] ← 0; FI;

**COMPARE_WORDS_EQUAL (SRC1, SRC2)**

IF SRC1[15:0] = SRC2[15:0]
THEN DEST[15:0] ← FFFFH;
ELSE DEST[15:0] ← 0; FI;

(* Continue comparison of 2nd through 7th 16-bit words in SRC1 and SRC2 *)

IF SRC1[127:112] = SRC2[127:112]
THEN DEST[127:112] ← FFFFH;
ELSE DEST[127:112] ← 0; FI;

**COMPARE_DWORDS_EQUAL (SRC1, SRC2)**

IF SRC1[31:0] = SRC2[31:0]
THEN DEST[31:0] ← FFFFFFFFH;
ELSE DEST[31:0] ← 0; FI;

(* Continue comparison of 2nd through 3rd 32-bit dwords in SRC1 and SRC2 *)

IF SRC1[127:96] = SRC2[127:96]
THEN DEST[127:96] ← FFFFFFFFH;
ELSE DEST[127:96] ← 0; FI;

**COMPARE_QWORDS_EQUAL (SRC1, SRC2)**
IF SRC1[63:0] = SRC2[63:0]
THEN DEST[63:0] ← FFFFFFFFFFFFFFFFH;
ELSE DEST[63:0] ← 0; FI;
IF SRC1[127:64] = SRC2[127:64]
THEN DEST[127:64] ← FFFFFFFFFFFFFFFFH;
ELSE DEST[127:64] ← 0; FI;

VPCMPEQB (VEX.128 encoded version)
DEST[127:0] ← COMPARE_BYTES_EQUAL(SRC1,SRC2)
DEST[255:128] ← 0

PCMPEQB (128-bit Legacy SSE version)
DEST[127:0] ← COMPARE_BYTES_EQUAL(Dest,Src)
DEST[255:128] (Unmodified)

VPCMPEQW (VEX.128 encoded version)
DEST[127:0] ← COMPARE_WORDS_EQUAL(SRC1,SRC2)
DEST[255:128] ← 0

PCMPEQW (128-bit Legacy SSE version)
DEST[127:0] ← COMPARE_WORDS_EQUAL(Dest,Src)
DEST[255:128] (Unmodified)

VPCMPEQD (VEX.128 encoded version)
DEST[127:0] ← COMPARE_DWORDS_EQUAL(SRC1,SRC2)
DEST[255:128] ← 0

PCMPEQD (128-bit Legacy SSE version)
DEST[127:0] ← COMPARE_DWORDS_EQUAL(Dest,Src)
DEST[255:128] (Unmodified)

VPCMPEQQ (VEX.128 encoded version)
DEST[127:0] ← COMPARE_QWORDS_EQUAL(SRC1,SRC2)
DEST[255:128] ← 0

PCMPEQQ (128-bit Legacy SSE version)
DEST[127:0] ← COMPARE_QWORDS_EQUAL(Dest,Src)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PCMPEQB _m128i _mm_cmpeq_epi8 ( _m128i a, _m128i b)
PCMPEQW _m128i _mm_cmpeq_epi16 ( _m128i a, _m128i b)
INSTRUCTION SET REFERENCE

PCMPEQD __m128i _mm_cmpeq_epi32 (__m128i a, __m128i b)
PCMPEQQ __m128i _mm_cmpeq_epi64(__m128i a, __m128i b);

SIMD Floating-Point Exceptions
none

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
**INSTRUCTION SET REFERENCE**

**PCMPGTB/PCMPGTw/PCMPGTD/PCMPGTQ- Compare Packed Integers for Greater Than**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 64 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed signed byte integers in xmm1 and xmm2/m128 for greater than.</td>
</tr>
<tr>
<td>PCMPGTB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 65 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed signed word integers in xmm1 and xmm2/m128 for greater than.</td>
</tr>
<tr>
<td>PCMPGTw xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 66 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed signed doubleword integers in xmm1 and xmm2/m128 for greater than.</td>
</tr>
<tr>
<td>PCMPGTD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 37 /r</td>
<td>V/V</td>
<td>SSE4_2</td>
<td>Compare packed qwords in xmm2/m128 and xmm1 for greater than.</td>
</tr>
<tr>
<td>PCMPGTQ xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 64 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed byte integers in xmm2 and xmm3/m128 for greater than.</td>
</tr>
<tr>
<td>VPCMPGTB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 65 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed word integers in xmm2 and xmm3/m128 for greater than.</td>
</tr>
<tr>
<td>VPCMPGTw xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 66 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed doubleword integers in xmm2 and xmm3/m128 for greater than.</td>
</tr>
<tr>
<td>VPCMPGTD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 37 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed qwords in xmm2 and xmm3/m128 for greater than.</td>
</tr>
<tr>
<td>VPCMPGTQ xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**
Performs a SIMD signed compare for the greater value of the packed byte, word, doubleword, or quadword integers in the first source operand and the second source operand. If a data element in the first source operand is greater than the corresponding data element in the second source operand, the corresponding data...
element in the destination operand is set to all 1s; otherwise, it is set to all 0s. The second source operand can be an XMM register or a 128-bit memory location. The first source operand and destination operand are XMM registers.

The PCMPGTB instruction compares the corresponding signed byte integers in the first and second source operands; the PCMPGTW instruction compares the corresponding signed word integers in the first and second source operands; the PCMPGTD instruction compares the corresponding signed doubleword integers in the first and second source operands, and the PCMPGTQ instruction compares the corresponding signed qword integers in the first and second source operands.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

COMPARE_BYTES_GREATER (SRC1, SRC2)
  IF SRC1[7:0] > SRC2[7:0]
  THEN DEST[7:0] ← FFH;
  ELSE DEST[7:0] ← 0; FI;
  (* Continue comparison of 2nd through 15th bytes in SRC1 and SRC2 *)
  IF SRC1[127:120] > SRC2[127:120]
  THEN DEST[127:120] ← FFH;
  ELSE DEST[127:120] ← 0; FI;

COMPARE_WORDS_GREATER (SRC1, SRC2)
  IF SRC1[15:0] > SRC2[15:0]
  THEN DEST[15:0] ← FFFFH;
  ELSE DEST[15:0] ← 0; FI;
  (* Continue comparison of 2nd through 7th 16-bit words in SRC1 and SRC2 *)
  IF SRC1[127:112] > SRC2[127:112]
  THEN DEST[127:112] ← FFFFH;
  ELSE DEST[127:112] ← 0; FI;

COMPARE_DWORDS_GREATER (SRC1, SRC2)
  IF SRC1[31:0] > SRC2[31:0]
  THEN DEST[31:0] ← FFFFFFFFH;
  ELSE DEST[31:0] ← 0; FI;
  (* Continue comparison of 2nd through 3rd 32-bit dwords in SRC1 and SRC2 *)
  IF SRC1[127:96] > SRC2[127:96]
  THEN DEST[127:96] ← FFFFFFFFH;
  ELSE DEST[127:96] ← 0; FI;
INSTRUCTION SET REFERENCE

COMPARE_QWORDS_GREATER (SRC1, SRC2)
  IF SRC1[63:0] > SRC2[63:0]
  THEN DEST[63:0] ← FFFFFFFFFFFFFFFFH;
  ELSE DEST[63:0] ← 0; FI;
  IF SRC1[127:64] > SRC2[127:64]
  THEN DEST[127:64] ← FFFFFFFFFFFFFFFFH;
  ELSE DEST[127:64] ← 0; FI;

VPCMPGTB (VEX.128 encoded version)
DEST[127:0] ← COMPARE_BYTES_GREATER(SRC1,SRC2)
DEST[255:128] ← 0

PCMPGTB (128-bit Legacy SSE version)
DEST[127:0] ← COMPARE_BYTES_GREATER(DEST,SRC)
DEST[255:128] (Unmodified)

VPCMPGTTW (VEX.128 encoded version)
DEST[127:0] ← COMPARE_WORDS_GREATER(SRC1,SRC2)
DEST[255:128] ← 0

PCMPGTTW (128-bit Legacy SSE version)
DEST[127:0] ← COMPARE_WORDS_GREATER(DEST,SRC)
DEST[255:128] (Unmodified)

VPCMPGTQ (VEX.128 encoded version)
DEST[127:0] ← COMPARE_QWORDS_GREATER(SRC1,SRC2)
DEST[255:128] ← 0

PCMPGTQ (128-bit Legacy SSE version)
DEST[127:0] ← COMPARE_QWORDS_GREATER(DEST,SRC)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PCMPGTB __m128i _mm_cmpgt_epi8 (__m128i a, __m128i b)
INSTRUCTION SET REFERENCE

PCMPGTW __m128i _mm_cmpgt_epi16 ( __m128i a, __m128i b)
PCMPGTD __m128i _mm_cmpgt_epi32 ( __m128i a, __m128i b)
PCMPGTQ __m128i _mm_cmpgt_epi64 (__m128i a, __m128i b);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
### INSTRUCTION SET REFERENCE

#### VPERMILPD- Permute Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.128.66.0F38 0D /r VPERMILPD xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Permute double-precision floating-point values in xmm2 using controls from xmm3/mem and store result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F38 0D /r VPERMILPD ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Permute double-precision floating-point values in ymm2 using controls from ymm3/mem and store result in ymm1</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 05 /r ib VPERMILPD xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Permute double-precision floating-point values in xmm2/mem using controls from imm8</td>
</tr>
<tr>
<td>VEX.256.66.0F3A 05 /r ib VPERMILPD ymm1, ymm2/m256, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Permute double-precision floating-point values in ymm2/mem using controls from imm8</td>
</tr>
</tbody>
</table>

### Description

Permute double-precision floating-point values in the first source operand (second operand) using 8-bit control fields in the low bytes of the second source operand (third operand) and store results in the destination operand (first operand). The first source operand is a YMM register, the second source operand is a YMM register or a 256-bit memory location, and the destination operand is a YMM register.
There is one control byte per destination double-precision element. Each control byte is aligned with the low 8 bits of the corresponding double-precision destination element. Each control byte contains a 1-bit select field (see Figure 5-19) that determines which of the source elements are selected. Source elements are restricted to lie in the same source 128-bit region as the destination.

Permute double-precision floating-point values in the first source operand (second operand) using two, 1-bit control fields in the low 2 bits of the 8-bit immediate and store results in the destination operand (first operand). The source operand is a YMM register or 256-bit memory location and the destination operand is a YMM register.

Note: For the VEX.128.66.0F3A 05 instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.
INSTRUCTION SET REFERENCE

Note: For the VEX.256.66.0F3A 05 instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.

Operation

**VPERMILPD (256-bit immediate version)**

IF (imm8[0] = 0) THEN DEST[63:0] ← SRC1[63:0]
IF (imm8[0] = 1) THEN DEST[63:0] ← SRC1[127:64]
IF (imm8[1] = 0) THEN DEST[127:64] ← SRC1[63:0]

**VPERMILPD (128-bit immediate version)**

IF (imm8[0] = 0) THEN DEST[63:0] ← SRC1[63:0]
IF (imm8[0] = 1) THEN DEST[63:0] ← SRC1[127:64]
IF (imm8[1] = 0) THEN DEST[127:64] ← SRC1[63:0]
DEST[255:128] ← 0

**VPERMILPD (256-bit variable version)**

IF (SRC2[1] = 0) THEN DEST[63:0] ← SRC1[63:0]
IF (SRC2[1] = 1) THEN DEST[63:0] ← SRC1[127:64]
IF (SRC2[65] = 0) THEN DEST[127:64] ← SRC1[63:0]
IF (SRC2[65] = 1) THEN DEST[127:64] ← SRC1[127:64]

**VPERMILPD (128-bit variable version)**

IF (SRC2[1] = 0) THEN DEST[63:0] ← SRC1[63:0]
IF (SRC2[1] = 1) THEN DEST[63:0] ← SRC1[127:64]
IF (SRC2[65] = 0) THEN DEST[127:64] ← SRC1[63:0]
IF (SRC2[65] = 1) THEN DEST[127:64] ← SRC1[127:64]

DEST[255:128] ← 0
Intel C/C++ Compiler Intrinsic Equivalent

VPERMILPD __m128d _mm_permute_pd (__m128d a, int control)
VPERMILPD __m256d _mm256_permute_pd (__m256d a, int control)
VPERMILPD __m128d _mm_permutevar_pd (__m128d a, __m128i control);
VPERMILPD __m256d _mm256_permutevar_pd (__m256d a, __m256i control);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 6
## VPERMILPS- Permute Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.128.66.0F38 0C /r VPERMILPS xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Permute single-precision floating-point values in xmm2 using controls from xmm3/mem and store result in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 04 /r ib VPERMILPS xmm1, xmm2/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Permute single-precision floating-point values in xmm2/mem using controls from imm8 and store result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F38 0C /r VPERMILPS ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Permute single-precision floating-point values in ymm2 using controls from ymm3/mem and store result in ymm1</td>
</tr>
<tr>
<td>VEX.256.66.0F3A 04 /r ib VPERMILPS ymm1, ymm2/m256, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Permute single-precision floating-point values in ymm2/mem using controls from imm8 and store result in ymm1</td>
</tr>
</tbody>
</table>

### Description

*(variable control version)*

Permute single-precision floating-point values in the first source operand (second operand) using 8-bit control fields in the low bytes of corresponding elements the shuffle control (third operand) and store results in the destination operand (first operand). The first source operand is a YMM register, the second source operand is a YMM register or a 256-bit memory location, and the destination operand is a YMM register.
There is one control byte per destination single-precision element. Each control byte is aligned with the low 8 bits of the corresponding single-precision destination element. Each control byte contains a 2-bit select field (see Figure 5-21) that determines which of the source elements are selected. Source elements are restricted to lie in the same source 128-bit region as the destination.

Permute single-precision floating-point values in the first source operand (second operand) using four 2-bit control fields in the 8-bit immediate and store results in the destination operand (first operand). The source operand is a YMM register or 256-bit memory location and the destination operand is a YMM register. This is similar to a wider version of PSHUFD, just operating on single-precision floating-point values.

Figure 5-20. VPERMILPS Operation

Figure 5-21. VPERMILPS Shuffle Control
INSTRUCTION SET REFERENCE

Note: For the VEX.128.66.0F3A 04 instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.

Note: For the VEX.256.66.0F3A 04 instruction version, VEX.vvvv is reserved and must be 1111b otherwise instruction will #UD.

Operation

Select4(SRC, control) {
CASE (control[1:0]) OF
  0: TMP ← SRC[31:0];
  1: TMP ← SRC[63:32];
  2: TMP ← SRC[95:64];
  3: TMP ← SRC[127:96];
ESAC;
RETURN TMP
}

VPERMILPS (256-bit immediate version)
DEST[31:0] ← Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] ← Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] ← Select4(SRC1[127:0], imm8[5:4]);
DEST[127:96] ← Select4(SRC1[127:0], imm8[7:6]);
DEST[159:128] ← Select4(SRC1[255:128], imm8[1:0]);
DEST[255:224] ← Select4(SRC1[255:128], imm8[7:6]);

VPERMILPS (128-bit immediate version)
DEST[31:0] ← Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] ← Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] ← Select4(SRC1[127:0], imm8[5:4]);
DEST[127:96] ← Select4(SRC1[127:0], imm8[7:6]);
DEST[255:128] ← 0

VPERMILPS (256-bit variable version)
DEST[31:0] ← Select4(SRC1[127:0], SRC2[1:0]);
DEST[63:32] ← Select4(SRC1[127:0], SRC2[33:32]);
DEST[95:64] ← Select4(SRC1[127:0], SRC2[65:64]);
DEST[127:96] ← Select4(SRC1[127:0], SRC2[97:96]);
DEST[255:224] ← Select4(SRC1[255:128], SRC2[225:224]);
VPERMILPS (128-bit variable version)
DEST[31:0] ← Select4(SRC1[127:0], SRC2[1:0]);
DEST[63:32] ← Select4(SRC1[127:0], SRC2[33:32]);
DEST[95:64] ← Select4(SRC1[127:0], SRC2[65:64]);
DEST[127:96] ← Select4(SRC1[127:0], SRC2[97:96]);
DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
VPERM1LPS __m128 _mm_permute_ps (__m128 a, int control);
VPERM1LPS __m256 _mm256_permute_ps (__m256 a, int control);
VPERM1LPS __m128 _mm_permutevar_ps (__m128 a, __m128i control);
VPERM1LPS __m256 _mm256_permutevar_ps (__m256 a, __m256i control);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 6
INSTRUCTION SET REFERENCE

VPERM2F128- Permute Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.256.66.0F3A 06 /r ib VPERM2F128 ymm1, ymm2, ymm3/m256, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1</td>
</tr>
</tbody>
</table>

Description

Permute 128 bit floating-point-containing fields from the first source operand (second operand) and second source operand (third operand) using bits in the 8-bit immediate and store results in the destination operand (first operand). The first source operand is a YMM register, the second source operand is a YMM register or a 256-bit memory location, and the destination operand is a YMM register.

Figure 5-22. VPERM2F128 Operation

Imm8[1:0] select the source for the first destination 128-bit field, imm8[5:4] select the source for the second destination field. If imm8[3] is set, the low 128-bit field is zeroed. If imm8[7] is set, the high 128-bit field is zeroed.
VEX.L must be 1, otherwise the instruction will #UD.

Operation

**VPERM2F128**

CASE IMM8[1:0] of
0: DEST[127:0] ← SRC1[127:0]
1: DEST[127:0] ← SRC1[255:128]
2: DEST[127:0] ← SRC2[127:0]
ESAC

CASE IMM8[5:4] of
0: DEST[255:128] ← SRC1[127:0]
2: DEST[255:128] ← SRC2[127:0]
ESAC

IF (imm8[3])
DEST[127:0] ← 0
FI

IF (imm8[7])
DEST[255:128] ← 0
FI

**Intel C/C++ Compiler Intrinsic Equivalent**

VPERM2F128 _m256 _mm256_permute2f128_ps (_m256 a, _m256 b, int control)
VPERM2F128 _m256d _mm256_permute2f128_pd (_m256d a, _m256d b, int control)
VPERM2F128 _m256i _mm256_permute2f128_si256 (_m256i a, _m256i b, int control)

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type 6; additionally

#UD If VEX.L = 0.
PEXTRB/PEXTRw/PEXTRD/PEXTRQ- Extract Integer

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 14 /r ib</td>
<td>V/V SSE4_1</td>
<td></td>
<td>Extract a byte integer value from xmm2 at the source byte offset specified by imm8 into reg or m8. The upper bits of r64/r32 is filled with zeros.</td>
</tr>
<tr>
<td>PEXTRB reg/m8, xmm2, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F C5 /r ib</td>
<td>V/V SSE2</td>
<td></td>
<td>Extract the word specified by imm8 from xmm1 and move it to reg, bits 15:0. The upper bits of r64/r32 is filled with zeros.</td>
</tr>
<tr>
<td>PEXTRW reg, xmm1, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 3A 15 /r ib</td>
<td>V/V SSE4_1</td>
<td></td>
<td>Extract a word integer value from xmm2 at the source word offset specified by imm8 into reg or m16. The upper bits of r64/r32 is filled with zeros.</td>
</tr>
<tr>
<td>PEXTRW reg/m16, xmm2, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 3A 16 /r ib</td>
<td>V/V SSE4_1</td>
<td></td>
<td>Extract a dword integer value from xmm2 at the source dword offset specified by imm8 into r32/m32.</td>
</tr>
<tr>
<td>PEXTRD r32/m32, xmm2, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 REX.W 0F 3A 16 /r ib</td>
<td>N.E./V SSE4_1</td>
<td></td>
<td>Extract a qword integer value from xmm2 at the source dword offset specified by imm8 into r64/m64.</td>
</tr>
<tr>
<td>PEXTRQ r64/m64, xmm2, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F3A 14 /r ib</td>
<td>V/V AVX</td>
<td></td>
<td>Extract a byte integer value from xmm2 at the source byte offset specified by imm8 into reg or m8. The upper bits of r64/r32 is filled with zeros.</td>
</tr>
<tr>
<td>VPEXTRB reg/m8, xmm2, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F C5 /r ib</td>
<td>V/V AVX</td>
<td></td>
<td>Extract the word specified by imm8 from xmm1 and move it to reg, bits 15:0. Zero-extend the result. The upper bits of r64/r32 is filled with zeros.</td>
</tr>
<tr>
<td>VPEXTRW reg, xmm1, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
### INSTRUCTION SET REFERENCE

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.128.66.0F3A 15 /r ib VPEXTRW reg/m16, xmm2, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Extract a word integer value from xmm2 at the source word offset specified by imm8 into reg or m16. The upper bits of r64/r32 is filled with zeros.</td>
</tr>
<tr>
<td>VEX.128.66.0F3A.W0 16 /r ib VPEXTRD r32/m32, xmm2, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Extract a dword integer value from xmm2 at the source dword offset specified by imm8 into r32/m32.</td>
</tr>
<tr>
<td>VEX.128.66.0F3A.W1 16 /r ib VPEXTRQ r64/m64, xmm2, imm8</td>
<td>N.E./V</td>
<td>AVX</td>
<td>Extract a qword integer value from xmm2 at the source dword offset specified by imm8 into r64/m64.</td>
</tr>
</tbody>
</table>

**Description**

Extract a byte/word/dword/qword integer value from the source XMM register at a byte/word/dword/qword offset determined from imm8[3:0]. The destination can be a register or byte/word/dword/qword memory location. If the destination is a register, the upper bits of the register are zero extended.

In 64-bit mode, if the destination operand is a register, default operand size is 64 bits. The bits above the least significant dword/word/byte data element are filled with zeros.

Note: In VEX.128 encoded versions, VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

(V)PEXTRTD/(V)PEXTRQ

IF (64-Bit Mode and 64-bit dest operand)
THEN
   Src_Offset ← Imm8[0]
   r64/m64 ← (Src >> Src_Offset * 64)
ELSE
   Src_Offset ← Imm8[1:0]
   r32/m32 ← ((Src >> Src_Offset * 32) AND 0FFFFFFFFh);
FI

(V)PEXTRW (dest=m16)
SRC_Offset ← Imm8[2:0]
Mem16 ← (Src >> Src_Offset*16)
INSTRUCTION SET REFERENCE

(V)PEXTRW ( dest=reg)
IF (64-Bit Mode )
THEN
    SRC_Offset ← Imm8[2:0]
    DEST[15:0] ← ((Src >> Src_Offset*16) AND 0FFFFh)
    DEST[63:16] ← ZERO_FILL;
ELSE
    SRC_Offset ← Imm8[2:0]
    DEST[15:0] ← ((Src >> Src_Offset*16) AND 0FFFFh)
    DEST[31:16] ← ZERO_FILL;
FI

(V)PEXTRB ( dest=m8)
SRC_Offset ← Imm8[3:0]
Mem8 ← (Src >> Src_Offset*8)

(V)PEXTRB ( dest=reg)
IF (64-Bit Mode )
THEN
    SRC_Offset ← Imm8[3:0]
    DEST[7:0] ← ((Src >> Src_Offset*8) AND 0FFh)
    DEST[63:8] ← ZERO_FILL;
ELSE
    SRC_Offset ← Imm8[3:0];
    DEST[7:0] ← ((Src >> Src_Offset*8) AND 0FFh);
    DEST[31:8] ← ZERO_FILL;
FI

Intel C/C++ Compiler Intrinsic Equivalent
PEXTRB int _mm_extract_epi8 (__m128i src, const int ndx);
PEXTRW int _mm_extract_epi16 (__m128i src, int ndx);
PEXTRD int _mm_extract_epi32 (__m128i src, const int ndx);
PEXTRQ __int64 _mm_extract_epi64 (__m128i src, const int ndx);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5; additionally
#UD

If VEX.L = 1.
If VEX.vvvv != 1111B.
INSTRUCTION SET REFERENCE

PHADDW/PHADDD - Packed Horizontal Add

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 01 /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Add 16-bit signed integers horizontally, pack to xmm1.</td>
</tr>
<tr>
<td>PHADDD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 02 /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Add 32-bit signed integers horizontally, pack to xmm1.</td>
</tr>
<tr>
<td>PHADDD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 01 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Add 16-bit signed integers horizontally, pack to xmm1.</td>
</tr>
<tr>
<td>VPHADDW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 02 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Add 32-bit signed integers horizontally, pack to xmm1.</td>
</tr>
<tr>
<td>VPHADDD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

PHADDDW adds two adjacent 16-bit signed integers horizontally from the second source operand and the first source operand and packs the 16-bit signed results to the destination operand. PHADDD adds two adjacent 32-bit signed integers horizontally from the second source operand and the first source operand and packs the 32-bit signed results to the destination operand. The first source and destination operands are XMM registers. The second source operand is an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

VPHADDW (VEX.128 encoded version)

DEST[15:0] ← SRC1[31:16] + SRC1[15:0]
DEST[79:64] ← SRC2[31:16] + SRC2[15:0]
DEST[111:96] ← SRC2[95:80] + SRC2[79:64]
DEST[255:128] \leftarrow 0

**VPHADD (VEX.128 encoded version)**
DEST[31-0] \leftarrow SRC1[63-32] + SRC1[31-0]
DEST[63-32] \leftarrow SRC1[127-96] + SRC1[95-64]
DEST[95-64] \leftarrow SRC2[63-32] + SRC2[31-0]
DEST[127-96] \leftarrow SRC2[127-96] + SRC2[95-64]
DEST[255:128] \leftarrow 0

**PHADDW (128-bit Legacy SSE version)**
DEST[15:0] \leftarrow DEST[31:16] + DEST[15:0]
DEST[79:64] \leftarrow SRC[31:16] + SRC[15:0]
DEST[111:96] \leftarrow SRC[95:80] + SRC[79:64]
DEST[255:128] (Unmodified)

**PHADD (128-bit Legacy SSE version)**
DEST[31-0] \leftarrow DEST[63-32] + DEST[31-0]
DEST[63-32] \leftarrow DEST[127-96] + DEST[95-64]
DEST[95-64] \leftarrow SRC[63-32] + SRC[31-0]
DEST[127-96] \leftarrow SRC[127-96] + SRC[95-64]
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

PHADDW __m128i _mm_hadd_epi16 (__m128i a, __m128i b)

PHADD __m128i _mm_hadd_epi32 (__m128i a, __m128i b)

**SIMD Floating-Point Exceptions**
None

**Other Exceptions**
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

PHADDSW - Packed Horizontal Add with Saturation

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 03 /r PHADDSW xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Add 16-bit signed integers horizontally, pack saturated integers to xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 03 /r VPHADDSW xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Add 16-bit signed integers horizontally, pack saturated integers to xmm1.</td>
</tr>
</tbody>
</table>

Description

PHADDSW adds two adjacent signed 16-bit integers horizontally from the second source and first source operands and saturates the signed results; packs the signed, saturated 16-bit results to the destination operand. The first source and destination operands are XMM registers. The second source operand is an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

**VPHADDSW (VEX.128 encoded version)**

DEST[15:0] = SaturateToSignedWord(SRC1[31:16] + SRC1[15:0])
DEST[79:64] = SaturateToSignedWord(SRC2[31:16] + SRC2[15:0])
DEST[111:96] = SaturateToSignedWord(SRC2[95:80] + SRC2[79:64])
DEST[255:128] ← 0

**PHADDSW (128-bit Legacy SSE version)**

DEST[15:0] = SaturateToSignedWord(DST[31:16] + DST[15:0])

5-326

Ref. # 319433-005
DEST[79:64] = SaturateToSignedWord(SRC[31:16] + SRC[15:0])
DEST[111:96] = SaturateToSignedWord(SRC[95:80] + SRC[79:64])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PHADDSW __m128i _mm_hadds_epi16 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions
None

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.
PHMINPOSUW - Horizontal Minimum and Position

**Description**

Determine the minimum unsigned word value in the source operand and place the unsigned word in the low word (bits 0-15) of the destination operand. The word index of the minimum value is stored in bits 16-18 of the destination operand. The remaining upper bits of the destination are set to zero.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

VPHMINPOSUW (VEX.128 encoded version)

INDEX ← 0
MIN ← SRC[15:0]
IF (SRC[31:16] < MIN) THEN INDEX ← 1; MIN ← SRC[31:16]
IF (SRC[47:32] < MIN) THEN INDEX ← 2; MIN ← SRC[47:32]
* Repeat operation for words 3 through 6
IF (SRC[127:112] < MIN) THEN INDEX ← 7; MIN ← SRC[127:112]
DEST[15:0] ← MIN
DEST[18:16] ← INDEX
DEST[127:19] ← 00000000000000000000000000H
DEST[255:128] ← 0

**Opcode/Instruction**

<table>
<thead>
<tr>
<th>Oprcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 41 /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Find the minimum unsigned word in xmm2/m128 and place its value in the low word of xmm1 and its index in the second-lowest word of xmm1</td>
</tr>
<tr>
<td>PHMINPOSUW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 41 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Find the minimum unsigned word in xmm2/m128 and place its value in the low word of xmm1 and its index in the second-lowest word of xmm1</td>
</tr>
<tr>
<td>VPHMINPOSUW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
PHMINPOSUW (128-bit Legacy SSE version)
INDEX \(\leftarrow 0\)
MIN \(\leftarrow\) SRC\[15:0\]
IF (SRC\[31:16\] < MIN) THEN INDEX \(\leftarrow 1\); MIN \(\leftarrow\) SRC\[31:16\]
IF (SRC\[47:32\] < MIN) THEN INDEX \(\leftarrow 2\); MIN \(\leftarrow\) SRC\[47:32\]
* Repeat operation for words 3 through 6
IF (SRC\[127:112\] < MIN) THEN INDEX \(\leftarrow 7\); MIN \(\leftarrow\) SRC\[127:112\]
DEST\[15:0\] \(\leftarrow\) MIN
DEST\[18:16\] \(\leftarrow\) INDEX
DEST\[127:19\] \(\leftarrow\) 00000000000000000000000000000000H
DEST\[255:128\] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PHMINPOSUW __m128i _mm_minpos_epu16( __m128i packed_words)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally

#UD If VEX.L = 1.
If VEX.vvvv \(!= 1111B\).
INSTRUCTION SET REFERENCE

**PHSUBW/PHSUBD - Packed Horizontal Subtract**

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 05 /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Subtract 16-bit signed integers horizontally, pack to xmm1.</td>
</tr>
<tr>
<td>PHSUBW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 06 /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Subtract 32-bit signed integers horizontally, pack to xmm1.</td>
</tr>
<tr>
<td>PHSUBD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 05 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract 16-bit signed integers horizontally, pack to xmm1.</td>
</tr>
<tr>
<td>VPHSUBW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 06 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract 32-bit signed integers horizontally, pack to xmm1.</td>
</tr>
<tr>
<td>VPHSUBD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

PHSUBW performs horizontal subtraction on each adjacent pair of 16-bit signed integers by subtracting the most significant word from the least significant word of each pair in the second source operand and destination operands, and packs the signed 16-bit results to the destination operand. PHSUBD performs horizontal subtraction on each adjacent pair of 32-bit signed integers by subtracting the most significant doubleword from the least significant doubleword of each pair, and packs the signed 32-bit result to the destination operand.

The first source and destination operands are XMM registers. The second source operand is an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**VPHSUBW (VEX.128 encoded version)**

DEST[15:0] ← SRC1[15:0] - SRC1[31:16]
VPHSUBD (VEX.128 encoded version)
DEST[31-0] ← SRC1[31-0] - SRC1[63-32]
DEST[63-32] ← SRC1[95-64] - SRC1[127-96]
DEST[95-64] ← SRC2[31-0] - SRC2[63-32]
DEST[127-96] ← SRC2[95-64] - SRC2[127-96]
DEST[255:128] ← 0

PHSUBW (128-bit Legacy SSE version)
DEST[255:128] (Unmodified)

PHSUBD (128-bit Legacy SSE version)
DEST[31-0] ← DEST[31-0] - DEST[63-32]
DEST[63-32] ← DEST[95-64] - DEST[127-96]
DEST[95-64] ← SRC[31-0] - SRC[63-32]
DEST[127-96] ← SRC[95-64] - SRC[127-96]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PHSUBW __m128i _mm_hsub_epi16 (__m128i a, __m128i b)

PHSUBD __m128i _mm_hsub_epi32 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions

None

Other Exceptions
See Exceptions Type 4; additionally

#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

**PHSUBSW - Packed Horizontal Subtract with Saturation**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 07 /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Subtract 16-bit signed integer horizontally, pack saturated integers to xmm1</td>
</tr>
<tr>
<td>PHSUBSW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 07 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract 16-bit signed integer horizontally, pack saturated integers to xmm1</td>
</tr>
<tr>
<td>VPHSUBSW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

PHSUBSW performs horizontal subtraction on each adjacent pair of 16-bit signed integers by subtracting the most significant word from the least significant word of each pair in the second source and first source operands. The signed, saturated 16-bit results are packed to the destination operand. The destination and first source operand are XMM registers. The second operand can be an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**VPHSUBSW (VEX.128 encoded version)**

- DEST[15:0] = SaturateToSignedWord(SRC1[15:0] - SRC1[31:16])
- DEST[79:64] = SaturateToSignedWord(SRC2[15:0] - SRC2[31:16])
- DEST[255:128] ← 0

**PHSUBSW (128-bit Legacy SSE version)**

INSTRUCTION SET REFERENCE

\[
\begin{align*}
\text{DEST}[63:48] &= \text{SaturateToSignedWord}(\text{DEST}[111:96] - \text{DEST}[127:112]) \\
\text{DEST}[79:64] &= \text{SaturateToSignedWord}(\text{SRC}[15:0] - \text{SRC}[31:16]) \\
\text{DEST}[95:80] &= \text{SaturateToSignedWord}(\text{SRC}[47:32] - \text{SRC}[63:48]) \\
\text{DEST}[111:96] &= \text{SaturateToSignedWord}(\text{SRC}[79:64] - \text{SRC}[95:80]) \\
\text{DEST}[255:128] &= \text{(Unmodified)}
\end{align*}
\]

Intel C/C++ Compiler Intrinsic Equivalent

\[
\text{PHSUBSW } \_\text{m128i } \_\text{mm_hsubs_epi16 } (\_\text{m128i } a, \_\text{m128i } b)
\]

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

\#UD \hspace{1cm} \text{If VEX.L = 1.}
INSTRUCTION SET REFERENCE

PINSRB/PINSRW/PINSRD/PINSRQ- Insert Integer

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 20 /r ib PINSRB xmm1, r32/m8, imm8</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Insert a byte integer value from r32/m8 into xmm1 at the byte offset in imm8</td>
</tr>
<tr>
<td>66 0F C4 /r ib PINSRW xmm1, r32/m16, imm8</td>
<td>V/V</td>
<td>SSE2</td>
<td>Insert a word integer value from r32/m16 into xmm1 at the word offset in imm8</td>
</tr>
<tr>
<td>66 0F 3A 22 /r ib PINSRD xmm1, r32/m32, imm8</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Insert a dword integer value from r32/m32 into xmm1 at the dword offset in imm8</td>
</tr>
<tr>
<td>66 REX.W 0F 3A 22 /r ib PINSRQ xmm1, r64/m64, imm8</td>
<td>N.E./V</td>
<td>SSE4_1</td>
<td>Insert a qword integer value from r64/m64 into xmm1 at the qword offset in imm8</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 20 /r ib VPINSRB xmm1, xmm2, r32/m8, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Merge a byte integer value from r32/m8 and rest from xmm2 into xmm1 at the byte offset in imm8</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F C4 /r ib VPINSRW xmm1, xmm2, r32/m16, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Insert a word integer value from r32/m16 and rest from xmm2 into xmm1 at the byte offset in imm8</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A.W0 22 /r ib VPINSRD xmm1, xmm2, r32/m32, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Insert a dword integer value from r32/m32 and rest from xmm2 into xmm1 at the dword offset in imm8</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A.W1 22 /r ib VPINSRQ xmm1, xmm2, r64/m64, imm8</td>
<td>N.E./V</td>
<td>AVX</td>
<td>Insert a qword integer value from r64/m64 and rest from xmm2 into xmm1 at the qword offset in imm8</td>
</tr>
</tbody>
</table>

Description

Copies a byte/word/dword/qword from the second source operand and inserts it into the destination operand at the byte/word/dword/qword offset specified with the immediate operand (third operand). The other bytes/words/dwords/qwords in the destination register are copied from the first source operand. The byte select is specified by the 4/3/2/1 least-significant bits of the immediate.
The first source operand and destination operands are XMM registers. The second source operand is a r32 register or an 8-/16-/32-/ or 64-bit memory location. For PINSRW, REX.W causes the source to be an r64 instead of an r32. REX.W distinguishes between PINSRD and PINSRQ (PINSRQ is not encodable in 32-bit modes).

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

```plaintext
write_q_element(position, val, src)
{
    TEMP ← SRC
    CASE (position)
    0: TEMP[63:0] ← val
    1: TEMP[127:64] ← val
    ESAC
    return TEMP
}
```

```plaintext
write_d_element(position, val, src)
{
    TEMP ← SRC
    CASE (position)
    0: TEMP[31:0] ← val
    1: TEMP[63:32] ← val
    2: TEMP[95:64] ← val
    3: TEMP[127:96] ← val
    ESAC
    return TEMP
}
```

```plaintext
write_w_element(position, val, src)
{
    TEMP ← SRC
    CASE (position)
    0: TEMP[15:0] ← val
    1: TEMP[31:16] ← val
    2: TEMP[47:32] ← val
    3: TEMP[63:48] ← val
    4: TEMP[79:64] ← val
    5: TEMP[95:80] ← val
    6: TEMP[111:96] ← val
}```
INSTRUCTION SET REFERENCE

7: TEMP[127:112] ← val
ESAC
return TEMP
}

write_b_element(position, val, src)
{
TEMP ← SRC
CASE (position)
0: TEMP[7:0] ← val
1: TEMP[15:8] ← val
2: TEMP[23:16] ← val
3: TEMP[31:24] ← val
5: TEMP[47:40] ← val
7: TEMP[63:56] ← val
8: TEMP[71:64] ← val
9: TEMP[79:72] ← val
10: TEMP[87:80] ← val
11: TEMP[95:88] ← val
12: TEMP[103:96] ← val
13: TEMP[111:104] ← val
14: TEMP[119:112] ← val
15: TEMP[127:120] ← val
ESAC
return TEMP
}

VPINSRQ (VEX.128 encoded version)
SEL ← imm8[0]
DEST[127:0] ← write_q_element(SEL, SRC2, SRC1)
DEST[255:128] ← 0

VPINSRD (VEX.128 encoded version)
SEL ← imm8[1:0]
DEST[127:0] ← write_d_element(SEL, SRC2, SRC1)
DEST[255:128] ← 0

VPINSRW (VEX.128 encoded version)
SEL ← imm8[2:0]
DEST[127:0] ← write_w_element(SEL, SRC2, SRC1)
DEST[255:128] ← 0
VPINSRB (VEX.128 encoded version)
SEL ← imm8[3:0]
DEST[127:0] ← write_b_element(SEL, SRC2, SRC1)
DEST[255:128] ← 0

PINSRQ (Legacy SSE version)
SEL ← imm8[0]
DEST[127:0] ← write_q_element(SEL, SRC, DEST)
DEST[255:128] (Unmodified)

PINSRD (Legacy SSE version)
SEL ← imm8[1:0]
DEST[127:0] ← write_d_element(SEL, SRC, DEST)
DEST[255:128] (Unmodified)

PINSRW (Legacy SSE version)
SEL ← imm8[2:0]
DEST[127:0] ← write_w_element(SEL, SRC, DEST)
DEST[255:128] (Unmodified)

PINSRB (Legacy SSE version)
SEL ← imm8[3:0]
DEST[127:0] ← write_b_element(SEL, SRC, DEST)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PINSRB __m128i _mm_insert_epi8 (__m128i s1, int s2, const int ndx);
PINSRW __m128i _mm_insert_epi16 ( __m128i a, int b, int imm)
PINSRD __m128i _mm_insert_epi32 ( __m128i s2, int s, const int ndx);
PINSRQ __m128i _mm_insert_epi64(__m128i s2, __int64 s, const int ndx);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5; additionally
#UD If VEX.L = 1.
PMADDWD- Multiply and Add Packed Integers

**Description**
Multiplying the individual signed words of the first source operand by the corresponding signed words of the second source operand, producing temporary signed doubleword results. The adjacent doubleword results are then summed and stored in the destination operand. For example, the corresponding low-order words (15:0) and (31-16) in the second source and first source operands are multiplied by one another and the doubleword results are added together and stored in the low doubleword of the destination register (31-0). The same operation is performed on the other pairs of adjacent words. The second source operand is an XMM register or a 128-bit memory location.

The first source and destination operands are XMM registers. The PMADDWD instruction wraps around only in one situation: when the 2 pairs of words being operated on in a group are all 8000H. In this case, the result wraps around to 80000000H.

**128-bit Legacy SSE version:** Bits (255:128) of the corresponding YMM destination register remain unchanged.

**VEX.128 encoded version:** Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**VPMADDWD (VEX.128 encoded version)**

\[
\begin{align*}
\text{DEST}[31:0] &\leftarrow (\text{SRC1}[15:0] \times \text{SRC2}[15:0]) + (\text{SRC1}[31:16] \times \text{SRC2}[31:16]) \\
\text{DEST}[63:32] &\leftarrow (\text{SRC1}[47:32] \times \text{SRC2}[47:32]) + (\text{SRC1}[63:48] \times \text{SRC2}[63:48]) \\
\text{DEST}[95:64] &\leftarrow (\text{SRC1}[79:64] \times \text{SRC2}[79:64]) + (\text{SRC1}[95:80] \times \text{SRC2}[95:80]) \\
\text{DEST}[127:96] &\leftarrow (\text{SRC1}[111:96] \times \text{SRC2}[111:96]) + (\text{SRC1}[127:112] \times \text{SRC2}[127:112]) \\
\text{DEST}[255:128] &\leftarrow 0
\end{align*}
\]
PMADDWD (128-bit Legacy SSE version)
DEST[31:0] ← (DEST[15:0] * SRC[15:0]) + (DEST[31:16] * SRC[31:16])
DEST[95:64] ← (DEST[79:64] * SRC[79:64]) + (DEST[95:80] * SRC[95:80])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PMADDWD __m128i _mm_madd_epi16 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
PMADDUBSW- Multiply and Add Packed Integers

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 04 /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed-words to xmm1.</td>
</tr>
<tr>
<td>PMADDUBSW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 04 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply signed and unsigned bytes, add horizontal pair of signed words, pack saturated signed-words to xmm1.</td>
</tr>
<tr>
<td>VPMADDUBSW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

PMADDUBSW multiplies vertically each unsigned byte of the first source operand with the corresponding signed byte of the second source operand, producing intermediate signed 16-bit integers. Each adjacent pair of signed words is added and the saturated result is packed to the destination operand. For example, the lowest-order bytes (bits 7:0) in the first source and second source operands are multiplied and the intermediate signed word result is added with the corresponding intermediate result from the 2nd lowest-order bytes (bits 15:8) of the operands; the sign-saturated result is stored in the lowest word of the destination register (15:0). The same operation is performed on the other pairs of adjacent bytes. The second source operand can be an XMM register or 128-bit memory location. The first source operand and destination operands are XMM registers.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

VPMADDUBSW (VEX.128 encoded version)

\[
\text{DEST}[15:0] \leftarrow \text{SaturateToSignedWord} \left( \text{SRC2}[15:8] \times \text{SRC1}[15:8] + \text{SRC2}[7:0] \times \text{SRC1}[7:0] \right) \\
// Repeat operation for 2nd through 7th word
\]

\[
\text{DEST}[127:112] \leftarrow \text{SaturateToSignedWord} \left( \text{SRC2}[127:120] \times \text{SRC1}[127:120] + \text{SRC2}[119:112] \times \text{SRC1}[119:112] \right) \\
\text{DEST}[255:128] \leftarrow 0
\]

PMADDUBSW (128-bit Legacy SSE version)

\[
\text{DEST}[15:0] \leftarrow \text{SaturateToSignedWord} \left( \text{SRC}[15:8] \times \text{DEST}[15:8] + \text{SRC}[7:0] \times \text{DEST}[7:0] \right) \\
// Repeat operation for 2nd through 7th word
\]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PMADDUBSW __m128i _mm_maddubs_epi16 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

P MAXSB/PMAXSW/PMAXSD- Maximum of Packed Signed Integers

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 3C /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Compare packed signed byte integers in xmm1 and xmm2/m128 and store packed maximum values in xmm1.</td>
</tr>
<tr>
<td>P MAXSB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F EE /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed signed word integers in xmm2/m128 and xmm1 and stores maximum packed values in xmm1.</td>
</tr>
<tr>
<td>P MAXSW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 3D /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Compare packed signed dword integers in xmm1 and xmm2/m128 and store packed maximum values in xmm1.</td>
</tr>
<tr>
<td>P MAXSD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 3C /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed byte integers in xmm2 and xmm3/m128 and store packed maximum values in xmm1.</td>
</tr>
<tr>
<td>VP MAXSB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F EE /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed word integers in xmm3/m128 and xmm2 and store packed maximum values in xmm1.</td>
</tr>
<tr>
<td>VP MAXSW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 3D /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed dword integers in xmm2 and xmm3/m128 and store packed maximum values in xmm1.</td>
</tr>
<tr>
<td>VP MAXSD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description
Perform a SIMD compare of the packed signed byte, word, or dword integers in the second source operand and the first source operand and returns the maximum value for each pair of integers to the destination operand. The first source and destination operand is an XMM register; The second source operand is an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**PMAXSB (128-bit Legacy SSE version)**

IF DEST[7:0] > SRC[7:0] THEN
    DEST[7:0] ← DEST[7:0];
ELSE
    DEST[15:0] ← SRC[7:0]; FI;

(* Repeat operation for 2nd through 15th bytes in source and destination operands *)

IF DEST[127:120] > SRC[127:120] THEN
    DEST[127:120] ← DEST[127:120];
ELSE
    DEST[127:120] ← SRC[127:120]; FI;

DEST[255:128] (Unmodified)

**VPMAXSB (VEX.128 encoded version)**

IF SRC1[7:0] > SRC2[7:0] THEN
    DEST[7:0] ← SRC1[7:0];
ELSE
    DEST[7:0] ← SRC2[7:0]; FI;

(* Repeat operation for 2nd through 15th bytes in source and destination operands *)

IF SRC1[127:120] > SRC2[127:120] THEN
    DEST[127:120] ← SRC1[127:120];
ELSE
    DEST[127:120] ← SRC2[127:120]; FI;

DEST[255:128] ← 0

**PMAXSW (128-bit Legacy SSE version)**

IF DEST[15:0] > SRC[15:0] THEN
    DEST[15:0] ← DEST[15:0];
ELSE
    DEST[15:0] ← SRC[15:0]; FI;

(* Repeat operation for 2nd through 7th words in source and destination operands *)

    DEST[127:112] ← DEST[127:112];
ELSE
    DEST[127:112] ← SRC[127:112]; FI;

DEST[255:128] (Unmodified)

**VPMAXSW (VEX.128 encoded version)**

IF SRC1[15:0] > SRC2[15:0] THEN
    DEST[15:0] ← SRC1[15:0];
ELSE
   DEST[15:0] ← SRC2[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
   DEST[127:112] ← SRC1[127:112];
ELSE
   DEST[127:112] ← SRC2[127:112]; FI;
DEST[255:128] ← 0

PMAXSD (128-bit Legacy SSE version)
IF DEST[31:0] > SRC[31:0] THEN
   DEST[31:0] ← DEST[31:0];
ELSE
   DEST[31:0] ← SRC[31:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
   DEST[127:95] ← DEST[127:95];
ELSE
   DEST[127:95] ← SRC[127:95]; FI;
DEST[255:128] (Unmodified)

VPMAXSD (VEX.128 encoded version)
IF SRC1[31:0] > SRC2[31:0] THEN
   DEST[31:0] ← SRC1[31:0];
ELSE
   DEST[31:0] ← SRC2[31:0]; FI;
(* Repeat operation for 2nd through 3rd dwords in source and destination operands *)
IF SRC1[127:95] > SRC2[127:95] THEN
   DEST[127:95] ← SRC1[127:95];
ELSE
   DEST[127:95] ← SRC2[127:95]; FI;
DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
PMAXSB __m128i _mm_max_epi8 (__m128i a, __m128i b);
PMAXSW __m128i _mm_max_epi16 (__m128i a, __m128i b)
PMAXSD __m128i _mm_max_epi32 (__m128i a, __m128i b);

SIMD Floating-Point Exceptions
None
Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.
### PMAXUB/PMAXUW/PMAXUD - Maximum of Packed Unsigned Integers

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F DE /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed unsigned byte integers in xmm1 and xmm2/m128 and store packed maximum values in xmm1.</td>
</tr>
<tr>
<td>PMAXUB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 3E/r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Compare packed unsigned word integers in xmm2/m128 and xmm1 and stores maximum packed values in xmm1.</td>
</tr>
<tr>
<td>PMAXUW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 3F /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Compare packed unsigned dword integers in xmm1 and xmm2/m128 and store packed maximum values in xmm1.</td>
</tr>
<tr>
<td>PMAXUD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F DE /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed unsigned byte integers in xmm2 and xmm3/m128 and store packed maximum values in xmm1.</td>
</tr>
<tr>
<td>VPMAXUB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 3E/r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed unsigned word integers in xmm3/m128 and xmm2 and store maximum packed values in xmm1.</td>
</tr>
<tr>
<td>VPMAXUW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 3F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed unsigned dword integers in xmm2 and xmm3/m128 and store packed maximum values in xmm1.</td>
</tr>
<tr>
<td>VPMAXUD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### Description

Performs a SIMD compare of the packed unsigned byte, word, or dword integers in the second source operand and the first source operand and returns the maximum value for each pair of integers to the destination operand. The first source and destination operand is an XMM register; The second source operand is an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

**PMAXUB (128-bit Legacy SSE version)**

IF DEST[7:0] > SRC[7:0] THEN
    DEST[7:0] ← DEST[7:0];
ELSE
    DEST[15:0] ← SRC[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF DEST[127:120] > SRC[127:120] THEN
    DEST[127:120] ← DEST[127:120];
ELSE
    DEST[127:120] ← SRC[127:120]; FI;
DEST[255:128] (Unmodified)

**VPMAXUB (VEX.128 encoded version)**

IF SRC1[7:0] > SRC2[7:0] THEN
    DEST[7:0] ← SRC1[7:0];
ELSE
    DEST[7:0] ← SRC2[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF SRC1[127:120] > SRC2[127:120] THEN
    DEST[127:120] ← SRC1[127:120];
ELSE
    DEST[127:120] ← SRC2[127:120]; FI;
DEST[255:128] ← 0

**PMAXUW (128-bit Legacy SSE version)**

IF DEST[15:0] > SRC[15:0] THEN
    DEST[15:0] ← DEST[15:0];
ELSE
    DEST[15:0] ← SRC[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
    DEST[127:112] ← DEST[127:112];
ELSE
    DEST[127:112] ← SRC[127:112]; FI;
DEST[255:128] (Unmodified)

**VPMAXUW (VEX.128 encoded version)**

IF SRC1[15:0] > SRC2[15:0] THEN
    DEST[15:0] ← SRC1[15:0];
ELSE
    DEST[15:0] ← SRC2[15:0]; FI;
    (* Repeat operation for 2nd through 7th words in source and destination operands *)
    DEST[127:112] ← SRC1[127:112];
ELSE
    DEST[127:112] ← SRC2[127:112]; FI;
DEST[255:128] ← 0

PMAXUD (128-bit Legacy SSE version)
IF DEST[31:0] > SRC[31:0] THEN
    DEST[31:0] ← DEST[31:0];
ELSE
    DEST[31:0] ← SRC[31:0]; FI;
    (* Repeat operation for 2nd through 7th words in source and destination operands *)
    DEST[127:95] ← DEST[127:95];
ELSE
    DEST[127:95] ← SRC[127:95]; FI;
DEST[255:128] (Unmodified)

VPMAXUD (VEX.128 encoded version)
IF SRC1[31:0] > SRC2[31:0] THEN
    DEST[31:0] ← SRC1[31:0];
ELSE
    DEST[31:0] ← SRC2[31:0]; FI;
    (* Repeat operation for 2nd through 3rd dwords in source and destination operands *)
IF SRC1[127:95] > SRC2[127:95] THEN
    DEST[127:95] ← SRC1[127:95];
ELSE
    DEST[127:95] ← SRC2[127:95]; FI;
DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
PMAXUB _m128i _mm_max_epu8 ( _m128i a, _m128i b);
PMAXUW _m128i _mm_max_epu16 ( _m128i a, _m128i b)
PMAXUD _m128i _mm_max_epu32 ( _m128i a, _m128i b);

SIMD Floating-Point Exceptions
None
Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

PMINSB/PMINSW/PMINSD- Minimum of Packed Signed Integers

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 38 /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Compare packed signed byte integers in xmm1 and xmm2/m128 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>PMINSB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F EA /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed signed word integers in xmm2/m128 and xmm1 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>PMINSW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 39 /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Compare packed signed dword integers in xmm1 and xmm2/m128 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>PMINSD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 38 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed byte integers in xmm2 and xmm3/m128 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>VPMINSB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F EA /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed word integers in xmm3/m128 and xmm2 and return packed minimum values in xmm1.</td>
</tr>
<tr>
<td>VPMINSW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 39 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed signed dword integers in xmm2 and xmm3/m128 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>VPMINSD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description
Performs a SIMD compare of the packed signed byte, word, or dword integers in the second source operand and the first source operand and returns the minimum value for each pair of integers to the destination operand. The first source and destination operand is an XMM register; The second source operand is an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

**PMINSB (128-bit Legacy SSE version)**

```
IF DEST[7:0] < SRC[7:0] THEN
    DEST[7:0] ← DEST[7:0];
ELSE
    DEST[15:0] ← SRC[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF DEST[127:120] < SRC[127:120] THEN
    DEST[127:120] ← DEST[127:120];
ELSE
    DEST[127:120] ← SRC[127:120]; FI;
DEST[255:128] (Unmodified)
```

**VPMINSB (VEX.128 encoded version)**

```
IF SRC1[7:0] < SRC2[7:0] THEN
    DEST[7:0] ← SRC1[7:0];
ELSE
    DEST[7:0] ← SRC2[7:0]; FI;
(* Repeat operation for 2nd through 15th bytes in source and destination operands *)
IF SRC1[127:120] < SRC2[127:120] THEN
    DEST[127:120] ← SRC1[127:120];
ELSE
    DEST[127:120] ← SRC2[127:120]; FI;
DEST[255:128] ← 0
```

**PMINSW (128-bit Legacy SSE version)**

```
IF DEST[15:0] < SRC[15:0] THEN
    DEST[15:0] ← DEST[15:0];
ELSE
    DEST[15:0] ← SRC[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
    DEST[127:112] ← DEST[127:112];
ELSE
    DEST[127:112] ← SRC[127:112]; FI;
DEST[255:128] (Unmodified)
```

**VPMINSW (VEX.128 encoded version)**

```
IF SRC1[15:0] < SRC2[15:0] THEN
    DEST[15:0] ← SRC1[15:0];
```
ELSE
    DEST[15:0] ← SRC2[15:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
    DEST[127:112] ← SRC1[127:112];
ELSE
    DEST[127:112] ← SRC2[127:112]; FI;
DEST[255:128] ← 0

PMINS (128-bit Legacy SSE version)
IF DEST[31:0] < SRC[31:0] THEN
    DEST[31:0] ← DEST[31:0];
ELSE
    DEST[31:0] ← SRC[31:0]; FI;
(* Repeat operation for 2nd through 7th words in source and destination operands *)
    DEST[127:95] ← DEST[127:95];
ELSE
    DEST[127:95] ← SRC[127:95]; FI;
DEST[255:128] (Unmodified)

VPMINS (VEX.128 encoded version)
IF SRC1[31:0] < SRC2[31:0] THEN
    DEST[31:0] ← SRC1[31:0];
ELSE
    DEST[31:0] ← SRC2[31:0]; FI;
(* Repeat operation for 2nd through 3rd dwords in source and destination operands *)
IF SRC1[127:95] < SRC2[127:95] THEN
    DEST[127:95] ← SRC1[127:95];
ELSE
    DEST[127:95] ← SRC2[127:95]; FI;
DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
PMINSB __m128i _mm_min_epi8 (__m128i a, __m128i b);
PMINSW __m128i _mm_min_epi16 (__m128i a, __m128i b)
PMINSQ __m128i _mm_min_epi32 (__m128i a, __m128i b);

SIMD Floating-Point Exceptions
None
Other Exceptions

See Exceptions Type 4; additionally

#UD         If VEX.L = 1.
INSTRUCTION SET REFERENCE

PMINUB/PMINUW/PMINUD - Minimum of Packed Unsigned Integers

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F DA /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare packed unsigned byte integers in xmm1 and xmm2/m128 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>PMINUB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 3A/r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Compare packed unsigned word integers in xmm2/m128 and xmm1 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>PMINUW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 3B /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Compare packed unsigned dword integers in xmm1 and xmm2/m128 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>PMINUD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F DA /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed unsigned byte integers in xmm2 and xmm3/m128 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>VPMINUB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 3A/r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed unsigned word integers in xmm3/m128 and xmm2 and return packed minimum values in xmm1.</td>
</tr>
<tr>
<td>VPMINUW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 3B /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare packed unsigned dword integers in xmm2 and xmm3/m128 and store packed minimum values in xmm1.</td>
</tr>
<tr>
<td>VPMINUD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a SIMD compare of the packed unsigned byte, word, or dword integers in the second source operand and the first source operand and returns the minimum value for each pair of integers to the destination operand. The first source and destination operand is an XMM register; The second source operand is an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

PMINUB (128-bit Legacy SSE version)
PMINUB instruction for 128-bit operands:
  IF DEST[7:0] < SRC[7:0] THEN
    DEST[7:0] ← DEST[7:0];
  ELSE
    DEST[15:0] ← SRC[7:0]; FI;
  (* Repeat operation for 2nd through 15th bytes in source and destination operands *)
  IF DEST[127:120] < SRC[127:120] THEN
    DEST[127:120] ← DEST[127:120];
  ELSE
    DEST[127:120] ← SRC[127:120]; FI;
DEST[255:128] (Unmodified)

VPMINUB (VEX.128 encoded version)
VPMINUB instruction for 128-bit operands:
  IF SRC1[7:0] < SRC2[7:0] THEN
    DEST[7:0] ← SRC1[7:0];
  ELSE
    DEST[7:0] ← SRC2[7:0]; FI;
  (* Repeat operation for 2nd through 15th bytes in source and destination operands *)
  IF SRC1[127:120] < SRC2[127:120] THEN
    DEST[127:120] ← SRC1[127:120];
  ELSE
    DEST[127:120] ← SRC2[127:120]; FI;
DEST[255:128] ← 0

PMINUW (128-bit Legacy SSE version)
PMINUW instruction for 128-bit operands:
  IF DEST[15:0] < SRC[15:0] THEN
    DEST[15:0] ← DEST[15:0];
  ELSE
    DEST[15:0] ← SRC[15:0]; FI;
  (* Repeat operation for 2nd through 7th words in source and destination operands *)
    DEST[127:112] ← DEST[127:112];
  ELSE
    DEST[127:112] ← SRC[127:112]; FI;
DEST[255:128] (Unmodified)
INSTRUCTION SET REFERENCE

VPMINUW (VEX.128 encoded version)
VPMINUW instruction for 128-bit operands:

IF SRC1[15:0] < SRC2[15:0] THEN
    DEST[15:0] ← SRC1[15:0];
ELSE
    DEST[15:0] ← SRC2[15:0]; FI;

(* Repeat operation for 2nd through 7th words in source and destination operands *)

    DEST[127:112] ← SRC1[127:112];
ELSE
    DEST[127:112] ← SRC2[127:112]; FI;

DEST[255:128] ← 0

PMINUD (128-bit Legacy SSE version)
PMINUD instruction for 128-bit operands:

IF DEST[31:0] < SRC[31:0] THEN
    DEST[31:0] ← DEST[31:0];
ELSE
    DEST[31:0] ← SRC[31:0]; FI;

(* Repeat operation for 2nd through 7th words in source and destination operands *)

    DEST[127:95] ← DEST[127:95];
ELSE
    DEST[127:95] ← SRC[127:95]; FI;

DEST[255:128] (Unmodified)

VPMINUD (VEX.128 encoded version)
VPMINUD instruction for 128-bit operands:

IF SRC1[31:0] < SRC2[31:0] THEN
    DEST[31:0] ← SRC1[31:0];
ELSE
    DEST[31:0] ← SRC2[31:0]; FI;

(* Repeat operation for 2nd through 3rd dwords in source and destination operands *)

IF SRC1[127:95] < SRC2[127:95] THEN
    DEST[127:95] ← SRC1[127:95];
ELSE
    DEST[127:95] ← SRC2[127:95]; FI;

DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent

PMINUB __m128i _mm_min_epu8 (__m128i a, __m128i b)
INSTRUCTION SET REFERENCE

PMINUW __m128i _mm_min_epu16 ( __m128i a, __m128i b);
PMINUD __m128i _mm_min_epu32 ( __m128i a, __m128i b);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
PMOVMSKB- Move Byte Mask

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F D7 /r PMOVMSKB reg, xmm1</td>
<td>V/V</td>
<td>SSE2</td>
<td>Move a byte mask of xmm1 to reg. The upper bits of r32 or r64 are filled with zeros.</td>
</tr>
<tr>
<td>VEX.128.66.0F D7 /r VPMOVMSKB reg, xmm1</td>
<td>V/V</td>
<td>AVX</td>
<td>Move a byte mask of xmm1 to reg. The upper bits of r32 or r64 are filled with zeros.</td>
</tr>
</tbody>
</table>

Description

Creates a mask made up of the most significant bit of each byte of the source operand (second operand) and stores the result in the low byte or word of the destination operand (first operand). The source operand is an XMM register; the destination operand is a general-purpose register. The byte mask is 16-bits.

The destination operand is a general-purpose register. In 64-bit mode, the default operand size of the destination operand is 64 bits. The upper bits above bit 15 are filled with zeros. REX.W is ignored.

VEX.128 encodings are valid but identical in function. VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.

Operation

(V)PMOVMSKB instruction with 128-bit source operand and r32:

r32[0] ← SRC[7];
r32[1] ← SRC[15];
(* Repeat operation for bytes 2 through 14 *)
r32[15] ← SRC[127];
r32[31:16] ← ZERO_FILL;

(V)PMOVMSKB instruction with 128-bit source operand and r64:

r64[0] ← SRC[7];
r64[1] ← SRC[15];
(* Repeat operation for bytes 2 through 14 *)
r64[15] ← SRC[127];
r64[63:16] ← ZERO_FILL;
Intel C/C++ Compiler Intrinsic Equivalent

PMOVMSKB int _mm_movemask_epi8 ( __m128i a)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 7; additionally

#UD            If VEX.L = 1.
              If VEX.vvvv != 1111B.
## PMOVSX - Packed Move with Sign Extend

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0f 38 20 /r</td>
<td>V/V</td>
<td>SSE_1</td>
<td>Sign extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 16-bit integers in xmm1</td>
</tr>
<tr>
<td>PMOVSX BW xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0f 38 21 /r</td>
<td>V/V</td>
<td>SSE_1</td>
<td>Sign extend 4 packed 8-bit integers in the low 4 bytes of xmm2/m32 to 4 packed 32-bit integers in xmm1</td>
</tr>
<tr>
<td>PMOVSX BD xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0f 38 22 /r</td>
<td>V/V</td>
<td>SSE_1</td>
<td>Sign extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>PMOVSX BQ xmm1, xmm2/m16</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0f 38 23/r</td>
<td>V/V</td>
<td>SSE_1</td>
<td>Sign extend 4 packed 16-bit integers in the low 8 bytes of xmm2/m64 to 4 packed 32-bit integers in xmm1</td>
</tr>
<tr>
<td>PMOVSX WD xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0f 38 24 /r</td>
<td>V/V</td>
<td>SSE_1</td>
<td>Sign extend 2 packed 16-bit integers in the low 4 bytes of xmm2/m32 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>PMOVSX WQ xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0f 38 25 /r</td>
<td>V/V</td>
<td>SSE_1</td>
<td>Sign extend 2 packed 32-bit integers in the low 8 bytes of xmm2/m64 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>PMOVSX DQ xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 20 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Sign extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 16-bit integers in xmm1</td>
</tr>
<tr>
<td>VPMOVSX BW xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 21 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Sign extend 4 packed 8-bit integers in the low 4 bytes of xmm2/m32 to 4 packed 32-bit integers in xmm1</td>
</tr>
<tr>
<td>VPMOVSX BD xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Description

Packed byte, word, or dword integers in the low bytes of the source operand (second operand) are sign extended to word, dword, or quadword integers and stored in packed signed bytes the destination operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.

Operation

Packed_Sign_Extend_BYTE_to_WORD
DEST[15:0] ← SignExtend(SRC[7:0]);
DEST[31:16] ← SignExtend(SRC[15:8]);
DEST[47:32] ← SignExtend(SRC[23:16]);
DEST[63:48] ← SignExtend(SRC[31:24]);
DEST[79:64] ← SignExtend(SRC[39:32]);
DEST[95:80] ← SignExtend(SRC[47:40]);

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.128.66.0F38 22 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Sign extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>VPMOVSXBQ xmm1, xmm2/m16</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 23 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Sign extend 4 packed 16-bit integers in the low 8 bytes of xmm2/m64 to 4 packed 32-bit integers in xmm1</td>
</tr>
<tr>
<td>VPMOVSXWD xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 24 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Sign extend 2 packed 16-bit integers in the low 4 bytes of xmm2/m32 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>VPMOVSXWQ xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 25 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Sign extend 2 packed 32-bit integers in the low 8 bytes of xmm2/m64 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>VPMOVSXDQ xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
INSTRUCTION SET REFERENCE

DEST[111:96] ← SignExtend(SRC[55:48]);
DEST[127:112] ← SignExtend(SRC[63:56]);

Packed_Sign_Extend_BYTE_to_DWORD
DEST[31:0] ← SignExtend(SRC[7:0]);
DEST[63:32] ← SignExtend(SRC[15:8]);
DEST[95:64] ← SignExtend(SRC[23:16]);
DEST[127:96] ← SignExtend(SRC[31:24]);

Packed_Sign_Extend_BYTE_to_QWORD
DEST[63:0] ← SignExtend(SRC[7:0]);
DEST[127:64] ← SignExtend(SRC[15:8]);

Packed_Sign_Extend_WORD_to_DWORD
DEST[31:0] ← SignExtend(SRC[15:0]);
DEST[63:32] ← SignExtend(SRC[31:16]);
DEST[95:64] ← SignExtend(SRC[47:32]);
DEST[127:96] ← SignExtend(SRC[63:48]);

Packed_Sign_Extend_WORD_to_QWORD
DEST[63:0] ← SignExtend(SRC[15:0]);
DEST[127:64] ← SignExtend(SRC[31:16]);

Packed_Sign_Extend_DWORD_to_QWORD
DEST[63:0] ← SignExtend(SRC[31:0]);
DEST[127:64] ← SignExtend(SRC[63:32]);

VPMOVSXBW
Packed_Sign_Extend_BYTE_to_WORD() 
DEST[255:128] ← 0

VPMOVSXBD
Packed_Sign_Extend_BYTE_to_DWORD() 
DEST[255:128] ← 0

VPMOVSXBD
Packed_Sign_Extend_DWORD_to_QWORD() 
DEST[255:128] ← 0

VPMOVSXWD
Packed_Sign_Extend_WORD_to_DWORD() 
DEST[255:128] ← 0
INSTRUCTION SET REFERENCE

**VPMOVSXWQ**  
Packed_Sign_Extend_WORD_to_QWORD()  
DEST[255:128] ← 0

**VPMOVSXDQ**  
Packed_Sign_Extend_DWORD_to_QWORD()  
DEST[255:128] ← 0

**PMOVSXBW**  
Packed_Sign_Extend_BYTE_to_WORD()  
DEST[255:128] (Unmodified)

**PMOVSXBD**  
Packed_Sign_Extend_BYTE_to_DWORD()  
DEST[255:128] (Unmodified)

**PMOVSXBQ**  
Packed_Sign_Extend_BYTE_to_QWORD()  
DEST[255:128] (Unmodified)

**PMOVSXWD**  
Packed_Sign_Extend_WORD_to_DWORD()  
DEST[255:128] (Unmodified)

**PMOVSXWQ**  
Packed_Sign_Extend_WORD_to_QWORD()  
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

PMOVSXBW __m128i __mm_cvtepi8_epi16 (__m128i a);  
PMOVSXBD __m128i __mm_cvtepi8_epi32 (__m128i a);  
PMOVSXBQ __m128i __mm_cvtepi8_epi64 (__m128i a);  
PMOVSXWD __m128i __mm_cvtepi16_epi32 (__m128i a);  
PMOVSXWQ __m128i __mm_cvtepi16_epi64 (__m128i a);
INSTRUCTION SET REFERENCE

PMOVSDXQ___m128i____mm__cvtepi32_epi64(___m128i a);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5; additionally
#UD If VEX.L = 1.
If VEX.vvvv != 1111B.
### PMOVZX - Packed Move with Zero Extend

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0f 38 30 /r PMOVZXBW xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Zero extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 16-bit integers in xmm1</td>
</tr>
<tr>
<td>66 0f 38 31 /r PMOVZXBD xmm1, xmm2/m32</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Zero extend 4 packed 8-bit integers in the low 4 bytes of xmm2/m32 to 4 packed 32-bit integers in xmm1</td>
</tr>
<tr>
<td>66 0f 38 32 /r PMOVZXBQ xmm1, xmm2/m16</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Zero extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>66 0f 38 33 /r PMOVZXWD xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Zero extend 4 packed 16-bit integers in the low 8 bytes of xmm2/m64 to 4 packed 32-bit integers in xmm1</td>
</tr>
<tr>
<td>66 0f 38 34 /r PMOVZXWQ xmm1, xmm2/m32</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Zero extend 2 packed 16-bit integers in the low 4 bytes of xmm2/m32 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>66 0f 38 35 /r PMOVZXDQ xmm1, xmm2/m64</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Zero extend 2 packed 32-bit integers in the low 8 bytes of xmm2/m64 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F38 30 /r VPMOVZXBW xmm1, xmm2/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Zero extend 8 packed 8-bit integers in the low 8 bytes of xmm2/m64 to 8 packed 16-bit integers in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F38 31 /r VPMOVZXBD xmm1, xmm2/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Zero extend 4 packed 8-bit integers in the low 4 bytes of xmm2/m32 to 4 packed 32-bit integers in xmm1</td>
</tr>
</tbody>
</table>
INSTRUCTION SET REFERENCE

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.128.66.0F38 32 /r VPMOVZXBQ xmm1, xmm2/m16</td>
<td>V/V</td>
<td>AVX</td>
<td>Zero extend 2 packed 8-bit integers in the low 2 bytes of xmm2/m16 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F38 33 /r VPMOVZXWD xmm1, xmm2/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Zero extend 4 packed 16-bit integers in the low 8 bytes of xmm2/m64 to 4 packed 32-bit integers in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F38 34 /r VPMOVZXWQ xmm1, xmm2/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Zero extend 2 packed 16-bit integers in the low 4 bytes of xmm2/m32 to 2 packed 64-bit integers in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F38 35 /r VPMOVZXXDQ xmm1, xmm2/m64</td>
<td>V/V</td>
<td>AVX</td>
<td>Zero extend 2 packed 32-bit integers in the low 8 bytes of xmm2/m64 to 2 packed 64-bit integers in xmm1</td>
</tr>
</tbody>
</table>

Description

Packed byte, word, or dword integers in the low bytes of the source operand (second operand) are zero extended to word, dword, or quadword integers and stored in packed signed bytes the destination operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.

Operation

Packed_Zero_Extend_BYTE_to_WORD
DEST[15:0] ← ZeroExtend(SRC[7:0]);
DEST[31:16] ← ZeroExtend(SRC[15:8]);
DEST[63:48] ← ZeroExtend(SRC[31:24]);
DEST[79:64] ← ZeroExtend(SRC[39:32]);
DEST[95:80] ← ZeroExtend(SRC[47:40]);
DEST[111:96] ← ZeroExtend(SRC[55:48]);
INSTRUCTION SET REFERENCE

DEST[127:112] ← ZeroExtend(SRC[63:56]);

Packed_Zero_Extend_BYTE_to_DWORD
DEST[31:0] ← ZeroExtend(SRC[7:0]);
DEST[63:32] ← ZeroExtend(SRC[15:8]);
DEST[95:64] ← ZeroExtend(SRC[23:16]);
DEST[127:96] ← ZeroExtend(SRC[31:24]);

Packed.ZeroExtend_BYTE_to_QWORD
DEST[63:0] ← ZeroExtend(SRC[7:0]);
DEST[127:64] ← ZeroExtend(SRC[15:8]);

Packed_Zero_Extend_WORD_to_DWORD
DEST[31:0] ← ZeroExtend(SRC[15:0]);
DEST[63:32] ← ZeroExtend(SRC[31:16]);
DEST[95:64] ← ZeroExtend(SRC[47:32]);
DEST[127:96] ← ZeroExtend(SRC[63:48]);

Packed_Zero_Extend_WORD_to_QWORD
DEST[63:0] ← ZeroExtend(SRC[15:0]);
DEST[127:64] ← ZeroExtend(SRC[31:16]);

Packed_Zero_Extend_DWORD_to_QWORD
DEST[63:0] ← ZeroExtend(SRC[31:0]);
DEST[127:64] ← ZeroExtend(SRC[63:32]);

VPMOVZXBW
Packed_Zero_Extend_BYTE_to_WORD()
DEST[255:128] ← 0

VPMOVZXBD
Packed_Zero_Extend_BYTE_to_DWORD()
DEST[255:128] ← 0

VPMOVZXBQ
Packed_Zero_Extend_BYTE_to_QWORD()
DEST[255:128] ← 0

VPMOVZXWD
Packed_Zero_Extend_WORD_to_DWORD()
DEST[255:128] ← 0

VPMOVZWXQ
INSTRUCTION SET REFERENCE

Packed_Zero_Extend_WORD_to_QWORD()
DEST[255:128] ← 0

VPMOVZXDQ
Packed_Zero_Extend_DWORD_to_QWORD()
DEST[255:128] ← 0

PMOVZXBW
Packed_Zero_Extend_BYTE_to_WORD()
DEST[255:128] (Unmodified)

PMOVZxbd
Packed_Zero_Extend_BYTE_to_DWORD()
DEST[255:128] (Unmodified)

PMOVZXBQ
Packed_Zero_Extend_BYTE_to_QWORD()
DEST[255:128] (Unmodified)

PMOVZXWD
Packed_Zero_Extend_WORD_to_DWORD()
DEST[255:128] (Unmodified)

PMOVZXWQ
Packed_Zero_Extend_WORD_to_QWORD()
DEST[255:128] (Unmodified)

PMOVZXDQ
Packed_Zero_Extend_DWORD_to_QWORD()
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PMOVZXBW __m128i __mm_cvtepu8_epi16 (__m128i a);
PMOVZxbd __m128i __mm_cvtepu8_epi32 (__m128i a);
PMOVZXBQ __m128i __mm_cvtepu8_epi64 (__m128i a);
PMOVZXWD __m128i __mm_cvtepu16_epi32 (__m128i a);
PMOVZXWQ __m128i __mm_cvtepu16_epi64 (__m128i a);
PMOVZXDQ __m128i __mm_cvtepu32_epi64 (__m128i a);
SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5; additionally
#UD
    If VEX.L = 1.
    If VEX.vvvv != 111B.
INSTRUCTION SET REFERENCE

PMULHUW - Multiply Packed Unsigned Integers and Store High Result

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F E4 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Multiply the packed unsigned word integers in xmm1 and xmm2/m128, and store the high 16 bits of the results in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F E4 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply the packed unsigned word integers in xmm2 and xmm3/m128, and store the high 16 bits of the results in xmm1.</td>
</tr>
</tbody>
</table>

Description
Performs a SIMD unsigned multiply of the packed unsigned word integers in the first source operand and the second source operand, and stores the high 16 bits of each 32-bit intermediate results in the destination operand.

The second source operand is an XMM register or a 128-bit memory location. The destination operand and first source operands are XMM registers.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation
PMULHUW (VEX.128 encoded version)
TEMP[31:0] ← SRC1[15:0] * SRC2[15:0]
TEMP[31:0] ← SRC1[31:16] * SRC2[31:16]
TEMP[31:0] ← SRC1[79:64] * SRC2[79:64]
TEMP[31:0] ← SRC1[95:80] * SRC2[95:80]
DEST[15:0] ← TEMP0[31:16]
DEST[31:16] ← TEMP1[31:16]
DEST[47:32] ← TEMP2[31:16]
DEST[63:48] ← TEMP3[31:16]
DEST[79:64] ← TEMP4[31:16]
DEST[95:80] ← TEMP5[31:16]
INSTRUCTION SET REFERENCE

DEST[111:96] ← TEMP6[31:16]
DEST[127:112] ← TEMP7[31:16]
DEST[255:128] ← 0

**PMULHUW (128-bit Legacy SSE version)**
TEMP0[31:0] ← DEST[15:0] * SRC[15:0]
TEMP1[31:0] ← DEST[31:16] * SRC[31:16]
TEMP4[31:0] ← DEST[79:64] * SRC[79:64]
TEMP5[31:0] ← DEST[95:80] * SRC[95:80]
TEMP6[31:0] ← DEST[111:96] * SRC[111:96]
DEST[15:0] ← TEMP0[31:16]
DEST[31:16] ← TEMP1[31:16]
DEST[47:32] ← TEMP2[31:16]
DEST[63:48] ← TEMP3[31:16]
DEST[79:64] ← TEMP4[31:16]
DEST[95:80] ← TEMP5[31:16]
DEST[111:96] ← TEMP6[31:16]
DEST[127:112] ← TEMP7[31:16]
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

PMULHUW _m128i _mm_mulhi_epu16 (_m128i a, _m128i b)

**SIMD Floating-Point Exceptions**
None

**Other Exceptions**
See Exceptions Type 4; additionally

#UD If VEX.L = 1.
### PMULHRSW - Multiply Packed Unsigned Integers with Round and Shift

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 0B /r PMULHRSW xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Multiply 16-bit signed words, scale and round signed doublewords, pack high 16 bits to xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 0B /r VPMULHRSW xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply 16-bit signed words, scale and round signed doublewords, pack high 16 bits to xmm1.</td>
</tr>
</tbody>
</table>

**Description**

PMULHRSW multiplies vertically each signed 16-bit integer from the first source operand with the corresponding signed 16-bit integer of the second source operand, producing intermediate, signed 32-bit integers. Each intermediate 32-bit integer is truncated to the 18 most significant bits. Rounding is always performed by adding 1 to the least significant bit of the 18-bit intermediate result. The final result is obtained by selecting the 16 bits immediately to the right of the most significant bit of each 18-bit intermediate result and packed to the destination operand. The first source and destination operands are XMM registers. The second source operand is an XMM register or 128-bit memory location.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**VPMULHRSW (VEX.128 encoded version)**

\[
\begin{align*}
\text{temp0}[31:0] & \leftarrow \text{INT32} ((\text{SRC1}[15:0] * \text{SRC2}[15:0]) \gg 14) + 1 \\
\text{temp1}[31:0] & \leftarrow \text{INT32} ((\text{SRC1}[31:16] * \text{SRC2}[31:16]) \gg 14) + 1 \\
\text{temp2}[31:0] & \leftarrow \text{INT32} ((\text{SRC1}[47:32] * \text{SRC2}[47:32]) \gg 14) + 1 \\
\text{temp3}[31:0] & \leftarrow \text{INT32} ((\text{SRC1}[63:48] * \text{SRC2}[63:48]) \gg 14) + 1 \\
\text{temp4}[31:0] & \leftarrow \text{INT32} ((\text{SRC1}[79:64] * \text{SRC2}[79:64]) \gg 14) + 1 \\
\text{temp5}[31:0] & \leftarrow \text{INT32} ((\text{SRC1}[95:80] * \text{SRC2}[95:80]) \gg 14) + 1 \\
\text{temp6}[31:0] & \leftarrow \text{INT32} ((\text{SRC1}[111:96] * \text{SRC2}[111:96]) \gg 14) + 1 \\
\text{temp7}[31:0] & \leftarrow \text{INT32} ((\text{SRC1}[127:112] * \text{SRC2}[127:112]) \gg 14) + 1 \\
\text{DEST}[15:0] & \leftarrow \text{temp0}[16:1] \\
\text{DEST}[31:16] & \leftarrow \text{temp1}[16:1] \\
\text{DEST}[47:32] & \leftarrow \text{temp2}[16:1]
\end{align*}
\]
PMULHRSW (128-bit Legacy SSE version)

```
PMULHRSW __m128i _mm_mulhrs_epi16 (__m128i a, __m128i b)
```

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

#UD         If VEX.L = 1.
INSTRUCTION SET REFERENCE

PMULHW - Multiply Packed Integers and Store High Result

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F E5 /r</td>
<td>V/V</td>
<td>Multiply the packed signed word integers in xmm1 and xmm2/m128, and store the high 16 bits of the results in xmm1.</td>
</tr>
<tr>
<td>PMULHW xmm1, xmm2/m128</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F E5 /r</td>
<td>V/V</td>
<td>Multiply the packed signed word integers in xmm2 and xmm3/m128, and store the high 16 bits of the results in xmm1.</td>
</tr>
<tr>
<td>VPMULHW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description
Performs a SIMD signed multiply of the packed signed word integers in the first source operand and the second source operand, and stores the high 16 bits of each intermediate 32-bit result in the destination operand. The second source operand can be an XMM register or a 128-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded SSE version: Bits (255:128) of the corresponding YMM destination register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation
PMULHW (VEX.128 encoded version)
TEMP1[31:0] ← SRC1[31:16] * SRC2[31:16]
TEMP4[31:0] ← SRC1[79:64] * SRC2[79:64]
TEMP5[31:0] ← SRC1[95:80] * SRC2[95:80]
TEMP6[31:0] ← SRC1[111:96] * SRC2[111:96]
DEST[15:0] ← TEMP0[31:16]
DEST[31:16] ← TEMP1[31:16]
DEST[47:32] ← TEMP2[31:16]
DEST[63:48] ← TEMP3[31:16]
DEST[79:64] ← TEMP4[31:16]
DEST[95:80] ← TEMP5[31:16]
PMULHW (128-bit Legacy SSE version)

TEMP0[31:0] ← DEST[15:0] * SRC[15:0] (*Signed Multiplication*)
TEMP1[31:0] ← DEST[31:16] * SRC[31:16]
TEMP4[31:0] ← DEST[79:64] * SRC[79:64]
TEMP5[31:0] ← DEST[95:80] * SRC[95:80]
TEMP6[31:0] ← DEST[111:96] * SRC[111:96]

Intel C/C++ Compiler Intrinsic Equivalent

PMULHW __m128i _mm_mulhi_epi16 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.
**INSTRUCTION SET REFERENCE**

**PMULLW/PMULLD - Multiply Packed Integers and Store Low Result**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F D5 /r PMULLW xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Multiply the packed signed word integers in xmm1 and xmm2/m128, and store the low 16 bits of the results in xmm1.</td>
</tr>
<tr>
<td>66 0F 38 40 /r PMULLD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Multiply the packed dword signed integers in xmm1 and xmm2/m128 and store the low 32 bits of each product in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F D5 /r VPMULLW xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply the packed signed word integers in xmm2 and xmm3/m128, and store the low 16 bits of the results in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 40 /r VPMULLD xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply the packed dword signed integers in xmm2 and xmm3/m128 and store the low 32 bits of each product in xmm1.</td>
</tr>
</tbody>
</table>

**Description**

Performs a SIMD signed multiply of the packed signed word (dword) integers in the first source operand and the second source operand and stores the low 16(32) bits of each intermediate 32-bit(64-bit) result in the destination operand. (Figure 4-4 in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2B, shows this operation when using 64-bit operands.) The second source operand can be an XMM register or a 128-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**VPMULLD (VEX.128 encoded version)**

Temp0[63:0] ← SRC1[31:0] * SRC2[31:0]
Temp2[63:0] ← SRC1[95:64] * SRC2[95:64]
PMULLD (128-bit Legacy SSE version)
Temp0[63:0] ← DEST[31:0] * SRC[31:0]
Temp2[63:0] ← DEST[95:64] * SRC[95:64]
DEST[31:0] ← Temp0[31:0]
DEST[63:32] ← Temp1[31:0]
DEST[95:64] ← Temp2[31:0]
DEST[127:96] ← Temp3[31:0]
DEST[255:128] ← 0

VPMULLW (VEX.128 encoded version)
Temp0[31:0] ← SRC1[15:0] * SRC2[15:0]
Temp1[31:0] ← SRC1[31:16] * SRC2[31:16]
Temp4[31:0] ← SRC1[79:64] * SRC2[79:64]
Temp5[31:0] ← SRC1[95:80] * SRC2[95:80]
Temp6[31:0] ← SRC1[111:96] * SRC2[111:96]
DEST[15:0] ← Temp0[15:0]
DEST[31:16] ← Temp1[15:0]
DEST[47:32] ← Temp2[15:0]
DEST[63:48] ← Temp3[15:0]
DEST[79:64] ← Temp4[15:0]
DEST[95:80] ← Temp5[15:0]
DEST[111:96] ← Temp6[15:0]
DEST[127:112] ← Temp7[15:0]
DEST[255:128] ← 0

PMULLW (128-bit Legacy SSE version)
Temp0[31:0] ← DEST[15:0] * SRC[15:0]
Temp1[31:0] ← DEST[31:16] * SRC[31:16]
Temp4[31:0] ← DEST[79:64] * SRC[79:64]
INSTRUCTION SET REFERENCE

Temp5[31:0] ← DEST[95:80] * SRC[95:80]
Temp6[31:0] ← DEST[111:96] * SRC[111:96]
DEST[15:0] ← Temp0[15:0]
DEST[31:16] ← Temp1[15:0]
DEST[47:32] ← Temp2[15:0]
DEST[63:48] ← Temp3[15:0]
DEST[79:64] ← Temp4[15:0]
DEST[95:80] ← Temp5[15:0]
DEST[111:96] ← Temp6[15:0]
DEST[127:112] ← Temp7[15:0]
DEST[127:96] ← Temp3[31:0];
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PMULLW __m128i _mm_mullo_epi16 (__m128i a, __m128i b)
PMULLUD __m128i _mm_mullo_epi32 (__m128i a, __m128i b);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

PMULUDQ - Multiply Packed Unsigned Doubleword Integers

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F F4 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Multiply packed unsigned doubleword integers in xmm1 by packed unsigned doubleword integers in xmm2/m128, and store the quadword results in xmm1.</td>
</tr>
<tr>
<td>PMULUDQ xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F F4 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Multiply packed unsigned doubleword integers in xmm2 by packed unsigned doubleword integers in xmm3/m128, and store the quadword results in xmm1.</td>
</tr>
<tr>
<td>VPMULUDQ xmm1, xmm2,</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Multiplies the first source operand by the second source operand and stores the result in the destination operand. The second source operand is two packed unsigned doubleword integers stored in the first (low) and third doublewords of an XMM register or an 128-bit memory location. The first source operand is two packed doubleword integers stored in the first and third doublewords of an XMM register. The destination contains two packed unsigned quadword integers stored in an XMM register. When a quadword result is too large to be represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the destination element (that is, the carry is ignored).

For 128-bit memory operands, 128 bits are fetched from memory, but only the first and third doublewords are used in the computation.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

**Operation**

**VPMULUDQ (VEX.128 encoded version)**

DEST[63:0] ← SRC1[31:0] * SRC2[31:0]
DEST[127:64] ← SRC1[95:64] * SRC2[95:64]
DEST[255:128] ← 0

**PMULUDQ (128-bit Legacy SSE version)**

DEST[63:0] ← DEST[31:0] * SRC[31:0]
DEST[127:64] ← DEST[95:64] * SRC[95:64]
INSTRUCTION SET REFERENCE

DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PMULUDQ __m128i _mm_mul_epu32 ( __m128i a, __m128i b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
PMULDQ - Multiply Packed Doubleword Integers

Description
Multiples the first source operand by the second source operand and stores the result in the destination operand. The second source operand is two packed signed doubleword integers stored in the first (low) and third doublewords of an XMM register or an 128-bit memory location. The first source operand is two packed signed doubleword integers stored in the first and third doublewords of an XMM register. The destination contains two packed signed quadword integers stored in an XMM register. When a quadword result is too large to be represented in 64 bits (overflow), the result is wrapped around and the low 64 bits are written to the destination element (that is, the carry is ignored).

For 128-bit memory operands, 128 bits are fetched from memory, but only the first and third doublewords are used in the computation.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation
VPMULDQ (VEX.128 encoded version)
DEST[63:0] ← SRC1[31:0] * SRC2[31:0]
DEST[127:64] ← SRC1[95:64] * SRC2[95:64]
DEST[255:128] ← 0

PMULDQ (128-bit Legacy SSE version)
DEST[63:0] ← DEST[31:0] * SRC[31:0]
DEST[127:64] ← DEST[95:64] * SRC[95:64]
INSTRUCTION SET REFERENCE

DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PMULDQ _m128i _mm_mul_epi32( _m128i a, _m128i b);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally

#UD If VEX.L = 1.
POR - Bitwise Logical Or

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F EB /r POR xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Bitwise OR of xmm2/m128 and xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F EB /r VPOR xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Bitwise OR of xmm2/m128 and xmm3.</td>
</tr>
</tbody>
</table>

Description

Performs a bitwise logical OR operation on the second source operand and the first source operand and stores the result in the destination operand. The second source operand is an XMM register or a 128-bit memory location. The first source and destination operands can be XMM registers. Each bit of the result is set to 1 if either or both of the corresponding bits of the first and second operands are 1; otherwise, it is set to 0.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

VPOR (VEX.128 encoded version)
DEST ← SRC1 OR SRC2
DEST[255:128] ← 0

POR (128-bit Legacy SSE version)
DEST ← DEST OR SRC
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

POR __m128i _mm_or_si128 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions
none

Other Exceptions
See Exceptions Type 4; additionally

Ref. # 319433-005
INSTRUCTION SET REFERENCE

#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

PSADBW - Compute Sum of Absolute Differences

Description
Computes the absolute value of the difference of packed groups of 8 unsigned byte integers from the second operand and from the first source operand. The first 8 differences are summed to produce an unsigned word integer that is stored in the low word of the destination; the second 8 differences are summed to produce an unsigned word integer in bit 79:64 of the destination. The remaining words of the destination are set to 0.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation
VPSADBW (VEX.128 encoded version)
TEMP0 ← ABS(SRC1[7:0] - SRC2[7:0])
(* Repeat operation for bytes 2 through 14 *)
TEMP15 ← ABS(SRC1[127:120] - SRC2[127:120])
DEST[15:0] ← SUM(TEMP0:TEMP7)
DEST[63:16] ← 000000000000H
DEST[79:64] ← SUM(TEMP8:TEMP15)
DEST[127:80] ← 00000000000
DEST[255:128] ← 0
INSTRUCTION SET REFERENCE

PSADBW (128-bit Legacy SSE version)
TEMP0 ← ABS(DEST[7:0] - SRC[7:0])
(* Repeat operation for bytes 2 through 14 *)
TEMP15 ← ABS(DEST[127:120] - SRC[127:120])
DEST[15:0] ← SUM(TEMP0:TEMP7)
DEST[63:16] ← 000000000000H
DEST[79:64] ← SUM(TEMP8:TEMP15)
DEST[127:80] ← 000000000000

DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PSADBW __m128i _mm_sad_epu8(__m128i a, __m128i b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
PSHUFB - Packed Shuffle Bytes

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 00 /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Shuffles bytes in xmm1 according to contents of xmm2/m128.</td>
</tr>
<tr>
<td>PSHUFB xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Shuffles bytes in xmm2 according to contents of xmm3/m128.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 00 /r</td>
<td>V/V</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VPSHUFB xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Shuffles bytes in the first source operand according to the shuffle control mask in the second source operand. The instruction permutes byte data in the first source operand, leaving the shuffle mask unaffected. If the most significant bit (bit[7]) of each byte of the shuffle control mask is set, then constant zero is written in the result byte. Each byte element in the shuffle control mask provides an index field to select the byte element in the first source operand. The index field is defined as the least significant 4 bits of each byte element of the shuffle control mask. The first source and destination operands are XMM registers. The second source operand is either an XMM register or a 128-bit memory location.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: The destination operand is the first operand, the first source operand is the second operand, the second source operand is the third operand. Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise the instruction will #UD.

Operation

VPSHUFB (VEX.128 encoded version)

for i = 0 to 15 {
    if (SRC2[(i * 8)+7] == 1 ) then
        DEST[(i*8)+7..(i*8)+0] ← 0;
    else
        index[3..0] ← SRC2[(i*8)+3 .. (i*8)+0];
        DEST[(i*8)+7..(i*8)+0] ← SRC1[(index*8+7)..(index*8+0)];
    endif
}

DEST[255:128] ← 0
INSTRUCTION SET REFERENCE

PSHUFB (128-bit Legacy SSE version)
for i = 0 to 15 {
    if (SRC[(i * 8)+7] == 1 ) then
        DEST[(i*8)+7..(i*8)+0] ← 0;
    else
        index[3..0] ← SRC[(i*8)+3 .. (i*8)+0];
        DEST[(i*8)+7..(i*8)+0] ← DEST[(index*8+7)..(index*8+0)];
    endif
}
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PSHUFB __m128i _mm_shuffle_epi8(__m128i a, __m128i b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
PSHUFD - Shuffle Packed Doublewords

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 70 /r ib</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shuffle the doublewords in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.</td>
</tr>
<tr>
<td>PSHUFD xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F 70 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Shuffle the doublewords in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.</td>
</tr>
<tr>
<td>VPSHUFD xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Copies doublewords from source operand and inserts them in the destination operand at the locations selected with the immediate control operand. Figure 5-23 shows the operation of the PSHUFD instruction and the encoding of the order operand. Each 2-bit field in the order operand selects the contents of one doubleword location in the destination operand. For example, bits 0 and 1 of the order operand select the contents of doubleword 0 of the destination operand. The encoding of bits 0 and 1 of the order operand (see the field encoding in Figure 5-23) determines which doubleword from the source operand will be copied to doubleword 0 of the destination operand.

![Figure 5-23. PSHUFD Instruction Operation](image)

The source operand can be an XMM register or a 128-bit memory location. The destination operand is an XMM register. The order operand is an 8-bit immediate. Note
INSTRUCTION SET REFERENCE

that this instruction permits a doubleword in the source operand to be copied to more
than one doubleword location in the destination operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination
register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.
VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction
will #UD.

Operation

VPSHUFD (VEX.128 encoded version)
DEST[31:0] ← (SRC >> (ORDER[1:0] * 32))[31:0];
DEST[63:32] ← (SRC >> (ORDER[3:2] * 32))[31:0];
DEST[95:64] ← (SRC >> (ORDER[5:4] * 32))[31:0];
DEST[127:96] ← (SRC >> (ORDER[7:6] * 32))[31:0];
DEST[255:128] ← 0

PSHUFD (128-bit Legacy SSE version)
DEST[31:0] ← (SRC >> (ORDER[1:0] * 32))[31:0];
DEST[63:32] ← (SRC >> (ORDER[3:2] * 32))[31:0];
DEST[95:64] ← (SRC >> (ORDER[5:4] * 32))[31:0];
DEST[127:96] ← (SRC >> (ORDER[7:6] * 32))[31:0];
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PSHUFD __m128i _mm_shuffle_epi32(__m128i a, int n)

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally
#UD  If VEX.L = 1.
        If VEX.vvvv != 1111B.
PSHUFHW - Shuffle Packed High Words

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 70 /r ib</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shuffle the high words in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.</td>
</tr>
<tr>
<td>PSHUFHW xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F3.0F 70 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Shuffle the high words in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.</td>
</tr>
<tr>
<td>VPSHUFHW xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Copies words from the high quadword of the source operand and inserts them in the high quadword of the destination operand at word locations selected with the immediate operand. This operation is similar to the operation used by the PSHUFD instruction, which is illustrated in Figure 4-7 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B. For the PSHUFHW instruction, each 2-bit field in the immediate operand selects the contents of one word location in the high quadword of the destination operand. The binary encodings of the immediate operand fields select words (0, 1, 2 or 3) from the high quadword of the source operand to be copied to the destination operand. The low quadword of the source operand is copied to the low quadword of the destination operand.

The destination operand is an XMM register. The source operand can be an XMM register or a 128-bit memory location. Note that this instruction permits a word in the high quadword of the source operand to be copied to more than one word location in the high quadword of the destination operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise the instruction will #UD.

Operation

VPSHUFHW (VEX.128 encoded version)

DEST[63:0] ← SRC1[63:0]
DEST[79:64] ← (SRC1 >> (imm[1:0] *16))[79:64]
INSTRUCTION SET REFERENCE

DEST[255:128] ← 0

PSHUFHW (128-bit Legacy SSE version)
DEST[63:0] ← SRC[63:0]
DEST[79:64] ← (SRC >> (imm[1:0] *16))[79:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PSHUFHW __m128i _mm_shufflehi_epi16(__m128i a, int n)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD     If VEX.L = 1.
        If VEX.vvv != 1111B.
INSTRUCTION SET REFERENCE

PSHUFLW - Shuffle Packed Low Words

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 70 /r ib</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shuffle the low words in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.</td>
</tr>
<tr>
<td>PSHUFLW xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.F2.0F 70 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Shuffle the low words in xmm2/m128 based on the encoding in imm8 and store the result in xmm1.</td>
</tr>
<tr>
<td>VPSHUFLW xmm1, xmm2/m128, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Copies words from the low quadword of the source operand and inserts them in the low quadword of the destination operand at word locations selected with the immediate operand. This operation is similar to the operation used by the PSHUFD instruction, which is illustrated in Figure 5-23. For the PSHUFLW instruction, each 2-bit field in the immediate operand selects the contents of one word location in the low quadword of the destination operand. The binary encodings of the immediate operand fields select words (0, 1, 2 or 3) from the low quadword of the source operand to be copied to the destination operand. The high quadword of the source operand is copied to the high quadword of the destination operand.

The destination operand is an XMM register. The source operand can be an XMM register or a 128-bit memory location. Note that this instruction permits a word in the low quadword of the source operand to be copied to more than one word location in the low quadword of the destination operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.vvvv is reserved and must be 1111b, VEX.L must be 0, otherwise instructions will #UD.

Operation

VPSHUFLW (VEX.128 encoded version)

\[
\begin{align*}
\text{DEST}[15:0] & \leftarrow (\text{SRC1} >> (\text{imm}[1:0] \cdot 16))[15:0] \\
\text{DEST}[31:16] & \leftarrow (\text{SRC1} >> (\text{imm}[3:2] \cdot 16))[15:0] \\
\text{DEST}[47:32] & \leftarrow (\text{SRC1} >> (\text{imm}[5:4] \cdot 16))[15:0] \\
\text{DEST}[63:48] & \leftarrow (\text{SRC1} >> (\text{imm}[7:6] \cdot 16))[15:0] \\
\text{DEST}[127:64] & \leftarrow \text{SRC}[127:64] \\
\text{DEST}[255:128] & \leftarrow 0
\end{align*}
\]
INSTRUCTION SET REFERENCE

PSHUFLW (128-bit Legacy SSE version)
DEST[15:0] ← (SRC >> (imm[1:0] * 16))[15:0]
DEST[127:64] ← SRC[127:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PSHUFLW __m128i _mm_shufflelo_epi16(__m128i a, int n)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
If VEX.vvv != 111B.
PSIGNB/PSIGNW/PSIGND - Packed SIGN

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 08 /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Negate/zero/preserve packed byte integers in xmm1 depending on the corresponding sign in xmm2/m128.</td>
</tr>
<tr>
<td>PSIGNB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 09 /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Negate/zero/preserve packed word integers in xmm1 depending on the corresponding sign in xmm2/m128.</td>
</tr>
<tr>
<td>PSIGNW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 38 0A /r</td>
<td>V/V</td>
<td>SSSE3</td>
<td>Negate/zero/preserve packed doubleword integers in xmm1 depending on the corresponding sign in xmm2/m128.</td>
</tr>
<tr>
<td>PSIGND xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 08 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Negate/zero/preserve packed byte integers in xmm2 depending on the corresponding sign in xmm3/m128.</td>
</tr>
<tr>
<td>VPSIGNB xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 09 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Negate/zero/preserve packed word integers in xmm2 depending on the corresponding sign in xmm3/m128.</td>
</tr>
<tr>
<td>VPSIGNW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 0A /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Negate/zero/preserve packed doubleword integers in xmm2 depending on the corresponding sign in xmm3/m128.</td>
</tr>
<tr>
<td>VPSIGND xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

PSIGNB/PSIGNW/PSIGND negates each data element of the first source operand if the signed integer value of the corresponding data element in the second source operand is less than zero. If the signed integer value of a data element in the second source operand is positive, the corresponding data element in the first source operand is unchanged. If a data element in the second source operand is zero, the corresponding data element in the first source operand is set to zero.

PSIGNB operates on signed bytes. PSIGNW operates on 16-bit signed words. PSIGND operates on signed 32-bit integers.
INSTRUCTION SET REFERENCE

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise instructions will #UD.

Operation

BYTE_SIGN(SRC1, SRC2)

if (SRC2[7...0] < 0 )
    DEST[7...0] ← Neg(SRC1[7...0])
else if(SRC2[7...0] == 0 )
    DEST[7...0] ← 0
else if(SRC2[7...0] > 0 )
    DEST[7...0] ← SRC1[7...0]

Repeat operation for 2nd through 15th bytes

if (SRC2[127..120] < 0 )
    DEST[127...120] ← Neg(SRC1[127...120])
else if(SRC2[127.. 120] == 0 )
    DEST[127...120] ← 0
else if(SRC2[127.. 120] > 0 )
    DEST[127...120] ← SRC1[127...120]

WORD_SIGN(SRC1, SRC2)

if (SRC2[15..0] < 0 )
    DEST[15...0] ← Neg(SRC1[15...0])
else if(SRC2[15..0] == 0 )
    DEST[15...0] ← 0
else if(SRC2[15..0] > 0 )
    DEST[15...0] ← SRC1[15...0]

Repeat operation for 2nd through 7th words

if (SRC2[127..112] < 0 )
    DEST[127...112] ← Neg(SRC1[127...112])
else if(SRC2[127.. 112] == 0 )
    DEST[127...112] ← 0
else if(SRC2[127.. 112] > 0 )
    DEST[127...112] ← SRC1[127...112]

DWORD_SIGN(SRC1, SRC2)

if (SRC2[31..0] < 0 )
    DEST[31...0] ← Neg(SRC1[31...0])
else if(SRC2[31..0] == 0 )
    DEST[31...0] ← 0
else if(SRC2[31..0] > 0 )
    DEST[31...0] ← SRC1[31...0]
Repeat operation for 2nd through 3rd double words
if (SRC2[127..96] < 0 )
    DEST[127...96] ← Neg(SRC1[127...96])
else if(SRC2[127.. 96] == 0 )
    DEST[127...96] ← 0
else if(SRC2[127.. 96] > 0 )
    DEST[127...96] ← SRC1[127...96]

VPSIGNB (VEX.128 encoded version)
DEST[127:0] ← BYTE_SIGN(SRC1, SRC2)
DEST[255:128] ← 0

PSIGNB (128-bit Legacy SSE version)
DEST[127:0] ← BYTE_SIGN(DEST, SRC)
DEST[255:128] (Unmodified)

VPSIGNW (VEX.128 encoded version)
DEST[127:0] ← WORD_SIGN(SRC1, SRC2)
DEST[255:128] ← 0

PSIGNW (128-bit Legacy SSE version)
DEST[127:0] ← WORD_SIGN(DEST, SRC)
DEST[255:128] (Unmodified)

VPSIGND (VEX.128 encoded version)
DEST[127:0] ← DWORD_SIGN(SRC1, SRC2)
DEST[255:128] ← 0

PSIGND (128-bit Legacy SSE version)
DEST[127:0] ← DWORD_SIGN(DEST, SRC)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PSIGNB __m128i _mm_sign_epi8 (__m128i a, __m128i b)
PSIGNW __m128i _mm_sign_epi16 (__m128i a, __m128i b)
PSIGND __m128i _mm_sign_epi32 (__m128i a, __m128i b)
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
PSLLDQ - Byte Shift Left

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 73 /7 ib</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift xmm1 left by imm8 bytes while shifting in 0s.</td>
</tr>
<tr>
<td>PSLLDQ xmm1, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDD.128.66.0F 73 /7 ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift xmm2 left by imm8 bytes while shifting in 0s and store result in xmm1.</td>
</tr>
<tr>
<td>VPSLLDQ xmm1, xmm2, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Shifts the source operand to the left by the number of bytes specified in the count operand. The empty low-order bytes are cleared (set to all 0s). If the value specified by the count operand is greater than 15, the destination operand is set to all 0s. The source and destination operands are XMM registers. The count operand is an 8-bit immediate.

128-bit Legacy SSE version: The source and destination operands are the same. Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.vvvv encodes the destination register, and VEX.B + ModRM.r/m encodes the source register. VEX.L must be 0, otherwise instructions will #UD.

**Operation**

**VPSLLDQ (VEX.128 encoded version)**

\[
\text{TEMP} \leftarrow \text{COUNT}
\]

\[
\text{IF (TEMP > 15) THEN TEMP} \leftarrow 16; \text{FI}
\]

\[
\text{DEST} \leftarrow \text{SRC} \ll (\text{TEMP} \times 8)
\]

\[
\text{DEST}[255:128] \leftarrow 0
\]

**PSLLDQ (128-bit Legacy SSE version)**

\[
\text{TEMP} \leftarrow \text{COUNT}
\]

\[
\text{IF (TEMP > 15) THEN TEMP} \leftarrow 16; \text{FI}
\]

\[
\text{DEST} \leftarrow \text{DEST} \ll (\text{TEMP} \times 8)
\]

\[
\text{DEST}[255:128] \text{ (Unmodified)}
\]

**Intel C/C++ Compiler Intrinsic Equivalent**

PSLLDQ __m128i _mm_slli_si128 (__m128i a, int imm)
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 7; additionally
#UD If VEX.L = 1.
### PSRLDQ - Byte Shift Right

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 73 /3 ib PSRLDQ xmm1, imm8</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift xmm1 right by imm8 bytes while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDD.128.66.0F 73 /3 ib VPSRLDQ xmm1, xmm2, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift xmm1 right by imm8 bytes while shifting in 0s.</td>
</tr>
</tbody>
</table>

**Description**

Shifts the source operand to the right by the number of bytes specified in the count operand. The empty high-order bytes are cleared (set to all 0s). If the value specified by the count operand is greater than 15, the destination operand is set to all 0s.

The source and destination operands are XMM registers. The count operand is an 8-bit immediate.

128-bit Legacy SSE version: The source and destination operands are the same. Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.vvvv encodes the destination register, and VEX.B + ModRM.r/m encodes the source register. VEX.L must be 0, otherwise instructions will #UD.

**Operation**

**VPSRLDQ (VEX.128 encoded version)**

\[
\text{TEMP} \leftarrow \text{COUNT} \\
\text{IF} (\text{TEMP} > 15) \text{ THEN } \text{TEMP} \leftarrow 16; \text{FI} \\
\text{DEST} \leftarrow \text{SRC} \gg (\text{TEMP} \times 8) \\
\text{DEST}[255:128] \leftarrow 0
\]

**PSRLDQ(128-bit Legacy SSE version)**

\[
\text{TEMP} \leftarrow \text{COUNT} \\
\text{IF} (\text{TEMP} > 15) \text{ THEN } \text{TEMP} \leftarrow 16; \text{FI} \\
\text{DEST} \leftarrow \text{DEST} \gg (\text{TEMP} \times 8) \\
\text{DEST}[255:128] \text{ (Unmodified)}
\]

**Intel C/C++ Compiler Intrinsic Equivalent**

PSRLDQ __m128i _mm_srli_si128 (__m128i a, int imm)
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 7; additionally
#UD If VEX.L = 1.
## PSLLW/PSLLD/PSLLQ - Bit Shift Left

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F F1/r PSLLW xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift words in xmm1 left by amount specified in xmm2/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F 71 /6 ib PSLLW xmm1, imm8</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift words in xmm1 left by imm8 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F F2 /r PSLLD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift doublewords in xmm1 left by amount specified in xmm2/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F 72 /6 ib PSLLD xmm1, imm8</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift doublewords in xmm1 left by imm8 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F F3 /r PSLLQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift quadwords in xmm1 left by amount specified in xmm2/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F 73 /6 ib PSLLQ xmm1, imm8</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift quadwords in xmm1 left by imm8 while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F F1 /r VPSLLW xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift words in xmm2 left by amount specified in xmm3/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDD.128.66.0F 71 /6 ib VPSLLW xmm1, xmm2, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift words in xmm2 left by imm8 while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F F2 /r VPSLLD xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift doublewords in xmm2 left by amount specified in xmm3/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDD.128.66.0F 72 /6 ib VPSLLD xmm1, xmm2, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift doublewords in xmm2 left by imm8 while shifting in 0s.</td>
</tr>
</tbody>
</table>
INSTRUCTION SET REFERENCE

Shifts the bits in the individual data elements (words, doublewords, or quadword) in the first source operand to the left by the number of bits specified in the count operand. As the bits in the data elements are shifted left, the empty low-order bits are cleared (set to 0). If the value specified by the count operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination operand is set to all 0s.

The destination and first source operands are XMM registers. The count operand can be either an XMM register or a 128-bit memory location or an 8-bit immediate. Note that only the first 64-bits of a 128-bit count operand are checked to compute the count.

The PSLLW instruction shifts each of the words in the first source operand to the left by the number of bits specified in the count operand; the PSLLD instruction shifts each of the doublewords in the first source operand; and the PSLLQ instruction shifts the quadword (or quadwords) in the first source operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are ignored.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. For shifts with an immediate count (VEX.128.66.0F 71-73 /6), VEX.vvvv encodes the destination register, and VEX.B + ModRM.r/m encodes the source register. VEX.L must be 0, otherwise instructions will #UD. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are ignored.

Operation

\[
\text{LOGICAL\_LEFT\_SHIFT\_WORDS}(\text{SRC}, \text{COUNT\_SRC})
\]
\[
\text{COUNT} \leftarrow \text{COUNT\_SRC}[63:0];
\]
\[
\text{IF} (\text{COUNT} > 15) \quad \text{THEN}
\]
\[
\text{DEST}[127:0] \leftarrow 00000000000000000000000000000000H
\]
\[
\text{ELSE}
\]
\[
\text{DEST}[15:0] \leftarrow \text{ZeroExtend} (\text{SRC}[15:0] \ll \text{COUNT});
\]

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.128.66.0F F3 /r VPSLLQ xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift quadwords in xmm2 left by amount specified in xmm3/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDD.128.66.0F 73 /6 ib VPSLLQ xmm1, xmm2, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift quadwords in xmm2 left by imm8 while shifting in 0s.</td>
</tr>
</tbody>
</table>

Ref. # 319433-005
INSTRUCTION SET REFERENCE

(* Repeat shift operation for 2nd through 7th words *)
DEST[127:112] ← ZeroExtend(SRC[127:112] << COUNT);
FI;

LOGICAL_LEFT_SHIFT_DWORDS(SRC, COUNT_SRC)
COUNT ← COUNT_SRC[63:0];
IF (COUNT > 31)
THEN 
    DEST[127:0] ← 00000000000000000000000000000000H
ELSE
    DEST[31:0] ← ZeroExtend(SRC[31:0] << COUNT);
    (* Repeat shift operation for 2nd through 3rd words *)
    DEST[127:96] ← ZeroExtend(SRC[127:96] << COUNT);
FI;

LOGICAL_LEFT_SHIFT_QWORDS(SRC, COUNT_SRC)
COUNT ← COUNT_SRC[63:0];
IF (COUNT > 63)
THEN 
    DEST[127:0] ← 00000000000000000000000000000000H
ELSE
    DEST[63:0] ← ZeroExtend(SRC[63:0] << COUNT);
    DEST[127:64] ← ZeroExtend(SRC[127:64] << COUNT);
FI;

VPSLLW (xmm, xmm, xmm/m128)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_WORDS(SRC1, SRC2)
DEST[255:128] ← 0

VPSLLW (xmm, imm8)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_WORDS(SRC1, imm8)
DEST[255:128] ← 0

PSLLW (xmm, xmm, xmm/m128)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_WORDS(DEST, SRC)
DEST[255:128] (Unmodified)

PSLLW (xmm, imm8)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_WORDS(DEST, imm8)
DEST[255:128] (Unmodified)

VPSLLD (xmm, xmm, xmm/m128)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_DWORDS(SRC1, SRC2)
INSTRUCTION SET REFERENCE

DEST[255:128] ← 0

VPSLLD (xmm, imm8)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_DWORDS(SRC1, imm8)
DEST[255:128] ← 0

PSLLD (xmm, xmm, xmm/m128)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_DWORDS(DEST, SRC)
DEST[255:128] (Unmodified)

PSLLD (xmm, imm8)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_DWORDS(DEST, imm8)
DEST[255:128] (Unmodified)

VPSLLQ (xmm, xmm, xmm/m128)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_QWORDS(SRC1, SRC2)
DEST[255:128] ← 0

VPSLLQ (xmm, imm8)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_QWORDS(SRC1, imm8)
DEST[255:128] ← 0

PSLLQ (xmm, xmm, xmm/m128)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_QWORDS(DEST, SRC)
DEST[255:128] (Unmodified)

PSLLQ (xmm, imm8)
DEST[127:0] ← LOGICAL_LEFT_SHIFT_QWORDS(DEST, imm8)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
PSLLW __m128i _mm_slli_epi16 (__m128i m, int count)
PSLLW __m128i _mm_sll_epi16 (__m128i m, __m128i count)
PSLLD __m128i _mm_slli_epi32 (__m128i m, int count)
PSLLD __m128i _mm_sll_epi32 (__m128i m, __m128i count)
PSLLQ __m128i _mm_slli_epi64 (__m128i m, int count)
PSLLQ __m128i _mm_sll_epi64 (__m128i m, __m128i count)
SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4 and 7 for non-VEX-encoded instructions.
#UD If VEX.L = 1.
INSTRUCTION SET REFERENCE

PSRAW/PSRAD - Bit Shift Arithmetic Right

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F E1/r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift words in xmm1 right by amount specified in xmm2/m128 while shifting in sign bits.</td>
</tr>
<tr>
<td>PSRAW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 71 /4 ib</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift words in xmm1 right by imm8 while shifting in sign bits.</td>
</tr>
<tr>
<td>PSRAW xmm1, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F E2 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift doublewords in xmm1 right by amount specified in xmm2/m128 while shifting in sign bits.</td>
</tr>
<tr>
<td>PSRAD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 72 /4 ib</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift doublewords in xmm1 right by imm8 while shifting in sign bits.</td>
</tr>
<tr>
<td>PSRAD xmm1, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F E1 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift words in xmm2 right by amount specified in xmm3/m128 while shifting in sign bits.</td>
</tr>
<tr>
<td>VPSRAW xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 71 /4 ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift words in xmm2 right by imm8 while shifting in sign bits.</td>
</tr>
<tr>
<td>VPSRAW xmm1, xmm2, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F E2 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift doublewords in xmm2 right by amount specified in xmm3/m128 while shifting in sign bits.</td>
</tr>
<tr>
<td>VPSRAD xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 72 /4 ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift doublewords in xmm2 right by imm8 while shifting in sign bits.</td>
</tr>
<tr>
<td>VPSRAD xmm1, xmm2, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description
Shifts the bits in the individual data elements (words, doublewords, or quadword) in the first source operand to the right by the number of bits specified in the count operand. As the bits in the data elements are shifted left, the empty high-order bits are filled with the initial value of the sign bit of the data. If the value specified by the
count operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination operand is filled with the initial value of the sign bit.

The destination and first source operands are XMM registers. The count operand can be either an XMM register or a 128-bit memory location or an 8-bit immediate. Note that only the first 64-bits of a 128-bit count operand are checked to compute the count.

The PSRAW instruction shifts each of the words in the first source operand to the right by the number of bits specified in the count operand; the PSRAD instruction shifts each of the doublewords in the first source operand; and the PSRAQ instruction shifts the quadword (or quadwords) in the first source operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are ignored.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. For shifts with an immediate count (VEX.128.66.0F 71-73 /4), VEX.vvvv encodes the destination register, and VEX.B + ModRM.r/m encodes the source register. VEX.L must be 0, otherwise instructions will #UD. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are ignored.

**Operation**

**ARITHMETIC_RIGHT_SHIFT_WORDS(SRC, COUNT_SRC)**

COUNT ← COUNT_SRC[63:0];

IF (COUNT > 15)

THEN

COUNT ← 16

FI

DEST[15:0] ← SignExtend(SRC[15:0] >> COUNT);

(* Repeat shift operation for 2nd through 7th words *)

DEST[127:112] ← SignExtend(SRC[127:112] >> COUNT);

**ARITHMETIC_RIGHT_SHIFT_DWORDS(SRC, COUNT_SRC)**

COUNT ← COUNT_SRC[63:0];

IF (COUNT > 31)

THEN

COUNT ← 32

FI

DEST[31:0] ← SignExtend(SRC[31:0] >> COUNT);

(* Repeat shift operation for 2nd through 3rd words *)

DEST[127:96] ← SignExtend(SRC[127:96] >> COUNT);

**VPSRAW (xmm, xmm, xmm/m128)**
INSTRUCTION SET REFERENCE

DEST[127:0] ← ARITHMETIC_RIGHT_SHIFT_WORDS(SRC1, SRC2)
DEST[255:128] ← 0

**VPSRAW (xmm, imm8)**
DEST[127:0] ← ARITHMETIC_RIGHT_SHIFT_WORDS(SRC1, imm8)
DEST[255:128] ← 0

**PSRAW (xmm, xmm, xmm/m128)**
DEST[127:0] ← ARITHMETIC_RIGHT_SHIFT_WORDS(DEST, SRC)
DEST[255:128] (Unmodified)

**PSRAW (xmm, imm8)**
DEST[127:0] ← ARITHMETIC_RIGHT_SHIFT_WORDS(DEST, imm8)
DEST[255:128] (Unmodified)

**VPSRAD (xmm, xmm, xmm/m128)**
DEST[127:0] ← ARITHMETIC_RIGHT_SHIFT_DWORDS(SRC1, SRC2)
DEST[255:128] ← 0

**VPSRAD (xmm, imm8)**
DEST[127:0] ← ARITHMETIC_RIGHT_SHIFT_DWORDS(SRC1, imm8)
DEST[255:128] ← 0

**PSRAD (xmm, xmm, xmm/m128)**
DEST[127:0] ← ARITHMETIC_RIGHT_SHIFT_DWORDS(DEST, SRC)
DEST[255:128] (Unmodified)

**PSRAD (xmm, imm8)**
DEST[127:0] ← ARITHMETIC_RIGHT_SHIFT_DWORDS(DEST, imm8)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PSRAW __m128i _mm_srai_epi16 (__m128i m, int count)
PSRAW __m128i _mm_sra_epi16 (__m128i m, __m128i count)
PSRAD __m128i _mm_srai_epi32 (__m128i m, int count)
PSRAD __m128i _mm_sra_epi32 (__m128i m, __m128i count)

SIMD Floating-Point Exceptions

None
Other Exceptions
See Exceptions Type 4 and 7 for non-VEX-encoded instructions.

#UD If VEX.L = 1.
### PSRLW/PSRLD/PSRLQ - Shift Packed Data Right Logical

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F D1 /r PSRLW xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift words in xmm1 right by amount specified in xmm2/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F 71 /2 ib PSRLW xmm1, imm8</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift words in xmm1 right by imm8 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F D2 /r PSRLD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift doublewords in xmm1 right by amount specified in xmm2/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F 72 /2 ib PSRLD xmm1, imm8</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift doublewords in xmm1 right by imm8 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F D3 /r PSRLQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift quadwords in xmm1 right by amount specified in xmm2/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>66 0F 73 /2 ib PSRLQ xmm1, imm8</td>
<td>V/V</td>
<td>SSE2</td>
<td>Shift quadwords in xmm1 right by imm8 while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F D1 /r VPSRLW xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift words in xmm2 right by amount specified in xmm3/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDD.128.66.0F 71 /2 ib VPSRLW xmm1, xmm2, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift words in xmm2 right by imm8 while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F D2 /r VPSRLD xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift doublewords in xmm2 right by amount specified in xmm3/m128 while shifting in 0s.</td>
</tr>
<tr>
<td>VEX.NDD.128.66.0F 72 /2 ib VPSRLD xmm1, xmm2, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Shift doublewords in xmm2 right by imm8 while shifting in 0s.</td>
</tr>
</tbody>
</table>

Ref. # 319433-005
Description

Shifts the bits in the individual data elements (words, doublewords, or quadword) in the first source operand to the right by the number of bits specified in the count operand. As the bits in the data elements are shifted right, the empty high-order bits are cleared (set to 0). If the value specified by the count operand is greater than 15 (for words), 31 (for doublewords), or 63 (for a quadword), then the destination operand is set to all 0s.

The destination and first source operands are XMM registers. The count operand can be either an XMM register or a 128-bit memory location or an 8-bit immediate. Note that only the first 64-bits of a 128-bit count operand are checked to compute the count.

The PSRLW instruction shifts each of the words in the first source operand to the right by the number of bits specified in the count operand; the PSRLD instruction shifts each of the doublewords in the first source operand; and the PSRLQ instruction shifts the quadword (or quadwords) in the first source operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are ignored.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. For shifts with an immediate count (VEX.128.66.0F 71-73 /2), VEX.vvvv encodes the destination register, and VEX.B + ModRM.r/m encodes the source register. VEX.L must be 0, otherwise instructions will #UD. If the count operand is a memory address, 128 bits are loaded but the upper 64 bits are ignored.

Operation

\[
\text{LOGICAL\_RIGHT\_SHIFT\_WORDS}(\text{SRC}, \text{COUNT\_SRC})
\]

\[
\text{COUNT} \leftarrow \text{COUNT\_SRC}[63:0];
\]

\[
\text{IF} (\text{COUNT} > 15)
\]

\[
\text{THEN}
\]

\[
\text{DEST}[127:0] \leftarrow 00000000000000000000000000000000H
\]

\[
\text{ELSE}
\]

\[
\text{DEST}[15:0] \leftarrow \text{ZeroExtend}(\text{SRC}[15:0] >> \text{COUNT});
\]
INSTRUCTION SET REFERENCE

(* Repeat shift operation for 2nd through 7th words *)
DEST[127:112] ← ZeroExtend(SRC[127:112] >> COUNT);
FI;

LOGICAL_RIGHT_SHIFT_DWORDS(SRC, COUNT_SRC)
COUNT ← COUNT_SRC[63:0];
IF (COUNT > 31)
THEN
   DEST[127:0] ← 00000000000000000000000000000000H
ELSE
   DEST[31:0] ← ZeroExtend(SRC[31:0] >> COUNT);
(* Repeat shift operation for 2nd through 3rd words *)
   DEST[127:96] ← ZeroExtend(SRC[127:96] >> COUNT);
FI;

LOGICAL_RIGHT_SHIFT_QWORDS(SRC, COUNT_SRC)
COUNT ← COUNT_SRC[63:0];
IF (COUNT > 63)
THEN
   DEST[127:0] ← 00000000000000000000000000000000H
ELSE
   DEST[63:0] ← ZeroExtend(SRC[63:0] >> COUNT);
   DEST[127:64] ← ZeroExtend(SRC[127:64] >> COUNT);
FI;

VPSRLW (xmm, xmm, xmm/m128)
DEST[127:0] ← LOGICAL_RIGHT_SHIFT_WORDS(SRC1, SRC2)
DEST[255:128] ← 0

VPSRLW (xmm, imm8)
DEST[127:0] ← LOGICAL_RIGHT_SHIFT_WORDS(SRC1, imm8)
DEST[255:128] ← 0

PSRLW (xmm, xmm, xmm/m128)
DEST[127:0] ← LOGICAL_RIGHT_SHIFT_WORDS(DEST, SRC)
DEST[255:128] (Unmodified)

PSRLW (xmm, imm8)
DEST[127:0] ← LOGICAL_RIGHT_SHIFT_WORDS(DEST, imm8)
DEST[255:128] (Unmodified)

VPSRLD (xmm, xmm, xmm/m128)
DEST[127:0] ← LOGICAL_RIGHT_SHIFT_DWORDS(SRC1, SRC2)
INSTRUCTION SET REFERENCE

DEST[255:128] $\leftarrow$ 0

**VPSRLD** (**xmm, imm8**)
DEST[127:0] $\leftarrow$ LOGICAL_RIGHT_SHIFT_DWORDS(SRC1, imm8)
DEST[255:128] $\leftarrow$ 0

**PSRLD** (**xmm, xmm, xmm/m128**)
DEST[127:0] $\leftarrow$ LOGICAL_RIGHT_SHIFT_DWORDS(DEST, SRC)
DEST[255:128] (Unmodified)

**PSRLD** (**xmm, imm8**)
DEST[127:0] $\leftarrow$ LOGICAL_RIGHT_SHIFT_DWORDS(DEST, imm8)
DEST[255:128] (Unmodified)

**VPSRLQ** (**xmm, xmm, xmm/m128**)
DEST[127:0] $\leftarrow$ LOGICAL_RIGHT_SHIFT_QWORDS(SRC1, SRC2)
DEST[255:128] $\leftarrow$ 0

**VPSRLQ** (**xmm, imm8**)
DEST[127:0] $\leftarrow$ LOGICAL_RIGHT_SHIFT_QWORDS(SRC1, imm8)
DEST[255:128] $\leftarrow$ 0

**PSRLQ** (**xmm, xmm, xmm/m128**)
DEST[127:0] $\leftarrow$ LOGICAL_RIGHT_SHIFT_QWORDS(DEST, SRC)
DEST[255:128] (Unmodified)

**PSRLQ** (**xmm, imm8**)
DEST[127:0] $\leftarrow$ LOGICAL_RIGHT_SHIFT_QWORDS(DEST, imm8)
DEST[255:128] (Unmodified)

*Intel C/C++ Compiler Intrinsic Equivalent*

PSRLW __m128i _mm_srli_epi16 (__m128i m, int count)
PSRLW __m128i _mm_srl_epi16 (__m128i m, __m128i count)
PSRLD __m128i _mm_srli_epi32 (__m128i m, int count)
PSRLD __m128i _mm_srl_epi32 (__m128i m, __m128i count)
PSRLQ __m128i _mm_srli_epi64 (__m128i m, int count)
PSRLQ __m128i _mm_srl_epi64 (__m128i m, __m128i count)

Ref. # 319433-005
INSTRUCTION SET REFERENCE

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4 and 7 for non-VEX-encoded instructions.
#UD If VEX.L = 1.
PTEST- Packed Bit Test

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 17 /r</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Set ZF and CF depending on bitwise AND and ANDN of sources</td>
</tr>
<tr>
<td>PTEST xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 17 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Set ZF and CF depending on bitwise AND and ANDN of sources</td>
</tr>
<tr>
<td>VPTEST xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.66.0F38 17 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Set ZF and CF depending on bitwise AND and ANDN of sources</td>
</tr>
<tr>
<td>VPTEST ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 0E /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Set ZF and CF depending on sign bit AND and ANDN of packed single-precision floating-point sources</td>
</tr>
<tr>
<td>VTESTPS xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.66.0F38 0E /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Set ZF and CF depending on sign bit AND and ANDN of packed single-precision floating-point sources</td>
</tr>
<tr>
<td>VTESTPS ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F38 0F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Set ZF and CF depending on sign bit AND and ANDN of packed double-precision floating-point sources</td>
</tr>
<tr>
<td>VTESTPD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.256.66.0F38 0F /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Set ZF and CF depending on sign bit AND and ANDN of packed double-precision floating-point sources</td>
</tr>
<tr>
<td>VTESTPD ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

PTEST and VPTEST set the ZF flag if all bits in the result are 0 of the bitwise AND of the first source operand (first operand) and the second source operand (second operand). VPTEST sets the CF flag if all bits in the result are 0 of the bitwise AND of the second source operand (second operand) and the logical NOT of the destination operand.
INSTRUCTION SET REFERENCE

VTESTPS performs a bitwise comparison of all the sign bits of the packed single-precision elements in the first source operation and corresponding sign bits in the second source operand. If the AND of the source sign bits with the dest sign bits produces all zeros, the ZF is set else the ZF is clear. If the AND of the inverted source sign bits with the dest sign bits produces all zeros the CF is set else the CF is clear.

VTESTPD performs a bitwise comparison of all the sign bits of the double-precision elements in the first source operation and corresponding sign bits in the second source operand. If the AND of the source sign bits with the dest sign bits produces all zeros, the ZF is set else the ZF is clear. If the AND the inverted source sign bits with the dest sign bits produces all zeros the CF is set else the CF is clear.

The first source register is specified by the ModR/M reg field.

VEX.256 encoded version: The first source register is a YMM register. The second source register can be a YMM register or a 256-bit memory location. The destination register is not modified.

128-bit version: The first source register is an XMM register. The second source register can be an XMM register or a 256-bit memory location. The destination register is not modified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation

VPTEST (VEX.256 encoded version)
IF (SRC[255:0] BITWISE AND DEST[255:0] == 0) THEN ZF ← 1;
ELSE ZF ← 0;
IF (SRC[255:0] BITWISE AND NOT DEST[255:0] == 0) THEN CF ← 1;
ELSE CF ← 0;
DEST (unmodified)
AF ← OF ← PF ← SF ← 0;

PTEST (128-bit versions)
IF (SRC[127:0] BITWISE AND DEST[127:0] == 0)
    THEN ZF ← 1;
    ELSE ZF ← 0;
IF (SRC[127:0] BITWISE AND NOT DEST[127:0] == 0)
    THEN CF ← 1;
    ELSE CF ← 0;
DEST (unmodified)
AF ← OF ← PF ← SF ← 0;

VTESTPS (VEX.256 encoded version)
TEMP[255:0] ← SRC[255:0] AND DEST[255:0]
THEN ZF ← 1;
ELSE ZF ← 0;

TEMP[255:0] ← SRC[255:0] AND NOT DEST[255:0]
THEN CF ← 1;
ELSE CF ← 0;
DEST (unmodified)
AF ← OF ← PF ← SF ← 0;

**VTESTPD (VEX.256 encoded version)**
TEMP[255:0] ← SRC[255:0] AND DEST[255:0]
THEN ZF ← 1;
ELSE ZF ← 0;

TEMP[255:0] ← SRC[255:0] AND NOT DEST[255:0]
THEN CF ← 1;
ELSE CF ← 0;
DEST (unmodified)
AF ← OF ← PF ← SF ← 0;

**Intel C/C++ Compiler Intrinsic Equivalent**

VPTEST

int _mm256_testz_si256 (__m256i s1, __m256i s2);
int _mm256_testc_si256 (__m256i s1, __m256i s2);
int _mm256_testnzc_si256 (__m256i s1, __m256i s2);
int _mm_testz_si128 (__m128i s1, __m128i s2);
int _mm_testc_si128 (__m128i s1, __m128i s2);
int _mm_testnzc_si128 (__m128i s1, __m128i s2);

VTESTPS
INSTRUCTION SET REFERENCE

\begin{verbatim}
int _mm256_testz_ps (__m256 s1, __m256 s2);
int _mm256_testc_ps (__m256 s1, __m256 s2);
int _mm256_testnzc_ps (__m256 s1, __m128 s2);
int _mm_testz_ps (__m128 s1, __m128 s2);
int _mm_testc_ps (__m128 s1, __m128 s2);
int _mm_testnzc_ps (__m128 s1, __m128 s2);
\end{verbatim}

\begin{verbatim}
VTESTPD
int _mm256_testz_pd (__m256d s1, __m256d s2);
int _mm256_testc_pd (__m256d s1, __m256d s2);
int _mm256_testnzc_pd (__m256d s1, __m256d s2);
int _mm_testz_pd (__m128d s1, __m128d s2);
int _mm_testc_pd (__m128d s1, __m128d s2);
int _mm_testnzc_pd (__m128d s1, __m128d s2);
\end{verbatim}

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.vvvv != 1111.
**PSUBB/PSUBW/PSUBD/PSUBQ -Packed Integer Subtract**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F F8 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Subtract packed byte integers in xmm2/m128 from xmm1.</td>
</tr>
<tr>
<td>PSUBB xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F F9 /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Subtract packed word integers in xmm2/m128 from xmm1.</td>
</tr>
<tr>
<td>PSUBW xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F FA /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Subtract packed doubleword integers in xmm2/m128 from xmm1.</td>
</tr>
<tr>
<td>PSUBD xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F FB/r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Subtract packed quadword integers in xmm2/m128 from xmm1.</td>
</tr>
<tr>
<td>PSUBQ xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F F8 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed byte integers in xmm3/m128 from xmm2.</td>
</tr>
<tr>
<td>VPSUBB xmm1, xmm2,</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F F9 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed word integers in xmm3/m128 from xmm2.</td>
</tr>
<tr>
<td>VPSUBW xmm1, xmm2,</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F FA /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed doubleword integers in xmm3/m128 from xmm2.</td>
</tr>
<tr>
<td>VPSUBD xmm1, xmm2,</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F FB/r</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed quadword integers in xmm3/m128 from xmm2.</td>
</tr>
<tr>
<td>VPSUBQ xmm1, xmm2,</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Subtracts the packed byte, word, doubleword, or quadword integers in the second source operand from the first source operand and stores the result in the destination operand. The second source operand is an XMM register or an 128-bit memory location. The first source operand and destination operands are XMM registers. When a
result is too large to be represented in the 8/16/32/64 integer (overflow), the result is wrapped around and the low bits are written to the destination element (that is, the carry is ignored).

Note that these instructions can operate on either unsigned or signed (two’s complement notation) integers; however, it does not set bits in the EFLAGS register to indicate overflow and/or a carry. To prevent undetected overflow conditions, software must control the ranges of the values operated on.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise instructions will #UD.

### Operation

**VPSUBB (VEX.128 encoded version)**

<table>
<thead>
<tr>
<th>Register</th>
<th>Source 1</th>
<th>Source 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>DEST[7:0]</td>
<td>SRC[7:0]-SRC2[7:0]</td>
<td></td>
</tr>
<tr>
<td>DEST[47:40]</td>
<td>SRC[47:40]-SRC2[47:40]</td>
<td></td>
</tr>
<tr>
<td>DEST[63:56]</td>
<td>SRC[63:56]-SRC2[63:56]</td>
<td></td>
</tr>
<tr>
<td>DEST[71:64]</td>
<td>SRC[71:64]-SRC2[71:64]</td>
<td></td>
</tr>
<tr>
<td>DEST[79:72]</td>
<td>SRC[79:72]-SRC2[79:72]</td>
<td></td>
</tr>
<tr>
<td>DEST[87:80]</td>
<td>SRC[87:80]-SRC2[87:80]</td>
<td></td>
</tr>
<tr>
<td>DEST[103:96]</td>
<td>SRC[103:96]-SRC2[103:96]</td>
<td></td>
</tr>
<tr>
<td>DEST[111:104]</td>
<td>SRC[111:104]-SRC2[111:104]</td>
<td></td>
</tr>
<tr>
<td>DEST[127:120]</td>
<td>SRC[127:120]-SRC2[127:120]</td>
<td></td>
</tr>
<tr>
<td>DEST[255:128]</td>
<td>0</td>
<td></td>
</tr>
</tbody>
</table>

**PSUBB (128-bit Legacy SSE version)**

<table>
<thead>
<tr>
<th>Register</th>
<th>Source</th>
</tr>
</thead>
<tbody>
<tr>
<td>DEST[7:0]</td>
<td>DEST[7:0]-SRC[7:0]</td>
</tr>
<tr>
<td>DEST[47:40]</td>
<td>DEST[47:40]-SRC[47:40]</td>
</tr>
<tr>
<td>DEST[63:56]</td>
<td>DEST[63:56]-SRC[63:56]</td>
</tr>
<tr>
<td>DEST[71:64]</td>
<td>DEST[71:64]-SRC[71:64]</td>
</tr>
<tr>
<td>DEST[79:72]</td>
<td>DEST[79:72]-SRC[79:72]</td>
</tr>
</tbody>
</table>
VPSUBW (VEX.128 encoded version)
DEST[87:80] ← DEST[87:80]-SRC[87:80]
DEST[103:96] ← DEST[103:96]-SRC[103:96]
DEST[111:104] ← DEST[111:104]-SRC[111:104]
DEST[127:120] ← DEST[127:120]-SRC[127:120]
DEST[255:128] (Unmodified)

VPSUBD (VEX.128 encoded version)
DEST[87:80] ← DEST[87:80]-SRC[87:80]
DEST[103:96] ← DEST[103:96]-SRC[103:96]
DEST[111:104] ← DEST[111:104]-SRC[111:104]
DEST[127:120] ← DEST[127:120]-SRC[127:120]
DEST[255:128] (Unmodified)
INSTRUCTION SET REFERENCE

VPSUBQ (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0]-SRC2[63:0]
DEST[127:64] ← SRC1[127:64]-SRC2[127:64]
DEST[255:128] ← 0

PSUBQ (128-bit Legacy SSE version)
DEST[63:0] ← DEST[63:0]-SRC[63:0]
DEST[127:64] ← DEST[127:64]-SRC[127:64]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PSUBB __m128i _mm_sub_epi8 (__m128i a, __m128i b)
PSUBW __m128i _mm_sub_epi16 (__m128i a, __m128i b)
PSUBD __m128i _mm_sub_epi32 (__m128i a, __m128i b)
PSUBQ __m128i _mm_sub_epi64(__m128i a, __m128i b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD            If VEX.L = 1
**PSUBSB/PSUBSW - Subtract Packed Signed Integers with Signed Saturation**

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F E8 /r PSUBSB xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Subtract packed signed byte integers in xmm2/m128 from packed signed byte integers in xmm1 and saturate results.</td>
</tr>
<tr>
<td>66 0F E9 /r PSUBSW xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Subtract packed signed word integers in xmm2/m128 from packed signed word integers in xmm1 and saturate results.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F E8 /r VPSUBSB xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed signed byte integers in xmm3/m128 from packed signed byte integers in xmm2 and saturate results.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F E9 /r VPSUBSW xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed signed word integers in xmm3/m128 from packed signed word integers in xmm2 and saturate results.</td>
</tr>
</tbody>
</table>

**Description**

Performs a SIMD subtract of the packed signed integers of the second source operand from the packed signed integers of the first source operand, and stores the packed integer results in the destination operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for an illustration of a SIMD operation. Overflow is handled with signed saturation, as described in the following paragraphs.

The first source and destination operands are XMM registers and the second source operand is either an XMM register or a 128-bit memory location.

The PSUBSB instruction subtracts packed signed byte integers. When an individual byte result is beyond the range of a signed byte integer (that is, greater than 7FH or less than 80H), the saturated value of 7FH or 80H, respectively, is written to the destination operand.

The PSUBSW instruction subtracts packed signed word integers. When an individual word result is beyond the range of a signed word integer (that is, greater than 7FFFFH or less than 8000H), the saturated value of 7FFFFH or 8000H, respectively, is written to the destination operand.
INSTRUCTION SET REFERENCE

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.
VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise instructions will #UD.

Operation

**VPSUBSB**
DEST[7:0] ← SaturateToSignedByte (SRC1[7:0] - SRC2[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[255:128] ← 0

**PSUBSB**
DEST[7:0] ← SaturateToSignedByte (DEST[7:0] - SRC[7:0]);
(* Repeat subtract operation for 2nd through 14th bytes *)
DEST[255:128] (Unmodified)

**VPSUBSW**
DEST[15:0] ← SaturateToSignedWord (SRC1[15:0] - SRC2[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[255:128] ← 0

**PSUBSW**
DEST[15:0] ← SaturateToSignedWord (DEST[15:0] - SRC[15:0]);
(* Repeat subtract operation for 2nd through 7th words *)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PSUBSB __m128i _mm_subs_epi8(__m128i m1, __m128i m2)
PSUBSW __m128i _mm_subs_epi16(__m128i m1, __m128i m2)

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4; additionally
#UD if VEX.L = 1.
PSUBUSB/PSUBUSW - Subtract Packed Unsigned Integers with Unsigned Saturation

**Description**

Performs a SIMD subtract of the packed unsigned integers of the second source operand from the packed unsigned integers of the first source operand and stores the packed unsigned integer results in the destination operand. See Figure 9-4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a SIMD operation. Overflow is handled with unsigned saturation, as described in the following paragraphs.

The first source and destination operands are XMM registers. The second source operand can be either an XMM register or a 128-bit memory location.

The PSUBUSB instruction subtracts packed unsigned byte integers. When an individual byte result is less than zero, the saturated value of 00H is written to the destination operand.

The PSUBUSW instruction subtracts packed unsigned word integers. When an individual word result is less than zero, the saturated value of 0000H is written to the destination operand.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F D8 /r PSUBUSB xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Subtract packed unsigned byte integers in xmm2/m128 from packed unsigned byte integers in xmm1 and saturate result.</td>
</tr>
<tr>
<td>66 0F D9 /r PSUBUSW xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Subtract packed unsigned word integers in xmm2/m128 from packed unsigned word integers in xmm1 and saturate result.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F D8 /r VPSUBUSB xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed unsigned byte integers in xmm3/m128 from packed unsigned byte integers in xmm2 and saturate result.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F D9 /r VPSUBUSW xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed unsigned word integers in xmm3/m128 from packed unsigned word integers in xmm2 and saturate result.</td>
</tr>
</tbody>
</table>
VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise instructions will #UD.

**Operation**

**VPSUBUSB**

DEST[7:0] ← SaturateToUnsignedByte (SRC1[7:0] - SRC2[7:0]);

(* Repeat subtract operation for 2nd through 14th bytes *)

DEST[127:120] ← SaturateToUnsignedByte (SRC1[127:120] - SRC2[127:120]);

DEST[255:128] ← 0

**PSUBUSB**

DEST[7:0] ← SaturateToUnsignedByte (DEST[7:0] - SRC[7:0]);

(* Repeat subtract operation for 2nd through 14th bytes *)

DEST[127:120] ← SaturateToUnsignedByte (DEST[127:120] - SRC[127:120]);

DEST[255:128] (Unmodified)

**VPSUBUSW**

DEST[15:0] ← SaturateToUnsignedWord (SRC1[15:0] - SRC2[15:0]);

(* Repeat subtract operation for 2nd through 7th words *)


DEST[255:128] ← 0

**PSUBUSW**

DEST[15:0] ← SaturateToUnsignedWord (DEST[15:0] - SRC[15:0]);

(* Repeat subtract operation for 2nd through 7th words *)


DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

PSUBUSB _m128i _mm_subs_epu8(__m128i m1, __m128i m2)

PSUBUSW _m128i _mm_subs_epu16(__m128i m1, __m128i m2)

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type 4; additionally

#UD if VEX.L = 1.
**INSTRUCTION SET REFERENCE**

**PUNPCKHBW/PUNPCKHWD/PUNPCKHDQ/PUNPCKHQDQ - Unpack High Data**

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 68/r PUNPCKHBW xmm1,xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Interleave high-order bytes from xmm1 and xmm2/m128 into xmm1.</td>
</tr>
<tr>
<td>66 0F 69/r PUNPCKHWD xmm1,xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Interleave high-order words from xmm1 and xmm2/m128 into xmm1.</td>
</tr>
<tr>
<td>66 0F 6A/r PUNPCKHDQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Interleave high-order doublewords from xmm1 and xmm2/m128 into xmm1.</td>
</tr>
<tr>
<td>66 0F 6D/r PUNPCKHQDQ xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Interleave high-order quadword from xmm1 and xmm2/m128 into xmm1 register.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 68/r VPUNPCKHBW xmm1,xmm2,xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Interleave high-order bytes from xmm2 and xmm3/m128 into xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 69/r VPUNPCKHWD xmm1,xmm2,xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Interleave high-order words from xmm2 and xmm3/m128 into xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 6A/r VPUNPCKHDQ xmm1, xmm2,xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Interleave high-order doublewords from xmm2 and xmm3/m128 into xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 6D/r VPUNPCKHQDQ xmm1, xmm2,xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Interleave high-order quadword from xmm2 and xmm3/m128 into xmm1 register.</td>
</tr>
</tbody>
</table>

**Description**

Unpacks and interleaves the high-order data elements (bytes, words, doublewords, and quadwords) of the first source operand and second source operand into the
destination operand. (Figure 5-24 shows the unpack operation for bytes in 64-bit operands.). The low-order data elements are ignored.

128-bit Legacy SSE version: The first source operand and the destination operand are the same.

The second source operand can be an XMM register or a 128-bit memory location. The first source and destination operands are XMM registers. When the source data comes from a 128-bit memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to a 16-byte boundary and normal segment checking will still be enforced.

The PUNPCKHBW instruction interleaves the high-order bytes of the source and destination operands, the PUNPCKHWD instruction interleaves the high-order words of the source and destination operands, the PUNPCKHDQ instruction interleaves the high order doubleword (or doublewords) of the source and destination operands, and the PUNPCKHQDQ instruction interleaves the high-order quadwords of the source and destination operands.

128-bit Legacy SSE versions: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded versions: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise instructions will #UD.

Operation

**INTERLEAVE_HIGH_BYTES (SRC1, SRC2)**

- DEST[7:0] ← SRC1[71:64]
- DEST[15:8] ← SRC2[71:64]
- DEST[23:16] ← SRC2[79:72]
- DEST[31:24] ← SRC2[79:72]
- DEST[47:40] ← SRC2[87:80]
- DEST[55:48] ← SRC1[95:88]
- DEST[63:56] ← SRC2[95:88]
- DEST[71:64] ← SRC1[103:96]
INSTRUCTION SET REFERENCE

DEST[79:72] ← SRC2[103:96]
DEST[87:80] ← SRC1[111:104]
DEST[95:88] ← SRC2[111:104]
DEST[103:96] ← SRC1[119:112]
DEST[111:104] ← SRC2[119:112]
DEST[119:112] ← SRC1[127:120]
DEST[127:120] ← SRC2[127:120]

INTERLEAVE_HIGH_WORDS (SRC1, SRC2)
DEST[15:0] ← SRC1[79:64]
DEST[31:16] ← SRC2[79:64]
DEST[47:32] ← SRC1[95:80]
DEST[63:48] ← SRC2[95:80]
DEST[79:64] ← SRC1[111:96]
DEST[95:80] ← SRC2[111:96]
DEST[111:96] ← SRC1[127:112]
DEST[127:112] ← SRC2[127:112]

INTERLEAVE_HIGH_DWORDS(SRC1, SRC2)
DEST[31:0] ← SRC1[95:64]
DEST[63:32] ← SRC2[95:64]
DEST[95:64] ← SRC1[127:96]
DEST[127:96] ← SRC2[127:96]

INTERLEAVE_HIGH_QWORDS(SRC1, SRC2)
DEST[63:0] ← SRC1[127:64]
DEST[127:64] ← SRC2[127:64]

PUNPCKHBW
DEST[127:0] ← INTERLEAVE_HIGH_BYTES(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKHBW
DEST[127:0] ← INTERLEAVE_HIGH_BYTES(SRC1, SRC2)
DEST[255:127] ← 0

PUNPCKHWD
DEST[127:0] ← INTERLEAVE_HIGH_WORDS(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKHWD
DEST[127:0] ← INTERLEAVE_HIGH_WORDS(SRC1, SRC2)
DEST[255:127] ← 0
INSTRUCTION SET REFERENCE

PUNPCKHDQ
DEST[127:0] ← INTERLEAVE_HIGH_DWORDS(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKHDQ
DEST[127:0] ← INTERLEAVE_HIGH_DWORDS(SRC1, SRC2)
DEST[255:127] ← 0

PUNPCKHQDQ
DEST[127:0] ← INTERLEAVE_HIGH_QWORDS(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKHQDQ
DEST[127:0] ← INTERLEAVE_HIGH_QWORDS(SRC1, SRC2)
DEST[255:127] ← 0

Intel C/C++ Compiler Intrinsic Equivalent
PUNPCKHBW __m128i _mm_unpackhi_epi8(__m128i m1, __m128i m2)
PUNPCKHWD __m128i _mm_unpackhi_epi16(__m128i m1, __m128i m2)
PUNPCKHDQ __m128i _mm_unpackhi_epi32(__m128i m1, __m128i m2)
PUNPCKHQDQ __m128i _mm_unpackhi_epi64 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
### PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ - Unpack Low Data

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 60/r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Interleave low-order bytes from xmm1 and xmm2/m128 into xmm1.</td>
</tr>
<tr>
<td>PUNPCKLBW xmm1,xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 61/r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Interleave low-order words from xmm1 and xmm2/m128 into xmm1.</td>
</tr>
<tr>
<td>PUNPCKLWD xmm1,xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 62/r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Interleave low-order doublewords from xmm1 and xmm2/m128 into xmm1.</td>
</tr>
<tr>
<td>PUNPCKLDQ xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>66 0F 6C/r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Interleave low-order quadword from xmm1 and xmm2/m128 into xmm1 register.</td>
</tr>
<tr>
<td>PUNPCKLQDQ xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 60/r</td>
<td>V/V</td>
<td>AVX</td>
<td>Interleave low-order bytes from xmm2 and xmm3/m128 into xmm1.</td>
</tr>
<tr>
<td>VPUNPCKLBW xmm1,xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 61/r</td>
<td>V/V</td>
<td>AVX</td>
<td>Interleave low-order words from xmm2 and xmm3/m128 into xmm1.</td>
</tr>
<tr>
<td>VPUNPCKLWD xmm1,xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 62/r</td>
<td>V/V</td>
<td>AVX</td>
<td>Interleave low-order doublewords from xmm2 and xmm3/m128 into xmm1.</td>
</tr>
<tr>
<td>VPUNPCKLDQ xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 6C/r</td>
<td>V/V</td>
<td>AVX</td>
<td>Interleave low-order quadword from xmm2 and xmm3/m128 into xmm1 register.</td>
</tr>
<tr>
<td>VPUNPCKLQDQ xmm1, xmm2, xmm3/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**
Unpacks and interleaves the low-order data elements (bytes, words, doublewords, and quadwords) of the first source operand and second source operand into the destination operand. (Figure 5-25 shows the unpack operation for bytes in 64-bit operands.). The high-order data elements are ignored.
INSTRUCTION SET REFERENCE

128-bit Legacy SSE version: The first source operand and the destination operand are the same.

![Figure 5-25. PUNPCKLBW Instruction Operation using 64-bit Operands](image)

The second source operand can be an XMM register or a 128-bit memory location. The first source and destination operands are XMM registers. When the source data comes from a 128-bit memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to a 16-byte boundary and normal segment checking will still be enforced.

The PUNPCKLBW instruction interleaves the low-order bytes of the source and destination operands, the PUNPCKLWD instruction interleaves the low-order words of the source and destination operands, the PUNPCKLDQ instruction interleaves the low order doubleword (or doublewords) of the source and destination operands, and the PUNPCKLQDQ instruction interleaves the low-order quadwords of the source and destination operands.

128-bit Legacy SSE versions: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded versions: Bits (255:128) of the destination YMM register are zeroed. VEX.L must be 0, otherwise instructions will #UD.

Operation

**INTERLEAVE_BYTES (SRC1, SRC2)**

\[
\begin{align*}
\text{DEST}[7:0] & \leftarrow \text{SRC1}[7:0] \\
\text{DEST}[15:8] & \leftarrow \text{SRC2}[7:0] \\
\text{DEST}[23:16] & \leftarrow \text{SRC2}[15:8] \\
\text{DEST}[31:24] & \leftarrow \text{SRC2}[15:8] \\
\text{DEST}[39:32] & \leftarrow \text{SRC1}[23:16] \\
\text{DEST}[47:40] & \leftarrow \text{SRC2}[23:16] \\
\text{DEST}[55:48] & \leftarrow \text{SRC1}[31:24] \\
\text{DEST}[63:56] & \leftarrow \text{SRC2}[31:24] \\
\text{DEST}[71:64] & \leftarrow \text{SRC1}[39:32] \\
\text{DEST}[79:72] & \leftarrow \text{SRC2}[39:32] \\
\text{DEST}[87:80] & \leftarrow \text{SRC1}[47:40] \\
\text{DEST}[95:88] & \leftarrow \text{SRC2}[47:40]
\end{align*}
\]
INSTRUCTION SET REFERENCE

DEST[103:96] ← SRC1[55:48]
DEST[111:104] ← SRC2[55:48]
DEST[119:112] ← SRC1[63:56]
DEST[127:120] ← SRC2[63:56]

INTERLEAVE_WORDS (SRC1, SRC2)
DEST[15:0] ← SRC1[15:0]
DEST[31:16] ← SRC2[15:0]
DEST[47:32] ← SRC1[31:16]
DEST[63:48] ← SRC2[31:16]
DEST[79:64] ← SRC1[47:32]
DEST[95:80] ← SRC2[47:32]
DEST[111:96] ← SRC1[63:48]
DEST[127:112] ← SRC2[63:48]

INTERLEAVE_DWORDS(SRC1, SRC2)
DEST[31:0] ← SRC1[31:0]
DEST[63:32] ← SRC2[31:0]
DEST[95:64] ← SRC1[63:32]
DEST[127:96] ← SRC2[63:32]

INTERLEAVE_QWORDS(SRC1, SRC2)
DEST[63:0] ← SRC1[63:0]
DEST[127:64] ← SRC2[63:0]

PUNPCKLBW
DEST[127:0] ← INTERLEAVE_BYTES(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKLBW
DEST[127:0] ← INTERLEAVE_BYTES(SRC1, SRC2)
DEST[255:127] ← 0

PUNPCKLWD
DEST[127:0] ← INTERLEAVE_WORDS(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKLWD
DEST[127:0] ← INTERLEAVE_WORDS(SRC1, SRC2)
DEST[255:127] ← 0

PUNPCKLDQ
DEST[127:0] ← INTERLEAVE_DWORDS(DEST, SRC)
INSTRUCTION SET REFERENCE

DEST[255:127] (Unmodified)

VPUNPCKLDQ
DEST[127:0] ← INTERLEAVE_DWORDS(SRC1, SRC2)
DEST[255:127] ← 0

PUNPCKLQDQ
DEST[127:0] ← INTERLEAVE_QWORDS(DEST, SRC)
DEST[255:127] (Unmodified)

VPUNPCKLQDQ
DEST[127:0] ← INTERLEAVE_QWORDS(SRC1, SRC2)
DEST[255:127] ← 0

Intel C/C++ Compiler Intrinsic Equivalent

PUNPCKLBW __m128i _mm_unpacklo_epi8 (__m128i m1, __m128i m2)
PUNPCKLWD __m128i _mm_unpacklo_epi16 (__m128i m1, __m128i m2)
PUNPCKLDQ __m128i _mm_unpacklo_epi32 (__m128i m1, __m128i m2)
PUNPCKLQDQ __m128i _mm_unpacklo_epi64 (__m128i m1, __m128i m2)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.L = 1.
PXOR - Exclusive Or

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F EF /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Bitwise XOR of xmm2/m128 and xmm1.</td>
</tr>
<tr>
<td>PXOR xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Bitwise XOR of xmm3/m128 and xmm2.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F EF /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Bitwise XOR of xmm3/m128 and xmm2.</td>
</tr>
<tr>
<td>VPXOR xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Bitwise XOR of xmm3/m128 and xmm2.</td>
</tr>
</tbody>
</table>

Description

Performs a bitwise logical XOR operation on the second source operand and the first source operand and stores the result in the destination operand. The second source operand is an XMM register or a 128-bit memory location. The first source and destination operands can be XMM registers. Each bit of the result is set to 1 if the corresponding bits of the first and second operands are different; otherwise, each bit is 0 if the corresponding bits of the first and second operand are the same.

128-bit Legacy SSE version: Bits (255:128) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

VEX.L must be 0, otherwise instructions will #UD.

Operation

**VPXOR (VEX.128 encoded version)**

DEST ← SRC1 XOR SRC2
DEST[255:128] ← 0

**PXOR (128-bit Legacy SSE version)**

DEST ← DEST XOR SRC
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

PXOR __m128i __mm_xor_si128 (__m128i a, __m128i b)

SIMD Floating-Point Exceptions

none

Other Exceptions

See Exceptions Type 4; additionally
INSTRUCTION SET REFERENCE

#UD If VEX.L = 1.
RCPPS- Compute Approximate Reciprocals of Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>OPCODE/INSTRUCTION</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>DESCRIPTION</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 53 /r RCPPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Computes the approximate reciprocals of packed single-precision values in xmm2/mem and stores the results in xmm1</td>
</tr>
<tr>
<td>VEX.128.0F 53 /r VRCPNS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Computes the approximate reciprocals of packed single-precision values in xmm2/mem and stores the results in xmm1</td>
</tr>
<tr>
<td>VEX.256.0F 53 /r VRCPNS ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Computes the approximate reciprocals of packed single-precision values in ymm2/mem and stores the results in ymm1</td>
</tr>
</tbody>
</table>

Description

Performs an SIMD computation of the approximate reciprocals of the four or eight packed single precision floating-point values in the source operand (second operand) and stores the packed single-precision floating-point operation in the destination operand. See Figure 10-5 in the IA-32 Intel Architecture Software Developer’s Manual, Volume 1 for an illustration of an SIMD single-precision floating-point operation.

The relative error for this approximation is:

|Relative Error| < 1.5 \times 2^{-12}

The RCPPS instruction is not affected by the rounding control bits in the MXCSR register.

When a source value is a 0.0, an Inf of the sign of the source value is returned. A denormal source value is treated as a 0.0 (of the same sign).

Tiny results are always flushed to 0.0, with the sign of the operand:

- The result is guaranteed not to be tiny for inputs that are not greater than \((2^{125})(2-3*2^{-10})\) in absolute value.
- The result is guaranteed to be flushed to 0 for values greater than \((2^{126})(1+3*2^{-11})\) in absolute value.
- Input values in between this range may or may not produce tiny results,
When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation

VRCPSS (VEX.256 encoded version)
DEST[31:0] ← APPROXIMATE(1/SRC[31:0])
DEST[63:32] ← APPROXIMATE(1/SRC[63:32])
DEST[95:64] ← APPROXIMATE(1/SRC[95:64])
DEST[127:96] ← APPROXIMATE(1/SRC[127:96])
DEST[159:128] ← APPROXIMATE(1/SRC[159:128])
DEST[191:160] ← APPROXIMATE(1/SRC[191:160])
DEST[223:192] ← APPROXIMATE(1/SRC[223:192])
DEST[255:224] ← APPROXIMATE(1/SRC[255:224])

VRCPSS (VEX.128 encoded version)
DEST[31:0] ← APPROXIMATE(1/SRC[31:0])
DEST[63:32] ← APPROXIMATE(1/SRC[63:32])
DEST[95:64] ← APPROXIMATE(1/SRC[95:64])
DEST[127:96] ← APPROXIMATE(1/SRC[127:96])
DEST[255:128] ← 0

RCPPS (128-bit Legacy SSE version)
DEST[31:0] ← APPROXIMATE(1/SRC[31:0])
DEST[63:32] ← APPROXIMATE(1/SRC[63:32])
DEST[95:64] ← APPROXIMATE(1/SRC[95:64])
DEST[127:96] ← APPROXIMATE(1/SRC[127:96])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

RCPPS __m256_mm256_rcp_ps (__m256 a);
RCPPS __m128_mm_rcp_ps (__m128 a);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.vvvv != 1111B.
RCPSS - Compute Reciprocal of Scalar Single-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 53 /r RCPSS xmm1, xmm2/m32</td>
<td>V/V</td>
<td>SSE</td>
<td>Computes the approximate reciprocal of the scalar single-precision floating-point value in xmm2/m32 and stores the result in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 53 /r VRCPSS xmm1, xmm2, xmm3/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Computes the approximate reciprocal of the scalar single-precision floating-point value in xmm3/m32 and stores the result in xmm1. Also, upper single precision floating-point values (bits[127:32]) from xmm2 are copied to xmm1[127:32].</td>
</tr>
</tbody>
</table>

Description

Computes of an approximate reciprocal of the low single-precision floating-point value in the second source operand and stores the single-precision floating-point result in the destination operand. The second source operand can be an XMM register or a 32-bit memory location. The first source operand and the destination operand are XMM registers. The three high-order doublewords of the destination operand are copied from the same bits of the first source operand. See Figure 10-6 in the Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, for an illustration of a scalar single-precision floating-point operation.

The relative error for this approximation is:

$$|\text{Relative Error}| < 1.5 \cdot 2^{-12}$$

The RCPSS instruction is not affected by the rounding control bits in the MXCSR register. When a source value is a 0.0, an Inf of the sign of the source value is returned.

A denormal source value is treated as a 0.0 (of the same sign).

Tiny results are always flushed to 0.0, with the sign of the operand:

- The result is guaranteed not to be tiny for inputs that are not greater than $$(2^{125}) \cdot (2-3 \cdot 2^{-10})$$ in absolute value.
- The result is guaranteed to be flushed to 0 for values greater than $$(2^{126}) \cdot (1+3 \cdot 2^{-11})$$ in absolute value.
- Input values in between this range may or may not produce tiny results, depending on the implementation.
When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. Software should ensure VRCPSS is encoded with VEX.L=0. Encoding VRCPSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation
VRCPSS (VEX.128 encoded version)
DEST[31:0] ← APPROXIMATE(1/SRC2[31:0])
DEST[255:128] ← 0

RCPSS (128-bit Legacy SSE version)
DEST[31:0] ← APPROXIMATE(1/SRC[31:0])
DEST[255:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
RCPSS __m128 __mm_rcp_ss(__m128 a)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5
INSTRUCTION SET REFERENCE

RSQRTPS - Compute Approximate Reciprocals of Square Roots of Packed Single-Precision Floating-point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 52 /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Computes the approximate reciprocals of the square roots of packed single-precision values in xmm2/mem and stores the results in xmm1.</td>
</tr>
<tr>
<td>VEX.128.0F 52 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Computes the approximate reciprocals of the square roots of packed single-precision values in xmm2/mem and stores the results in xmm1.</td>
</tr>
<tr>
<td>VEX.256.0F 52 /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Computes the approximate reciprocals of the square roots of packed single-precision values in ymm2/mem and stores the results in ymm1.</td>
</tr>
</tbody>
</table>

Description

Performs an SIMD computation of the approximate reciprocals of the square roots of the four or eight packed single precision floating-point values in the source operand (second operand) and stores the packed single-precision floating-point results in the destination operand. See Figure 10-5 in the IA-32 Intel Architecture Software Developer’s Manual, Volume 1 for an illustration of an SIMD single-precision floating-point operation.

|Relative Error| < 1.5 *2^-12 |
The RSQRTPS instruction is not affected by the rounding control bits in the MXCSR register.

When a source value is a 0.0, an Inf of the sign of the source value is returned.

A denormal source value is treated as a 0.0 (of the same sign).

When a source value is a negative value (other than 0.0), a floating-point indefinite is returned.

When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned.
INSTRUCTION SET REFERENCE

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Operation

**VRSQRTPS (VEX.256 encoded version)**

**DEST[31:0]** ← APPROXIMATE(1/SQRT(SRC[31:0]))

**DEST[63:32]** ← APPROXIMATE(1/SQRT(SRC1[63:32]))

**DEST[95:64]** ← APPROXIMATE(1/SQRT(SRC1[95:64]))

**DEST[127:96]** ← APPROXIMATE(1/SQRT(SRC2[127:96]))

**DEST[159:128]** ← APPROXIMATE(1/SQRT(SRC2[159:128]))

**DEST[191:160]** ← APPROXIMATE(1/SQRT(SRC2[191:160]))

**DEST[223:192]** ← APPROXIMATE(1/SQRT(SRC2[223:192]))

**DEST[255:224]** ← APPROXIMATE(1/SQRT(SRC2[255:224]))

**VRSQRTPS (VEX.128 encoded version)**

**DEST[31:0]** ← APPROXIMATE(1/SQRT(SRC[31:0]))

**DEST[63:32]** ← APPROXIMATE(1/SQRT(SRC1[63:32]))

**DEST[95:64]** ← APPROXIMATE(1/SQRT(SRC1[95:64]))

**DEST[127:96]** ← APPROXIMATE(1/SQRT(SRC2[127:96]))

**DEST[255:128]** ← 0

**RSQRTPS (128-bit Legacy SSE version)**

**DEST[31:0]** ← APPROXIMATE(1/SQRT(SRC[31:0]))

**DEST[63:32]** ← APPROXIMATE(1/SQRT(SRC1[63:32]))

**DEST[95:64]** ← APPROXIMATE(1/SQRT(SRC1[95:64]))

**DEST[127:96]** ← APPROXIMATE(1/SQRT(SRC2[127:96]))

**DEST[255:128]** (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

RSQRTPS __m256 _mm256_rsqrt_ps (__m256 a);
INSTRUCTION SET REFERENCE

RSQRTPS __m128 _mm_rsqrt_ps (__m128 a);

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 4; additionally
#UD If VEX.vvvv != 1111B.
RSQRTSS - Compute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 52 /r RSQRTSS xmm1, xmm2/m32</td>
<td>V/V</td>
<td>SSE</td>
<td>Computes the approximate reciprocal of the square root of the low single precision floating-point value in xmm2/m32 and stores the results in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 52 /r VRSQRTSS xmm1, xmm2, xmm3/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Computes the approximate reciprocal of the square root of the low single precision floating-point value in xmm3/m32 and stores the results in xmm1. Also, upper single precision floating-point values (bits[127:32]) from xmm2 are copied to xmm1[127:32].</td>
</tr>
</tbody>
</table>

Description

Computes an approximate reciprocal of the square root of the low single-precision floating-point value in the second source operand stores the single-precision floating-point result in the destination operand. The second source operand can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers. The three high-order doublewords of the destination operand are copied from the same bits of the first source operand. See Figure 10-6 in the Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a scalar single-precision floating point operation. The relative error for this approximation is:

|Relative Error| < 1.5 *2^-12

The RSQRTSS instruction is not affected by the rounding control bits in the MXCSR register.

When a source value is a 0.0, an Inf of the sign of the source value is returned.
A denormal source value is treated as a 0.0 (of the same sign).
When a source value is a negative value (other than 0.0), a floating-point indefinite is returned.
When a source value is an SNaN or QNaN, the SNaN is converted to a QNaN or the source QNaN is returned.
INSTRUCTION SET REFERENCE

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. Software should ensure VRSQRTSS is encoded with VEX.L=0. Encoding VRSQRTSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation
VRSQRTSS (VEX.128 encoded version)
DEST[31:0] ← APPROXIMATE(1/SQRT(SRC2[31:0]))
DEST[127:32] ← SRC1[31:0]
DEST[255:128] ← 0

RSQRTSS (128-bit Legacy SSE version)
DEST[31:0] ← APPROXIMATE(1/SQRT(SRC2[31:0]))
DEST[255:32] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
RSQRTSS __m128 _mm_rsqrt_ss(__m128 a)

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 5
ROUNDPD- Round Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 09 /r ib</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Round packed double-precision floating-point values in xmm2/m128 and place the result in xmm1. The rounding mode is determined by imm8.</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 09 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Round packed double-precision floating-point values in xmm2/m128 and place the result in xmm1. The rounding mode is determined by imm8.</td>
</tr>
<tr>
<td>VEX.256.66.0F3A 09 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Round packed double-precision floating-point values in ymm2/m256 and place the result in ymm1. The rounding mode is determined by imm8.</td>
</tr>
</tbody>
</table>

**Description**

Round the four double-precision floating-point values in the source operand (second operand) by the rounding mode specified in the immediate operand (third operand) and place the result in the destination operand (first operand). The rounding process rounds the input to an integral value and returns the result as a double-precision floating-point value.

The immediate operand specifies control fields for the rounding operation, three bit fields are defined and shown in Figure 5-26. Bit 3 of the immediate byte controls processor behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Figure 5-26 lists the encoded values for rounding-mode field).

The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an SNaN then it will be converted to a QNaN. If DAZ is set to `1 then denormals will be converted to zero before rounding.

VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.
INSTRUCTION SET REFERENCE

128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

```
<table>
<thead>
<tr>
<th>7</th>
<th>4</th>
<th>3</th>
<th>2</th>
<th>1</th>
<th>0</th>
</tr>
</thead>
<tbody>
<tr>
<td>Reserved</td>
<td>p</td>
<td>o</td>
<td>RC</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
```

- 00: Nearest
- 01: Down / toward -INF
- 10: Up / toward +INF
- 11: truncate

0: normal behavior
1: Inexact (Precision) field is not updated and no precision exception will be taken if unmasked

**Operation**

```c
RoundToNearestIntegralValue(value, control) {
    rounding_direction ← MXCSR:RC
    if (control[2] == 1)
        rounding_direction ← MXCSR:RC
    else
        rounding_direction ← control[1:0]
}
```
case (rounding_direction)
  00: dest ← round_to_nearest_even_integer(value)
  01: dest ← round_to_equal_or_smaller_integer(value)
  10: dest ← round_to_equal_or_larger_integer(value)
  11: dest ← round_to_nearest_smallest_magnitude_integer(value)
esac

if (control[3] = 0)
{
  if (value != dest)
  {
    set_precision()
  }
}
return(dest)

VROUNDPD (VEX.256 encoded version)
DEST[63:0] ← RoundToNearestIntegralValue(SRC[63:0], ROUND_CONTROL)
DEST[127:64] ← RoundToNearestIntegralValue(SRC[127:64], ROUND_CONTROL)
DEST[191:128] ← RoundToNearestIntegralValue(SRC[191:128], ROUND_CONTROL)
DEST[255:192] ← RoundToNearestIntegralValue(SRC[255:192], ROUND_CONTROL)

VROUNDPD (VEX.128 encoded version)
DEST[63:0] ← RoundToNearestIntegralValue(SRC[63:0], ROUND_CONTROL)
DEST[127:64] ← RoundToNearestIntegralValue(SRC[127:64], ROUND_CONTROL)
DEST[255:128] ← 0

ROUNDPD (128-bit Legacy SSE version)
DEST[63:0] ← RoundToNearestIntegralValue(SRC[63:0], ROUND_CONTROL)
DEST[127:64] ← RoundToNearestIntegralValue(SRC[127:64], ROUND_CONTROL)
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

__m256 _mm256_round_pd(__m256d s1, int iRoundMode);
__m256 _mm256_floor_pd(__m256d s1);
__m256 _mm256_ceil_pd(__m256d s1)
__m128 _mm_round_pd(__m128d s1, int iRoundMode);
INSTRUCTION SET REFERENCE

__m128 _mm_floor_pd(__m128d s1);
__m128 _mm_ceil_pd(__m128d s1)

SIMD Floating-Point Exceptions
Precision, Invalid

Other Exceptions
See Exceptions Type 2; additionally
#UD If VEX.vvvv != 1111B.
## ROUNDPS- Round Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 08 /r ib</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Round packed single-precision floating-point values in xmm2/m128 and place the result in xmm1. The rounding mode is determined by imm8</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 08 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Round packed single-precision floating-point values in xmm2/m128 and place the result in xmm1. The rounding mode is determined by imm8</td>
</tr>
<tr>
<td>VEX.256.66.0F3A 08 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Round packed single-precision floating-point values in ymm2/m256 and place the result in ymm1. The rounding mode is determined by imm8</td>
</tr>
</tbody>
</table>

### Description

Round the four or eight single-precision floating-point values in the source operand (second operand) by the rounding mode specified in the immediate operand (third operand) and place the result in the destination operand (first operand). The rounding process rounds the input to an integral value and returns the result as a single-precision floating-point value.

The immediate operand specifies control fields for the rounding operation, three bit fields are defined and shown in Figure 5-26. Bit 3 of the immediate byte controls processor behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Figure 5-26 lists the encoded values for rounding-mode field).

The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an SNaN then it will be converted to a QNaN. If DAZ is set to `1 then denormals will be converted to zero before rounding.

VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.
INSTRUCTION SET REFERENCE

128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation
(see ROUNDPD for definition of RoundToNearestIntegralValue)

VROUNDPS (VEX.256 encoded version)
DEST[31:0] ← RoundToNearestIntegralValue(SRC[31:0], ROUND_CONTROL)
DEST[63:32] ← RoundToNearestIntegralValue(SRC[63:32], ROUND_CONTROL)
DEST[95:64] ← RoundToNearestIntegralValue(SRC[95:64], ROUND_CONTROL)
DEST[127:96] ← RoundToNearestIntegralValue(SRC[127:96], ROUND_CONTROL)
DEST[159:128] ← RoundToNearestIntegralValue(SRC[159:128], ROUND_CONTROL)
DEST[191:160] ← RoundToNearestIntegralValue(SRC[191:160], ROUND_CONTROL)
DEST[223:192] ← RoundToNearestIntegralValue(SRC[223:192], ROUND_CONTROL)
DEST[255:224] ← RoundToNearestIntegralValue(SRC[255:224], ROUND_CONTROL)

VROUNDPS (VEX.128 encoded version)
DEST[31:0] ← RoundToNearestIntegralValue(SRC[31:0], ROUND_CONTROL)
DEST[63:32] ← RoundToNearestIntegralValue(SRC[63:32], ROUND_CONTROL)
DEST[95:64] ← RoundToNearestIntegralValue(SRC[95:64], ROUND_CONTROL)
DEST[127:96] ← RoundToNearestIntegralValue(SRC[127:96], ROUND_CONTROL)
DEST[255:128] ← 0

ROUNDPS (128-bit Legacy SSE version)
DEST[31:0] ← RoundToNearestIntegralValue(SRC[31:0], ROUND_CONTROL)
DEST[63:32] ← RoundToNearestIntegralValue(SRC[63:32], ROUND_CONTROL)
DEST[95:64] ← RoundToNearestIntegralValue(SRC[95:64], ROUND_CONTROL)
DEST[127:96] ← RoundToNearestIntegralValue(SRC[127:96], ROUND_CONTROL)
DEST[255:128] ← (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

__m256 _mm256_round_ps(__m256 s1, int iRoundMode);
__m256 _mm256_floor_ps(__m256 s1);
__m256 _mm256_ceil_ps(__m256 s1)
__m128 _mm_round_ps(__m128 s1, int iRoundMode);
__m128 _mm_floor_ps(__m128 s1);
__m128 __mm_ceil_ps(__m128 s1)

**SIMD Floating-Point Exceptions**
**Precision, Invalid**

**Other Exceptions**
See Exceptions Type 2; additionally
#UD If VEX.vvvv != 1111B.
**ROUNDSD - Round Scalar Double-Precision Value**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 0B /r ib</td>
<td>V/V</td>
<td>SSE_1</td>
<td>Round the low packed double precision floating-point value in xmm2/m64 and place the result in xmm1. The rounding mode is determined by imm8.</td>
</tr>
<tr>
<td>ROUNDSD xmm1, xmm2/m64, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 0B /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Round the low packed double precision floating-point value in xmm3/m64 and place the result in xmm1. The rounding mode is determined by imm8. Upper packed double precision floating-point value (bits[127:64]) from xmm2 is copied to xmm1[127:64].</td>
</tr>
<tr>
<td>VROUNDSD xmm1, xmm2, xmm3/m64, imm8</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Round the DP FP value in the second source operand by the rounding mode specified in the immediate operand and place the result in the destination operand. The rounding process rounds the lowest double precision floating-point input to an integral value and returns the result as a double precision floating-point value in the lowest position. The upper double precision floating-point value in the destination is retained.

The immediate operand specifies control fields for the rounding operation, three bit fields are defined and shown in Figure 5-26. Bit 3 of the immediate byte controls processor behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Figure 5-26 lists the encoded values for rounding-mode field).

The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an SNaN then it will be converted to a QNaN. If DAZ is set to ‘1’ then denormals will be converted to zero before rounding.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VROUNDSD is encoded with VEX.L=0. Encoding VROUNDSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.
Operation

**VROUNDS (VEX.128 encoded version)**

\[
\text{DEST}[63:0] \leftarrow \text{RoundToNearestIntegralValue} (\text{SRC2}[63:0], \text{ROUND\_CONTROL})
\]

\[
\text{DEST}[127:64] \leftarrow \text{SRC1}[127:64]
\]

\[
\text{DEST}[255:128] \leftarrow 0
\]

**ROUNDSD (128-bit Legacy SSE version)**

\[
\text{DEST}[63:0] \leftarrow \text{RoundToNearestIntegralValue} (\text{SRC}[63:0], \text{ROUND\_CONTROL})
\]

\[
\text{DEST}[255:64] \text{ (Unmodified)}
\]

**Intel C/C++ Compiler Intrinsic Equivalent**

\[
\text{ROUNDSD \_m128d \_mm\_round\_sd(\_m128d \text{dst}, \_m128d \text{s1}, \text{int iRoundMode});}
\]

\[
\text{\_m128d \_mm\_floor\_sd(\_m128d \text{dst}, \_m128d \text{s1});}
\]

\[
\text{\_m128d \_mm\_ceil\_sd(\_m128d \text{dst}, \_m128d \text{s1});}
\]

**SIMD Floating-Point Exceptions**

Invalid (signaled only if SRC = SNaN), Precision (signaled only if imm[3] == '0'; if imm[3] == '1, then the Precision Mask in the MXCSR is ignored.)

Note that Denormal is not signaled by ROUNDSD.

**Other Exceptions**

See Exceptions Type 3
INSTRUCTION SET REFERENCE

ROUNDSS - Round Scalar Single-Precision Value

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 3A 0A ib</td>
<td>V/V</td>
<td>SSE4_1</td>
<td>Round the low packed single precision floating-point value in xmm2/m32 and place the result in xmm1. The rounding mode is determined by imm8.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F3A 0A ib</td>
<td>V/V</td>
<td>AVX</td>
<td>Round the low packed single precision floating-point value in xmm3/m32 and place the result in xmm1. The rounding mode is determined by imm8. Also, upper packed single precision floating-point values (bits[127:32]) from xmm2 are copied to xmm1[127:32].</td>
</tr>
</tbody>
</table>

Description

Round the single precision floating-point value in the second source operand by the rounding mode specified in the immediate operand and place the result in the destination operand. The rounding process rounds the lowest single precision floating-point input to an integral value and returns the result as a single precision floating-point value in the lowest position. The upper three single precision floating-point values in the destination are copied from the first source operand.

The immediate operand specifies control fields for the rounding operation, three bit fields are defined and shown in Figure 5-26. Bit 3 of the immediate byte controls processor behavior for a precision exception, bit 2 selects the source of rounding mode control. Bits 1:0 specify a non-sticky rounding-mode value (Figure 5-26 lists the encoded values for rounding-mode field).

The Precision Floating-Point Exception is signaled according to the immediate operand. If any source operand is an SNaN then it will be converted to a QNaN. If DAZ is set to '1 then denormals will be converted to zero before rounding.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VROUNDSS is encoded with VEX.L=0. Encoding VROUNDSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.
Operation

**VROUNDSS (VEX.128 encoded version)**
DEST[31:0] ← \text{RoundToNearestIntegralValue}(SRC2[31:0], \text{ROUND\_CONTROL})
DEST[255:128] ← 0

**ROUNDSS (128-bit Legacy SSE version)**
DEST[31:0] ← \text{RoundToNearestIntegralValue}(SRC[31:0], \text{ROUND\_CONTROL})
DEST[255:32] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

```
ROUNDSS __m128 _mm_round_ss(__m128 dst, __m128 s1, int iRoundMode);
__m128 _mm_floor_ss(__m128 dst, __m128 s1);
__m128 _mm_ceil_ss(__m128 dst, __m128 s1);
```

**SIMD Floating-Point Exceptions**

Invalid (signaled only if SRC = SNaN), Precision (signaled only if \text{imm}[3] == '0'; if \text{imm}[3] == '1', then the Precision Mask in the MXCSR is ignored.)
Note that Denormal is not signaled by ROUNDSS.

**Other Exceptions**
See Exceptions Type 3
SHUFPD - Shuffle Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F C6 /r ib</td>
<td>V/V</td>
<td>SSE2</td>
<td>SHUFPD xmm1, xmm2/m128, imm8</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F C6 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>VSHUFPD xmm1, xmm2, xmm3/m128, imm8</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F C6 /r ib</td>
<td>V/V</td>
<td>AVX</td>
<td>VSHUFPD ymm1, ymm2, ymm3/m256, imm8</td>
</tr>
</tbody>
</table>

Description

Moves either of the two packed double-precision floating-point values from each double quadword in the first source operand (second operand) into the low quadword of each double quadword of the destination operand (first operand); moves either of the two packed double-precision floating-point values from the second source operand (third operand) into the high quadword of each double quadword of the destination operand (see Figure 5-27). The immediate determines which values are moved to the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: The first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.
Figure 5-27. VSHUFPD Operation

**Operation**

**VSHUFPD (VEX.256 encoded version)**

IF IMM0[0] = 0
THEN DEST[63:0] ← SRC1[63:0]
ELSE DEST[63:0] ← SRC1[127:64]FI;

IF IMM0[1] = 0
THEN DEST[127:64] ← SRC2[63:0]
ELSE DEST[127:64] ← SRC2[127:64]FI;

IF IMM0[2] = 0

IF IMM0[3] = 0

**VSHUFPD (VEX.128 encoded version)**

IF IMM0[0] = 0
THEN DEST[63:0] ← SRC1[63:0]
ELSE DEST[63:0] ← SRC1[127:64]FI;

IF IMM0[1] = 0
THEN DEST[127:64] ← SRC2[63:0]
ELSE DEST[127:64] ← SRC2[127:64]FI;

DEST[255:128] ← 0

**VSHUFPD (128-bit Legacy SSE version)**

IF IMM0[0] = 0
THEN DEST[63:0] ← SRC1[63:0]
INSTRUCTION SET REFERENCE

ELSE DEST[63:0] <- SRC1[127:64] FI;
IF IMM0[1] = 0
  THEN DEST[127:64] <- SRC2[63:0]
ELSE DEST[127:64] <- SRC2[127:64] FI;
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VSHUFPD __m256d_mm256_shuffle_pd (__m256d a, __m256d b, const int select);
SHUFPD __m128d_mm_shuffle_pd (__m128d a, __m128d b, const int select);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4
INSTRUCTION SET REFERENCE

SHUFPS - Shuffle Packed Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F C6 /r ib SHUFPS xmm1, xmm3/m128, imm8</td>
<td>V/V</td>
<td>SSE</td>
<td>Shuffle Packed single-precision floating-point values selected by imm8 from xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.0F C6 /r ib VSHUFPS xmm1, xmm2, xmm3/m128, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Shuffle Packed single-precision floating-point values selected by imm8 from xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.0F C6 /r ib VSHUFPS ymm1, ymm2, ymm3/m256, imm8</td>
<td>V/V</td>
<td>AVX</td>
<td>Shuffle Packed single-precision floating-point values selected by imm8 from ymm1 and ymm3/mem</td>
</tr>
</tbody>
</table>

Description

Moves two of the four packed single-precision floating-point values from each double qword of the first source operand (second operand) into the low quadword of each double qword of the destination operand (first operand); moves two of the four packed single-precision floating-point values from each double qword of the second source operand (third operand) into the high quadword of each double qword of the destination operand (see Figure 5-28). The selector operand (third operand) determines which values are moved to the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Ref. # 319433-005 5-463
INSTRUCTION SET REFERENCE

Figure 5-28. VSHUFPS Operation

Operation
Select4(SRC, control) {
CASE (control[1:0]) OF
  0: TMP ← SRC[31:0];
  1: TMP ← SRC[63:32];
  2: TMP ← SRC[95:64];
  3: TMP ← SRC[127:96];
ESAC;
RETURN TMP
}

VSHUFPS (VEX.256 encoded version)
DEST[31:0] ← Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] ← Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] ← Select4(SRC2[127:0], imm8[5:4]);
DEST[127:96] ← Select4(SRC2[127:0], imm8[7:6]);
DEST[159:128] ← Select4(SRC1[255:128], imm8[1:0]);
DEST[255:224] ← Select4(SRC2[255:128], imm8[7:6]);

VSHUFPS (VEX.128 encoded version)
DEST[31:0] ← Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] ← Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] ← Select4(SRC2[127:0], imm8[5:4]);
DEST[127:96] ← Select4(SRC2[127:0], imm8[7:6]);
DEST[255:128] ← 0
**SHUFPS (128-bit Legacy SSE version)**

DEST[31:0] ← Select4(SRC1[127:0], imm8[1:0]);
DEST[63:32] ← Select4(SRC1[127:0], imm8[3:2]);
DEST[95:64] ← Select4(SRC2[127:0], imm8[5:4]);
DEST[127:96] ← Select4(SRC2[127:0], imm8[7:6]);
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

VSHUFPS __m256 _mm256_shuffle_ps (__m256 a, __m256 b, const int select);
SHUFPS __m128 _mm_shuffle_ps (__m128 a, __m128 b, const int select);

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type 4
SQRTPD- Square Root of Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 51/r SQRTPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Computes Square Roots of the packed double-precision floating-point values in xmm2/m128 and stores the result in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F 51/r VSQRTPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Computes Square Roots of the packed double-precision floating-point values in xmm2/m128 and stores the result in xmm1</td>
</tr>
<tr>
<td>VEX.256.66.0F 51/r VSQRTPD ymm1, ymm2/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Computes Square Roots of the packed double-precision floating-point values in ymm2/m256 and stores the result in ymm1</td>
</tr>
</tbody>
</table>

Description

Performs an SIMD computation of the square roots of the two or four packed double-precision floating-point values in the source operand (second operand) stores the packed double-precision floating-point results in the destination operand.

VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

Operation

VSQRTPD (VEX.256 encoded version)
INSTRUCTION SET REFERENCE

DEST[63:0] ← SQRT(SRC[63:0])
DEST[127:64] ← SQRT(SRC[127:64])
DEST[191:128] ← SQRT(SRC[191:128])
DEST[255:192] ← SQRT(SRC[255:192])

VSQRTPD (VEX.128 encoded version)
DEST[63:0] ← SQRT(SRC[63:0])
DEST[127:64] ← SQRT(SRC[127:64])
DEST[255:128] ← 0

SQRTPD (128-bit Legacy SSE version)
DEST[63:0] ← SQRT(SRC[63:0])
DEST[127:64] ← SQRT(SRC[127:64])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
SQRTPD __m256d _mm256_sqrt_pd (__m256d a);
SQRTPD __m128d _mm_sqrt_pd (__m128d a);

SIMD Floating-Point Exceptions
Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2; additionally
#UD If VEX.vvvv != 1111B.
**SQRTPS- Square Root of Single-Precision Floating-Point Values**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Support</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 51 /r SQRTPS xmm1, xmm2/m128</td>
<td>V/V SSE</td>
<td></td>
<td>Computes Square Roots of the packed single-precision floating-point values in xmm2/m128 and stores the result in xmm1</td>
</tr>
<tr>
<td>VEX.128.0F 51 /r VSQRTPS xmm1, xmm2/m128</td>
<td>V/V AVX</td>
<td></td>
<td>Computes Square Roots of the packed single-precision floating-point values in xmm2/m128 and stores the result in xmm1</td>
</tr>
<tr>
<td>VEX.256.0F 51/r VSQRTPS ymm1, ymm2/m256</td>
<td>V/V AVX</td>
<td></td>
<td>Computes Square Roots of the packed single-precision floating-point values in ymm2/m256 and stores the result in ymm1</td>
</tr>
</tbody>
</table>

**Description**

Performs an SIMD computation of the square roots of the four or eight packed single-precision floating-point values in the source operand (second operand) stores the packed double-precision floating-point results in the destination operand.

VEX.256 encoded version: The source operand is a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the source operand second source operand or a 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD.

**Operation**

**VSQRTPS (VEX.256 encoded version)**

DEST[31:0] ← SQRT(SRC[31:0])

DEST[63:32] ← SQRT(SRC[63:32])
INSTRUCTION SET REFERENCE

DEST[95:64] ← SQRT(SRC[95:64])
DEST[127:96] ← SQRT(SRC[127:96])
DEST[159:128] ← SQRT(SRC[159:128])
DEST[191:160] ← SQRT(SRC[191:160])
DEST[223:192] ← SQRT(SRC[223:192])
DEST[255:224] ← SQRT(SRC[255:224])

VSQRTPS (VEX.128 encoded version)
DEST[31:0] ← SQRT(SRC[31:0])
DEST[63:32] ← SQRT(SRC[63:32])
DEST[95:64] ← SQRT(SRC[95:64])
DEST[127:96] ← SQRT(SRC[127:96])
DEST[255:128] ← 0

SQRTPS (128-bit Legacy SSE version)
DEST[31:0] ← SQRT(SRC[31:0])
DEST[63:32] ← SQRT(SRC[63:32])
DEST[95:64] ← SQRT(SRC[95:64])
DEST[127:96] ← SQRT(SRC[127:96])
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
SQRTPS __m256 _mm256_sqrt_ps (__m256 a);
SQRTPS __m128 _mm_sqrt_ps (__m128 a);

SIMD Floating-Point Exceptions
Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2; additionally
#UD If VEX.vvvv ! 1111B.
INSTRUCTION SET REFERENCE

SQRTSD - Compute Square Root of Scalar Double-Precision Floating-Point Value

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 51/ SQRTSD xmm1,xmm2/m64</td>
<td>V/V SSE2</td>
<td></td>
<td>Computes square root of the low double-precision floating point value in xmm2/m64 and stores the results in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 51/ VSQRTSD xmm1,xmm2, xmm3/m64</td>
<td>V/V AVX</td>
<td></td>
<td>Computes square root of the low double-precision floating point value in xmm3/m64 and stores the results in xmm2. Also, upper double precision floating-point value (bits[127:64]) from xmm2 is copied to xmm1[127:64].</td>
</tr>
</tbody>
</table>

Description
Computes the square root of the low double-precision floating-point value in the second source operand and stores the double-precision floating-point result in the destination operand. The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers. The high quadword of the destination operand remains unchanged. See Figure 11-4 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a scalar double-precision floating-point operation.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. Software should ensure VSQRTSD is encoded with VEX.L=0. Encoding VSQRTSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation
VSQRTSD (VEX.128 encoded version)
DEST[63:0] ← SQRT(SRC2[63:0])
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

SQRTSD (128-bit Legacy SSE version)
DEST[63:0] ← SQRT(SRC[63:0])

Ref. # 319433-005
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

SQRTSD __m128d _mm_sqrt_sd (__m128d a, __m128d b)

**SIMD Floating-Point Exceptions**
Invalid, Precision, Denormal

**Other Exceptions**
See Exceptions Type 3
INSTRUCTION SET REFERENCE

SQRTPSS - Compute Square Root of Scalar Single-Precision Value

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 51 SQRTPSS xmm1, xmm2/m32</td>
<td>V/V</td>
<td>SSE</td>
<td>Computes square root of the low single-precision floating-point value in xmm2/m32 and stores the results in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 51 VSQRTPSS xmm1, xmm2, xmm3/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Computes square root of the low single-precision floating-point value in xmm3/m32 and stores the results in xmm1. Also, upper single precision floating-point values (bits[127:32]) from xmm2 are copied to xmm1[127:32].</td>
</tr>
</tbody>
</table>

Description
Computes the square root of the low single-precision floating-point value in the second source operand and stores the single-precision floating-point result in the destination operand. The second source operand can be an XMM register or a 32-bit memory location. The first source and destination operands are an XMM register. The three high order doublewords of the destination operand remain unchanged. See Figure 10-6 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1, for an illustration of a scalar single-precision floating-point operation.

128-bit Legacy SSE version: The first source operand and the destination operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (255:128) of the destination YMM register are zeroed. Software should ensure VSQRTPSS is encoded with VEX.L=0. Encoding VSQRTPSS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

VSQRTPSS (VEX.128 encoded version)
DEST[31:0] ← SQRT(SRC2[31:0])
DEST[255:128] ← 0

SQRTPSS (128-bit Legacy SSE version)
DEST[31:0] ← SQRT(SRC2[31:0])
DEST[255:32] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**
SQRTSS __m128 _mm_sqrt_ss(__m128 a)

**SIMD Floating-Point Exceptions**
Invalid, Precision, Denormal

**Other Exceptions**
See Exceptions Type 3
INSTRUCTION SET REFERENCE

VSTMXCSR—Store MXCSR Register State

Stores the contents of the MXCSR control and status register to the destination operand. The destination operand is a 32-bit memory location. The reserved bits in the MXCSR register are stored as 0s.

This instruction’s operation is the same in non-64-bit modes and 64-bit mode. VEX.L must be 0, otherwise instructions will #UD.

Operation

\[ m32 \leftarrow \text{MXCSR}; \]

Intel C/C++ Compiler Intrinsic Equivalent

\_mm_getcsr(void)

SIMD Floating-Point Exceptions

None.

Other Exceptions

See Exceptions Type 9
## INSTRUCTION SET REFERENCE

### SUBPD- Subtract Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 5C /r SUBPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Subtract packed double-precision floating-point values in xmm2/mem from xmm1 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 5C /r VSUBPD xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed double-precision floating-point values in xmm3/mem from xmm2 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 5C /r VSUBPD ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed double-precision floating-point values in ymm3/mem from ymm2 and stores result in ymm1</td>
</tr>
</tbody>
</table>

### Description

Performs an SIMD subtract of the four or eight packed double-precision floating-point values of the second Source operand from the first Source operand, and stores the packed double-precision floating-point results in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: T second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

### Operation

**VSUBPD (VEX.256 encoded version)**

DEST[63:0] ← SRC1[63:0] - SRC2[63:0]
DEST[127:64] ← SRC1[127:64] - SRC2[127:64]
INSTRUCTION SET REFERENCE

VSUBPD (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0] - SRC2[63:0]
DEST[127:64] ← SRC1[127:64] - SRC2[127:64]
DEST[255:128] ← 0

SUBPD (128-bit Legacy SSE version)
DEST[63:0] ← DEST[63:0] - SRC[63:0]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent
VSUBPD __m256d _mm256_sub_pd (__m256d a, __m256d b);
SUBPD __m128d _mm_sub_pd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
**SUBPS- Subtract Packed Single Precision Floating-Point Values**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 5C /r SUBPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Subtract packed single-precision floating-point values in xmm2/mem from xmm1 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 5C /r VSUBPS xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed single-precision floating-point values in xmm3/mem from xmm2 and stores result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.0F 5C /r VSUBPS ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract packed single-precision floating-point values in ymm3/mem from ymm2 and stores result in ymm1</td>
</tr>
</tbody>
</table>

**Description**

Performs an SIMD subtract of the eight or sixteen packed single-precision floating-point values in the second Source operand from the First Source operand, and stores the packed single-precision floating-point results in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: The first source operand is an XMM register or a 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

**Operation**

**VSUBPS (VEX.256 encoded version)**

- DEST[31:0] ← SRC1[31:0] - SRC2[31:0]
- DEST[95:64] ← SRC1[95:64] - SRC2[95:64]
INSTRUCTION SET REFERENCE


VSUBPS (VEX.128 encoded version)  
DEST[31:0] ← SRC1[31:0] - SRC2[31:0]  
DEST[95:64] ← SRC1[95:64] - SRC2[95:64]  
DEST[255:128] ← 0

SUBPS (128-bit Legacy SSE version)  
DEST[31:0] ← SRC1[31:0] - SRC2[31:0]  
DEST[95:64] ← SRC1[95:64] - SRC2[95:64]  
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

VSUBPS _m256 _mm256_sub_ps (_m256 a, _m256 b);
SUBPS _m128 _mm_sub_ps (_m128 a, _m128 b);

SIMD Floating-Point Exceptions  
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions  
See Exceptions Type 2
SUBSD- Subtract Scalar Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F2 0F 5C /r SUBSD xmm1, xmm2/m64</td>
<td>V/V SSE2</td>
<td>Subtract the low double-precision floating-point value in xmm2/mem from xmm1 and store the result in xmm1</td>
<td></td>
</tr>
<tr>
<td>VEX.NDS.128.F2.0F 5C /r VSUBSD xmm1,xmm2, xmm3/m64</td>
<td>V/V AVX</td>
<td>Subtract the low double-precision floating-point value in xmm3/mem from xmm2 and store the result in xmm1</td>
<td></td>
</tr>
</tbody>
</table>

Description

Subtract the low double-precision floating-point values in the second source operand from the first source operand and stores the double-precision floating-point result in the destination operand.

The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:64) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:64) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VSUBSD is encoded with VEX.L=0. Encoding VSUBSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

VSUBSD (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0] - SRC2[63:0]
DEST[127:64] ← SRC1[127:64]
DEST[255:128] ← 0

SUBSD (128-bit Legacy SSE version)
DEST[63:0] ← DEST[63:0] - SRC[63:0]
DEST[255:64] (Unmodified)
INSTRUCTION SET REFERENCE

Intel C/C++ Compiler Intrinsic Equivalent
SUBSD __m128d_mm_sub_sd (__m128d a, __m128d b);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 3
SUBSS- Subtract Scalar Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>F3 0F 5C /r SUBSS xmm1, xmm2/m32</td>
<td>V/V</td>
<td>SSE</td>
<td>Subtract the low single-precision floating-point value in xmm2/mem from xmm1 and store the result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.128.F3.0F 5C /r VSUBSS xmm1,xmm2, xmm3/m32</td>
<td>V/V</td>
<td>AVX</td>
<td>Subtract the low single-precision floating-point value in xmm3/mem from xmm2 and store the result in xmm1</td>
</tr>
</tbody>
</table>

Description

Subtract the low single-precision floating-point values from the second source operand and the first source operand and store the double-precision floating-point result in the destination operand.

The second source operand can be an XMM register or a 32-bit memory location. The first source and destination operands are XMM registers.

128-bit Legacy SSE version: The destination and first source operand are the same. Bits (255:32) of the corresponding YMM destination register remain unchanged.

VEX.128 encoded version: Bits (127:32) of the XMM register destination are copied from corresponding bits in the first source operand. Bits (255:128) of the destination YMM register are zeroed.

Software should ensure VSUBSD is encoded with VEX.L=0. Encoding VSUBSD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

Operation

VSUBSS (VEX.128 encoded version)
DEST[31:0] ← SRC1[31:0] - SRC2[31:0]
DEST[255:128] ← 0

SUBSS (128-bit Legacy SSE version)
DEST[31:0] ← DEST[31:0] - SRC[31:0]
DEST[255:32] (Unmodified)
**INSTRUCTION SET REFERENCE**

**Intel C/C++ Compiler Intrinsic Equivalent**

SUBSS __m128 _mm_sub_ss (__m128 a, __m128 b);

**SIMD Floating-Point Exceptions**

Overflow, Underflow, Invalid, Precision, Denormal

**Other Exceptions**

See Exceptions Type 3
**UCOMISD - Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS**

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 2E /r</td>
<td>V/V</td>
<td>SSE2</td>
<td>Compare low double precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.</td>
</tr>
<tr>
<td>UCOMISD xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.66.0F 2E /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare low double precision floating-point values in xmm1 and xmm2/mem64 and set the EFLAGS flags accordingly.</td>
</tr>
<tr>
<td>VUCOMISD xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Performs an unordered compare of the double-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN).

Operand 1 is an XMM register; operand 2 can be an XMM register or a 64 bit memory location.

The UCOMISD instruction differs from the COMISD instruction in that it signals a SIMD floating-point invalid operation exception (#I) only when a source operand is an SNaN. The COMISD instruction signals an invalid numeric exception only if a source operand is either an SNaN or a QNaN.

The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Software should ensure VCOMISD is encoded with VEX.L=0. Encoding VCOMISD with VEX.L=1 may encounter unpredictable behavior across different processor generations.

**Operation**

**UCOMISD (all versions)**

RESULT ← UnorderedCompare(DEST[63:0] <> SRC[63:0])
INSTRUCTION SET REFERENCE

(* Set EFLAGS *) CASE (RESULT) OF
   UNORDERED: ZF,PF,CF ← 111;
   GREATER_THAN: ZF,PF,CF ← 000;
   LESS_THAN: ZF,PF,CF ← 001;
   EQUAL: ZF,PF,CF ← 100;
ESAC;
OF, AF, SF ← 0; }

Intel C/C++ Compiler Intrinsic Equivalent

int _mm_ucomieq_sd(__m128d a, __m128d b)
int _mm_ucomilt_sd(__m128d a, __m128d b)
int _mm_ucomile_sd(__m128d a, __m128d b)
int _mm_ucomigt_sd(__m128d a, __m128d b)
int _mm_ucomige_sd(__m128d a, __m128d b)
int _mm_ucomineq_sd(__m128d a, __m128d b)
int _mm_ucomineq_sd(__m128d a, __m128d b)

SIMD Floating-Point Exceptions
Invalid (if SNaN operands), Denormal

Other Exceptions
See Exceptions Type 3; additionally
#UD If VEX.vvvv != 1111B.
UCOMISS - Unordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 2E /r</td>
<td>V/V</td>
<td>SSE</td>
<td>Compare low single precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.</td>
</tr>
<tr>
<td>UCOMISS xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.128.0F 2E /r</td>
<td>V/V</td>
<td>AVX</td>
<td>Compare low single precision floating-point values in xmm1 and xmm2/mem32 and set the EFLAGS flags accordingly.</td>
</tr>
<tr>
<td>VUCOMISS xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Description**

Compares the single-precision floating-point values in the low quadwords of operand 1 (first operand) and operand 2 (second operand), and sets the ZF, PF, and CF flags in the EFLAGS register according to the result (unordered, greater than, less than, or equal). The OF, SF and AF flags in the EFLAGS register are set to 0. The unordered result is returned if either source operand is a NaN (QNaN or SNaN).

Operand 1 is an XMM register; operand 2 can be an XMM register or a 32 bit memory location.

The UCOMISS instruction differs from the COMISS instruction in that it signals a SIMD floating-point invalid operation exception (#I) only if a source operand is an SNaN. The COMISS instruction signals an invalid numeric exception when a source operand is either a QNaN or SNaN.

The EFLAGS register is not updated if an unmasked SIMD floating-point exception is generated.

Note: In VEX-encoded versions, VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

Software should ensure VCOMISS is encoded with VEX.L=0. Encoding VCOMISS with VEX.L=1 may encounter unpredictable behavior across different processor generations.

**Operation**

**UCOMISS (all versions)**

RESULT ← UnorderedCompare(DEST[31:0] ↔ SRC[31:0])
INSTRUCTION SET REFERENCE

(* Set EFLAGS *) CASE (RESULT) OF
    UNORDERED: ZF,PF,CF ← 111;
    GREATER_THAN: ZF,PF,CF ← 000;
    LESS_THAN: ZF,PF,CF ← 001;
    EQUAL: ZF,PF,CF ← 100;
ESAC;
OF, AF, SF ← 0;

Intel C/C++ Compiler Intrinsic Equivalent

int _mm_ucomieq_ss(__m128 a, __m128 b)
int _mm_ucomilt_ss(__m128 a, __m128 b)
int _mm_ucomile_ss(__m128 a, __m128 b)
int _mm_ucomigt_ss(__m128 a, __m128 b)
int _mm_ucomige_ss(__m128 a, __m128 b)
int _mm_ucomineq_ss(__m128 a, __m128 b)

SIMD Floating-Point Exceptions
Invalid (if SNaN Operands), Denormal

Other Exceptions
See Exceptions Type 3; additionally
#UD If VEX.vvvv != 1111B.
### UNPCKHPD- Unpack and Interleave High Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 15 /r UNPCKHPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Unpacks and Interleaves double-precision floating-point values from high quadwords of xmm1 and xmm2/m128.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 15 /r VUNPCKHPD xmm1, xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Unpacks and Interleaves double precision floating-point values from high quadwords of xmm2 and xmm3/m128.</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 15 /r VUNPCKHPD ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Unpacks and Interleaves double precision floating-point values from high quadwords of ymm2 and ymm3/m256.</td>
</tr>
</tbody>
</table>

**Description**

Performs an interleaved unpack of the high double-precision floating-point values from the first source operand and the second source operand. See Figure 4-15 in the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B.

**128-bit versions**

When unpacking from a memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be enforced.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: T second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

**Operation**

VUNPCKHPD (VEX.256 encoded version)

DEST[63:0] ← SRC1[127:64]
INSTRUCTION SET REFERENCE

DEST[127:64] ← SRC2[127:64]
DEST[255:192] ← SRC2[255:192]

**VUNPCKHPD (VEX.128 encoded version)**
DEST[63:0] ← SRC1[127:64]
DEST[127:64] ← SRC2[127:64]
DEST[255:128] ← 0

**UNPCKHPD (128-bit Legacy SSE version)**
DEST[63:0] ← SRC1[127:64]
DEST[127:64] ← SRC2[127:64]
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

UNPCKHPD _m256d_mm256_unpackhi_pd(_m256d a, _m256d b)
UNPCKHPD _m128d_mm_unpackhi_pd(_m128d a, _m128d b)

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**
See Exceptions Type 4
### UNPCKHPS- Unpack and Interleave High Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 15 /r UNPCKHPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Unpacks and Interleaves single-precision floating-point values from high quadwords of xmm1 and xmm2/m128.</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 15 /r VUNPCKHPS xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Unpacks and Interleaves single-precision floating-point values from high quadwords of xmm2 and xmm3/m128.</td>
</tr>
<tr>
<td>VEX.NDS.256.0F 15 /r VUNPCKHPS ymm1,ymm2,ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Unpacks and Interleaves single-precision floating-point values from high quadwords of ymm2 and ymm3/m256.</td>
</tr>
</tbody>
</table>

**Description**

Performs an interleaved unpack of the high single-precision floating-point values from the first source operand and the second source operand.

When unpacking from a memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be enforced.
INSTRUCTION SET REFERENCE

Figure 5-29. VUNPCKHPS Operation

VEX.128 encoded version: the first source operand and second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: the second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VUNPCKHPS (VEX.256 encoded version)
DEST[31:0] ← SRC1[95:64]
DEST[63:32] ← SRC2[95:64]
DEST[95:64] ← SRC1[127:96]
DEST[127:96] ← SRC2[127:96]
DEST[255:224] ← SRC2[255:224]

VUNPCKHPS (VEX.128 encoded version)
DEST[31:0] ← SRC1[95:64]
DEST[63:32] ← SRC2[95:64]
DEST[95:64] ← SRC1[127:96]
DEST[127:96] ← SRC2[127:96]
DEST[255:128] ← 0

UNPCKHPS (128-bit Legacy SSE version)
DEST[31:0] ← SRC1[95:64]
DEST[63:32] ← SRC2[95:64]
DEST[95:64] ← SRC1[127:96]
DEST[127:96] ← SRC2[127:96]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

UNPCKHPS __m256_mm256_unpackhi_ps (__m256 a, __m256 b);
UNPCKHPS __m128_mm_unpackhi_ps (__m128 a, __m128 b);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4
UNPCKLPD- Unpack and Interleave Low Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 14 /r UNPCKLPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Unpacks and Interleaves double-precision floating-point values from low quadwords of xmm1 and xmm2/m128.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 14 /r VUNPCKLPD xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Unpacks and Interleaves double precision floating-point values low high quadwords of xmm2 and xmm3/m128.</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 14 /r VUNPCKLPD ymm1,ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Unpacks and Interleaves double precision floating-point values low high quadwords of ymm2 and ymm3/m256.</td>
</tr>
</tbody>
</table>

Description
Performs an interleaved unpack of the low double-precision floating-point values from the first source operand and the second source operand.

128-bit versions:
When unpacking from a memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be enforced.

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: T second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation
VUNPCKLPD (VEX.256 encoded version)
DEST[63:0] ← SRC1[63:0]
DEST[127:64] ← SRC2[63:0]

VUNPCKLPD (VEX.128 encoded version)
DEST[63:0] ← SRC1[63:0]
DEST[127:64] ← SRC2[63:0]
DEST[255:128] ← 0

UNPCKLPD (128-bit Legacy SSE version)
DEST[63:0] ← SRC1[63:0]
DEST[127:64] ← SRC2[63:0]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

UNPCKLPD __m256d _mm256_unpacklo_pd(__m256d a, __m256d b)
UNPCKLPD __m128d _mm_unpacklo_pd(__m128d a, __m128d b)

SIMD Floating-Point Exceptions

None

Other Exceptions
See Exceptions Type 4
**UNPCKLPS- Unpack and Interleave Low Packed Single-Precision Floating-Point Values**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 14 /r UNPCKLPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Unpacks and Interleaves single-precision floating-point values from low quadwords of xmm1 and xmm2/m128.</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 14 /r VUNPCKLPS xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Unpacks and Interleaves single-precision floating-point values from low quadwords of xmm2 and xmm3/m128.</td>
</tr>
<tr>
<td>VEX.NDS.256.0F 14 /r VUNPCKLPS ymm1,ymm2,ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Unpacks and Interleaves single-precision floating-point values from low quadwords of ymm2 and ymm3/m256.</td>
</tr>
</tbody>
</table>

**Description**

Performs an interleaved unpack of the low single-precision floating-point values from the first source operand and the second source operand.

When unpacking from a memory operand, an implementation may fetch only the appropriate 64 bits; however, alignment to 16-byte boundary and normal segment checking will still be enforced.
Figure 5-30. VUNPCKLPS Operation

VEX.128 encoded version: the first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: T second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation
UNPCKLPS (VEX.256 encoded version)
DEST[31:0] ← SRC1[31:0]
DEST[63:32] ← SRC2[31:0]
DEST[95:64] ← SRC1[63:32]
DEST[127:96] ← SRC2[63:32]
DEST[159:128] ← SRC1[159:128]
DEST[255:224] ← SRC2[191:160]

VUNPCKLPS (VEX.128 encoded version)
DEST[31:0] ← SRC1[31:0]
DEST[63:32] ← SRC2[31:0]
DEST[95:64] ← SRC1[63:32]
DEST[127:96] ← SRC2[63:32]
DEST[255:128] ← 0

UNPCKLPS (128-bit Legacy SSE version)
INSTRUCTION SET REFERENCE

DEST[31:0] ← SRC1[31:0]
DEST[63:32] ← SRC2[31:0]
DEST[95:64] ← SRC1[63:32]
DEST[127:96] ← SRC2[63:32]
DEST[255:128] (Unmodified)

Intel C/C++ Compiler Intrinsic Equivalent

UNPCKLPS _m256 _mm256_unpacklo_ps (_m256 a, _m256 b);
UNPCKLPS _m128 _mm_unpacklo_ps (_m128 a, _m128 b);

SIMD Floating-Point Exceptions

None

Other Exceptions

See Exceptions Type 4
INSTRUCTION SET REFERENCE

XORPD- Bitwise Logical XOR of Packed Double Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 57/r XORPD xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE2</td>
<td>Return the bitwise logical XOR of packed double-precision floating-point values in xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F 57 /r VXORPD xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical XOR of packed double-precision floating-point values in xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F 57 /r VXORPD ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical XOR of packed double-precision floating-point values in ymm2 and ymm3/mem</td>
</tr>
</tbody>
</table>

Description

Performs a bitwise logical XOR of the two or four packed double-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: The first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation

VXORPD (VEX.256 encoded version)

\[
\begin{align*}
\text{DEST}[63:0] & \leftarrow \text{SRC1}[63:0] \text{ BITWISE XOR SRC2}[63:0] \\
\text{DEST}[127:64] & \leftarrow \text{SRC1}[127:64] \text{ BITWISE XOR SRC2}[127:64] \\
\text{DEST}[191:128] & \leftarrow \text{SRC1}[191:128] \text{ BITWISE XOR SRC2}[191:128]
\end{align*}
\]
INSTRUCTION SET REFERENCE


**VXORPD (VEX.128 encoded version)**
DEST[63:0] ← SRC1[63:0] BITWISE XOR SRC2[63:0]
DEST[127:64] ← SRC1[127:64] BITWISE XOR SRC2[127:64]
DEST[255:128] ← 0

**XORPD (128-bit Legacy SSE version)**
DEST[63:0] ← DEST[63:0] BITWISE XOR SRC[63:0]
DEST[127:64] ← DEST[127:64] BITWISE XOR SRC[127:64]
DEST[255:128] (Unmodified)

**Intel C/C++ Compiler Intrinsic Equivalent**

VXORPD _mm256_xor_pd (_mm256d a, _mm256d b);
XORPD _mm128d _mm_xor_pd (_m128d a, _m128d b);

**SIMD Floating-Point Exceptions**
None

**Other Exceptions**
See Exceptions Type 4
XORPS- Bitwise Logical XOR of Packed Single Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>0F 57 /r XORPS xmm1, xmm2/m128</td>
<td>V/V</td>
<td>SSE</td>
<td>Return the bitwise logical XOR of packed single-precision floating-point values in xmm1 and xmm2/mem</td>
</tr>
<tr>
<td>VEX.NDS.128.0F 57 /r VXORPS xmm1,xmm2, xmm3/m128</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical XOR of packed single-precision floating-point values in xmm2 and xmm3/mem</td>
</tr>
<tr>
<td>VEX.NDS.256.0F 57 /r VXORPS ymm1, ymm2, ymm3/m256</td>
<td>V/V</td>
<td>AVX</td>
<td>Return the bitwise logical XOR of packed single-precision floating-point values in ymm2 and ymm3/mem</td>
</tr>
</tbody>
</table>

Description
Performs a bitwise logical XOR of the four or eight packed single-precision floating-point values from the first source operand and the second source operand, and stores the result in the destination operand.

VEX.256 encoded version: The first source operand is a YMM register. The second source operand can be a YMM register or a 256-bit memory location. The destination operand is a YMM register.

VEX.128 encoded version: The first source operand second source operand is an XMM register or 128-bit memory location. The destination operand is an XMM register. The upper bits (255:128) of the corresponding YMM register destination are zeroed.

128-bit Legacy SSE version: The second source can be an XMM register or an 128-bit memory location. The destination is not distinct from the first source XMM register and the upper bits (255:128) of the corresponding YMM register destination are unmodified.

Operation
VXORPS (VEX.256 encoded version)
DEST[31:0] ← SRC1[31:0] BITWISE XOR SRC2[31:0]
DEST[95:64] ← SRC1[95:64] BITWISE XOR SRC2[95:64]
**INSTRUCTION SET REFERENCE**

\[
\begin{align*}
\text{DEST}[127:96] & \leftarrow \text{SRC1}[127:96] \text{ BITWISE XOR SRC2}[127:96] \\
\text{DEST}[159:128] & \leftarrow \text{SRC1}[159:128] \text{ BITWISE XOR SRC2}[159:128] \\
\text{DEST}[191:160] & \leftarrow \text{SRC1}[191:160] \text{ BITWISE XOR SRC2}[191:160] \\
\text{DEST}[223:192] & \leftarrow \text{SRC1}[223:192] \text{ BITWISE XOR SRC2}[223:192] \\
\text{DEST}[255:224] & \leftarrow \text{SRC1}[255:224] \text{ BITWISE XOR SRC2}[255:224].
\end{align*}
\]

**VXORPS (VEX.128 encoded version)**

\[
\begin{align*}
\text{DEST}[31:0] & \leftarrow \text{SRC1}[31:0] \text{ BITWISE XOR SRC2}[31:0] \\
\text{DEST}[63:32] & \leftarrow \text{SRC1}[63:32] \text{ BITWISE XOR SRC2}[63:32] \\
\text{DEST}[95:64] & \leftarrow \text{SRC1}[95:64] \text{ BITWISE XOR SRC2}[95:64] \\
\text{DEST}[127:96] & \leftarrow \text{SRC1}[127:96] \text{ BITWISE XOR SRC2}[127:96] \\
\text{DEST}[255:128] & \leftarrow 0
\end{align*}
\]

**XORPS (128-bit Legacy SSE version)**

\[
\begin{align*}
\text{DEST}[31:0] & \leftarrow \text{SRC1}[31:0] \text{ BITWISE XOR SRC2}[31:0] \\
\text{DEST}[63:32] & \leftarrow \text{SRC1}[63:32] \text{ BITWISE XOR SRC2}[63:32] \\
\text{DEST}[95:64] & \leftarrow \text{SRC1}[95:64] \text{ BITWISE XOR SRC2}[95:64] \\
\text{DEST}[127:96] & \leftarrow \text{SRC1}[127:96] \text{ BITWISE XOR SRC2}[127:96] \\
\text{DEST}[255:128] & \text{ (Unmodified)}
\end{align*}
\]

**Intel C/C++ Compiler Intrinsic Equivalent**

\[
\begin{align*}
\text{VXORPS \_m256\_mm256\_xor\_ps (\_m256 a, \_m256 b);} \\
\text{XORPS \_m128\_mm\_xor\_ps (\_m128 a, \_m128 b);} \\
\end{align*}
\]

**SIMD Floating-Point Exceptions**

None

**Other Exceptions**

See Exceptions Type 4
VZEROALL- Zero All YMM registers

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VZEROALL (VEX.256 encoded version)</td>
<td>V/V</td>
<td>AVX</td>
<td>Zero all YMM registers</td>
</tr>
</tbody>
</table>

Description

The instruction zeros contents of all XMM or YMM registers.

Note: VEX.vvvv is reserved and must be 1111b, otherwise instructions will #UD.

In Compatibility and legacy 32-bit mode only the lower 8 registers are modified.

Operation

**VZEROALL (VEX.256 encoded version)**

IF (64-bit mode)

- YMM0[255:0] ← 0
- YMM1[255:0] ← 0
- YMM2[255:0] ← 0
- YMM3[255:0] ← 0
- YMM4[255:0] ← 0
- YMM5[255:0] ← 0
- YMM6[255:0] ← 0
- YMM7[255:0] ← 0
- YMM8[255:0] ← 0
- YMM9[255:0] ← 0
- YMM10[255:0] ← 0
- YMM11[255:0] ← 0
- YMM12[255:0] ← 0
- YMM13[255:0] ← 0
- YMM14[255:0] ← 0
- YMM15[255:0] ← 0

ELSE

- YMM0[255:0] ← 0
- YMM1[255:0] ← 0
- YMM2[255:0] ← 0
- YMM3[255:0] ← 0
- YMM4[255:0] ← 0
- YMM5[255:0] ← 0
- YMM6[255:0] ← 0

Ref. # 319433-005
INSTRUCTION SET REFERENCE

YMM7[255:0] ← 0
YMM8-15: Unmodified
FI

Intel C/C++ Compiler Intrinsic Equivalent
VZEROALL _mm256_zeroall()

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 8
VZEROUPPER- Zero Upper bits of YMM registers

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>64/32 bit Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.128.0F 77</td>
<td>V/V</td>
<td>AVX</td>
<td>Zero upper 128 bits of all YMM registers</td>
</tr>
<tr>
<td>VZEROUPPER</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

The instruction zeros the upper 128 bits of all YMM registers. The lower 128-bits of the registers (the corresponding XMM registers) are unmodified. This instruction is recommended when transitioning between AVX and legacy SSE code - it will eliminate performance penalties caused by false dependencies.

Note: VEX.vvvv is reserved and must be 1111b otherwise instructions will #UD. In Compatibility and legacy 32-bit mode only the lower 8 registers are modified.

Operation

VZEROUPPER

IF (64-bit mode)

YMM0[255:128] ← 0
YMM1[255:128] ← 0
YMM2[255:128] ← 0
YMM3[255:128] ← 0
YMM4[255:128] ← 0
YMM5[255:128] ← 0
YMM6[255:128] ← 0
YMM7[255:128] ← 0
YMM8[255:128] ← 0
YMM9[255:128] ← 0
YMM10[255:128] ← 0
YMM11[255:128] ← 0
YMM12[255:128] ← 0
YMM13[255:128] ← 0
YMM14[255:128] ← 0
YMM15[255:128] ← 0

ELSE

YMM0[255:128] ← 0
YMM1[255:128] ← 0
YMM2[255:128] ← 0
YMM3[255:128] ← 0
INSTRUCTION SET REFERENCE

YMM4[255:128] ← 0
YMM5[255:128] ← 0
YMM6[255:128] ← 0
YMM7[255:128] ← 0
YMM8-15: unmodified
FI

Intel C/C++ Compiler Intrinsic Equivalent
VZEROUPPER _mm256_zeroupper()

SIMD Floating-Point Exceptions
None

Other Exceptions
See Exceptions Type 8
6.1 FMA INSTRUCTION SET REFERENCE

This section describes FMA instructions in details. Conventions and notations of instruction format can be found in Section 5.1.
VFMADD132PD/VFMADD213PD/VFMADD231PD - Fused Multiply-Add of Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Mode</th>
<th>CPUID Support</th>
<th>CPUEID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 98 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiplied packed double-precision floating-point values from xmm0 and xmm2/mem, add to xmm1 and put result in xmm0.</td>
<td></td>
</tr>
<tr>
<td>VFMADD132PD xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 A8 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiplied packed double-precision floating-point values from xmm0 and xmm1, add to xmm2/mem and put result in xmm0.</td>
<td></td>
</tr>
<tr>
<td>VFMADD213PD xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 B8 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiplied packed double-precision floating-point values from xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.</td>
<td></td>
</tr>
<tr>
<td>VFMADD231PD xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 98 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiplied packed double-precision floating-point values from ymm0 and ymm2/mem, add to ymm1 and put result in ymm0.</td>
<td></td>
</tr>
<tr>
<td>VFMADD132PD ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 A8 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiplied packed double-precision floating-point values from ymm0 and ymm1, add to ymm2/mem and put result in ymm0.</td>
<td></td>
</tr>
<tr>
<td>VFMADD213PD ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 B8 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiplied packed double-precision floating-point values from ymm1 and ymm2/mem, add to ymm0 and put result in ymm0.</td>
<td></td>
</tr>
<tr>
<td>VFMADD231PD ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a set of SIMD multiply-add computation on packed double-precision floating-point values using three source operands and writes the multiply-add results in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location.

VFMADD132PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the two or four packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four
packed double-precision floating-point values to the destination operand (first source operand).

VFMADD213PD: Multiplies the two or four packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand, adds the infinite precision intermediate result to the two or four packed double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

VFMADD231PD: Multiplies the two or four packed double-precision floating-point values from the second source to the two or four packed double-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the two or four packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NaNs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, “FMA Instruction Operand Order and Arithmetic Behavior”.

**Operation**

In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

**VFMADD132PD DEST, SRC2, SRC3**

IF (VEX.128) THEN
   MAXVL = 2
ELSEIF (VEX.256)
   MAXVL = 4
FI

For i = 0 to MAXVL-1 {
   n = 64*i;
INSTRUCTION SET REFERENCE - FMA

\[
\text{DEST}[n+63:n] \leftarrow \text{RoundFPControl\_MXCSR(DEST}[n+63:n]\ast\text{SRC3}[n+63:n] + \text{SRC2}[n+63:n])}
\]

IF (VEX.128) THEN
\[
\text{DEST}[255:128] \leftarrow 0
\]
FI

VFMADD213PD DEST, SRC2, SRC3
IF (VEX.128) THEN
\[
\text{MAXVL }= 2
\]
ELSEIF (VEX.256)
\[
\text{MAXVL }= 4
\]
FI
For \( i = 0 \) to MAXVL-1 {
\[
\text{n }= 64^\times i;
\]
\[
\text{DEST}[n+63:n] \leftarrow \text{RoundFPControl\_MXCSR(SRC2}[n+63:n]\ast\text{DEST}[n+63:n] + \text{SRC3}[n+63:n])}
\]
}
IF (VEX.128) THEN
\[
\text{DEST}[255:128] \leftarrow 0
\]
FI

VFMADD231PD DEST, SRC2, SRC3
IF (VEX.128) THEN
\[
\text{MAXVL }= 2
\]
ELSEIF (VEX.256)
\[
\text{MAXVL }= 4
\]
FI
For \( i = 0 \) to MAXVL-1 {
\[
\text{n }= 64^\times i;
\]
\[
\text{DEST}[n+63:n] \leftarrow \text{RoundFPControl\_MXCSR(SRC2}[n+63:n]\ast\text{SRC3}[n+63:n] + \text{DEST}[n+63:n])}
\]
}
IF (VEX.128) THEN
\[
\text{DEST}[255:128] \leftarrow 0
\]
FI

Intel C/C++ Compiler Intrinsic Equivalent

VFMADD132PD _m128d __m_fadd_pd (__m128d a, __m128d b, __m128d c);
VFMADD213PD _m128d __m_fadd_pd (__m128d a, __m128d b, __m128d c);
VFMADD231PD _m128d __m_fadd_pd (__m128d a, __m128d b, __m128d c);
VFMADD132PD _m256d __mm256_fadd_pd (__m256d a, __m256d b, __m256d c);
VFMADD213PD _m256d __mm256_fadd_pd (__m256d a, __m256d b, __m256d c);
VFMADD231PD __m256d _mm256_fmadd_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions

See Exceptions Type 2
VFMADD132PS/VFMADD213PS/VFMADD231PS - Fused Multiply-Add of Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode</th>
<th>CPUID Support</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 98 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMADD132PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 A8 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm1, add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VFMADD213PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 B8 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMADD231PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 98 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, add to ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VFMADD132PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 A8 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm1, add to ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VFMADD213PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 B8 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, add to ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VFMADD231PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a set of SIMD multiply-add computation on packed single-precision floating-point values using three source operands and writes the multiply-add results in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location.

VFMADD132PS: Multiplies the four or eight packed single-precision floating-point values from the first source operand to the four or eight packed single-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the four or eight packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting four or eight
packed single-precision floating-point values to the destination operand (first source operand).

VFMADD213PS: Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the first source operand, adds the infinite precision intermediate result to the four or eight packed single-precision floating-point values in the third source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VFMADD231PS: Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the third source operand, adds the infinite precision intermediate result to the four or eight packed single-precision floating-point values in the first source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, “FMA Instruction Operand Order and Arithmetic Behavior”

Operation
In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

**VFMADD132PS DEST, SRC2, SRC3**

IF (VEX.128) THEN
    MAXVL = 4
ELSEIF (VEX.256)
    MAXVL = 8
FI

For i = 0 to MAXVL-1 {
    n = 32*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] + SRC2[n+31:n])
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI

VFMADD213PS DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL = 4
ELSEIF (VEX.256)
    MAXVL = 8
FI
For i = 0 to MAXVL-1 {
    n = 32*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] + SRC3[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI

VFMADD231PS DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL = 4
ELSEIF (VEX.256)
    MAXVL = 8
FI
For i = 0 to MAXVL-1 {
    n = 32*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] + DEST[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI

Intel C/C++ Compiler Intrinsic Equivalent
VFMADD132PS __m128 __m128_mm_fmadd_ps (__m128 a, __m128 b, __m128 c);
VFMADD213PS __m128 __m128_mm_fmadd_ps (__m128 a, __m128 b, __m128 c);
VFMADD231PS __m128 __m128_mm_fmadd_ps (__m128 a, __m128 b, __m128 c);
VFMADD132PS __m256 __m256_mm256_fma_ps (__m256 a, __m256 b, __m256 c);
VFMADD213PS __m256 __m256_mm256_fma_ps (__m256 a, __m256 b, __m256 c);
VFMADD231PS __m256 __mm256_fmaa_ps (__m256 a, __m256 b, __m256 c);

**SIMD Floating-Point Exceptions**

Overflow, Underflow, Invalid, Precision, Denormal

**Other Exceptions**

See Exceptions Type 2
INSTRUCTION SET REFERENCE - FMA

VFMADD132SD/VFMADD213SD/VFMADD231SD - Fused Multiply-Add of Scalar Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 99 /r VFMADD132SD xmm0, xmm1, xmm2/m64</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm0 and xmm2/mem, add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 A9 /r VFMADD213SD xmm0, xmm1, xmm2/m64</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm0 and xmm1, add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 B9 /r VFMADD231SD xmm0, xmm1, xmm2/m64</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.</td>
</tr>
</tbody>
</table>

Description

Performs a SIMD multiply-add computation on the low packed double-precision floating-point values using three source operands and writes the multiply-add result in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location.

VFMADD132SD: Multiplies the low packed double-precision floating-point value from the first source operand to the low packed double-precision floating-point value in the third source operand, adds the infinite precision intermediate result to the low packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VFMADD213SD: Multiplies the low packed double-precision floating-point value from the second source operand to the low packed double-precision floating-point value in the first source operand, adds the infinite precision intermediate result to the low packed double-precision floating-point value in the third source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VFMADD231SD: Multiplies the low packed double-precision floating-point value from the second source to the low packed double-precision floating-point value in the third source operand, adds the infinite precision intermediate result to the low packed double-precision floating-point value in the first source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 64-bit memory location and encoded in rm_field. The upper bits ([255:128]) of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior"

**Operation**
In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

**VFMADD132SD DEST, SRC2, SRC3**

DEST[63:0] ← RoundFPControl_MXCSR(DEST[63:0]*SRC3[63:0] + SRC2[63:0])
DEST[127:64] ← DEST[127:64]
DEST[255:128] ← 0

**VFMADD213SD DEST, SRC2, SRC3**

DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0])
DEST[127:64] ← DEST[127:64]
DEST[255:128] ← 0

**VFMADD231SD DEST, SRC2, SRC3**

DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] + DEST[63:0])
DEST[127:64] ← DEST[127:64]
DEST[255:128] ← 0

**Intel C/C++ Compiler Intrinsic Equivalent**

VFMADD132SD _mm128d _mm_fmadd_sd (_m128d a, _m128d b, _m128d c);
VFMADD213SD _mm128d _mm_fmadd_sd (_m128d a, _m128d b, _m128d c);
VFMADD231SD _mm128d _mm_fmadd_sd (_m128d a, _m128d b, _m128d c);

**SIMD Floating-Point Exceptions**

Overflow, Underflow, Invalid, Precision, Denormal

**Other Exceptions**

See Exceptions Type 3
VFMADD132SS/VFMADD213SS/VFMADD231SS - Fused Multiply-Add of Scalar Single-Precision Floating-Point Values

**Description**

Performs a SIMD multiply-add computation on packed single-precision floating-point values using three source operands and writes the multiply-add results in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location.

**VFMADD132SS**: Multiplies the low packed single-precision floating-point value from the first source operand to the low packed single-precision floating-point value in the third source operand, adds the infinite precision intermediate result to the low packed single-precision floating-point value in the second source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).

**VFMADD213SS**: Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the first source operand, adds the infinite precision intermediate result to the low packed single-precision floating-point value in the third source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).

**VFMADD231SS**: Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the third source operand, adds the infinite precision intermediate result to the low packed single-precision floating-point value in the first source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 32-bit memory location and encoded in rm_field. The upper bits ([255:128]) of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, “FMA Instruction Operand Order and Arithmetic Behavior”

Operation
In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

**VFMADD132SS DEST, SRC2, SRC3**
DEST[31:0] ← RoundFPControl_MXCSR(DEST[31:0]*SRC3[31:0] + SRC2[31:0])
DEST[255:128] ← 0

**VFMADD213SS DEST, SRC2, SRC3**
DEST[31:0] ← RoundFPControl_MXCSR(SRC2[31:0]*DEST[31:0] + SRC3[31:0])
DEST[255:128] ← 0

**VFMADD231SS DEST, SRC2, SRC3**
DEST[31:0] ← RoundFPControl_MXCSR(SRC2[31:0]*SRC3[63:0] + DEST[31:0])
DEST[255:128] ← 0

**Intel C/C++ Compiler Intrinsic Equivalent**

VFMADD132SS __m128 __mm_fmadd_ss (__m128 a, __m128 b, __m128 c);
VFMADD213SS __m128 __mm_fmadd_ss (__m128 a, __m128 b, __m128 c);
VFMADD231SS __m128 __mm_fmadd_ss (__m128 a, __m128 b, __m128 c);

**SIMD Floating-Point Exceptions**

Overflow, Underflow, Invalid, Precision, Denormal
INSTRUCTION SET REFERENCE - FMA

Other Exceptions
See Exceptions Type 3
### VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD - Fused Multiply-Alternating Add/Subtract of Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 96 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, add/subtract elements in xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 A6 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, add/subtract elements in xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 B6 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm1, add/subtract elements in xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 96 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, add/subtract elements in ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 A6 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, add/subtract elements in ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 B6 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm1, add/subtract elements in ymm2/mem and put result in ymm0.</td>
</tr>
</tbody>
</table>

**Description**

VFMADDSUB132PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, adds the odd double-precision floating-point elements and subtracts the
INSTRUCTION SET REFERENCE - FMA

even double-precision floating-point values in the second source operand, performs
rounding and stores the resulting two or four packed double-precision floating-point
values to the destination operand (first source operand).

VFMADDSUB213PD: Multiplies the two or four packed double-precision floating-point
values from the second source operand to the two or four packed double-precision
floating-point values in the first source operand. From the infinite precision interme-
diate result, adds the odd double-precision floating-point elements and subtracts the
even double-precision floating-point values in the third source operand, performs
rounding and stores the resulting two or four packed double-precision floating-point
values to the destination operand (first source operand).

VFMADDSUB231PD: Multiplies the two or four packed double-precision floating-point
values from the second source operand to the two or four packed double-precision
floating-point values in the third source operand. From the infinite precision interme-
diate result, adds the odd double-precision floating-point elements and subtracts the
even double-precision floating-point values in the first source operand, performs
rounding and stores the resulting two or four packed double-precision floating-point
values to the destination operand (first source operand).

VEX.256 encoded version: The destination operand (also first source operand) is a
YMM register and encoded in reg_field. The second source operand is a YMM register
and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit
memory location and encoded in rm_field.

VEX.128 encoded version: The destination operand (also first source operand) is a
XMM register and encoded in reg_field. The second source operand is a XMM register
and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit
memory location and encoded in rm_field. The upper 128 bits of the YMM destination
register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruc-
tion mnemonic listed in the opcode/instruction column of the summary table. The
behavior of the complementary mnemonic in situations involving NANs are governed
by the definition of the instruction mnemonic defined in the opcode/instruction
column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic
Behavior"

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite
precision inputs and outputs (no rounding)

VFMADDSUB132PD DEST, SRC2, SRC3

IF (VEX.128) THEN
  DEST[63:0] ← RoundFPControl_MXCSR(Dest[63:0]*SRC3[63:0] - SRC2[63:0])
  DEST[127:64] ← RoundFPControl_MXCSR(Dest[127:64]*SRC3[127:64] + SRC2[127:64])
  DEST[255:128] ← 0
ELSEIF (VEX.256)
  DEST[63:0] ← RoundFPControl_MXCSR(Dest[63:0]*SRC3[63:0] - SRC2[63:0])
  DEST[127:64] ← RoundFPControl_MXCSR(Dest[127:64]*SRC3[127:64] + SRC2[127:64])
INSTRUCTION SET REFERENCE - FMA

FI

VFMADDSUB213PD DEST, SRC2, SRC3
IF (VEX.128) THEN
    DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] - SRC3[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] + SRC3[127:64])
    DEST[255:128] ← 0
ELSEIF (VEX.256)
    DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] - SRC3[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] + SRC3[127:64])
FI

VFMADDSUB231PD DEST, SRC2, SRC3
IF (VEX.128) THEN
    DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] - DEST[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] + DEST[127:64])
    DEST[255:128] ← 0
ELSEIF (VEX.256)
    DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] - DEST[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] + DEST[127:64])
FI

Intel C/C++ Compiler Intrinsic Equivalent

VFMADDSUB132PD __m128d_mm_fmaddsub_pd (__m128d a, __m128d b, __m128d c);
VFMADDSUB213PD __m128d_mm_fmaddsub_pd (__m128d a, __m128d b, __m128d c);
VFMADDSUB231PD __m128d_mm_fmaddsub_pd (__m128d a, __m128d b, __m128d c);
VFMADDSUB132PD __m256d_mm256_fmaddsub_pd (__m256d a, __m256d b, __m256d c);
VFMADDSUB213PD __m256d_mm256_fmaddsub_pd (__m256d a, __m256d b, __m256d c);
VFMADDSUB231PD __m256d_mm256_fmaddsub_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal
INSTRUCTION SET REFERENCE - FMA

Other Exceptions
See Exceptions Type 2
VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS - Fused Multiply-Alternating Add/Subtract of Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 96 /r VFMADDSUB132PS xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, add/subtract elements in xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 A6 /r VFMADDSUB213PS xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm1, add/subtract elements in xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 B6 /r VFMADDSUB231PS xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, add/subtract elements in xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 96 /r VFMADDSUB132PS ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, add/subtract elements in ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 A6 /r VFMADDSUB213PS ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm1, add/subtract elements in ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 B6 /r VFMADDSUB231PS ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, add/subtract elements in ymm0 and put result in ymm0.</td>
</tr>
</tbody>
</table>

Description

VFMADDSUB132PS: Multiplies the four or eight packed single-precision floating-point values from the first source operand to the four or eight packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, adds the odd single-precision floating-point elements and subtracts the
even single-precision floating-point values in the second source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VFMADDSUB213PS: Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the first source operand. From the infinite precision intermediate result, adds the odd single-precision floating-point elements and subtracts the even single-precision floating-point values in the third source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VFMADDSUB231PS: Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, adds the odd single-precision floating-point elements and subtracts the even single-precision floating-point values in the first source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior”

Operation

In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

VFMADDSUB132PS DEST, SRC2, SRC3

IF (VEX.128) THEN
  MAXVL = 2
ELSEIF (VEX.256)
  MAXVL = 4
FI

For i = 0 to MAXVL -1 {
  n = 64*i;

  VFMADDSUB213PS DEST, SRC2, SRC3
DEST[n+31:n] ← RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] - SRC2[n+31:n])
DEST[n+63:n+32] ← RoundFPControl_MXCSR(DEST[n+63:n+32]*SRC3[n+63:n+32] +
SRC2[n+63:n+32])
}
IF (VEX.128) THEN
   DEST[255:128] ← 0
FI

VFMADDSUB213PS DEST, SRC2, SRC3
IF (VEX.128) THEN
   MAXVL = 2
ELSEIF (VEX.256)  
   MAXVL = 4
FI
For i = 0 to MAXVL-1{
   n = 64*i;
   DEST[n+31:n] ← RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] - SRC3[n+31:n])
   DEST[n+63:n+32] ← RoundFPControl_MXCSR(SRC2[n+63:n+32]*DEST[n+63:n+32] +
SRC3[n+63:n+32])
}
IF (VEX.128) THEN
   DEST[255:128] ← 0
FI

VFMADDSUB231PS DEST, SRC2, SRC3
IF (VEX.128) THEN
   MAXVL = 2
ELSEIF (VEX.256)  
   MAXVL = 4
FI
For i = 0 to MAXVL-1{
   n = 64*i;
   DEST[n+31:n] ← RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] - DEST[n+31:n])
   DEST[n+63:n+32] ← RoundFPControl_MXCSR(SRC2[n+63:n+32]*SRC3[n+63:n+32] +
DEST[n+63:n+32])
}
IF (VEX.128) THEN
   DEST[255:128] ← 0
FI

Intel C/C++ Compiler Intrinsic Equivalent
VFMADDSUB132PS __m128 __mm_fmaddsub_ps (__m128 a, __m128 b, __m128 c);
INSTRUCTION SET REFERENCE - FMA

VFMADDSUB213PS __m128 _mm_fmaddsub_ps (__m128 a, __m128 b, __m128 c);
VFMADDSUB231PS __m128 _mm_fmaddsub_ps (__m128 a, __m128 b, __m128 c);
VFMADDSUB132PS __m256 _mm256_fmaddsub_ps (__m256 a, __m256 b, __m256 c);
VFMADDSUB213PS __m256 _mm256_fmaddsub_ps (__m256 a, __m256 b, __m256 c);
VFMADDSUB231PS __m256 _mm256_fmaddsub_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
### VFMSUBADD132PD/VFMSUBADD213PD/VFMSUBADD231PD - Fused Multiply-Alternating Subtract/Add of Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VFMSUBADD132PD xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, subtract/add elements in xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUBADD213PD xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, subtract/add elements in xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUBADD231PD xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, subtract/add elements in ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 97 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, subtract/add elements in ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 97 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, subtract/add elements in ymm0 and put result in ymm0.</td>
</tr>
</tbody>
</table>

**Description**

VFMSUBADD132PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the odd double-precision floating-point elements and adds the
even double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

**VFMSUBADD213PD**: Multiplies the two or four packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand. From the infinite precision intermediate result, subtracts the odd double-precision floating-point elements and adds the even double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

**VFMSUBADD231PD**: Multiplies the two or four packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the odd double-precision floating-point elements and adds the even double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

**VEX.256 encoded version**: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

**VEX.128 encoded version**: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NaNs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior"

**Operation**

In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding).

**VFMSUBADD132PD DEST, SRC2, SRC3**

IF (VEX.128) THEN
    DEST[63:0] ← RoundFPControl_MXCSR(Dest[63:0]*SRC3[63:0] + SRC2[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(Dest[127:64]*SRC3[127:64] - SRC2[127:64])
    DEST[255:128] ← 0
ELSEIF (VEX.256)
    DEST[63:0] ← RoundFPControl_MXCSR(Dest[63:0]*SRC3[63:0] + SRC2[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(Dest[127:64]*SRC3[127:64] - SRC2[127:64])

6-24 Ref. # 319433-005
FI

VFMSUBADD213PD DEST, SRC2, SRC3
IF (VEX.128) THEN
    DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] - SRC3[127:64])
    DEST[255:128] ← 0
ELSEIF (VEX.256)
    DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*DEST[63:0] + SRC3[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(SRC2[127:64]*DEST[127:64] - SRC3[127:64])
FI

VFMSUBADD231PD DEST, SRC2, SRC3
IF (VEX.128) THEN
    DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] + DEST[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] - DEST[127:64])
    DEST[255:128] ← 0
ELSEIF (VEX.256)
    DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] + DEST[63:0])
    DEST[127:64] ← RoundFPControl_MXCSR(SRC2[127:64]*SRC3[127:64] - DEST[127:64])
FI

Intel C/C++ Compiler Intrinsic Equivalent

VFMSUBADD132PD __m128d __m128d __m128d __m128d (__m128d a, __m128d b, __m128d c);
VFMSUBADD213PD __m128d __m128d __m128d __m128d (__m128d a, __m128d b, __m128d c);
VFMSUBADD231PD __m128d __m128d __m128d __m128d (__m128d a, __m128d b, __m128d c);
VFMSUBADD132PD __m256d __m256d __m256d __m256d (__m256d a, __m256d b, __m256d c);
VFMSUBADD213PD __m256d __m256d __m256d __m256d (__m256d a, __m256d b, __m256d c);
VFMSUBADD231PD __m256d __m256d __m256d __m256d (__m256d a, __m256d b, __m256d c);
INSTRUCTION SET REFERENCE - FMA

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
VFMSUBADD132PS/VFMSUBADD213PS/VFMSUBADD231PS - Fused Multiply-Alternating Subtract/Add of Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 97 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, subtract/add elements in xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUBADD132PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 A7 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm1, subtract/add elements in xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUBADD213PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 B7 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, subtract/add elements in xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUBADD231PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 97 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, subtract/add elements in ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VFMSUBADD132PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 A7 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm1, subtract/add elements in ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VFMSUBADD213PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 B7 /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, subtract/add elements in ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VFMSUBADD231PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

VFMSUBADD132PS: Multiplies the four or eight packed single-precision floating-point values from the first source operand to the four or eight packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the odd single-precision floating-point elements and adds the
even single-precision floating-point values in the second source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VFMSUBADD213PS: Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the first source operand. From the infinite precision intermediate result, subtracts the odd single-precision floating-point elements and adds the even single-precision floating-point values in the third source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VFMSUBADD231PS: Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the odd single-precision floating-point elements and adds the even single-precision floating-point values in the first source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior"

**Operation**

In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

**VFMSUBADD132PS DEST, SRC2, SRC3**

IF (VEX.128) THEN
   MAXVL = 2
ELSEIF (VEX.256)
   MAXVL = 4
FI

For i = 0 to MAXVL -1 {
n = 64*i;
DEST[n+31:n] ← RoundFPControl_MXCSR(DEST[n+31:n]*SRC3[n+31:n] + SRC2[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI

VFMSUBADD213PS DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL =2
ELSEIF (VEX.256)
    MAXVL = 4
FI
For i = 0 to MAXVL -1{
    n = 64*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR(SRC2[n+31:n]*DEST[n+31:n] + SRC3[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI

VFMSUBADD231PS DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL =2
ELSEIF (VEX.256)
    MAXVL = 4
FI
For i = 0 to MAXVL -1{
    n = 64*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] + DEST[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI

Intel C/C++ Compiler Intrinsic Equivalent
VFMSUBADD132PS __m128 _mm_fmsubadd_ps (__m128 a, __m128 b, __m128 c);
INSTRUCTION SET REFERENCE - FMA

VFMSUBADD213PS __m128 __mm_fmsubadd_ps (__m128 a, __m128 b, __m128 c);
VFMSUBADD231PS __m128 __mm_fmsubadd_ps (__m128 a, __m128 b, __m128 c);
VFMSUBADD132PS __m256 __mm256_fmsubadd_ps (__m256 a, __m256 b, __m256 c);
VFMSUBADD213PS __m256 __mm256_fmsubadd_ps (__m256 a, __m256 b, __m256 c);
VFMSUBADD231PS __m256 __mm256_fmsubadd_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
VFMSUB132PD/VFMSUB213PD/VFMSUB231PD - Fused Multiply-Subtract of Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9A /r VFMSUB132PD xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AA /r VFMSUB213PD xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm1, subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 BA /r VFMSUB231PD xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, subtract xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 9A /r VFMSUB132PD ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, subtract ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 AA /r VFMSUB213PD ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm1, subtract ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 BA /r VFMSUB231PD ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, subtract ymm0 and put result in ymm0.</td>
</tr>
</tbody>
</table>

Description

Performs a set of SIMD multiply-subtract computation on packed double-precision floating-point values using three source operands and writes the multiply-subtract results in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location.

VFMSUB132PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the two or four packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four
INSTRUCTION SET REFERENCE - FMA

packed double-precision floating-point values to the destination operand (first source operand).
VFMSUB213PD: Multiplies the two or four packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand. From the infinite precision intermediate result, subtracts the two or four packed double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).
VFMSUB231PD: Multiplies the two or four packed double-precision floating-point values from the second source to the two or four packed double-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the two or four packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand). VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, “FMA Instruction Operand Order and Arithmetic Behavior”

Operation
In the operations below, “*” and “+” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

VFMSUB132PD DEST, SRC2, SRC3
IF (VEX.128) THEN
   MAXVL =2
ELSEIF (VEX.256)
   MAXVL = 4
FI
For i = 0 to MAXVL-1 {
   n = 64*i;
}
IF (VEX.128) THEN 
DEST[255:128] ← 0 
FI

VFMSUB213PD DEST, SRC2, SRC3
IF (VEX.128) THEN 
    MAXVL = 2 
ELSEIF (VEX.256) 
    MAXVL = 4 
FI
For i = 0 to MAXVL-1 {
    n = 64*i;
}
IF (VEX.128) THEN 
DEST[255:128] ← 0 
FI

VFMSUB231PD DEST, SRC2, SRC3
IF (VEX.128) THEN 
    MAXVL = 2 
ELSEIF (VEX.256) 
    MAXVL = 4 
FI
For i = 0 to MAXVL-1 {
    n = 64*i;
}
IF (VEX.128) THEN 
DEST[255:128] ← 0 
FI

Intel C/C++ Compiler Intrinsic Equivalent

VFMSUB132PD __m128d _mm_fmsub_pd (__m128d a, __m128d b, __m128d c);
VFMSUB213PD __m128d _mm_fmsub_pd (__m128d a, __m128d b, __m128d c);
VFMSUB231PD __m128d _mm_fmsub_pd (__m128d a, __m128d b, __m128d c);
VFMSUB132PD __m256d _mm256_fmsub_pd (__m256d a, __m256d b, __m256d c);
VFMSUB213PD __m256d _mm256_fmsub_pd (__m256d a, __m256d b, __m256d c);
VFMSUB231PD __m256d_mm256_fmsub_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
VFMSUB132PS/VFMSUB213PS/VFMSUB231PS - Fused Multiply-Subtract of Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9A /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUB132PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AA /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm1, subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUB213PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BA /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, subtract xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUB231PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 9A /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, subtract ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VFMSUB132PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 AA /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm1, subtract ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VFMSUB213PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.0 BA /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, subtract ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VFMSUB231PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a set of SIMD multiply-subtract computation on packed single-precision floating-point values using three source operands and writes the multiply-subtract results in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location.

VFMSUB132PS: Multiplies the four or eight packed single-precision floating-point values from the first source operand to the four or eight packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the four or eight packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting four or
eight packed single-precision floating-point values to the destination operand (first source operand).

VFMSUB213PS: Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the first source operand. From the infinite precision intermediate result, subtracts the four or eight packed single-precision floating-point values in the third source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VFMSUB231PS: Multiplies the four or eight packed single-precision floating-point values from the second source to the four or eight packed single-precision floating-point values in the third source operand. From the infinite precision intermediate result, subtracts the four or eight packed single-precision floating-point values in the first source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior".

Operation
In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

VFMSUB132PS DEST, SRC2, SRC3
IF (VEX.128) THEN
   MAXVL = 4
ELSEIF (VEX.256)
   MAXVL = 8
FI
For i = 0 to MAXVL-1 {
   n = 32*i;
VFMSUB213PS DEST, SRC2, SRC3
IF (VEX.128) THEN
  MAXVL = 4
ELSEIF (VEX.256)
  MAXVL = 8
FI
For i = 0 to MAXVL-1 {
  n = 32*i;
  DEST[n+31:n] ← RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] - SRC2[n+31:n])
}
IF (VEX.128) THEN
  DEST[255:128] ← 0
FI
VFMSUB231PS DEST, SRC2, SRC3
IF (VEX.128) THEN
  MAXVL = 4
ELSEIF (VEX.256)
  MAXVL = 8
FI
For i = 0 to MAXVL-1 {
  n = 32*i;
  DEST[n+31:n] ← RoundFPControl_MXCSR(SRC2[n+31:n]*SRC3[n+31:n] - DEST[n+31:n])
}
IF (VEX.128) THEN
  DEST[255:128] ← 0
FI

Intel C/C++ Compiler Intrinsic Equivalent
VFMSUB132PS __m128 _mm_fmsub_ps (__m128 a, __m128 b, __m128 c);
VFMSUB213PS __m128 _mm_fmsub_ps (__m128 a, __m128 b, __m128 c);
VFMSUB231PS __m128 _mm_fmsub_ps (__m128 a, __m128 b, __m128 c);
VFMSUB132PS __m256 _mm256_fmsub_ps (__m256 a, __m256 b, __m256 c);
VFMSUB213PS __m256 _mm256_fmsub_ps (__m256 a, __m256 b, __m256 c);
INSTRUCTION SET REFERENCE - FMA

VFMSUB231PS __m256 _mm256_fmsub_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions

See Exceptions Type 2
VFMSUB132SD/VFMSUB213SD/VFMSUB231SD - Fused Multiply-Subtract of Scalar Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9B /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm0 and xmm2/mem, subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUB132SD xmm0, xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AB /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm0 and xmm1, subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUB213SD xmm0, xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 BB /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm1 and xmm2/mem, subtract xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VFMSUB231SD xmm0, xmm1, xmm2/m64</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

Performs a SIMD multiply-subtract computation on the low packed double-precision floating-point values using three source operands and writes the multiply-add result in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location.

VFMSUB132SD: Multiplies the low packed double-precision floating-point value from the first source operand to the low packed double-precision floating-point value in the third source operand. From the infinite precision intermediate result, subtracts the low packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VFMSUB213SD: Multiplies the low packed double-precision floating-point value from the second source operand to the low packed double-precision floating-point value in the first source operand. From the infinite precision intermediate result, subtracts the low packed double-precision floating-point value in the third source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VFMSUB231SD: Multiplies the low packed double-precision floating-point value from the second source to the low packed double-precision floating-point value in the third source operand. From the infinite precision intermediate result, subtracts the low packed double-precision floating-point value in the first source operand, performs...
rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 64-bit memory location and encoded in rm_field. The upper bits ([255:128]) of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NaNs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior”.

**Operation**

In the operations below, “-” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

VFM _SUB132SD DEST, SRC2, SRC3
DEST[63:0] ← RoundFPControl_MXCSR(Dest[63:0]*SRC3[63:0] - SRC2[63:0])
DEST[127:64] ← DEST[127:64]
DEST[255:128] ← 0

VFM _SUB213SD DEST, SRC2, SRC3
DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*Dest[63:0] - SRC3[63:0])
DEST[127:64] ← DEST[127:64]
DEST[255:128] ← 0

VFM _SUB231SD DEST, SRC2, SRC3
DEST[63:0] ← RoundFPControl_MXCSR(SRC2[63:0]*SRC3[63:0] - Dest[63:0])
DEST[127:64] ← DEST[127:64]
DEST[255:128] ← 0

**Intel C/C++ Compiler Intrinsic Equivalent**

VFM _SUB132SD __m128d _mm_fmsub_sd (__m128d a, __m128d b, __m128d c);
VFM _SUB213SD __m128d _mm_fmsub_sd (__m128d a, __m128d b, __m128d c);
VFM _SUB231SD __m128d _mm_fmsub_sd (__m128d a, __m128d b, __m128d c);

**SIMD Floating-Point Exceptions**

Overflow, Underflow, Invalid, Precision, Denormal
Other Exceptions
See Exceptions Type 3
VFMSUB132SS/VFMSUB213SS/VFMSUB231SS - Fused Multiply-Subtract of Scalar Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9B /r VFMSUB132SS xmm0, xmm1, xmm2/m32</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar single-precision floating-point value from xmm0 and xmm2/mem, subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AB /r VFMSUB213SS xmm0, xmm1, xmm2/m32</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar single-precision floating-point value from xmm0 and xmm1, subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BB /r VFMSUB231SS xmm0, xmm1, xmm2/m32</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar single-precision floating-point value from xmm1 and xmm2/mem, subtract xmm0 and put result in xmm0.</td>
</tr>
</tbody>
</table>

Description
Perform a SIMD multiply-subtract computation on the low packed single-precision floating-point values using three source operands and writes the multiply-add result in the destination operand. The destination operand is also the first source operand. The second operand must be a SIMD register. The third source operand can be a SIMD register or a memory location.

VFMSUB132SS:Multiplies the low packed single-precision floating-point value from the first source operand to the low packed single-precision floating-point value in the third source operand. From the infinite precision intermediate result, subtracts the low packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).

VFMSUB213SS:Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the first source operand. From the infinite precision intermediate result, subtracts the low packed single-precision floating-point value in the third source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).

VFMSUB231SS:Multiplies the low packed single-precision floating-point value from the second source to the low packed single-precision floating-point value in the third source operand. From the infinite precision intermediate result, subtracts the low packed single-precision floating-point value in the first source operand, performs
rounding and stores the resulting packed single-precision floating-point value to the
destination operand (first source operand).

VEX.128 encoded version: The destination operand (also first source operand) is a
XMM register and encoded in reg_field. The second source operand is a XMM register
and encoded in VEX.vvvv. The third source operand is a XMM register or a 32-bit
memory location and encoded in rm_field. The upper bits ([255:128]) of the YMM
destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruc-
tion mnemonic listed in the opcode/instruction column of the summary table. The
behavior of the complementary mnemonic in situations involving NANs are governed
by the definition of the instruction mnemonic defined in the opcode/instruction
column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic
Behavior"

**Operation**

In the operations below, “-” and “*” symbols represent multiplication and addition with infinite precision
inputs and outputs (no rounding)

**VFMSUB132SS DEST, SRC2, SRC3**

DEST[31:0] ← RoundFPControl_MXCSR(DEST[31:0]*SRC3[31:0] - SRC2[31:0])
DEST[255:128] ← 0

**VFMSUB213SS DEST, SRC2, SRC3**

DEST[31:0] ← RoundFPControl_MXCSR(SRC2[31:0]*DEST[31:0] - SRC3[31:0])
DEST[255:128] ← 0

**VFMSUB231SS DEST, SRC2, SRC3**

DEST[31:0] ← RoundFPControl_MXCSR(SRC2[31:0]*SRC3[63:0] - DEST[31:0])
DEST[255:128] ← 0

**Intel C/C++ Compiler Intrinsic Equivalent**

VFMSUB132SS _m128 _mm_fmsub_ss(_m128 a, _m128 b, _m128 c);
VFMSUB213SS _m128 _mm_fmsub_ss(_m128 a, _m128 b, _m128 c);
VFMSUB231SS _m128 _mm_fmsub_ss(_m128 a, _m128 b, _m128 c);

**SIMD Floating-Point Exceptions**

Overflow, Underflow, Invalid, Precision, Denormal
INSTRUCTION SET REFERENCE - FMA

Other Exceptions
See Exceptions Type 3
VFNMADD132PD/VFNMADD213PD/VFNMADD231PD - Fused Negative Multiply-Add of Packed Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9C /r VFNMADD132PD xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, negate the multiplication result and add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AC /r VFNMADD213PD xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm1, negate the multiplication result and add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 BC /r VFNMADD231PD xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, negate the multiplication result and add to xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 9C /r VFNMADD132PD ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, negate the multiplication result and add to ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 AC /r VFNMADD213PD ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm1, negate the multiplication result and add to ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 BC /r VFNMADD231PD ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and add to ymm0 and put result in ymm0.</td>
</tr>
</tbody>
</table>

Description
VFNMADD132PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand, adds the negated infinite precision intermediate result to the two or four packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four
packed double-precision floating-point values to the destination operand (first source operand).

**VFNMADD213PD**: Multiplies the two or four packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand, adds the negated infinite precision intermediate result to the two or four packed double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

**VFNMADD231PD**: Multiplies the two or four packed double-precision floating-point values from the second source to the two or four packed double-precision floating-point values in the third source operand, adds the negated infinite precision intermediate result to the two or four packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

**VEX.256 encoded version**: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

**VEX.128 encoded version**: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior"

**Operation**

In the operations below, “-” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

**VFNMADD132PD DEST, SRC2, SRC3**

IF (VEX.128) THEN
  MAXVL =2
ELSEIF (VEX.256)
  MAXVL = 4
FI

For i = 0 to MAXVL-1 {
  n = 64*i;
  DEST[n+63:n] ← RoundFPControl_MXCSR(-(DEST[n+63:n]*SRC3[n+63:n]) + SRC2[n+63:n])
}
IF (VEX.128) THEN
DEST[255:128] ← 0
FI

VFNMADD213PD DEST, SRC2, SRC3
IF (VEX.128) THEN
  MAXVL = 2
ELSEIF (VEX.256)
  MAXVL = 4
FI
For i = 0 to MAXVL-1 {
  n = 64*i;
  DEST[n+63:n] ← RoundFPControl_MXCSR(-(SRC2[n+63:n]*DEST[n+63:n]) + SRC3[n+63:n])
}
IF (VEX.128) THEN
DEST[255:128] ← 0
FI

VFNMADD231PD DEST, SRC2, SRC3
IF (VEX.128) THEN
  MAXVL = 2
ELSEIF (VEX.256)
  MAXVL = 4
FI
For i = 0 to MAXVL-1 {
  n = 64*i;
  DEST[n+63:n] ← RoundFPControl_MXCSR(-(SRC2[n+63:n]*SRC3[n+63:n]) + DEST[n+63:n])
}
IF (VEX.128) THEN
DEST[255:128] ← 0
FI

Intel C/C++ Compiler Intrinsic Equivalent
VFNMADD132PD __m128d_mm_fnmadd_pd (__m128d a, __m128d b, __m128d c);
VFNMADD213PD __m128d_mm_fnmadd_pd (__m128d a, __m128d b, __m128d c);
VFNMADD231PD __m128d_mm_fnmadd_pd (__m128d a, __m128d b, __m128d c);
VFNMADD132PD __m256d_mm256_fnmadd_pd (__m256d a, __m256d b, __m256d c);
VFNMADD213PD __m256d_mm256_fnmadd_pd (__m256d a, __m256d b, __m256d c);
INSTRUCTION SET REFERENCE - FMA

VFNMADD231PD __m256d _mm256_fnmadd_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
VFNMADD132PS/VFNMADD213PS/VFNMADD231PS - Fused Negative Multiply-Add of Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9C /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, negate the multiplication result and add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VFNMADD132PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AC /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm1, negate the multiplication result and add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VFNMADD213PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BC /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, negate the multiplication result and add to xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VFNMADD231PS xmm0, xmm1, xmm2/m128</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 9C /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, negate the multiplication result and add to ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VFNMADD132PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 AC /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm1, negate the multiplication result and add to ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VFNMADD213PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.0 BC /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and add to ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VFNMADD231PS ymm0, ymm1, ymm2/m256</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description
VFNMADD132PS: Multiplies the four or eight packed single-precision floating-point values from the first source operand to the four or eight packed single-precision floating-point values in the third source operand, adds the negated infinite precision intermediate result to the four or eight packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting four or
eight packed single-precision floating-point values to the destination operand (first source operand).

**VFNADD213PS:** Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the first source operand, adds the negated infinite precision intermediate result to the four or eight packed single-precision floating-point values in the third source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

**VFNADD231PS:** Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the third source operand, adds the negated infinite precision intermediate result to the four or eight packed single-precision floating-point values in the first source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

**VEX.256 encoded version:** The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

**VEX.128 encoded version:** The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NaNs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior.”

**Operation**

In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

**VFNADD132PS DEST, SRC2, SRC3**

IF (VEX.128) THEN
  MAXVL = 4
ELSEIF (VEX.256)
  MAXVL = 8
FI

For i = 0 to MAXVL-1 {
  n = 32*i;
  DEST[n+31:n] ← RoundFPControl_MXCSR(-(DEST[n+31:n]*SRC3[n+31:n]) + SRC2[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI

VFNMADD213PS DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL = 4
ELSEIF (VEX.256)
    MAXVL = 8
FI
For i = 0 to MAXVL-1 {
    n = 32*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR(-(SRC2[n+31:n]*DEST[n+31:n]) + SRC3[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI

VFNMADD231PS DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL = 4
ELSEIF (VEX.256)
    MAXVL = 8
FI
For i = 0 to MAXVL-1 {
    n = 32*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR(-(SRC2[n+31:n]*SRC3[n+31:n]) + DEST[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI

Intel C/C++ Compiler Intrinsic Equivalent

VFNMADD132PS __m128 _mm_fnmadd_ps (__m128 a, __m128 b, __m128 c);
VFNMADD213PS __m128 _mm_fnmadd_ps (__m128 a, __m128 b, __m128 c);
VFNMADD231PS __m128 _mm_fnmadd_ps (__m128 a, __m128 b, __m128 c);
VFNMADD132PS __m256 _mm256_fnmadd_ps (__m256 a, __m256 b, __m256 c);
VFNMADD213PS __m256 _mm256_fnmadd_ps (__m256 a, __m256 b, __m256 c);
INSTRUCTION SET REFERENCE - FMA

VFNMADD231PS __m256 _mm256 fnmadd_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
VFNMADD132SD/VFNMADD213SD/VFNMADD231SD - Fused Negative Multiply-Add of Scalar Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9D /r VFNMADD132SD xmm0, xmm1, xmm2/m64</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm0 and xmm2/mem, negate the multiplication result and add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AD /r VFNMADD213SD xmm0, xmm1, xmm2/m64</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm0 and xmm1, negate the multiplication result and add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 BD /r VFNMADD231SD xmm0, xmm1, xmm2/m64</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm1 and xmm2/mem, negate the multiplication result and add to xmm0 and put result in xmm0.</td>
</tr>
</tbody>
</table>

**Description**

VFNMADD132SD: Multiplies the low packed double-precision floating-point value from the first source operand to the low packed double-precision floating-point value in the third source operand, adds the negated infinite precision intermediate result to the low packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VFNMADD213SD: Multiplies the low packed double-precision floating-point value from the second source operand to the low packed double-precision floating-point value in the first source operand, adds the negated infinite precision intermediate result to the low packed double-precision floating-point value in the third source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VFNMADD231SD: Multiplies the low packed double-precision floating-point value from the second source to the low packed double-precision floating-point value in the third source operand, adds the negated infinite precision intermediate result to the low packed double-precision floating-point value in the first source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 64-bit
memory location and encoded in rm_field. The upper bits ([255:128]) of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior".

**Operation**
In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

VFNMADD132SD DEST, SRC2, SRC3
DEST[63:0] ← RoundFPControl_MXCSR(- (DEST[63:0]*SRC3[63:0]) + SRC2[63:0])
DEST[127:64] ← DEST[127:64]
DEST[255:128] ← 0

VFNMADD213SD DEST, SRC2, SRC3
DEST[63:0] ← RoundFPControl_MXCSR(- (SRC2[63:0]*DEST[63:0]) + SRC3[63:0])
DEST[127:64] ← DEST[127:64]
DEST[255:128] ← 0

VFNMADD231SD DEST, SRC2, SRC3
DEST[63:0] ← RoundFPControl_MXCSR(- (SRC2[63:0]*SRC3[63:0]) + DEST[63:0])
DEST[127:64] ← DEST[127:64]
DEST[255:128] ← 0

**Intel C/C++ Compiler Intrinsic Equivalent**

VFNMADD132SD __m128d _mm_fnmadd_sd (__m128d a, __m128d b, __m128d c);
VFNMADD213SD __m128d _mm_fnmadd_sd (__m128d a, __m128d b, __m128d c);
VFNMADD231SD __m128d _mm_fnmadd_sd (__m128d a, __m128d b, __m128d c);

**SIMD Floating-Point Exceptions**
Overflow, Underflow, Invalid, Precision, Denormal

**Other Exceptions**
See Exceptions Type 3
VFNMADD132SS/VFNMADD213SS/VFNMADD231SS - Fused Negative Multiply-Add of Scalar Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9D /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar single-precision floating-point value from xmm0 and xmm2/mem, negate the multiplication result and add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VFNMADD132SS xmm0, xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AD /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar single-precision floating-point value from xmm0 and xmm1, negate the multiplication result and add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VFNMADD213SS xmm0, xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BD /r</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar single-precision floating-point value from xmm1 and xmm2/mem, negate the multiplication result and add to xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VFNMADD231SS xmm0, xmm1, xmm2/m32</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Description

VFNMADD132SS: Multiplies the low packed single-precision floating-point value from the first source operand to the low packed single-precision floating-point value in the third source operand, adds the negated infinite precision intermediate result to the low packed single-precision floating-point value in the second source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).

VFNMADD213SS: Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the first source operand, adds the negated infinite precision intermediate result to the low packed single-precision floating-point value in the third source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).

VFNMADD231SS: Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the third source operand, adds the negated infinite precision intermediate result to the low packed single-precision floating-point value in the first source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).
INSTRUCTION SET REFERENCE - FMA

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 32-bit memory location and encoded in rm_field. The upper bits ([255:128]) of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior".

Operation
In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

VFNMADD132SS DEST, SRC2, SRC3
DEST[31:0] ← RoundFPControl_MXCSR(- (DEST[31:0]*SRC3[31:0]) + SRC2[31:0])
DEST[255:128] ← 0

VFNMADD213SS DEST, SRC2, SRC3
DEST[31:0] ← RoundFPControl_MXCSR(- (SRC2[31:0]*DEST[31:0]) + SRC3[31:0])
DEST[255:128] ← 0

VFNMADD231SS DEST, SRC2, SRC3
DEST[31:0] ← RoundFPControl_MXCSR(- (SRC2[31:0]*SRC3[63:0]) + DEST[31:0])
DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent

VFNMADD132SS __m128 _mm_fnmadd_ss (__m128 a, __m128 b, __m128 c);
VFNMADD213SS __m128 _mm_fnmadd_ss (__m128 a, __m128 b, __m128 c);
VFNMADD231SS __m128 _mm_fnmadd_ss (__m128 a, __m128 b, __m128 c);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 3
VFNMSUB132PD/VFNMSUB213PD/VFNMSUB231PD - Fused Negative Multiply-Subtract of Packed Double-Precision Floating-Point Values

**Description**

VFNMSUB132PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand. From negated infinite precision floating-point values from xmm0 and xmm2/mem, negate the multiplication result and subtract xmm1 and put result in xmm0.

VFNMSUB213PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand. From negated infinite precision floating-point values from xmm0 and xmm1, negate the multiplication result and subtract xmm2/mem and put result in xmm0.

VFNMSUB231PD: Multiplies the two or four packed double-precision floating-point values from the first source operand to the two or four packed double-precision floating-point values in the third source operand. From negated infinite precision floating-point values from xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0 and put result in xmm0.

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9E /r VFNMSUB132PD xmm0, xmm1, xmm2/m128</td>
<td>V/V FMA</td>
<td></td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, negate the multiplication result and subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AE /r VFNMSUB213PD xmm0, xmm1, xmm2/m128</td>
<td>V/V FMA</td>
<td></td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm1, negate the multiplication result and subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 BE /r VFNMSUB231PD xmm0, xmm1, xmm2/m128</td>
<td>V/V FMA</td>
<td></td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 9E /r VFNMSUB132PD ymm0, ymm1, ymm2/m256</td>
<td>V/V FMA</td>
<td></td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, negate the multiplication result and subtract ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 AE /r VFNMSUB213PD ymm0, ymm1, ymm2/m256</td>
<td>V/V FMA</td>
<td></td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and subtract ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 BE /r VFNMSUB231PD ymm0, ymm1, ymm2/m256</td>
<td>V/V FMA</td>
<td></td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and subtract ymm0 and put result in ymm0.</td>
</tr>
</tbody>
</table>
intermediate results, subtracts the two or four packed double-precision floating-point values in the second source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

VFMSUB213PD: Multiplies the two or four packed double-precision floating-point values from the second source operand to the two or four packed double-precision floating-point values in the first source operand. From negated infinite precision intermediate results, subtracts the two or four packed double-precision floating-point values in the third source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand).

VFMSUB231PD: Multiplies the two or four packed double-precision floating-point values from the second source to the two or four packed double-precision floating-point values in the third source operand. From negated infinite precision intermediate results, subtracts the two or four packed double-precision floating-point values in the first source operand, performs rounding and stores the resulting two or four packed double-precision floating-point values to the destination operand (first source operand). VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, ”FMA Instruction Operand Order and Arithmetic Behavior”.

Operation
In the operations below, “-” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

VFNMSUB132PD DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL =2
ELSEIF (VEX.256)
    MAXVL = 4
FI
For i = 0 to MAXVL-1 {
\[ n = 64*i; \]
\[ \text{DEST}[n+63:n] \leftarrow \text{RoundFPControl_MXCSR}( - (\text{DEST}[n+63:n]*\text{SRC3}[n+63:n]) - \text{SRC2}[n+63:n]) \]}

IF (VEX.128) THEN
\[ \text{DEST}[255:128] \leftarrow 0 \]
FI

\text{VFNMBSUB213PD} \text{ DEST, SRC2, SRC3}

IF (VEX.128) THEN
\[ \text{MAXVL} = 2 \]
ELSEIF (VEX.256)
\[ \text{MAXVL} = 4 \]
FI
For \( i = 0 \) to MAXVL-1 {
\[ n = 64*i; \]
\[ \text{DEST}[n+63:n] \leftarrow \text{RoundFPControl_MXCSR}( - (\text{SRC2}[n+63:n]*\text{DEST}[n+63:n]) - \text{SRC3}[n+63:n]) \]}
IF (VEX.128) THEN
\[ \text{DEST}[255:128] \leftarrow 0 \]
FI

\text{VFNMBSUB231PD} \text{ DEST, SRC2, SRC3}

IF (VEX.128) THEN
\[ \text{MAXVL} = 2 \]
ELSEIF (VEX.256)
\[ \text{MAXVL} = 4 \]
FI
For \( i = 0 \) to MAXVL-1 {
\[ n = 64*i; \]
\[ \text{DEST}[n+63:n] \leftarrow \text{RoundFPControl_MXCSR}( - (\text{SRC2}[n+63:n]*\text{SRC3}[n+63:n]) - \text{DEST}[n+63:n]) \]}
IF (VEX.128) THEN
\[ \text{DEST}[255:128] \leftarrow 0 \]
FI

\textbf{Intel C/C++ Compiler Intrinsic Equivalent}

\text{VFNMBSUB132PD} \text{ _m128d _mm_fnmsub_pd (_m128d a, _m128d b, _m128d c);} 
\text{VFNMBSUB213PD} \text{ _m128d _mm_fnmsub_pd (_m128d a, _m128d b, _m128d c);} 
\text{VFNMBSUB231PD} \text{ _m128d _mm_fnmsub_pd (_m128d a, _m128d b, _m128d c);} 
\text{VFNMBSUB132PD} \text{ _m256d _mm256_fnmsub_pd (_m256d a, _m256d b, _m256d c);}
INSTRUCTION SET REFERENCE - FMA

VFNMSUB213PD __m256d _mm256_fnmsub_pd (__m256d a, __m256d b, __m256d c);
VFNMSUB231PD __m256d _mm256_fnmsub_pd (__m256d a, __m256d b, __m256d c);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
VFNMSUB132PS/VFNMSUB213PS/VFNMSUB231PS - Fused Negative Multiply-Subtract of Packed Single-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9E /r VFNMSUB132PS xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, negate the multiplication result and subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AE /r VFNMSUB213PS xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm1, negate the multiplication result and subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BE /r VFNMSUB231PS xmm0, xmm1, xmm2/m128</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 9E /r VFNMSUB132PS ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, negate the multiplication result and subtract ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 AE /r VFNMSUB213PS ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and subtract ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.0 BE /r VFNMSUB231PS ymm0, ymm1, ymm2/m256</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and subtract ymm0 and put result in ymm0.</td>
</tr>
</tbody>
</table>

Description

VFNMSUB132PS: Multiplies the four or eight packed single-precision floating-point values from the first source operand to the four or eight packed single-precision floating-point values in the third source operand. From negated infinite precision
intermediate results, subtracts the four or eight packed single-precision floating-point values in the second source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VFNMSUB213PS: Multiplies the four or eight packed single-precision floating-point values from the second source operand to the four or eight packed single-precision floating-point values in the first source operand. From negated infinite precision intermediate results, subtracts the four or eight packed single-precision floating-point values in the third source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VFNMSUB231PS: Multiplies the four or eight packed single-precision floating-point values from the second source to the four or eight packed single-precision floating-point values in the third source operand. From negated infinite precision intermediate results, subtracts the four or eight packed single-precision floating-point values in the first source operand, performs rounding and stores the resulting four or eight packed single-precision floating-point values to the destination operand (first source operand).

VEX.256 encoded version: The destination operand (also first source operand) is a YMM register and encoded in reg_field. The second source operand is a YMM register and encoded in VEX.vvvv. The third source operand is a YMM register or a 256-bit memory location and encoded in rm_field.

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 128-bit memory location and encoded in rm_field. The upper 128 bits of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior".

Operation
In the operations below, “+” and “*” symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

VFNMSUB132PS DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL = 4
ELSEIF (VEX.256)
    MAXVL = 8
FI
For i = 0 to MAXVL-1 {
n = 32*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR( - (DEST[n+31:n]*SRC3[n+31:n]) - SRC2[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI
VFNMSUB213PS DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL =4
ELSEIF (VEX.256)
    MAXVL = 8
FI
For i = 0 to MAXVL-1 {
    n = 32*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR( - (SRC2[n+31:n]*DEST[n+31:n]) - SRC3[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI
VFNMSUB231PS DEST, SRC2, SRC3
IF (VEX.128) THEN
    MAXVL =4
ELSEIF (VEX.256)
    MAXVL = 8
FI
For i = 0 to MAXVL-1 {
    n = 32*i;
    DEST[n+31:n] ← RoundFPControl_MXCSR( - (SRC2[n+31:n]*SRC3[n+31:n]) - DEST[n+31:n])
}
IF (VEX.128) THEN
    DEST[255:128] ← 0
FI
Intel C/C++ Compiler Intrinsic Equivalent
VFNMSUB132PS __m128 __mm_fnmsub_ps (__m128 a, __m128 b, __m128 c);
VFNMSUB213PS __m128 __mm_fnmsub_ps (__m128 a, __m128 b, __m128 c);
VFNMSUB231PS __m128 __mm_fnmsub_ps (__m128 a, __m128 b, __m128 c);
VFNMSUB132PS __m256 __mm256_fnmsub_ps (__m256 a, __m256 b, __m256 c);
INSTRUCTION SET REFERENCE - FMA

VFNUMSUB213PS __m256 _mm256_fnmsub_ps (__m256 a, __m256 b, __m256 c);
VFNUMSUB231PS __m256 _mm256_fnmsub_ps (__m256 a, __m256 b, __m256 c);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 2
VFNMSUB132SD/VFNMSUB213SD/VFNMSUB231SD - Fused Negative Multiply-Subtract of Scalar Double-Precision Floating-Point Values

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9F /r VFNMSUB132SD xmm0, xmm1, xmm2/m64</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm0 and xmm2/mem, negate the multiplication result and subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AF /r VFNMSUB213SD xmm0, xmm1, xmm2/m64</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm0 and xmm1, negate the multiplication result and subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 BF /r VFNMSUB231SD xmm0, xmm1, xmm2/m64</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar double-precision floating-point value from xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0 and put result in xmm0.</td>
</tr>
</tbody>
</table>

Description

VFNMSUB132SD: Multiplies the low packed double-precision floating-point value from the first source operand to the low packed double-precision floating-point value in the third source operand. From negated infinite precision intermediate result, subtracts the low double-precision floating-point value in the second source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VFNMSUB213SD: Multiplies the low packed double-precision floating-point value from the second source operand to the low packed double-precision floating-point value in the first source operand. From negated infinite precision intermediate result, subtracts the low double-precision floating-point value in the third source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).

VFNMSUB231SD: Multiplies the low packed double-precision floating-point value from the second source to the low packed double-precision floating-point value in the third source operand. From negated infinite precision intermediate result, subtracts the low double-precision floating-point value in the first source operand, performs rounding and stores the resulting packed double-precision floating-point value to the destination operand (first source operand).
INSTRUCTION SET REFERENCE - FMA

VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 64-bit memory location and encoded in rm_field. The upper bits ([255:128]) of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NANs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior".

Operation
In the operations below, "*" and "-" symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

**VFNMSUB132SD DEST, SRC2, SRC3**
DEST[63:0] <- RoundFPControl_MXCSR(- (DEST[63:0]*SRC3[63:0]) - SRC2[63:0])
DEST[127:64] <- DEST[127:64]
DEST[255:128] <- 0

**VFNMSUB213SD DEST, SRC2, SRC3**
DEST[63:0] <- RoundFPControl_MXCSR(- (SRC2[63:0]*DEST[63:0]) - SRC3[63:0])
DEST[127:64] <- DEST[127:64]
DEST[255:128] <- 0

**VFNMSUB231SD DEST, SRC2, SRC3**
DEST[63:0] <- RoundFPControl_MXCSR(- (SRC2[63:0]*SRC3[63:0]) - DEST[63:0])
DEST[127:64] <- DEST[127:64]
DEST[255:128] <- 0

Intel C/C++ Compiler Intrinsic Equivalent

VFNMSUB132SD __m128d _mm_fnmsub_sd (__m128d a, __m128d b, __m128d c);
VFNMSUB213SD __m128d _mm_fnmsub_sd (__m128d a, __m128d b, __m128d c);
VFNMSUB231SD __m128d _mm_fnmsub_sd (__m128d a, __m128d b, __m128d c);

SIMD Floating-Point Exceptions

Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 3
VFNMSUB132SS/VFNMSUB213SS/VFNMSUB231SS - Fused Negative Multiply-Subtract of Scalar Single-Precision Floating-Point Values

**Opcode/Instruction**

<table>
<thead>
<tr>
<th>Opcode/ Instruction</th>
<th>Mode Support</th>
<th>CPUID Feature Flag</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9F /r VFNMSUB132SS xmm0, xmm1, xmm2/m32</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar single-precision floating-point value from xmm0 and xmm2/mem, negate the multiplication result and subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AF /r VFNMSUB213SS xmm0, xmm1, xmm2/m32</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar single-precision floating-point value from xmm0 and xmm1, negate the multiplication result and subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BF /r VFNMSUB231SS xmm0, xmm1, xmm2/m32</td>
<td>V/V</td>
<td>FMA</td>
<td>Multiply scalar single-precision floating-point value from xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0 and put result in xmm0.</td>
</tr>
</tbody>
</table>

**Description**

VFNMSUB132SS: Multiplies the low packed single-precision floating-point value from the first source operand to the low packed single-precision floating-point value in the third source operand. From negated infinite precision intermediate result, the low single-precision floating-point value in the second source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).

VFNMSUB213SS: Multiplies the low packed single-precision floating-point value from the second source operand to the low packed single-precision floating-point value in the first source operand. From negated infinite precision intermediate result, the low single-precision floating-point value in the third source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).

VFNMSUB231SS: Multiplies the low packed single-precision floating-point value from the second source to the low packed single-precision floating-point value in the third source operand. From negated infinite precision intermediate result, the low single-precision floating-point value in the first source operand, performs rounding and stores the resulting packed single-precision floating-point value to the destination operand (first source operand).
VEX.128 encoded version: The destination operand (also first source operand) is a XMM register and encoded in reg_field. The second source operand is a XMM register and encoded in VEX.vvvv. The third source operand is a XMM register or a 32-bit memory location and encoded in rm_field. The upper bits ([255:128]) of the YMM destination register are zeroed.

Compiler tools may optionally support a complementary mnemonic for each instruction mnemonic listed in the opcode/instruction column of the summary table. The behavior of the complementary mnemonic in situations involving NaNs are governed by the definition of the instruction mnemonic defined in the opcode/instruction column. See also Section 2.3.1, "FMA Instruction Operand Order and Arithmetic Behavior".

Operation
In the operations below, "\*" and "-" symbols represent multiplication and addition with infinite precision inputs and outputs (no rounding)

VFNMSUB132SS DEST, SRC2, SRC3
DEST[31:0] ← RoundFPControl_MXCSR(-(DEST[31:0]*SRC3[31:0]) - SRC2[31:0])
DEST[255:128] ← 0

VFNMSUB213SS DEST, SRC2, SRC3
DEST[31:0] ← RoundFPControl_MXCSR(-(SRC2[31:0]*DEST[31:0]) - SRC3[31:0])
DEST[255:128] ← 0

VFNMSUB231SS DEST, SRC2, SRC3
DEST[31:0] ← RoundFPControl_MXCSR(-(SRC2[31:0]*SRC3[63:0]) - DEST[31:0])
DEST[255:128] ← 0

Intel C/C++ Compiler Intrinsic Equivalent

VFNMSUB132SS __m128 _mm_fnmsub_ss (__m128 a, __m128 b, __m128 c);
VFNMSUB213SS __m128 _mm_fnmsub_ss (__m128 a, __m128 b, __m128 c);
VFNMSUB231SS __m128 _mm_fnmsub_ss (__m128 a, __m128 b, __m128 c);

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal

Other Exceptions
See Exceptions Type 3
This page was intentionally left blank.
INSTRUCTION SET REFERENCE - FMA
Most SSE/SSE2/SSE3/SSSE3/SSE4 Instructions have been promoted to support VEX.128 encodings which, for non-memory-store versions implies support for zeroing upper bits of YMM registers. Table A-1 summarizes the promotion status for existing instructions. The column "VEX.256" indicates whether 256-bit vector form of the instruction using the VEX.256 prefix encoding is supported. The column "VEX.128" indicates whether the instruction using VEX.128 prefix encoding is supported.
## Table A-1. Promoted SSE/SSE2/SSE3/SSSE3/SSE4 Instructions

<table>
<thead>
<tr>
<th>VEX.256 Encoding</th>
<th>VEX.128 Encoding</th>
<th>group</th>
<th>Instruction</th>
<th>If No, Reason?</th>
</tr>
</thead>
<tbody>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F 1X</td>
<td>MOVUPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVSS</td>
<td>MOVUPD</td>
<td>scalar</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVSD</td>
<td>MOVSD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVLPS</td>
<td>MOVLPS</td>
<td>Note 1</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVLPD</td>
<td>MOVLPD</td>
<td>Note 1</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVHLPS</td>
<td>Redundant with VPER-MILPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>MOVDDUP</td>
<td></td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>MOVSLDUP</td>
<td></td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>UNPCKLPS</td>
<td></td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>UNPCKLPS</td>
<td></td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>UNPCKLPS</td>
<td></td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVHPS</td>
<td>Note 1</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVHPD</td>
<td>Note 1</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVHLPS</td>
<td>Redundant with VPER-MILPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>MOVAPS</td>
<td></td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>MOVSHDUP</td>
<td></td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>MOVAPD</td>
<td></td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>CVTPI2PS</td>
<td>MMX</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>CVTSI2SS</td>
<td>scalar</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>CVTPI2PD</td>
<td>MMX</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>CVTSI2SD</td>
<td>scalar</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVTNTPS</td>
<td></td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>MOVTNTPD</td>
<td></td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>CVTTPS2PI</td>
<td>MMX</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>CVTTS2SI</td>
<td>scalar</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>CVTTPD2PI</td>
<td>MMX</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>CVTSD2SI</td>
<td>scalar</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>CVTPS2PI</td>
<td>MMX</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>CVTSS2SI</td>
<td>scalar</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>CVTTPD2PI</td>
<td>MMX</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>CVTSD2SI</td>
<td>scalar</td>
<td></td>
</tr>
<tr>
<td>VEX.256 Encoding</td>
<td>VEX.128 Encoding</td>
<td>group</td>
<td>Instruction</td>
<td>If No, Reason?</td>
</tr>
<tr>
<td>------------------</td>
<td>------------------</td>
<td>---------</td>
<td>---------------</td>
<td>----------------</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>UCOMISS</td>
<td>scalar</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>UCOMISD</td>
<td>scalar</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>COMISS</td>
<td>scalar</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>COMISD</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F 5X</td>
<td>MOVMSKPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>MOVMSKPD</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>SQRTPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>SQRTSX</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>SQRTPD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>SQRTPSD</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>RCPRTPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>RCPRTSS</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>RCPSS</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ANDPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ANDPD</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ANDNPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ANDNPDP</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ORPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ORPD</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>XORPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>XORMPD</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ADDPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>ADDSS</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ADDPD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>ADDSD</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>MULPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MULSS</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>MULPD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MULSD</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>CVTPS2PD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>CVTSS2SD</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>CVTSP2PS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>CVTSD2SS</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>CVTDP2PS</td>
<td></td>
</tr>
</tbody>
</table>
## INSTRUCTION SUMMARY

<table>
<thead>
<tr>
<th>VEX.256 Encoding</th>
<th>VEX.128 Encoding</th>
<th>group</th>
<th>Instruction</th>
<th>If No, Reason?</th>
</tr>
</thead>
<tbody>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>CVTPS2DQ</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>CVTTPS2DQ</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>SUBPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>SUBSS</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>SUBPD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>SUBSD</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>MINPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MINSS</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>MINPD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MINSD</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>DIVPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>DIVSS</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>DIVPD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>DIVSD</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>MAXPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MAXSS</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>MAXPD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F 6X</td>
<td>MAXSD</td>
<td>scalar</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PUNPCKLBW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PUNPCKLWD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PUNPCKLDQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PACKSSWB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PCMPGTB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PCMPGTD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PACKUSWB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PUNPCKHBW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PUNPCKHWD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PUNPCKHDQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PACKSSDW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PUNPCKLQDQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PUNPCKHQDQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MOVQ</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>MOVD</td>
<td>scalar</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>MOVDQA</td>
<td></td>
</tr>
<tr>
<td>VEX.256 Encoding</td>
<td>VEX.128 Encoding</td>
<td>group</td>
<td>Instruction</td>
<td>If No, Reason?</td>
</tr>
<tr>
<td>------------------</td>
<td>------------------</td>
<td>-----------</td>
<td>-------------</td>
<td>---------------</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F 7X</td>
<td>MOVQ2DQ</td>
<td>MMX</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MOVDQU</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F 7X</td>
<td>PSHUFD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F 7X</td>
<td>PSHUFHW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F 7X</td>
<td>PSHUFLW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F 7X</td>
<td>PCMPEQW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F 7X</td>
<td>PCMPEQD</td>
<td>VI</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F 7X</td>
<td>HADDPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F 7X</td>
<td>HSUBPD</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F 7X</td>
<td>HSUBPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F AX</td>
<td>LDMXCSR</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F AX</td>
<td>STMXCSR</td>
<td>VI</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F CX</td>
<td>CMPSS</td>
<td>scalar</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F AX</td>
<td>CMPPD</td>
<td>scalar</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F AX</td>
<td>CMPSD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F AX</td>
<td>PINSRW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F AX</td>
<td>PEXTRW</td>
<td>VI</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>SHUFPD</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>SHUFPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>ADDSUBPD</td>
<td>VI</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>ADDSUBPS</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>PSRLW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>PSRLD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>PSRLQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>PADDQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>PMULLW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>YY 0F DX</td>
<td>MOVQ2DQ</td>
<td>MMX</td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>YY 0F DX</td>
<td>MOVDQ2Q</td>
<td>MMX</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>YY 0F DX</td>
<td>PMOVMSKB</td>
<td>VI</td>
</tr>
</tbody>
</table>
## INSTRUCTION SUMMARY

<table>
<thead>
<tr>
<th>VEX.256 Encoding</th>
<th>VEX.128 Encoding</th>
<th>group</th>
<th>Instruction</th>
<th>If No, Reason?</th>
</tr>
</thead>
<tbody>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSUBUSB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSUBUSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMINUB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PAND</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PADDUSB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PADDUSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMAXUB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMINUSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>POR</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PADDSB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PADDUSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMAXUSW</td>
<td>VI</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>CVTPD2DQ</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>CVTTPD2DQ</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>CVTDQ2PD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MOVNTDQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSUBSB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSUBSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMINSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>POR</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PADDSB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PADDUSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMAXUSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PXOR</td>
<td>VI</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>YY 0F FX</td>
<td>LDDQU</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSLLW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSLLD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSLLQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMULUDQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMADDWD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSADBW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MASKMOVQDQU</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSUBB</td>
<td>VI</td>
</tr>
</tbody>
</table>
### INSTRUCTION SUMMARY

<table>
<thead>
<tr>
<th>VEX.256 Encoding</th>
<th>VEX.128 Encoding</th>
<th>group</th>
<th>Instruction</th>
<th>If No, Reason?</th>
</tr>
</thead>
<tbody>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSUBW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSUBD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSUBQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PADDDB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PADDW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PADD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>SSSE3</td>
<td>PHADDDW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PHADDSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PHADDD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PHSUBBW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PHSUBSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSHUFB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMULHRSW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSIGNB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSIGNW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PSIGND</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PABS</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PABSD</td>
<td>VI</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td>SSE4.1</td>
<td>BLENDPS</td>
<td>VI</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>BLENDPD</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>BLENDVPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>BLENDVPD</td>
<td>Note 2</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>DPPD</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>DPPS</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>EXTRACTPS</td>
<td>Note 3</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>INSERTPS</td>
<td>Note 3</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MOVNTDQA</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>MPSADBW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PACKUSDW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PBLENDV</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PBLENDW</td>
<td>VI</td>
</tr>
</tbody>
</table>

Note 2: BLENDVPS is not available in the VEX.256 encoding.

Note 3: EXTRACTPS is not available in the VEX.256 encoding.
<table>
<thead>
<tr>
<th>VEX.256 Encoding</th>
<th>VEX.128 Encoding</th>
<th>group</th>
<th>Instruction</th>
<th>If No, Reason?</th>
</tr>
</thead>
<tbody>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PCMPEQQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PEXTRD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PEXTRQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PEXTRB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PEXTRW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PHMINPOSUW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PINSRB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PINSRD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PINSRQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMAXSB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMAXSD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMAXUD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMAXUW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMINSB</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMINSD</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMINUW</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMOVZIxxx</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PMOVSXxxx</td>
<td>VI</td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>PTEST</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ROUNDPS</td>
<td></td>
</tr>
<tr>
<td>yes</td>
<td>yes</td>
<td></td>
<td>ROUNDPD</td>
<td></td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>ROUNDSD</td>
<td>scalar</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>ROUNDSS</td>
<td>scalar</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td>SSE4.2</td>
<td>PCMPGTQ</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>SSE4.2</td>
<td>CRC32c</td>
<td>integer</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PCMPESTRI</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PCMPESTRM</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PCMPISTRI</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>yes</td>
<td></td>
<td>PCMPISTRM</td>
<td>VI</td>
</tr>
<tr>
<td>no</td>
<td>no</td>
<td>SSE4.2</td>
<td>POPCNT</td>
<td>integer</td>
</tr>
</tbody>
</table>

Description of Column “If No, Reason?”
**INSTRUCTION SUMMARY**

**MMX**: Instructions referencing MMX registers do not support VEX

**Scalar**: Scalar instructions are not promoted to 256-bit

**Integer**: Integer instructions are not promoted.

**VI**: “Vector Integer” instructions are not promoted to 256-bit

**Note 1**: MOVLDP/PS and MOVHPD/PS are not promoted to 256-bit. The equivalent functionality are provided by VINSETTF128 and VEXTRACTF128 instructions as the existing instructions have no natural 256b extension

**Note 2**: BLENDVPD and BLENDVPS are superseded by the more flexible VBLENDVPD and VBLENDVPS.

**Note 3**: It is expected that using 128-bit INSERTPS followed by a VINSERTF128 would be better than promoting INSERTPS to 256-bit (for example).
## INSTRUCTION SUMMARY

### Table A-2. AVX, FMA and AES New Instructions

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>66 0F 38 DE /r</td>
<td>AESDEC xmm1, xmm2/m128</td>
<td>Perform 1 round of AES decryption of xmm1 using the 128-bit round key from the xmm2/m128.</td>
</tr>
<tr>
<td>66 0F 38 DF /r</td>
<td>AESDECLAST xmm1, xmm2/m128</td>
<td>Perform the last round of AES decryption of xmm1 using the 128 bit round key from xmm2/m128.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 DE /r</td>
<td>VAESDEC xmm1, xmm2, xmm3/m128</td>
<td>Perform 1 round of AES decryption of xmm2 using the 128-bit round key from the xmm3/m128, and stores the result in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 DF /r</td>
<td>VAESDECLAST xmm1, xmm2, xmm3/m128</td>
<td>Perform the last round of AES decryption of xmm2 using the 128 bit round key from xmm3/m128, and stores the result in xmm1.</td>
</tr>
<tr>
<td>66 0F 38 DC /r</td>
<td>AESENC xmm1, xmm2/m128</td>
<td>Perform 1 round of AES encryption of xmm1 using the 128-bit round key from the xmm2/m128.</td>
</tr>
<tr>
<td>66 0F 38 DD /r</td>
<td>AESENCLAST xmm1, xmm2/m128</td>
<td>Perform the last round of AES encryption of xmm1 using the 128 bit round key from xmm2/m128.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 DC /r</td>
<td>VAESENC xmm1, xmm2, xmm3/m128</td>
<td>Perform 1 round of AES encryption of xmm2 using the 128-bit round key from the xmm3/m128, and stores the result in xmm1.</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 DD /r</td>
<td>VASENCLAST xmm1, xmm2, xmm3/m128</td>
<td>Perform the last round of AES encryption of xmm2 using the 128 bit round key from xmm3/m128, and stores the result in xmm1.</td>
</tr>
<tr>
<td>66 0F 38 DB /r</td>
<td>AESIMC xmm1, xmm2/m128</td>
<td>Perform the InvMixColumn operation using xmm2/mem and store result in xmm1.</td>
</tr>
<tr>
<td>VEX.128.66.0F38 DB /r</td>
<td>VAESIMC xmm1, xmm2/m128</td>
<td>Perform the InvMixColumn operation using xmm2/mem and store result in xmm1.</td>
</tr>
<tr>
<td>66 0F 3A DF /r</td>
<td>AESKEYGENASSIST xmm1, xmm2/m128, imm8</td>
<td>Assist in AES round key generation using an immediate round control byte, a key specified in xmm2/m128 and stores the result in xmm1.</td>
</tr>
<tr>
<td>VEX.128.66.0F3A DF /r</td>
<td>AESKEYGENASSIST xmm1, xmm2/m128, imm8</td>
<td>Assist in AES round key generation using an immediate round control byte, a key specified in xmm2/m128 and stores the result in xmm1.</td>
</tr>
<tr>
<td>VEX.256.66.0F38 1A /r</td>
<td>VBBROADCASTF128 ymm1, m128</td>
<td>Broadcast 128-bit floating-point values in mem to low and high 128-bits in ymm1.</td>
</tr>
<tr>
<td>VEX.256.66.0F38 19/r</td>
<td>VBBROADCASTSD ymm1, m64</td>
<td>Broadcast double-precision floating-point element in mem to four locations in ymm1.</td>
</tr>
<tr>
<td>Opcode</td>
<td>Instruction</td>
<td>Description</td>
</tr>
<tr>
<td>--------------</td>
<td>-------------------</td>
<td>-----------------------------------------------------------------------------</td>
</tr>
<tr>
<td>VEX.256.66.0F38 18 /r</td>
<td>VBBROADCAST ymm1, m32</td>
<td>Broadcast single-precision floating-point element in mem to eight locations in ymm1.</td>
</tr>
<tr>
<td>VEX.128.66.0F38 18/r</td>
<td>VBBROADCAST xmm1, m32</td>
<td>Broadcast single-precision floating-point element in mem to four locations in xmm1.</td>
</tr>
<tr>
<td>VEX.256.66.0F3A 19 /r ib</td>
<td>VEXTRACTF128 xmm1/m128, ymm2, imm8</td>
<td>Extracts 128-bits of packed floating-point values from ymm2 and store results in xmm1/mem.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 98 /r</td>
<td>VFMADD132PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 A8 /r</td>
<td>VFMADD213PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm0, add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 B8 /r</td>
<td>VFMADD231PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 98 /r</td>
<td>VFMADD132PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, add to ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 A8 /r</td>
<td>VFMADD213PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm0, add to ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 B8 /r</td>
<td>VFMADD231PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, add to ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 98 /r</td>
<td>VFMADD132PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 A8 /r</td>
<td>VFMADD213PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm0, add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 B8 /r</td>
<td>VFMADD231PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 98 /r</td>
<td>VFMADD132PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, add to ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 A8 /r</td>
<td>VFMADD213PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm0, add to ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>Opcode</td>
<td>Instruction</td>
<td>Description</td>
</tr>
<tr>
<td>--------</td>
<td>-------------</td>
<td>-------------</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 B8 /r</td>
<td>VFMADD231PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, add to ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 99 /r</td>
<td>VFMADD132SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm0 and xmm2/mem, add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 A9 /r</td>
<td>VFMADD213SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm1 and xmm0, add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 B9 /r</td>
<td>VFMADD231SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 99 /r</td>
<td>VFMADD132SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm0 and xmm2/mem, add to xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 A9 /r</td>
<td>VFMADD213SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm1 and xmm0, add to xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 B9 /r</td>
<td>VFMADD231SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm1 and xmm2/mem, add to xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 96 /r</td>
<td>VFMADDSUB132PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, add/subtract elements in xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 A6 /r</td>
<td>VFMADDSUB213PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm0, add/subtract elements in xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 B6 /r</td>
<td>VFMADDSUB231PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, add/subtract elements in xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 96 /r</td>
<td>VFMADDSUB132PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, add/subtract elements in ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 A6 /r</td>
<td>VFMADDSUB213PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm0, add/subtract elements in ymm2/mem and put result in ymm0.</td>
</tr>
</tbody>
</table>
### INSTRUCTION SUMMARY

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.256.66.0F38.W1 B6 /r</td>
<td>VFMADDSUB231PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm1, add/subtract elements in ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 96 /r</td>
<td>VFMADDSUB132PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, add/subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 A6 /r</td>
<td>VFMADDSUB213PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm0, add/subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 B6 /r</td>
<td>VFMADDSUB231PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, add/subtract xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 96 /r</td>
<td>VFMADDSUB132PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, add/subtract ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 A6 /r</td>
<td>VFMADDSUB213PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm0, add/subtract ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 B6 /r</td>
<td>VFMADDSUB231PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, add/subtract ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 97 /r</td>
<td>VFMSUBADD132PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, subtract/add elements in xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 A7 /r</td>
<td>VFMSUBADD213PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm0, subtract/add elements in xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 B7 /r</td>
<td>VFMSUBADD231PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, subtract/add elements in xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 97 /r</td>
<td>VFMSUBADD132PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, subtract/add elements in ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>Opcode</td>
<td>Instruction</td>
<td>Description</td>
</tr>
<tr>
<td>----------------------</td>
<td>--------------------------------------------------</td>
<td>-----------------------------------------------------------------------------</td>
</tr>
<tr>
<td>VEX.DDS.256.66.</td>
<td>VFMSUBADD213PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm0, subtract/add elements in ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>0F38.W1 A7 /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.</td>
<td>VFMSUBADD231PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, subtract/add elements in ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>0F38.W1 B7 /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.</td>
<td>VFMSUBADD132PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, subtract/add xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>0F38.W0 97 /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.</td>
<td>VFMSUBADD213PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm0, subtract/add xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>0F38.W0 A7 /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.</td>
<td>VFMSUBADD231PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, subtract/add xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>0F38.W0 B7 /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.</td>
<td>VFMSUBADD132PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, subtract/add ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>0F38.W0 97 /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.</td>
<td>VFMSUBADD213PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm0, subtract/add ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>0F38.W0 A7 /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.</td>
<td>VFMSUBADD231PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, subtract/add ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>0F38.W0 B7 /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.</td>
<td>VFMSUB132PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>0F38.W1 9A /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.</td>
<td>VFMSUB213PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm0, subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>0F38.W1 AA /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.128.66.</td>
<td>VFMSUB231PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, subtract xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>0F38.W1 BA /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>VEX.DDS.256.66.</td>
<td>VFMSUB132PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, subtract ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>0F38.W1 9A /r</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Opcode</td>
<td>Instruction</td>
<td>Description</td>
</tr>
<tr>
<td>----------------------</td>
<td>-------------</td>
<td>-----------------------------------------------------------------------------</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 AA /r</td>
<td>VFMSUB213PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, subtract ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W1 BA /r</td>
<td>VFMSUB231PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, subtract ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9A /r</td>
<td>VFMSUB132PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm0 and xmm2/mem, subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AA /r</td>
<td>VFMSUB213PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm0, subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BA /r</td>
<td>VFMSUB231PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, subtract xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 9A /r</td>
<td>VFMSUB132PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, subtract ymm1 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 AA /r</td>
<td>VFMSUB213PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm0, subtract ymm2/mem and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 BA /r</td>
<td>VFMSUB231PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, subtract ymm0 and put result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9B /r</td>
<td>VFMSUB132SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm0 and xmm2/mem, subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AB /r</td>
<td>VFMSUB213SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm1 and xmm0, subtract xmm2/mem and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 BB /r</td>
<td>VFMSUB231SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm1 and xmm2/mem, subtract xmm0 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9B /r</td>
<td>VFMSUB132SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm0 and xmm2/mem, subtract xmm1 and put result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AB /r</td>
<td>VFMSUB213SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm1 and xmm0, subtract xmm2/mem and put result in xmm0.</td>
</tr>
</tbody>
</table>
## INSTRUCTION SUMMARY

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9C /r</td>
<td>VFNMADD132PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm0, negate the multiplication result and add to xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AC /r</td>
<td>VFNMADD213PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm0, negate the multiplication result and add to xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BC /r</td>
<td>VFNMADD231PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm0, negate the multiplication result and add to xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 9C /r</td>
<td>VFNMADD132PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm0, negate the multiplication result and add to ymm0. Put the result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 AC /r</td>
<td>VFNMADD213PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm0, negate the multiplication result and add to ymm0. Put the result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 BC /r</td>
<td>VFNMADD231PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm0, negate the multiplication result and add to ymm0. Put the result in ymm0.</td>
</tr>
</tbody>
</table>
## Instruction Summary

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.256.66.0F38.W0 AC /r</td>
<td>VFNMADD213PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm0, negate the multiplication result and add to ymm2/mem. Put the result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66.0F38.W0 BC /r</td>
<td>VFNMADD231PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and add to ymm0. Put the result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9D /r</td>
<td>VFNMADD132SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm0 and xmm2/mem, negate the multiplication result and add to xmm1. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AD /r</td>
<td>VFNMADD1213SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm1 and xmm0, negate the multiplication result and add to xmm2/mem. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 BD /r</td>
<td>VFNMADD231SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm1 and xmm2/mem, negate the multiplication result and add to xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9D /r</td>
<td>VFNMADD132SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm0 and xmm2/mem, negate the multiplication result and add to xmm1. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AD /r</td>
<td>VFNMADD213SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm1 and xmm0, negate the multiplication result and add to xmm2/mem. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BD /r</td>
<td>VFNMADD231SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm1 and xmm2/mem, negate the multiplication result and add to xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9E /r</td>
<td>VFNMADD132PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm0 and xmm2/mem, negate the multiplication result and subtract xmm1. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AE /r</td>
<td>VFNMADD213PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm0, negate the multiplication result and subtract xmm2/mem. Put the result in xmm0.</td>
</tr>
</tbody>
</table>
## INSTRUCTION SUMMARY

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66. 0F38.W1 BE /r</td>
<td>VFNMSUB231PD xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed double-precision floating-point values from xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66. 0F38.W1 9E /r</td>
<td>VFNMSUB132PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm0 and ymm2/mem, negate the multiplication result and subtract ymm1. Put the result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66. 0F38.W1 AE /r</td>
<td>VFNMSUB213PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm0, negate the multiplication result and subtract ymm2/mem. Put the result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66. 0F38.W1 BE /r</td>
<td>VFNMSUB231PD ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed double-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and subtract ymm0. Put the result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66. 0F38.W0 9E /r</td>
<td>VFNMSUB231PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66. 0F38.W0 AE /r</td>
<td>VFNMSUB213PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm0, negate the multiplication result and subtract xmm2/mem. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66. 0F38.W0 BE /r</td>
<td>VFNMSUB231PS xmm0, xmm1, xmm2/m128</td>
<td>Multiply packed single-precision floating-point values from xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66. 0F38.W0 9E /r</td>
<td>VFNMSUB132PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm0 and ymm2/mem, negate the multiplication result and subtract ymm1. Put the result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66. 0F38.W0 AE /r</td>
<td>VFNMSUB213PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm0, negate the multiplication result and subtract ymm2/mem. Put the result in ymm0.</td>
</tr>
<tr>
<td>VEX.DDS.256.66. 0F38.W0 BE /r</td>
<td>VFNMSUB231PS ymm0, ymm1, ymm2/m256</td>
<td>Multiply packed single-precision floating-point values from ymm1 and ymm2/mem, negate the multiplication result and subtract ymm0. Put the result in ymm0.</td>
</tr>
</tbody>
</table>
## INSTRUCTION SUMMARY

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.DDS.128.66.0F38.W1 9F /r</td>
<td>VFNMSUB132SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm0 and xmm2/mem, negate the multiplication result and subtract xmm1. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 AF /r</td>
<td>VFNMSUB213SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm1 and xmm0, negate the multiplication result and subtract xmm2/mem. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W1 BF /r</td>
<td>VFNMSUB231SD xmm0, xmm1, xmm2/m64</td>
<td>Multiply scalar double-precision floating-point value in xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 9F /r</td>
<td>VFNMSUB132SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm0 and xmm2/mem, negate the multiplication result and subtract xmm1. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 AF /r</td>
<td>VFNMSUB213SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm1 and xmm0, negate the multiplication result and subtract xmm2/mem. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.DDS.128.66.0F38.W0 BF /r</td>
<td>VFNMSUB231SS xmm0, xmm1, xmm2/m32</td>
<td>Multiply scalar single-precision floating-point value in xmm1 and xmm2/mem, negate the multiplication result and subtract xmm0. Put the result in xmm0.</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F3A 18 /r ib</td>
<td>VINSERTF128 ymm1, ymm2, xmm3/m128, imm8</td>
<td>Insert 128-bits of packed floating-point values from xmm3/mem and the remaining values from ymm2 into ymm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 2C /r</td>
<td>VMASKMOVPS xmm1, xmm2, m128</td>
<td>Load packed single-precision values from mem using mask in xmm2 and store in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F38 2C /r</td>
<td>VMASKMOVPS ymm1, ymm2, m256</td>
<td>Load packed single-precision values from mem using mask in ymm2 and store in ymm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 2D/r</td>
<td>VMASKMOVPD xmm1, xmm2, m128</td>
<td>Load packed double-precision values from mem using mask in xmm2 and store in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66.0F38 2D /r</td>
<td>VMASKMOVPD ymm1, ymm2, m256</td>
<td>Load packed double-precision values from mem using mask in ymm2 and store in ymm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66.0F38 2E /r</td>
<td>VMASKMOVPS m128, xmm1, xmm2</td>
<td>Store packed single-precision values from xmm2 using mask in xmm1</td>
</tr>
</tbody>
</table>
# Instruction Summary

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.NDS.256.66. 0F38 2E /r</td>
<td>VMASKMOVPS m256, ymm1, ymm2</td>
<td>Store packed single-precision values from ymm2 mask in ymm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66. 0F38 2F /r</td>
<td>VMASKMOVPD m128, xmm1, xmm2</td>
<td>Store packed double-precision values from xmm2 using mask in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66. 0F38 2F /r</td>
<td>VMASKMOVPD m256, ymm1, ymm2</td>
<td>Store packed double-precision values from ymm2 using mask in ymm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66. 0F38 0D /r</td>
<td>VPERMILPD xmm1, xmm2, XMM3/m128</td>
<td>Permute Double-Precision Floating-Point values in xmm2 using controls from xmm3/mem and store result in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 05 /r ib</td>
<td>VPERMILPD xmm1, xmm2/m128, imm8</td>
<td>Permute Double-Precision Floating-Point values in xmm2/mem using controls from imm8 and store result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66. 0F38 0D /r</td>
<td>VPERMILPD ymm1, ymm2, ymm3/m256</td>
<td>Permute Double-Precision Floating-Point values in ymm2 using controls from ymm3/mem and store result in ymm1</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 05 /r ib</td>
<td>VPERMILPD ymm1, ymm2/m256, imm8</td>
<td>Permute Double-Precision Floating-Point values in ymm2/mem using controls from imm8 and store result in ymm1</td>
</tr>
<tr>
<td>VEX.NDS.128.66. 0F38 0C /r</td>
<td>VPERMILPS xmm1, xmm2, xmm3/m128</td>
<td>Permute Single-Precision Floating-Point values in xmm2 using controls from xmm3/mem and store result in xmm1</td>
</tr>
<tr>
<td>VEX.128.66.0F3A 04 /r ib</td>
<td>VPERMILPS xmm1, xmm2/m128, imm8</td>
<td>Permute Single-Precision Floating-Point values in xmm2/mem using controls from imm8 and store result in xmm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66. 0F38 0C /r</td>
<td>VPERMILPS ymm1, ymm2, YMM/m256</td>
<td>Permute Single-Precision Floating-Point values in ymm2 using controls from ymm3/mem and store result in ymm1</td>
</tr>
<tr>
<td>VEX.256.66.0F3A 04 /r ib</td>
<td>VPERMILPS ymm1, ymm2/m256, imm8</td>
<td>Permute Single-Precision Floating-Point values in ymm2/mem using controls from imm8 and store result in ymm1</td>
</tr>
<tr>
<td>VEX.NDS.256.66. 0F3A 05 /r ib</td>
<td>VPERM2F128 ymm1, ymm2, ymm3/m256, imm8</td>
<td>Permute 128-bit floating-point fields in ymm2 and ymm3/mem using controls from imm8 and store result in ymm1</td>
</tr>
<tr>
<td>66 0F 3A 44 /r ib</td>
<td>PCLMULQDQ xmm1, xmm2/m128, imm8</td>
<td>Carry-less multiplication of a pair of quad-word selected by an immediate byte from xmm2/m128 and xmm1, stores the 128-bit result in xmm1.</td>
</tr>
<tr>
<td>VEX.256.66.0F38 0E /r</td>
<td>VTESTPS ymm1, ymm2/m256</td>
<td>Set ZF if ymm2/mem AND ymm1 result is all 0s in packed single-precision sign bits. Set CF if ymm2/mem AND NOT ymm1 result is all 0s in packed single-precision sign bits.</td>
</tr>
</tbody>
</table>
### INSTRUCTION SUMMARY

<table>
<thead>
<tr>
<th>Opcode</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>VEX.256.66.0F38 0F /r</td>
<td>VTSTPD ymm1, ymm2/m256</td>
<td>Set ZF if ymm2/mem AND ymm1 result is all 0s in packed double-precision sign bits. Set CF if ymm2/mem AND NOT ymm1 result is all 0s in packed double-precision sign bits.</td>
</tr>
<tr>
<td>VEX.128.66.0F38 0E /r</td>
<td>VTSTPS xmm1, xmm2/m128</td>
<td>Set ZF if xmm2/mem AND xmm1 result is all 0s in packed single-precision sign bits. Set CF if xmm2/mem AND NOT xmm1 result is all 0s in packed single-precision sign bits.</td>
</tr>
<tr>
<td>VEX.128.66.0F38 0F /r</td>
<td>VTSTPD xmm1, xmm2/m128</td>
<td>Set ZF if xmm2/mem AND xmm1 result is all 0s in packed single precision sign bits. Set CF if xmm2/mem AND NOT xmm1 result is all 0s in packed double-precision sign bits.</td>
</tr>
<tr>
<td>VEX.256.0F 77</td>
<td>VZEROALL</td>
<td>Zero all YMM registers</td>
</tr>
<tr>
<td>VEX.128.0F 77</td>
<td>VZEROUPPER</td>
<td>Zero upper 128 bits of all YMM registers</td>
</tr>
</tbody>
</table>

This page was intentionally left blank.
### INSTRUCTION OPCODE MAP

**APPENDIX B**

INSTRUCTION OPCODE MAP

GREEN cells are existing instructions promoted to VEX.128
BLUE cells are existing instructions promoted to VEX.256 and VEX.128
RED cells are AVX and FMA new instructions
YELLOW cells are Non-VEX encoded new instructions

<table>
<thead>
<tr>
<th>0F</th>
<th>66 0F</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>movlps Vq, Mq; movlps Mq, Vq</td>
</tr>
<tr>
<td>B</td>
<td>cmovae/nc/nb Gv, Ev</td>
</tr>
<tr>
<td>C</td>
<td>cmova/nbe Gv, Ev</td>
</tr>
<tr>
<td>D</td>
<td>cmovns Gv, Ev</td>
</tr>
<tr>
<td>E</td>
<td>cmovnp/po Gv, Ev</td>
</tr>
<tr>
<td>F</td>
<td>cmovnl/ge Gv, Ev</td>
</tr>
</tbody>
</table>

Ref. # 319433-005  
B-1
INSTRUCTION OPCODE MAP

<table>
<thead>
<tr>
<th>66 0F</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

F2 0F

<table>
<thead>
<tr>
<th>F2 0F</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

F3 0F

<table>
<thead>
<tr>
<th>F3 0F</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

B-2

Ref. # 319433-005
### INSTRUCTION OPCODE MAP

#### 66 0F 38

<table>
<thead>
<tr>
<th>B</th>
<th>66</th>
<th>0F</th>
<th>38</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

#### Instrucion Map

- **movss Vss, Wss**
- **movss Wss, Vss**
- **movsldup Vq, Wq**
- **movshdup Vq, Wq**
- **cvtsi2ss Vss, Ed/q**
- **cvttss2si Gd, Wss**
- **cvtss2si Gd/q, Wss**
- **sqrtss Vss, Wss**
- **rsqrtss Vss, Wss**
- **rcpss Vss, Wss**
- **addss Vss, Wss**
- **mulss Vss, Wss**
- **cvtss2sd Vss, Wss**
- **cvttps2dq Vdq, Wps**
- **subss Vss, Wss**
- **minss Vss, Wss**
- **divss Vss, Wss**
- **movdqu Vdq, Wdq**
- **pshufhw Vdq, Wdq, Ib**
- **movq Vq, Wq**
- **movdqu Wdq, Vdq**
- **cmpss Vss, Wss, Ib**
- **movq2dq Vdq, Nq**
- **cvtdq2pd Vpd, Wdq**
- **pshufb Vdq, Wdq**
- **phaddw Vdq, Wdq**
- **phaddd Vdq, Wdq**
- **phaddsw Vdq, Wdq**
- **pmaddubsw Vdq, Wdq**
- **phsubw Vdq, Wdq**
- **phsubd Vdq, Wdq**
- **phsubsw Vdq, Wdq**
- **psignb Vdq, Wdq**
- **psignw Vdq, Wdq**
- **psignd Vdq, Wdq**
- **pmulhrsw Vdq, Wdq**
- **vpermilps Vpermilpd vtestps vtestpd**
- **pblendvb blendvps blendvpd ptest**
- **vbroadcasts s vbroadcasts d vbroadcastf 128**
- **pabsb Vdq, Wdq**
- **pabsw Vdq, Wdq**
- **pabsd Vdq, Wdq**
- **pmovsxbw pmovsxbd pmovsxbq pmovsxwd pmovsxwq pmovsxdq pmuldq pcmpeqq movntdqa packusdw**
- **vmaskmovp s (ld) vmaskmovp d (ld) vmaskmovp s (st) vmaskmovp d (st)**
- **pmovzxbw pmovzxbd pmovzxbq pmovzxwd pmovzxwq pmovzxdq**
- **pcmpgtq Vdq, Wdq pminsb pminsd pminuw pminud pmaxsb pmaxsd pmaxuw pmaxud**
- **pmulld**
- **phminposuw**

- **Aesimc Aesenc Aesenclast Aesdec Aesdeclast**

---

Ref. # 319433-005

B-3
### INSTRUCTION OPCODE MAP

#### 0F 38

<table>
<thead>
<tr>
<th>Opcode</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
</thead>
<tbody>
<tr>
<td>Pshufb</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phaddw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phaddw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pmaddubsw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phsubw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phsubw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pmulhrs</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pabsb</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pabsw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pabsd</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

#### F2 0F 38

<table>
<thead>
<tr>
<th>Opcode</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
</thead>
<tbody>
<tr>
<td>Pshufb</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phaddw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phaddw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pmaddubsw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phsubw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phsubw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pmulhrs</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pabsb</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pabsw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pabsd</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

#### 66 0F 3A

<table>
<thead>
<tr>
<th>Opcode</th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
</thead>
<tbody>
<tr>
<td>Pshufb</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phaddw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phaddw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pmaddubsw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phsubw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Phsubw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pmulhrs</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pabsb</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pabsw</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Pabsd</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

---

B-4  Ref. # 319433-005
## INSTRUCTION OPCODE MAP

### 0F 3A

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

### 0F 3A

<table>
<thead>
<tr>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>A</th>
<th>B</th>
<th>C</th>
<th>D</th>
<th>E</th>
<th>F</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
INSTRUCTION OPCODE MAP

This page was intentionally left blank.
CVTPS2PD- Convert Packed Single Precision Floating-point values to Packed Double Precision Floating-Point Values ................................................................. 5-100
CVTSD2SI- Convert Scalar Double-Precision Floating-Point Value to Doubleword Integer ................................................................. 5-103
CVTSD2SS- Convert Scalar Double-Precision Floating-Point Value to Scalar Single-Precision Floating-Point Value ................................................................. 5-105
CVTS12SD- Convert Doubleword Integer to Scalar Double-Precision Floating-Point Value ................................................................. 5-107
CVTS12SS- Convert Doubleword Integer to Scalar Single-Precision Floating-Point Value ................................................................. 5-109
CVTSS2SD- Convert Scalar Single-Precision Floating-Point Value to Scalar Double-Precision Floating-Point Value ................................................................. 5-111
CVTSS2SI- Convert Scalar Single-Precision Floating-Point Value to Doubleword Integer ................................................................. 5-113
CVTTSD2SI- Convert with Truncation Scalar Double-Precision Floating-Point Value to Signed Doubleword Integer ................................................................. 5-115
CVTTSS2SI- Convert with Truncation Scalar Single-Precision Floating-Point Value to Doubleword Integer ................................................................. 5-117
CVTTPS2DQ- Convert with Truncation Packed Single Precision Floating-Point Values to Packed Singed Doubleword Integer Values ................................................................. 5-118
CVTTPD2DQ- Convert with Truncation Packed Double-Precision Floating-point values to Packed Doubleword Integers ................................................................. 5-120
CVTTPS2DQ- Convert with Truncation Packed Single Precision Floating-Point Values to Packed Singed Doubleword Integer Values ................................................................. 5-122

DIVPD- Divide Packed Double-Precision Floating-Point Values ................................................................. 5-124
DIVPS- Divide Packed Single-Precision Floating-Point Values ................................................................. 5-126
DIVSD- Divide Scalar Double-Precision Floating-Point Values ................................................................. 5-128
DIVSS- Divide Scalar Single-Precision Floating-Point Values ................................................................. 5-130
DPPD- Dot Product of Packed Double-Precision Floating-Point Values ................................................................. 5-132
DPPS- Dot Product of Packed Single-Precision Floating-Point Values ................................................................. 5-134

EXTRACTPS- Extract packed floating-point values ................................................................. 5-139

Feature information, processor ................................................................. 2.31
FMA operation ................................................................. 2.46, 2.7
FXRSTOR instruction ................................................................. 2.49
FXSAVE instruction ................................................................. 2.49

HADDPD- Add Horizontal Double Precision Floating-Point Values ................................................................. 5-141
HADDPD- Add Horizontal Single Precision Floating-Point Values ................................................................. 5-143
HSUBPD- Subtract Horizontal Double Precision Floating-Point Values ................................................................. 5-146
HSUBPS- Subtract Horizontal Single Precision Floating-Point Values ................................................................. 5-148
Hyper-Threading Technology ................................................................. 2.50

IA-32e mode ................................................................. 2.39
INSRTS- Insert Scalar Single Precision Floating-Point Value ................................................................. 5-152

L1 Context ID ................................................................. 2.45
LDDQ- Move Unaligned Integer ................................................................. 5-156
LDMXCSR instruction ................................................................. 5-158
Machine check architecture

CPUID flag ......................................................... 2-49
description ......................................................... 2-49

MASKMOVQDU- Store Selected Bytes of Double Quadword with NT Hint ...................... 5-159
MAXPD- Maximum of Packed Double Precision Floating-Point Values ......................... 5-166
MAXSD- Return Maximum Scalar Double-Precision Floating-Point Value .................... 5-167
MAXPS- Minimum of Packed Single Precision Floating-Point Values ......................... 5-170
MAXSS- Return Maximum Scalar Single-Precision Floating-Point Value .................... 5-172
MINPD- Minimum of Packed Double Precision Floating-Point Values ......................... 5-174
MINPS- Minimum of Packed Single Precision Floating-Point Values ......................... 5-176
MINSD- Return Minimum Scalar Double-Precision Floating-Point Value .................... 5-179
MINSS- Return Minimum Scalar Single-Precision Floating-Point Value .................... 5-181

MMX instructions

CPUID flag for technology ......................................... 2-49
Model & family information ......................................... 2-54
MONITOR instruction

CPUID flag ......................................................... 2-45
feature data ......................................................... 2-54

MOVAPD- Move Aligned Packed Double-Precision Floating-Point Values ..................... 5-183
MOVAPS- Move Aligned Packed Single-Precision Floating-Point Values ...................... 5-186
MOVDUP- Replicate Double FP Values ................................ 5-194
MOVQDU- Move Unaligned Integer Values ................................ 5-198
MOVD/MOVQ- Move Doubleword and Quadword ....................................................... 5-189
MOVLPS- Move Packed Single-Precision Floating-Point Values High to Low ................. 5-200
MOVHPS- Move High Packed Single-Precision Floating-Point Values ......................... 5-204
MOVLHPS- Move Packed Single-Precision Floating-Point Values Low to High .............. 5-200
MOVLPS- Move Low Packed Single-Precision Floating-Point Values ......................... 5-210
MOVMSKPD- Extract Double-Precision Floating-Point Sign mask ............................. 5-212
MOVMSKPS- Extract Single-Precision Floating-Point Sign mask .............................. 5-214
MOVNTDO- Store Packed Integers Using Non-Temporal Hint .................................. 5-216
MOVNTDQA- Load Double Quadword Non-Temporal Aligned Hint .............................. 5-218
MOVNTPS- Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint 5-220
MOVAPS- Move Packed Single-Precision Floating-Point Values Using Non-Temporal Hint 5-222
MOVO- Move Quadword .............................................. 5-192
MOVSQ- Move or Merge Scalar Double-Precision Floating-Point Values ..................... 5-224
MOVSQDUP- Replicate Single FP Values ................................ 5-227
MOVSQDUP- Replicate Single FP Values ................................ 5-227
MOVSZDUP- Move or Merge Scalar Single-Precision Floating-Point Value ................. 5-233
MOVUPS- Move Unaligned Packed Double-Precision Floating-Point Values ................. 5-236
MOVUPS- Move Unaligned Packed Single-Precision Floating-Point Values ................. 5-239
MULPD- Multiply Packed Double Precision Floating-Point Values ............................ 5-247
MULPS- Multiply Packed Single Precision Floating-Point Values ............................ 5-249
MULSD- Multiply Scalar Double-Precision Floating-Point Values ............................ 5-251
MULSS- Multiply Scalar Single-Precision Floating-Point Values ............................ 5-253

MWAIT instruction

CPUID flag ......................................................... 2-45
feature data ......................................................... 2-54

ORPD- Bitwise Logical OR of Packed Double Precision Floating-Point Values .............. 5-255
ORPS- Bitwise Logical OR of Packed Single Precision Floating-Point Values ............... 5-257

PABSB/PABSW/PABSD - Packed Absolute Value ............................................. 5-259
PACKSSWB/PACKSSDW- Pack with Signed Saturation ........................................... 5-262
PACKUSWB/PACKUSDW - Pack with Unsigned Saturation .......................... 5-266
PADDB/PADDD/PADDQ - Add Packed Integers ........................................ 5-269
PADDB/PADDQ/PADDSW - Add Packed Signed Integers with Signed Saturation .... 5-273
PADDSW/PPADDSW - Add Packed Unsigned Integers with Unsigned Saturation .... 5-275
PALIGNR - Byte Align .............................................................................. 5-277
PAND - Logical AND .................................................................................. 5-279
PANDN - Logical AND NOT ................................................................. 5-281
PAVGW - Average Packed Integers ....................................................... 5-283
PBLENDV - Variable Blend Packed Bytes ............................................. 5-285
PBLENDW - Blend Packed Words ......................................................... 5-286
PCMPGTB/PCMPGTW/PCMPGTD/PCMPGTD/PCMPGTQ - Compare Packed Integers for Greater Than 5-306
PCMPISTRM - Packed Compare Implicit Length Strings, Return Mask .......... 5-290
PCMPEQOB/PCMPEQOW/PCMPEQD/PCMPEQOD/PCMPEQOQ - Compare Packed Integers for Equality 5-302
PCMPESTM - Packed Compare Explicit Length Strings, Return Index ............ 5-294
PCMPESTRM - Packed Compare Explicit Length Strings, Return Mask .......... 5-296
PADDUSB/PADDUSW - Add Packed Unsigned Integers with Unsigned Saturation 5-308
PADDUSW/PPADDUSW - Add Packed Unsigned Integers with Unsigned Saturation 5-308
PANDUSB/PANDUSW - Add Packed Logically ANDed Integers with Unsigned Saturation 5-310
PMAXSB/PMAXSW/PMAXSD - Maximum of Packed Signed Integers .............. 5-342
PMAXUB/PMAXUW/PMAXUD - Maximum of Packed Unsigned Integers ........... 5-346
PMINUS/PMINUSW/PMINUSD - Minimum of Packed Signed Integers ............... 5-354
PMINUSB/PMINUSW/PMINUSD - Minimum of Packed Unsigned Integers .......... 5-356
PMOVMSKB - Move Byte Mask ............................................................. 5-358
PMOVQ - Packed Move with Sign Extend .................................................. 5-359
PMOVZQ - Packed Move with Zero Extend ............................................... 5-360
PMULDQ - Multiply Packed Doubleword Integers ...................................... 5-381
PMULHRSW - Multiply Packed Unsigned Integers with Round and Shift ........ 5-372
PMULHUSW - Multiply Packed Unsigned Integers and Store High Result ........ 5-370
PMULHUW - Multiply Packed Unsigned Integers and Store High Result ........ 5-370
PMULHUW/PPMULHUW - Multiply Packed Unsigned Integers and Store High Result 5-370
PMULHWS/PMULHW - Multiply Packed Unsigned Integers and Store High Result 5-370
PMULLW/PMULLD - Multiply Packed Integers and Store Low Result ............ 5-376
PMULDOQ - Multiply Packed Signed Doubleword Integers ......................... 5-379
POR - Bitwise Logical OR ......................................................................... 5-383
PSADBW - Compute Sum of Absolute Differences .................................... 5-385
PSHUFB - Packed Shuffle Bytes ............................................................ 5-387
PSHUFD - Shuffle Packed Doublewords .................................................. 5-389
PSHUFW - Shuffle Packed High Words ..................................................... 5-391
PSHUFLL - Shuffle Packed Low Words ..................................................... 5-393
PSIGB/PSIGNW/PSIGND - Packed SIGN .................................................. 5-395
PSLLQ - Byte Shift Left ............................................................................ 5-399
PSLLW/PSLLD/PSLLQ - Bit Shift Left ...................................................... 5-403
PSRAW/PSRAD - Bit Shift Arithmetic Right ............................................ 5-408
PSRLQ - Byte Shift Right .......................................................................... 5-401
PSRLW/PSRLD/PSRLQ - Shift Packed Data Right Logical .......................... 5-412
PSUBB/PSUBBW/PSUBBW/PSUBQ - Packed Integer Subtract ...................... 5-421
PSUBBBPSUBW - Subtract Packed Signed Integers with Signed Saturation .... 5-425
PSUBSB/PSUBUSW - Subtract Packed Unsigned Integers with Unsigned Saturation 5-427
PTEST - Packed Bit Test ............................................................................ 5-417
PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ - Unpack High Data ........ 5-429
PUNPCKLBW/PUNPCKLWD/PUNPCKLDQ/PUNPCKLQDQ - Unpack Low Data ........ 5-433
PXOR - Exclusive Or ................................................................................ 5-437
RCPPS- Compute Approximate Reciprocals of Packed Single-Precision Floating-Point Values 5-439
RCPS - Compute Reciprocal of Scalar Single-Precision Floating-Point Value 5-442
RDMSR instruction 2.48
ROUNDPD- Round Packed Double-Precision Floating-Point Values 5-449
ROUNDPS- Round Packed Single-Precision Floating-Point Values 5-453
ROUNDD - Round Scalar Double-Precision Value 5-456
ROUNDS - Round Scalar Single-Precision Value 5-458
RSQRTPS - Compute Approximate Reciprocals of Square Roots of Packed Single-Precision Floating-point Values 5-444
RSQRTSS - Compute Reciprocal of Square Root of Scalar Single-Precision Floating-Point Value 5-447

S
Self Snoop 2.50
SHUFPD - Shuffle Packed Double Precision Floating-Point Values 5-460
SHUFPS - Shuffle Packed Single Precision Floating-Point Values 5-463
SIMD floating-point exceptions, unmasking, effects of 5-158
SpeedStep technology 2.45
SQRTPD- Square Root of Double-Precision Floating-Point Values 5-466
SQRTPS - Square Root of Single-Precision Floating-Point Values 5-468
SQRTSD - Compute Square Root of Scalar Double-Precision Floating-Point Value 5-470
SQRTSS - Compute Square Root of Scalar Single-Precision Floating-Point Value 5-472
SSE extensions
CPUID flag 2.50
SSE2 extensions
CPUID flag 2.50
SSE3
CPUID flag 2.44
SSE3 extensions
CPUID flag 2.44
SSSE3 extensions
CPUID flag 2.45
Stepping information 2.54
STMXCSR instruction 5-474
STMXCSR—Store MXCSR Register State 5-466
SUBPD- Subtract Packed Double Precision Floating-Point Values 5-474
SUBPS- Subtract Packed Single Precision Floating-Point Values 5-477
SUBSD- Subtract Scalar Double Precision Floating-Point Values 5-479
SUBSS- Subtract Scalar Single Precision Floating-Point Values 5-481
SYSENTER instruction
CPUID flag 2.48
SYSEXIT instruction
CPUID flag 2.48

T
Thermal Monitor
CPUID flag 2.50
Thermal Monitor 2 2.45
CPUID flag 2.45
Time Stamp Counter 2.48

U
UCOMISD - Unordered Compare Scalar Double-Precision Floating-Point Values and Set EFLAGS 5-483
UCOMISS - Unordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS 5-485
UNPCKHPD- Unpack and Interleave High Packed Double-Precision Floating-Point Values. 5-487

Ref. # 319433-005
VEX.W
VEX.X
VEX.vvvv
VFMADD132SD/VFMADD213SD/VFMADD231SD - Fused Multiply-Add of Scalar Double-Precision Floating-Point Values
VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD - Fused Multiply-Alternating Subtraction of Packed Double-Precision Floating-Point Values
VFMADD132SS/VFMADD213SS/VFMADD231SS - Fused Multiply-Add of Scalar Single-Precision Floating-Point Values
VZEROUPPER- Zero Upper bits of YMM registers

V
VBBROADCAST- Load with Broadcast
Version information, processor
VEX.B
VEX
VEX.mmmmm
VEX.L
VEX.pp
VEX.R
VEX.W
VEX.X
VFMADD132PD/VFMADD213PD/VFMADD231PD - Fused Multiply-Add of Packed Double-Precision Floating-Point Values
VFMADD132SD/VFMADD213SD/VFMADD231SD - Fused Multiply-Add of Scalar Double-Precision Floating-Point Values
VFMADD132SS/VFMADD213SS/VFMADD231SS - Fused Multiply-Add of Scalar Single-Precision Floating-Point Values
VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD - Fused Multiply-Alternating Subtraction of Packed Double-Precision Floating-Point Values
VFMADDSUB132SS/VFMADDSUB213SS/VFMADDSUB231SS - Fused Multiply-Alternating Subtraction of Scalar Single-Precision Floating-Point Values
VFMADD132PS/VFMADD213PS/VFMADD231PS - Fused Multiply-Add of Packed Single-Precision Floating-Point Values
VFMADDSUB132PS/VFMADDSUB213PS/VFMADDSUB231PS - Fused Multiply-Alternating Subtraction of Packed Single-Precision Floating-Point Values
VMSUB132PD/VMSUB213PD/VMSUB231PD - Fused Subtraction of Packed Double-Precision Floating-Point Values
VMSUB132SS/VMSUB213SS/VMSUB231SS - Fused Subtraction of Scalar Single-Precision Floating-Point Values
VFMADD132PD/VFMADD213PD/VFMADD231PD - Fused Multiply-Add of Packed Double-Precision Floating-Point Values
VFMADDSUB132PD/VFMADDSUB213PD/VFMADDSUB231PD - Fused Multiply-Alternating Subtraction of Packed Double-Precision Floating-Point Values
VFMADD132SS/VFMADD213SS/VFMADD231SS - Fused Multiply-Add of Scalar Single-Precision Floating-Point Values
VFMADDSUB132SS/VFMADDSUB213SS/VFMADDSUB231SS - Fused Multiply-Alternating Subtraction of Scalar Single-Precision Floating-Point Values
VFMSUB132PD/VFMSUB213PD/VFMSUB231PD - Fused Subtraction of Packed Double-Precision Floating-Point Values
VFMSUB132SS/VFMSUB213SS/VFMSUB231SS - Fused Subtraction of Scalar Single-Precision Floating-Point Values
XFEATURE_ENALBED_MASK ........................................... 2-2
XORPD- Bitwise Logical XOR of Packed Double Precision Floating-Point Values .......... 5-497
XORPS- Bitwise Logical XOR of Packed Single Precision Floating-Point Values .......... 5-499
XRSTOR .......................................................... 1-2, 2-2, 2-55, 3-1, 5-12
XSAVE .............................................................. 1-2, 2-2, 2-3, 2-4, 2-5, 2-11, 2-46, 2-55, 3-1, 5-12