Bugs in Intrinsics Guide

194 posts / 0 new

I'm confused by the operation of _mm512_i32extscatter_epi32.

If you use conv=_MM_DOWNCONV_EPI32_NONE and hint=_MM_HINT_NONE this intrinsic should be equal to _mm512_i32scatter_epi32.

For example, when using _MM_DOWNCONV_EPI32_UINT8, take j=15, then i=480 and n=120, and addr[127:120]:=UInt32ToUInt8(v1[511:480]), Are we really using 128-bit addresses? The operation of _mm512_i32scatter_epi32 does make a lot more sense. See below.

Can someone please explain how the operation of the _mm512_i32extscatter_epi32 should be read?

Regards Henk-Jan.

---
void _mm512_i32extscatter_epi32 (void * mv, __m512i index, __m512i v1, _MM_DOWNCONV_EPI32_ENUM conv, int scale, int hint)
Operation:

FOR j := 0 to 15
    addr := MEM[mv + index[j] * scale]
    i := j*32
    CASE conv OF 
        _MM_DOWNCONV_EPI32_NONE: 
            addr[i+31:i] := v1[i+31:i]
        _MM_DOWNCONV_EPI32_UINT8: 
            n := j*8 
            addr[n+7:n] := UInt32ToUInt8(v1[i+31:i])
        _MM_DOWNCONV_EPI32_SINT8:
            n := j*8
            addr[n+7:n] := SInt32ToSInt8(v1[i+31:i])
        _MM_DOWNCONV_EPI32_UINT16:
            n := j*16 
            addr[n+15:n] := UInt32ToUInt16(v1[i+31:i]) 
        _MM_DOWNCONV_EPI32_SINT16: 
            n := j*16 
            addr[n+15:n] := SInt32ToSInt16(v1[n+15:n]) 
    ESAC 
ENDFOR 

---
void _mm512_i32scatter_epi32 (void* base_addr, __m512i vindex, __m512i a, int scale)
Operation: 

FOR j := 0 to 15
    i := j*32 
    MEM[base_addr + SignExtend(vindex[i+31:i])*scale] := a[i+31:i] 
ENDFOR

 


Selecting AVX512_4FMAPS instruction set, one intrinsic is missing:  _mm512_4fmadd_ps

That intrinsic is in the tool; it is just missing from the filtered list AVX512_4FMAPS.

 


The entire instruction sets which uses masks as input, such as kmov, kshift, kand and so on is absent from the guide. I had to use those several times and it's harder with their absence.


Quote:

Eden S. (Intel) wrote:

The entire instruction sets which uses masks as input, such as kmov, kshift, kand and so on is absent from the guide.

The guide does describe some intrinsics like _mm512_kmov, _mm512_kand, etc. but they mostly deal with 16-bit masks and are not extracted to a separate category, which would be useful for searching.

 


In this guide I see the description of __m256i _mm256_insert_epi64 (__m256i a__int64 iconst int index).

The MS compiler (Visual Studio 2015) doesn't identify such an intrinsic.
Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 (Order Number: 325462-063US, July 2017) doesn't have any information re this intrinsic name as well.
Whats does it mean - do we have such operation or not ?
 


Quote:

Zvi Danovich (Intel) wrote:

In this guide I see the description of __m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index).

The MS compiler (Visual Studio 2015) doesn't identify such an intrinsic.
Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 (Order Number: 325462-063US, July 2017) doesn't have any information re this intrinsic name as well.
Whats does it mean - do we have such operation or not ?

gcc 7.2 does have this intrinsic. The intrinsic translates into not just one instruction but a sequence of them, that's why it's not present in the SDM. Also, it is only present in 64-bit mode. My guess is that the MSVC version you use lacks that intrinsic or you're compiling for 32-bit mode.

 


Do all the _mask_.._mask operations have switched mask names in their descriptions?  In these, there is an input mask k1 and a result mask (the unnamed return value, but called k in the Description and Operation).  It is written that they "store the results in mask vector k1 using zeromask k", which seems to me the opposite of their actual role.  For example:

__mmask8 _mm_mask_cmpeq_epi32_mask (__mmask8 k1, __m128i a, __m128i b)
...

Description

Compare packed 32-bit integers in a and b for equality, and store the results in mask vector k1 using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

Operation

FOR j := 0 to 3
  i := j*32
  IF k1[j]
    k[j] := ( a[i+31:i] == b[i+31:i] ) ? 1 : 0
  ELSE
    k[j] := 0
  FI
ENDFOR
k[MAX:4] := 0

There's a bug either in ICC or the documentation. Consider https://godbolt.org/g/LYJjM2. The documentation for _mm_mask_mov_ps says "dst[MAX:128] := 0". The comments in the test case expect this behavior. However, the compiler translates it to a no-op. It is certainly correct that the vmovaps instruction has the zeroing behavior, though. But as the Compiler Explorer example shows, not every call to _mm_mask_mov_ps leads to a vmovaps instruction in the resulting binary.

This issue needs to be resolved for all variants of *_mask_mov_*.


@Matthias Kretz

Your link doesn't show the problem. I've forced the compiler to generate code here: https://godbolt.org/g/7kgQtM. Note that gcc also produces the similar code. I tend to think that the compilers do not guarantee zeroing the upper bits of the vector register, leaving them "undefined". I think, what you want to do is this: https://godbolt.org/g/5gvTVn.

 


@andysem No, it seems you misunderstood the problem. What I want is https://godbolt.org/g/YCB3iD .

Well, to be clear. I don't want _mm_mask_mov_ps to always expand to vmovaps. I believe the documentation in the intrinsics guide is incorrect. Which is why I posted it on this thread and not as a compiler bug.


@Matthias Kretz

I think I understood you correctly. The vmovaps instruction is generated in the second link I gave, on Intel compiler. Gcc fails to recognize that the zero masking can be optimized to a simple "vmovaps %xmm1, %xmm1" and instead generates "vmovaps %zmm0, %zmm0{%k1}{z}", but that is a matter of QoI. The effect is still the same.

My main point is that I believe when you cast to a smaller vector (e.g. zmm to xmm), perform operations on that smaller part and then cast back, you shouldn't assume any particular contents in the higher bits. The compiler does not guarantee which CPU instructions it generates from intrinsics because it can use more context from the surrounding code and optimize better. For example, the vectorizer could optimize your loop and still use the full-width vectors. The Intrinsics Guide only gives a rough idea of which instructions might be involved, but you shouldn't expect it to be followed literally.

If you want particular content in the upper bits of a vector register, you should write the code so that it does operate on those bits and fill them accordingly. In your case it means operating on __m512 and not __m128. Or you could use inline assembler, of course, but that often precludes the compiler from doing optimizations that would be otherwise possible.

 


Quote:

andysem wrote:
I think I understood you correctly. 

That's a bold statement to make. You know better what I wanted to do than I do. :-)

Let me try again. I want to report that the documentation in the Intel Intrinsics Guide does not match the behavior ICC (and GCC) produces. That's all I wanted to do here in this forum. I was not looking for a solution to a problem you're trying to guess that I have. ;-)

I have solved my problem already. While solving it, I noticed the last line in the _mm_mask_mov_ps pseudo code, and thought that looks only 99% correct. Let's write an example that breaks it. And that's how this post happened.


I'm just saying that your original example is not correct, in my opinion. And that compiler behavior wrt. instructions choice for pretty much any intrinsic can be different from what is said in the Intrinsics Guide. I don't see that as an error in the Intrinsics Guide or in the compiler. That's all.

 


Quote:

andysem wrote:
I'm just saying that your original example is not correct, in my opinion.

I believe ICC (and GCC) optimize correctly, in the Compiler Explorer example I provided. IIUC, you believe the same. I.e. you believe my code is wrong. My point all along, is not that the code is supposed to achieve anything other than contradict the documentation. Not to achieve the effect I wrote in the comments of the code. I didn't want to speak for the actual intent of Intel, when they designed the intrinsics. Which is why I left it open for them to decide whether they believe ICC is at fault here.

Quote:

andysem wrote:
And that compiler behavior wrt. instructions choice for pretty much any intrinsic can be different from what is said in the Intrinsics Guide. I don't see that as an error in the Intrinsics Guide or in the compiler. That's all.

The Intrinsics Guide documents a specific "Operation" for the _mm_mask_mov_ps intrinsic. I believe what the Guide should say, is that this "Operation" is what vmovaps does. I.e. the logical "Operation" of _mm_mask_mov_ps is the "Operation" of vmovaps modulo zeroing of the high bits. Because the compiler should be free to replace the logical operation on the low bits with something equivalent that maybe doesn't zero the high bits.


Hi I apologize if this was already requested, but is it possible to add the information about this instruction VPDPBUSD? You can find information already in the ISA extension manual: Intel® Architecture Instruction Set Extensions and Future Features Programming Reference

Intel C/C++ Compiler Intrinsic Equivalent
 

VPDPBUSD __m128i _mm_dpbusd_epi32(__m128i, __m128i, __m128i);
VPDPBUSD __m128i _mm_mask_dpbusd_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSD __m128i _mm_maskz_dpbusd_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSD __m256i _mm256_dpbusd_epi32(__m256i, __m256i, __m256i);
VPDPBUSD __m256i _mm256_mask_dpbusd_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPBUSD __m256i _mm256_maskz_dpbusd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSD __m512i _mm512_dpbusd_epi32(__m512i, __m512i, __m512i);
VPDPBUSD __m512i _mm512_mask_dpbusd_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPBUSD __m512i _mm512_maskz_dpbusd_epi32(__mmask16, __m512i, __m512i, __m512i);
VPDPBUSDS __m128i _mm_dpbusds_epi32(__m128i, __m128i, __m128i);
VPDPBUSDS __m128i _mm_mask_dpbusds_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSDS __m128i _mm_maskz_dpbusds_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSDS __m256i _mm256_dpbusds_epi32(__m256i, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_mask_dpbusds_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_maskz_dpbusds_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSDS __m512i _mm512_dpbusds_epi32(__m512i, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_mask_dpbusds_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_maskz_dpbusds_epi32(__mmask16, __m512i, __m512i, __m512i);

Thank you in advance.


Question about _BitScanForward, and its friends:

The operations described in Intrinsics Guide:

unsigned char _BitScanForward (unsigned __int32* index, unsigned __int32 mask)
tmp := 0
IF mask = 0
	dst := 0
ELSE
	DO WHILE ((tmp < 32) AND mask[tmp] = 0)
		tmp := tmp + 1
		index := tmp
		dst := 1
	OD
FI

Although not very clear, It seems to me that if mask==0, then *index will be left unchanged.

In the newest Intel® C++ Compiler 18.0 Developer Guide and Reference (also in version 15.0~17.0), it's described very clear:

unsigned char _BitScanForward(unsigned __int32 *p, unsigned __int32 b);
Sets *p to the bit index of the least significant set bit of b or leaves it unchanged if b is zero. The function returns a non-zero result when b is non-zero and returns zero when b is zero.

So this behavior is well defined (according to the documents).

However when I compile the following test code (using icc 18.0, option -O2 for brevity):

unsigned __int32 trailing_zeros(unsigned __int32 x)
{
	unsigned __int32 index = 32;
	_BitScanForward(&index, x);
	return index;
}

It's compiled into:

trailing_zeros(unsigned int):
  bsf eax, edi
  ret

Where is the initial value 32? If x==0, then eax is undefined, but it's returned directly!


I'm also very curious why _BitScanForward is documented like this.

As far as I know, the BSF instruction is still documented as "return undefined value when source is 0", so why made _BitScanForward different?


Shift clarification:

When describing shift operations, it would help clarify the shift amount by using the word "byte" or "bit" like in the documentation https://software.intel.com/en-us/node/524238.

For example:

__m128i _mm_srli_epi32 (__m128i aint imm8)

...

Shift packed 32-bit integers in a right by imm8 bytes while shifting in zeros, and store the results in dst.

Thank you and good day.


Quote:

Groarke, Philippe wrote:

__m128i _mm_srli_epi32 (__m128i a, int imm8)

...

Shift packed 32-bit integers in a right by imm8 bytes while shifting in zeros, and store the results in dst.

 _mm_srli_epi32 operates on bits. By default, shift and rotate operations work in terms of bits. _mm_srli_si128/_mm_slli_si128 and equivalents for larger vectors are exceptions.

 


Wrong return type in the `rdtsc` intrinsic

I've found an issue with the description of the `rdtsc` intrinsic.

This instruction is used to read the value of the CPU's timestamp counter, a 64-bit monotonically-increasing unsigned integer. However, the docs say the return type is a signed `__int64`. Both GCC and Clang properly expose this intrinsic as returning an unsigned long long, not a signed one. I believe this should be fixed.


Hi,

The AVX512-VNNI instructions vpdpbusd doesn't show in the guide as far as i see.


The Guide shows various permute*epi8 intrinsics as available in "AVX512_VBMI + AVX512VL", but aren't they only in VBMI?


Quote:

Kearney, Jim wrote:

The Guide shows various permute*epi8 intrinsics as available in "AVX512_VBMI + AVX512VL", but aren't they only in VBMI?

Instructions that operate on 128 or 256-bit vectors require AVX-512VL in addition to AVX-512VBMI. 512-bit vectors only require AVX-512VBMI.


Quote:

andysem wrote:

Quote:

Kearney, Jim wrote:

 

The Guide shows various permute*epi8 intrinsics as available in "AVX512_VBMI + AVX512VL", but aren't they only in VBMI?

Instructions that operate on 128 or 256-bit vectors require AVX-512VL in addition to AVX-512VBMI. 512-bit vectors only require AVX-512VBMI.

Sorry, I phrased that poorly.

I meant, the presentation in the Guide acts as though "+" means "or".  If one enables only AVX-512VL under Technologies, the _mm_ and _mm256_ variants are shown as available.  I would expect to have to check AVX-512_VBMI as well ("and").

 


Hi,

Does _mm_loadl_epi64 (movq xmm, m64) have alignment requirements? The documentation seems to suggest that _mm_load_si128 does have a 16-byte alignment requirement, while _mm_loadu_si128 does not have any requirements. It doesn't mention the alignment requirements for _mm_loadl_epi64 as far as I can tell.

Thanks!


Quote:

Armstrong, Brian wrote:

Does _mm_loadl_epi64 (movq xmm, m64) have alignment requirements?

SDM Volume 1, Section 4.1.1 Alignment of Words, Doublewords, Quadwords, and Double Quadwords summarizes memory alignment requirements for most instructions. In particular, it says that words, double words and quadwords need not be aligned.

 


I don't think the currently-defined intrinsics for SGX (_encls_u32, _enclu_u32, _enclv_u32) are appropriate. The leafs are so different and should really be considered different instructions. No other instructions that have defined intrinsics have this issue.

  • Different leafs do completely different things
  • Different leafs have completely different operands
  • Different leafs may have different hardware support (in terms of CPU features)

The operation is of course documented in the SDM, but I've summarized all the leafs here https://github.com/fortanix/rust-sgx/issues/15#issuecomment-447738899


For _mm256_testc_si256 and other similar intrinsics, the Intrinsics Guide lists the corresponding instruction as vptest and CPUID flags as AVX. vptest is an AVX2 instruction. I'm assuming, for AVX targets the compiler should translate the intrinsic to either vtestps or vtestpd instruction, but those instructions are not listed.


For _mm_cvtpd_ps it would be useful to say in the description and pseudo-code that the upper half of the resulting register is filled with zero. For _mm_cvtps_pd it would be nice to mention in the description that it converts only the lower 2 elements of the vector.

 

There's been a while since the last update of Intrinsics Guide, when will be a new update?

 


The description of tpause has a few problems

  1. the asm syntax is weird, with spurious commas
  2. there is no statement about what the result of the intrinsic is. The implementation shows that it is rflags.cf, but you should say that explicitly.

That's good feedback, thank you James, I'll update accordingly.

andysem, I try to batch several updates together and coordinate with any new intrinsics being announced, I'm still looking for the right date to do a release.


The _rdtsc intrinsic has a return type of __int64 (signed integer), which is incorrect, since the time is is read from the CPU's timestamp counter, an unsigned integer.

Existing C compilers (such as gcc) already return an unsigned long long type.

This is an issue for Rust developers when implementing Intel's intrinsics, because we use the intrinsic guide as a reference for the return type and parameters of the intrinsics. 


Intrinsics Guide Data Version: 3.4.4

Looks as if for the two non-masked operations _mm512_popcnt_epi8() and _mm512_popcnt_epi16(), the POPCNT() operator is missing in the pseudo-code.


Re-wording proposal for _mm512_bitshuffle_epi64_mask()

Per 64-bit element in b and its 8 associated 8-bit elements in c: Gather 8 bits from 64-bit element in b at bit positions controlled by the 8 8-bit elements of c, and store the result in the associated Byte of mask k. There are 8 such operations done in parallel per 64-bit lane, each producing one Byte in k.


Pseudo code of all _mm512_dpbusd_epi32() looks incorrect.

The for-loop is defined with j, but operator[] is using expressions in i.

I suspect that the expressions shall read

tmp2 := a.byte[4*j+1] * b.byte[4*j+1]  // i.e. all in j and also for b it shall be 4*j


For the two non-masked operations _mm512_popcnt_epi32() and _mm512_popcnt_epi64(), the POPCNT() operator is missing in the pseudo-code.


Hi, I found a mistake in Operation description for every addsub, fmaddsub, subadd, fmsubadd instructions.

The IF-statement inside loop looks like IF (j % 1 == 0)  but it is IF (j % 2 == 0).


Hi,

There is a small typo in _mm256_hsub_ps as it says "Horizontally add adjacent pairs [...]" where really it should say subtract

Thanks!


Descriptions of _mm_set_epi8() _mm256_set_epi8(), _mm512_set_epi8(), _mm512_set_epi16() all say "reverse order", which is incorrect.


Pseudo code of all nine *_dpbusd_epi32() still looks incorrect, since the 4* for operand b is missing.

tmp1 := a.byte[4*j] * b.byte[4*j]
tmp2 := a.byte[4*j+1] * b.byte[4*j+1]
tmp3 := a.byte[4*j+2] * b.byte[4*j+2]
tmp4 := a.byte[4*j+3] * b.byte[4*j+3]

 


There seems description error for all the VNNI intrinsics:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=255...

For example:

__m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)

The description is as below, while all b.byte[????] is not corresponding with related a.byte. e.x.: tmp1 should be a.byte[4*j] * b.byte[4*j] instead of b.byte[j].

FOR j := 0 to 3
	tmp1 := a.byte[4*j] * b.byte[j]
	tmp2 := a.byte[4*j+1] * b.byte[j+1]
	tmp3 := a.byte[4*j+2] * b.byte[j+2]
	tmp4 := a.byte[4*j+3] * b.byte[j+3]
	dst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4
ENDFOR
dst[MAX:128] := 0

The same error happens for all other VNNI intrinsic descriptions.


This issue applies to:  _mm_xor_epi32, _mm256_xor_epi32, _mm_xor_epi64, and _mm256_xor_epi64

The intrinsics guide shows these functions as generating vpord/vporq instead of vpxord/vpxorq.

The wrong instruction appears both on the right side of the function name in the summary view and in the internal detailed descriptions.


The description of the _ktestc_maskXX instructions in the online Intrinsics Guide disagrees with the Intel Architecture Software Developer Manual.

The Intrinsics Guide says that the function returns true if the NAND of the operands is all ones.

The Architecture Manual states that the 'CF' flag is lit if the result of the NAND operation is all zeros.

Presumably the Architecture Manual is correct, since the NAND producing an all ones result is meaningless.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today