Bugs in Intrinsics Guide

168 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'm confused by the operation of _mm512_i32extscatter_epi32.

If you use conv=_MM_DOWNCONV_EPI32_NONE and hint=_MM_HINT_NONE this intrinsic should be equal to _mm512_i32scatter_epi32.

For example, when using _MM_DOWNCONV_EPI32_UINT8, take j=15, then i=480 and n=120, and addr[127:120]:=UInt32ToUInt8(v1[511:480]), Are we really using 128-bit addresses? The operation of _mm512_i32scatter_epi32 does make a lot more sense. See below.

Can someone please explain how the operation of the _mm512_i32extscatter_epi32 should be read?

Regards Henk-Jan.

---
void _mm512_i32extscatter_epi32 (void * mv, __m512i index, __m512i v1, _MM_DOWNCONV_EPI32_ENUM conv, int scale, int hint)
Operation:

FOR j := 0 to 15
    addr := MEM[mv + index[j] * scale]
    i := j*32
    CASE conv OF 
        _MM_DOWNCONV_EPI32_NONE: 
            addr[i+31:i] := v1[i+31:i]
        _MM_DOWNCONV_EPI32_UINT8: 
            n := j*8 
            addr[n+7:n] := UInt32ToUInt8(v1[i+31:i])
        _MM_DOWNCONV_EPI32_SINT8:
            n := j*8
            addr[n+7:n] := SInt32ToSInt8(v1[i+31:i])
        _MM_DOWNCONV_EPI32_UINT16:
            n := j*16 
            addr[n+15:n] := UInt32ToUInt16(v1[i+31:i]) 
        _MM_DOWNCONV_EPI32_SINT16: 
            n := j*16 
            addr[n+15:n] := SInt32ToSInt16(v1[n+15:n]) 
    ESAC 
ENDFOR 

---
void _mm512_i32scatter_epi32 (void* base_addr, __m512i vindex, __m512i a, int scale)
Operation: 

FOR j := 0 to 15
    i := j*32 
    MEM[base_addr + SignExtend(vindex[i+31:i])*scale] := a[i+31:i] 
ENDFOR

 

Selecting AVX512_4FMAPS instruction set, one intrinsic is missing:  _mm512_4fmadd_ps

That intrinsic is in the tool; it is just missing from the filtered list AVX512_4FMAPS.

 

The entire instruction sets which uses masks as input, such as kmov, kshift, kand and so on is absent from the guide. I had to use those several times and it's harder with their absence.

Quote:

Eden S. (Intel) wrote:

The entire instruction sets which uses masks as input, such as kmov, kshift, kand and so on is absent from the guide.

The guide does describe some intrinsics like _mm512_kmov, _mm512_kand, etc. but they mostly deal with 16-bit masks and are not extracted to a separate category, which would be useful for searching.

 

In this guide I see the description of __m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index).

The MS compiler (Visual Studio 2015) doesn't identify such an intrinsic.
Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 (Order Number: 325462-063US, July 2017) doesn't have any information re this intrinsic name as well.
Whats does it mean - do we have such operation or not ?
 

Quote:

Zvi Danovich (Intel) wrote:

In this guide I see the description of __m256i _mm256_insert_epi64 (__m256i a, __int64 i, const int index).

The MS compiler (Visual Studio 2015) doesn't identify such an intrinsic.
Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4 (Order Number: 325462-063US, July 2017) doesn't have any information re this intrinsic name as well.
Whats does it mean - do we have such operation or not ?

gcc 7.2 does have this intrinsic. The intrinsic translates into not just one instruction but a sequence of them, that's why it's not present in the SDM. Also, it is only present in 64-bit mode. My guess is that the MSVC version you use lacks that intrinsic or you're compiling for 32-bit mode.

 

Do all the _mask_.._mask operations have switched mask names in their descriptions?  In these, there is an input mask k1 and a result mask (the unnamed return value, but called k in the Description and Operation).  It is written that they "store the results in mask vector k1 using zeromask k", which seems to me the opposite of their actual role.  For example:

__mmask8 _mm_mask_cmpeq_epi32_mask (__mmask8 k1, __m128i a, __m128i b)

...

Description

Compare packed 32-bit integers in a and b for equality, and store the results in mask vector k1 using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

Operation

FOR j := 0 to 3

  i := j*32

  IF k1[j]

    k[j] := ( a[i+31:i] == b[i+31:i] ) ? 1 : 0

  ELSE

    k[j] := 0

  FI

ENDFOR

k[MAX:4] := 0

There's a bug either in ICC or the documentation. Consider https://godbolt.org/g/LYJjM2. The documentation for _mm_mask_mov_ps says "dst[MAX:128] := 0". The comments in the test case expect this behavior. However, the compiler translates it to a no-op. It is certainly correct that the vmovaps instruction has the zeroing behavior, though. But as the Compiler Explorer example shows, not every call to _mm_mask_mov_ps leads to a vmovaps instruction in the resulting binary.

This issue needs to be resolved for all variants of *_mask_mov_*.

@Matthias Kretz

Your link doesn't show the problem. I've forced the compiler to generate code here: https://godbolt.org/g/7kgQtM. Note that gcc also produces the similar code. I tend to think that the compilers do not guarantee zeroing the upper bits of the vector register, leaving them "undefined". I think, what you want to do is this: https://godbolt.org/g/5gvTVn.

 

@andysem No, it seems you misunderstood the problem. What I want is https://godbolt.org/g/YCB3iD .

Well, to be clear. I don't want _mm_mask_mov_ps to always expand to vmovaps. I believe the documentation in the intrinsics guide is incorrect. Which is why I posted it on this thread and not as a compiler bug.

@Matthias Kretz

I think I understood you correctly. The vmovaps instruction is generated in the second link I gave, on Intel compiler. Gcc fails to recognize that the zero masking can be optimized to a simple "vmovaps %xmm1, %xmm1" and instead generates "vmovaps %zmm0, %zmm0{%k1}{z}", but that is a matter of QoI. The effect is still the same.

My main point is that I believe when you cast to a smaller vector (e.g. zmm to xmm), perform operations on that smaller part and then cast back, you shouldn't assume any particular contents in the higher bits. The compiler does not guarantee which CPU instructions it generates from intrinsics because it can use more context from the surrounding code and optimize better. For example, the vectorizer could optimize your loop and still use the full-width vectors. The Intrinsics Guide only gives a rough idea of which instructions might be involved, but you shouldn't expect it to be followed literally.

If you want particular content in the upper bits of a vector register, you should write the code so that it does operate on those bits and fill them accordingly. In your case it means operating on __m512 and not __m128. Or you could use inline assembler, of course, but that often precludes the compiler from doing optimizations that would be otherwise possible.

 

Quote:

andysem wrote:
I think I understood you correctly. 

That's a bold statement to make. You know better what I wanted to do than I do. :-)

Let me try again. I want to report that the documentation in the Intel Intrinsics Guide does not match the behavior ICC (and GCC) produces. That's all I wanted to do here in this forum. I was not looking for a solution to a problem you're trying to guess that I have. ;-)

I have solved my problem already. While solving it, I noticed the last line in the _mm_mask_mov_ps pseudo code, and thought that looks only 99% correct. Let's write an example that breaks it. And that's how this post happened.

I'm just saying that your original example is not correct, in my opinion. And that compiler behavior wrt. instructions choice for pretty much any intrinsic can be different from what is said in the Intrinsics Guide. I don't see that as an error in the Intrinsics Guide or in the compiler. That's all.

 

Quote:

andysem wrote:
I'm just saying that your original example is not correct, in my opinion.

I believe ICC (and GCC) optimize correctly, in the Compiler Explorer example I provided. IIUC, you believe the same. I.e. you believe my code is wrong. My point all along, is not that the code is supposed to achieve anything other than contradict the documentation. Not to achieve the effect I wrote in the comments of the code. I didn't want to speak for the actual intent of Intel, when they designed the intrinsics. Which is why I left it open for them to decide whether they believe ICC is at fault here.

Quote:

andysem wrote:
And that compiler behavior wrt. instructions choice for pretty much any intrinsic can be different from what is said in the Intrinsics Guide. I don't see that as an error in the Intrinsics Guide or in the compiler. That's all.

The Intrinsics Guide documents a specific "Operation" for the _mm_mask_mov_ps intrinsic. I believe what the Guide should say, is that this "Operation" is what vmovaps does. I.e. the logical "Operation" of _mm_mask_mov_ps is the "Operation" of vmovaps modulo zeroing of the high bits. Because the compiler should be free to replace the logical operation on the low bits with something equivalent that maybe doesn't zero the high bits.

Hi I apologize if this was already requested, but is it possible to add the information about this instruction VPDPBUSD? You can find information already in the ISA extension manual: Intel® Architecture Instruction Set Extensions and Future Features Programming Reference

Intel C/C++ Compiler Intrinsic Equivalent
 

VPDPBUSD __m128i _mm_dpbusd_epi32(__m128i, __m128i, __m128i);
VPDPBUSD __m128i _mm_mask_dpbusd_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSD __m128i _mm_maskz_dpbusd_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSD __m256i _mm256_dpbusd_epi32(__m256i, __m256i, __m256i);
VPDPBUSD __m256i _mm256_mask_dpbusd_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPBUSD __m256i _mm256_maskz_dpbusd_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSD __m512i _mm512_dpbusd_epi32(__m512i, __m512i, __m512i);
VPDPBUSD __m512i _mm512_mask_dpbusd_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPBUSD __m512i _mm512_maskz_dpbusd_epi32(__mmask16, __m512i, __m512i, __m512i);
VPDPBUSDS __m128i _mm_dpbusds_epi32(__m128i, __m128i, __m128i);
VPDPBUSDS __m128i _mm_mask_dpbusds_epi32(__m128i, __mmask8, __m128i, __m128i);
VPDPBUSDS __m128i _mm_maskz_dpbusds_epi32(__mmask8, __m128i, __m128i, __m128i);
VPDPBUSDS __m256i _mm256_dpbusds_epi32(__m256i, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_mask_dpbusds_epi32(__m256i, __mmask8, __m256i, __m256i);
VPDPBUSDS __m256i _mm256_maskz_dpbusds_epi32(__mmask8, __m256i, __m256i, __m256i);
VPDPBUSDS __m512i _mm512_dpbusds_epi32(__m512i, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_mask_dpbusds_epi32(__m512i, __mmask16, __m512i, __m512i);
VPDPBUSDS __m512i _mm512_maskz_dpbusds_epi32(__mmask16, __m512i, __m512i, __m512i);

Thank you in advance.

Question about _BitScanForward, and its friends:

The operations described in Intrinsics Guide:

unsigned char _BitScanForward (unsigned __int32* index, unsigned __int32 mask)
tmp := 0
IF mask = 0
	dst := 0
ELSE
	DO WHILE ((tmp < 32) AND mask[tmp] = 0)
		tmp := tmp + 1
		index := tmp
		dst := 1
	OD
FI

Although not very clear, It seems to me that if mask==0, then *index will be left unchanged.

In the newest Intel® C++ Compiler 18.0 Developer Guide and Reference (also in version 15.0~17.0), it's described very clear:

unsigned char _BitScanForward(unsigned __int32 *p, unsigned __int32 b);
Sets *p to the bit index of the least significant set bit of b or leaves it unchanged if b is zero. The function returns a non-zero result when b is non-zero and returns zero when b is zero.

So this behavior is well defined (according to the documents).

However when I compile the following test code (using icc 18.0, option -O2 for brevity):

unsigned __int32 trailing_zeros(unsigned __int32 x)
{
	unsigned __int32 index = 32;
	_BitScanForward(&index, x);
	return index;
}

It's compiled into:

trailing_zeros(unsigned int):
  bsf eax, edi
  ret

Where is the initial value 32? If x==0, then eax is undefined, but it's returned directly!

I'm also very curious why _BitScanForward is documented like this.

As far as I know, the BSF instruction is still documented as "return undefined value when source is 0", so why made _BitScanForward different?

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today