Bugs in Intrinsics Guide

108 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Hi,

I think there is an error in the description of the algorithm of the intrinsics '_mm512_*_extpackstorelo_*' (or maybe I'm missing something):

The condition

IF (storeAddr % 64) == 0 BREAK

should be something like

IF ((addr + storeOffset * downSize) % 64) == 0 BREAK

Otherwise, the first aligned element (hi) will be written by the 'lo' intrinsic and it shouldn't according to my understanding.

 

Please, let me know if I'm wrong.

Thanks.

Barcelona Supercomputing Center

There appear to be a number of issues with KNC intrinsics, including several missing intrinsics (specifically when the name matches an AVX-512 intrinsic), and intrinsics that should be cross listed as both AVX-512 and KNC but are only listed under AVX-512. I am in the process of reviewing all KNC intrinsics and will release an update that should resolve all these issues shortly.

The function _mm512_fmadd233_epi32 is listed in the Intrinsics guide as a = b*c. I guess that is also a typo.

 

btw I really like the Intrinsics guide! Would it be possible that you add a button for choosing the data type (integer, floating point)? Like in the software "Intel Intrinsics Guide - v.3.01.?

Another idea for improvement would be to add a "advanced search", e.g. search for function with a special output data type (int, double and so forth). That search option would have saved me a lot of time.

I've just updated the Intrinsics Guide (v3.1.5). This should resolve all the KNC issues, as well as the issue with fmadd233 and extpackstorelo.

http://software.intel.com/sites/landingpage/IntrinsicsGuide/

_mm_sub_epi16 intrinsic is documented to correspond to phsubw instruction, while it should be psubw. The timing data is also given for phsubw instead of psubw.

No compiler version info. For example, _mm_erfcinv_ps appeared in ICC 14.

I've resolved the issue with _mm_sub_epi16, the update should appear soon. I've also added the new intrinsics for xsavec, xsaves, and xrstors.

Great tool, some shortcomings

  1. _mm_xor_si128() says "bitwisw OR"
  2. All commands with "abs" may add information about behaviour for the value -2^(N-1) with N being bitwidth of corresponding epi type

Hello,

I currently use data version 3.1.6 very actively and had trouble with compiling the four intrinsics *_bslli_si128() and *_bsrli_si128(). With gcc, they only compile when I remove the b. I do not (yet) use Intel compiler, but the SW developer manual also lists those four intrinsics without b.

Intel C/C++ Compiler Intrinsic Equivalent

(V)PSLLDQ: __m128i _mm_slli_si128 ( __m128i a, int imm)

VPSLLDQ: __m256i _mm256_slli_si256 ( __m256i a, const int imm)

Intel C/C++ Compiler Intrinsic Equivalents

(V)PSRLDQ: __m128i _mm_srli_si128 ( __m128i a, int imm)

VPSRLDQ: __m256i _mm256_srli_si256 ( __m256i a, const int imm)

Please, specify that _mm_madd_epi16 and _mm256_madd_epi16 perform signed multiplication.

Citação:

Stefan M. escreveu:

 

Hello,

 

I currently use data version 3.1.6 very actively and had trouble with compiling the four intrinsics *_bslli_si128() and *_bsrli_si128(). With gcc, they only compile when I remove the b. I do not (yet) use Intel compiler, but the SW developer manual also lists those four intrinsics without b.

 

 

Intel C/C++ Compiler Intrinsic Equivalent

 

 

(V)PSLLDQ: __m128i _mm_slli_si128 ( __m128i a, int imm)

 

 

VPSLLDQ: __m256i _mm256_slli_si256 ( __m256i a, const int imm)

 

 

Intel C/C++ Compiler Intrinsic Equivalents

 

 

(V)PSRLDQ: __m128i _mm_srli_si128 ( __m128i a, int imm)

 

 

VPSRLDQ: __m256i _mm256_srli_si256 ( __m256i a, const int imm)

You can use either name, they perform the same functionality, although the "b" names may not be supported by GCC at this point.

 

http://software.intel.com/sites/landingpage/IntrinsicsGuide/ works for me now. That said there were site outages a few days ago (the forum was completely inaccessible for a day or two for me), maybe the problems are still happening from time to time.

Citação:

andysem escreveu:

http://software.intel.com/sites/landingpage/IntrinsicsGuide/ works for me now. That said there were site outages a few days ago (the forum was completely inaccessible for a day or two for me), maybe the problems are still happening from time to time.

Works for me also.

Sorry about that, there were some server changes that caused some intermittent issues, but it should be working fine now.

description for __m128i _mm_sad_epu8 (__m128i a, __m128i b) is not correct,  

 

Description

Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned two unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.

User Jeremias M. wrote here: https://software.intel.com/en-us/forums/topic/516476#comment-1791398 regarding an issue filtering results only for KNC and the search returning _mm512_mask_set1_epi32 as a valid intrinsic for KNC. That is not currently incorrect. It may become true in a future release as discussed in the cited thread.

Thank you for your feedback. I've updated the Intrinsics Guide to resolve the issues with _mm_sad_epu8 and _mm512_mask_set1_epi32, as well as a few other issues with KNC intrinsics.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Hi,

I was using the function _mm512_mask_reduce_gmax_pd and when I checked for the int same functions in the guide, appeared only for AVX-512 instructions.

So, I checked in zmmintrin.h header and I saw the functions implemented. Then I tested some functions( _mm512_mask_reduce_max_epi32 (__mmask16 k, __m512i a), _mm512_reduce_max_epi32 (__m512i a) ), and they worked.

I believe that it's possible the below functions were made  for KNC too.

int _mm512_reduce_max_epi32 (__m512i a)

__int64 _mm512_reduce_max_epi64 (__m512i a)

unsigned int _mm512_reduce_max_epu32 (__m512i a)

unsigned __int64 _mm512_reduce_max_epu64 (__m512i a)

double _mm512_reduce_max_pd (__m512d a)

float _mm512_reduce_max_ps (__m512 a)

 

You are correct, all the _reduce_ intrinsics are supported on KNC. I've updated the Intrinsics Guide to resolve this issue.

_mm_test_all_ones intrinsic has multiple different timing values for the same CPUs.

The Intel intrinsics guide page doesn't load for me or loads really slow (about a minute or so). It shows the intrinsics categories on the left and "Loading" in the center and hangs this way. I'm using Firefox 32.0.3 on Linux.

On a related note, will there be an offline standalone release? Browser version is not always convenient for me.

 

I find the opening screen of the guide to be very unreadable. It would be much more readable if only the function name were used at the top level instead of the full function prototypes. Using the prototypes just creates a lot of visual noise that obscures the function names. Since the prototype is easily visible when a function is displayed, IMHO, the extra click needed to see the prototype is outweighed by the improved readability.

It would be helpful if the description of the intrinsics also had a link to the corresponding instruction's description in the Intel Processor Instruction Set manual, so we can easily get the dirty details on the generated instruction.

Citação:

Glenn D. escreveu:

It would be much more readable if only the function name were used at the top level instead of the full function prototypes.

I disagree. The prototype is useful for me because I often don't remember the exact signature or arguments of the intrinsic, and all I have to do is just type it in the search field.

 

Citação:

andysem escreveu:

Please, specify that _mm_madd_epi16 and _mm256_madd_epi16 perform signed multiplication.

Was this forgotten? This information is still missing in 3.3.1.

 

Citação:

andysem escreveu:

Was this forgotten? This information is still missing in 3.3.1.

I guess so, I'll be sure to include this in the next update.

Hi.

There are invalid names of constants in Operations in _mm512_{,mask_}extload_*.
(according to zmmintrin.h)

_MM_BROADCAST1X16 should be _MM_BROADCAST_1X16.
_MM_BROADCAST4X16 should be _MM_BROADCAST_4X16.
_MM_BROADCAST1X8 should be _MM_BROADCAST_1X8.
_MM_BROADCAST4X8 should be _MM_BROADCAST_4X8.

Regards,
Sugizaki.

Please, mention in the description that _mm_maskmoveu_si128 and _mm_maskmove_si64 generate non-temporal memory stores.

Thanks guys, I've made these corrections.

there is a series of errors in the Intrinsics Guide for the description of intrinsics mapping to instructions with an immediate operand

operands of the imm8 type (8-bit) are declared as int (32-bit) intrinsic arguments so I'll advise to always use a notation such as imm[7:0] in the Intrinsics Guide

for example the description of _mm256_blend_epi16 at the moment makes some users think that they can use a 16-bit mask

(see https://software.intel.com/en-us/forums/topic/537849

Thanks for reporting this issue. I have updated the documentation around immediate parameters to clarify this better.

Citação:

Patrick Konsor (Intel) escreveu:
I have updated the documentation around immediate parameters to clarify this better.

the desciption for _mm256_blend_epi16 looks the same as before in the online Intrinsics Guide, I suppose that your changes aren't yet published, right ?

In v3.3.3, _mm_madd_epi16 claims (in the description and operation sections) to saturate the result of the addition. But the description of PMADDWD in the Software Developer's Manual doesn't say any saturation occurs, and actually says it will wrap (when the 16-bit inputs are all 0x8000, which I think is the only case where saturation/wrapping could possibly matter). Some test code confirms that it does wrap, not saturate, so it looks like a bug in the Intrinsics Guide.

(Same applies to all the other intrinsics for PMADDWD/VPMADDWD.)

Thanks, I've corrected that as well.

Both the 8-bit immediate and pmaddwd issues should be corrected in data version 3.3.4. I don't believe it's live yet, sometimes it takes the web ops people a little while to publish the changes.

Citação:

Patrick Konsor (Intel) escreveu:
Thanks for reporting this issue. I have updated the documentation around immediate parameters to clarify this better.

now I see the changes online, neat!

It looks like _mm512_set1_pd is only marked as AVX512F although it is available since IMCI.

Hi,

i was trying to use the function _mm512_set1_epi32 in KNC, but I received the following error:

On the remote process, dlopen() failed. The error message sent back from the sink is /var/volatile/tmp/coi_procs/1/4087/load_lib/icpcoutMmwX7Q: undefined symbol: _mm512_maskz_sllv_epi32
offload error: cannot load library to the device 0 (error code 20)
On the sink, dlopen() returned NULL. The result of dlerror() is "/var/volatile/tmp/coi_procs/1/4087/load_lib/icpcoutMmwX7Q: undefined symbol: _mm512_maskz_sllv_epi32"

I believe that this function is only avaiable in AVX-512.

 

In the other hand, the function _mm512_set1_epi32 is available in KNC.

 

Thanks.

Documentation for _mm_hsub_ps is wrong; The order on the operands in each pair-wise subtraction is reversed.

given sse registers A and B,

hsub(A, B) = [B[2] - B[3], B[0] - B[1], A[2] - A[3], A[0] - A[1]]

Thanks, I've corrected the floating-point hsub intrinsics.

The documentation for the SHA-1 instructions is wrong in several places.

Several times, the shift operation (<<) is written where rotate (<<<) is supposed to be. Such as in _mm_sha1rnds4_epu32:

A[1] := f(B, C, D) + (A << 5) + W[0] + K;
B[1] := A;
C[1] := B << 30;
D[1] := C;
E[1] := D;

FOR i = 1 to 3
  		A[i+1] := f(B[i], C[i], D[i]) + (A[i] << 5) + W[i] + E[i] + K;
  		B[i+1] := A[i];
  		C[i+1] := B[i] << 30;
  		D[i+1] := C[i];
  		E[i+1] := D[i];
ENDFOR;

All of those << should be <<<.

Thank you, I will correct these.

Could someone please check on the following?  It looks like too big of an error to not have been already mentioned and fixed, but in the current version 3.3.8 of the Intrinsics Guide...

I believe there is an error in the description for _mm512_srlv_epi32 (and similar intrinsics). The "operation" says that the shift is based on count[i+4:i] but according to what I see in the instruction extension manual, and behavior of SDE (KNL and SKX in particular) and a test program on KNC, it looks like the shift is based on the whole value of the count field.  It appears that a shift of >31 results in the bits being shifted off the end (result field is zero), not a shift by zero (result field unchanged).  The instruction extension manual probably also needs an update as the EVEX description is not specific as to what happens if a shift is greater than 32 (it is specific for VEX.128, VEX.256, etc...)

I believe there are also errors in the operation sections for other sizes (_mm256_srlv_epi32) and other related instructions (_mm512_sllv_epi32).

 

Thanks for reporting this, I have corrected this information.

 

Hi,

Wouldn't it be clearer in the Intrinsics Guide documentation if a "const" is added for the immediate value for the shuffle functions (SSE). For example:

___m128 _mm_shuffle_ps (__m128 a, __m128 b, const unsigned int imm8)

instead of 

___m128 _mm_shuffle_ps (__m128 a, __m128 b, unsigned int imm8)

Thank you

It wouldn't change anything from the user's perspective. Parameter constness has no effect on the caller. Even if this is just a documentation change, it doesn't actually communicate the fact that an immediate constant (or, more generally, a constant expression) is required.

 

There are a few bugs in the description of the ADX intrinsics.

1. _addcarry_* intrinsics generate the classic adc instruction, not adcx. As such, it affects both CF and OF and uses CF for carry. The adc instruction has been available long before ADX extension (no cpuid feature required).

2. _addcarryx_* can generate either adcx or adox, at compiler's choice. As such it uses either CF or OF for carry. These intrinsics require ADX cpuid feature.

3. _subborrow_* indeed generates sbb, which is the counterpart of adc. This instruction is also available in classic IA-32 and does not require ADX cpuid flag.

4. SDM defines _addcarry_* and _subborrow_* intrinsics for different integer sizes, from 8 to 64 bit, and Intrinsics Guide only describes 32 and 64-bit ones.

 

Thanks, I will make those corrections.

On point 4, I believe the SDM is in error here, the compiler only defines 32 and 64-bit versions; I will confirm this internally.

> On point 4, I believe the SDM is in error here, the compiler only defines 32 and 64-bit versions; I will confirm this internally.

Since the instruction does support 8 and 16-bit operands, it might be a good idea to add those intrinsics to the compiler if not already there.

 

Is _mm256_s[rl]li_si256 misnamed?  In the other versions _epi16, _epi32, _epi64 all seem to indicate the "pocket".  It seems to me that they should be _mm256_s[rl]i_si128 (or maybe epi128?) since the shift "pocket" is 128 bits and not 256.  This would make them consistent with the convention and have the intrinsic name be reflective of the actual operation.  (I also note that _mm256_bs[rl]i_epi128 seems to be the same underlying instruction and it is _epi128).
 

Páginas

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!