Bugs in Intrinsics Guide

93 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hi,

I think there is an error in the description of the algorithm of the intrinsics '_mm512_*_extpackstorelo_*' (or maybe I'm missing something):

The condition

IF (storeAddr % 64) == 0 BREAK

should be something like

IF ((addr + storeOffset * downSize) % 64) == 0 BREAK

Otherwise, the first aligned element (hi) will be written by the 'lo' intrinsic and it shouldn't according to my understanding.

 

Please, let me know if I'm wrong.

Thanks.

Barcelona Supercomputing Center

There appear to be a number of issues with KNC intrinsics, including several missing intrinsics (specifically when the name matches an AVX-512 intrinsic), and intrinsics that should be cross listed as both AVX-512 and KNC but are only listed under AVX-512. I am in the process of reviewing all KNC intrinsics and will release an update that should resolve all these issues shortly.

The function _mm512_fmadd233_epi32 is listed in the Intrinsics guide as a = b*c. I guess that is also a typo.

 

btw I really like the Intrinsics guide! Would it be possible that you add a button for choosing the data type (integer, floating point)? Like in the software "Intel Intrinsics Guide - v.3.01.?

Another idea for improvement would be to add a "advanced search", e.g. search for function with a special output data type (int, double and so forth). That search option would have saved me a lot of time.

I've just updated the Intrinsics Guide (v3.1.5). This should resolve all the KNC issues, as well as the issue with fmadd233 and extpackstorelo.

http://software.intel.com/sites/landingpage/IntrinsicsGuide/

_mm_sub_epi16 intrinsic is documented to correspond to phsubw instruction, while it should be psubw. The timing data is also given for phsubw instead of psubw.

No compiler version info. For example, _mm_erfcinv_ps appeared in ICC 14.

I've resolved the issue with _mm_sub_epi16, the update should appear soon. I've also added the new intrinsics for xsavec, xsaves, and xrstors.

Great tool, some shortcomings

  1. _mm_xor_si128() says "bitwisw OR"
  2. All commands with "abs" may add information about behaviour for the value -2^(N-1) with N being bitwidth of corresponding epi type

Hello,

I currently use data version 3.1.6 very actively and had trouble with compiling the four intrinsics *_bslli_si128() and *_bsrli_si128(). With gcc, they only compile when I remove the b. I do not (yet) use Intel compiler, but the SW developer manual also lists those four intrinsics without b.

Intel C/C++ Compiler Intrinsic Equivalent

(V)PSLLDQ: __m128i _mm_slli_si128 ( __m128i a, int imm)

VPSLLDQ: __m256i _mm256_slli_si256 ( __m256i a, const int imm)

Intel C/C++ Compiler Intrinsic Equivalents

(V)PSRLDQ: __m128i _mm_srli_si128 ( __m128i a, int imm)

VPSRLDQ: __m256i _mm256_srli_si256 ( __m256i a, const int imm)

Please, specify that _mm_madd_epi16 and _mm256_madd_epi16 perform signed multiplication.

Citation :

Stefan M. a écrit :

 

Hello,

 

I currently use data version 3.1.6 very actively and had trouble with compiling the four intrinsics *_bslli_si128() and *_bsrli_si128(). With gcc, they only compile when I remove the b. I do not (yet) use Intel compiler, but the SW developer manual also lists those four intrinsics without b.

 

 

Intel C/C++ Compiler Intrinsic Equivalent

 

 

(V)PSLLDQ: __m128i _mm_slli_si128 ( __m128i a, int imm)

 

 

VPSLLDQ: __m256i _mm256_slli_si256 ( __m256i a, const int imm)

 

 

Intel C/C++ Compiler Intrinsic Equivalents

 

 

(V)PSRLDQ: __m128i _mm_srli_si128 ( __m128i a, int imm)

 

 

VPSRLDQ: __m256i _mm256_srli_si256 ( __m256i a, const int imm)

You can use either name, they perform the same functionality, although the "b" names may not be supported by GCC at this point.

 

http://software.intel.com/sites/landingpage/IntrinsicsGuide/ works for me now. That said there were site outages a few days ago (the forum was completely inaccessible for a day or two for me), maybe the problems are still happening from time to time.

Citation :

andysem a écrit :

http://software.intel.com/sites/landingpage/IntrinsicsGuide/ works for me now. That said there were site outages a few days ago (the forum was completely inaccessible for a day or two for me), maybe the problems are still happening from time to time.

Works for me also.

Sorry about that, there were some server changes that caused some intermittent issues, but it should be working fine now.

description for __m128i _mm_sad_epu8 (__m128i a, __m128i b) is not correct,  

 

Description

Compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce four unsigned two unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst.

User Jeremias M. wrote here: https://software.intel.com/en-us/forums/topic/516476#comment-1791398 regarding an issue filtering results only for KNC and the search returning _mm512_mask_set1_epi32 as a valid intrinsic for KNC. That is not currently incorrect. It may become true in a future release as discussed in the cited thread.

Thank you for your feedback. I've updated the Intrinsics Guide to resolve the issues with _mm_sad_epu8 and _mm512_mask_set1_epi32, as well as a few other issues with KNC intrinsics.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Hi,

I was using the function _mm512_mask_reduce_gmax_pd and when I checked for the int same functions in the guide, appeared only for AVX-512 instructions.

So, I checked in zmmintrin.h header and I saw the functions implemented. Then I tested some functions( _mm512_mask_reduce_max_epi32 (__mmask16 k, __m512i a), _mm512_reduce_max_epi32 (__m512i a) ), and they worked.

I believe that it's possible the below functions were made  for KNC too.

int _mm512_reduce_max_epi32 (__m512i a)

__int64 _mm512_reduce_max_epi64 (__m512i a)

unsigned int _mm512_reduce_max_epu32 (__m512i a)

unsigned __int64 _mm512_reduce_max_epu64 (__m512i a)

double _mm512_reduce_max_pd (__m512d a)

float _mm512_reduce_max_ps (__m512 a)

 

You are correct, all the _reduce_ intrinsics are supported on KNC. I've updated the Intrinsics Guide to resolve this issue.

_mm_test_all_ones intrinsic has multiple different timing values for the same CPUs.

The Intel intrinsics guide page doesn't load for me or loads really slow (about a minute or so). It shows the intrinsics categories on the left and "Loading" in the center and hangs this way. I'm using Firefox 32.0.3 on Linux.

On a related note, will there be an offline standalone release? Browser version is not always convenient for me.

 

I find the opening screen of the guide to be very unreadable. It would be much more readable if only the function name were used at the top level instead of the full function prototypes. Using the prototypes just creates a lot of visual noise that obscures the function names. Since the prototype is easily visible when a function is displayed, IMHO, the extra click needed to see the prototype is outweighed by the improved readability.

It would be helpful if the description of the intrinsics also had a link to the corresponding instruction's description in the Intel Processor Instruction Set manual, so we can easily get the dirty details on the generated instruction.

Citation :

Glenn D. a écrit :

It would be much more readable if only the function name were used at the top level instead of the full function prototypes.

I disagree. The prototype is useful for me because I often don't remember the exact signature or arguments of the intrinsic, and all I have to do is just type it in the search field.

 

Citation :

andysem a écrit :

Please, specify that _mm_madd_epi16 and _mm256_madd_epi16 perform signed multiplication.

Was this forgotten? This information is still missing in 3.3.1.

 

Citation :

andysem a écrit :

Was this forgotten? This information is still missing in 3.3.1.

I guess so, I'll be sure to include this in the next update.

Hi.

There are invalid names of constants in Operations in _mm512_{,mask_}extload_*.
(according to zmmintrin.h)

_MM_BROADCAST1X16 should be _MM_BROADCAST_1X16.
_MM_BROADCAST4X16 should be _MM_BROADCAST_4X16.
_MM_BROADCAST1X8 should be _MM_BROADCAST_1X8.
_MM_BROADCAST4X8 should be _MM_BROADCAST_4X8.

Regards,
Sugizaki.

Please, mention in the description that _mm_maskmoveu_si128 and _mm_maskmove_si64 generate non-temporal memory stores.

Thanks guys, I've made these corrections.

there is a series of errors in the Intrinsics Guide for the description of intrinsics mapping to instructions with an immediate operand

operands of the imm8 type (8-bit) are declared as int (32-bit) intrinsic arguments so I'll advise to always use a notation such as imm[7:0] in the Intrinsics Guide

for example the description of _mm256_blend_epi16 at the moment makes some users think that they can use a 16-bit mask

(see https://software.intel.com/en-us/forums/topic/537849

Thanks for reporting this issue. I have updated the documentation around immediate parameters to clarify this better.

引文:

Patrick Konsor (Intel) 写道:
I have updated the documentation around immediate parameters to clarify this better.

the desciption for _mm256_blend_epi16 looks the same as before in the online Intrinsics Guide, I suppose that your changes aren't yet published, right ?

In v3.3.3, _mm_madd_epi16 claims (in the description and operation sections) to saturate the result of the addition. But the description of PMADDWD in the Software Developer's Manual doesn't say any saturation occurs, and actually says it will wrap (when the 16-bit inputs are all 0x8000, which I think is the only case where saturation/wrapping could possibly matter). Some test code confirms that it does wrap, not saturate, so it looks like a bug in the Intrinsics Guide.

(Same applies to all the other intrinsics for PMADDWD/VPMADDWD.)

Thanks, I've corrected that as well.

Both the 8-bit immediate and pmaddwd issues should be corrected in data version 3.3.4. I don't believe it's live yet, sometimes it takes the web ops people a little while to publish the changes.

引文:

Patrick Konsor (Intel) 写道:
Thanks for reporting this issue. I have updated the documentation around immediate parameters to clarify this better.

now I see the changes online, neat!

It looks like _mm512_set1_pd is only marked as AVX512F although it is available since IMCI.

Hi,

i was trying to use the function _mm512_set1_epi32 in KNC, but I received the following error:

On the remote process, dlopen() failed. The error message sent back from the sink is /var/volatile/tmp/coi_procs/1/4087/load_lib/icpcoutMmwX7Q: undefined symbol: _mm512_maskz_sllv_epi32
offload error: cannot load library to the device 0 (error code 20)
On the sink, dlopen() returned NULL. The result of dlerror() is "/var/volatile/tmp/coi_procs/1/4087/load_lib/icpcoutMmwX7Q: undefined symbol: _mm512_maskz_sllv_epi32"

I believe that this function is only avaiable in AVX-512.

 

In the other hand, the function _mm512_set1_epi32 is available in KNC.

 

Thanks.

Documentation for _mm_hsub_ps is wrong; The order on the operands in each pair-wise subtraction is reversed.

given sse registers A and B,

hsub(A, B) = [B[2] - B[3], B[0] - B[1], A[2] - A[3], A[0] - A[1]]

Thanks, I've corrected the floating-point hsub intrinsics.

The documentation for the SHA-1 instructions is wrong in several places.

Several times, the shift operation (<<) is written where rotate (<<<) is supposed to be. Such as in _mm_sha1rnds4_epu32:

A[1] := f(B, C, D) + (A << 5) + W[0] + K;
B[1] := A;
C[1] := B << 30;
D[1] := C;
E[1] := D;

FOR i = 1 to 3
  		A[i+1] := f(B[i], C[i], D[i]) + (A[i] << 5) + W[i] + E[i] + K;
  		B[i+1] := A[i];
  		C[i+1] := B[i] << 30;
  		D[i+1] := C[i];
  		E[i+1] := D[i];
ENDFOR;

All of those << should be <<<.

Thank you, I will correct these.

Pages

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui