Bugs in Intrinsics Guide

151 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

_mm256_bslli_epi128 was added as a synonym of _mm256_slli_si256 because of the misleading name of the latter.

Missing mask ops (kand, kmov, knot, etc...) for other than __mmask16?

I think (can never really be sure I've check every last corner of the system ;^)) that the intrinsic guide (and Parallel Studio XE2016 beta) are missing any mask operations for sizes other than __mmask16.  I did find 8, 32, and 64 bit versions described in the instruction extensions manual (https://software.intel.com/en-us/intel-architecture-instruction-set-exte..., chapter 6) so I believe they really are supposed to exist.

Without these, doing conditional stuff on anything other then 32 bit "pockets" is really tough.

 

There are 8, 32, and 64-bit instructions for mask operations, but the compiler only defines intrinsics for 16-bit.

The word from the compiler folks is that they're not going to provide intrinsics for 8/32/64. If you just use the _mmask8/32/64 types then the compiler will try to implement things as actual k-instructions.  I've played around with it a bit and it mostly works.  In many cases the compiler will keep everything in k-registers.  In other cases it starts moving things back and forth (unnecessarily) between k and normal registers.

Not all k-instructions have C/C++ equivalents, like kandn. You could try to emulate it with ~ and & operators and hope the compiler is smart enough to recognize the pattern, but I would prefer to have an intrinsic for that.

8/32/64-bit k-registers are essential for AVX-512BW/DQ, they should be properly supported in compilers, IMHO.

 

The _mm_fmaddsub_ps and _mm_fmsubadd_ps functions have the same content in the operation section yet different descriptions. Is that what it is supposed to be?

Thanks.

Arthur,

You're correct, it looks like the fmsubadd operations were incorrect. I have corrected them, the update should appear shortly.

The intrinsics guide description of _mm_testc_si128 says it NOTs the 2nd operand, but it's wrong.  Assuming the intrinsic is supposed to take its args in the same order as the PTEST instruction, then the docs are wrong and the actual behaviour is correct.

This was discovered during discussion at http://stackoverflow.com/questions/32072169/could-i-compare-to-zero-regi...

The Intel insn ref manual says:

(V)PTEST (128-bit version)

	IF (SRC[127:0] BITWISE AND DEST[127:0] = 0)
	    THEN ZF  1;
	    ELSE ZF  0;
	IF (SRC[127:0] BITWISE AND NOT DEST[127:0] = 0)
	    THEN CF  1;
	    ELSE CF  0;

Where DEST is the first argument.  It is correct.

 

The intrinsics guide description for int _mm_testc_si128 (__m128i a, __m128i b) says:

    IF (a[127:0] AND NOT b[127:0] == 0)
        CF := 1
    ELSE
        CF := 0
    FI

This is the reverse of the instruction generated from the intrinsic.  I guess the doc writers got mixed up by the ref manual putting NOT DEST second.    The other ptest intrinsics (like _mm_test_mix_ones_zeros) have the same bug in their description

When are you planning to add instruction latencies for Broadwell? Thanks!

In the description of the _mm_multishift_epi64_epi8 intrinsic there is this line:

tmp8[k] := b[q+((ctrl+k) & 63)]

However, k is not mentioned anywhere else. I believe, it should be l.

The _mm_test_all_ones intrinsic description is not accurate. It says it performs a complement (i.e. XOR) of the xmm and 0xFFFFFFFF (which is a 32-bit constant, presumably). The actual behavior is different and amounts to returning 1 if the provided xmm contains 1 in all bits and 0 otherwise.

I think the following (minor) errors exist in the guide (3.3.11):

1) _mm512_sad_epu8: says it produces "four unsigned 16-bit integers".  I think that should say "eight unsigned 16-bit integers".

2) _mm256_mpsadbw_epu8: i := imm8[2]*32 should be a_offset:=imm8[2]*32

Let me say up front that I can't imagine doing low level code without the Intrinsic Guide, but there are a few enhancements I'd like to see (not in any particular order):

1) more complete information on latency and throughput

2) some sort of separate file containing the intrinsic, instruction family (AVX, SSE4.1, AVX-512VBMI, etc...) and the latency and throughput.  If I had such information I could more easily take my source code and annotate it with the information for each intrinsic I use; thus helping me keep track of architecture dependencies and performance.

3) a version of the guide that I could download and use locally when I don't have good network connection.

4) some sort of relational search or maybe regex search (i.e. __m512i and add, ^__m512i[.]*_add_).  Mostly I just want to be able to narrow down the results.  Examples are when I only care about a particular size operation (__m256 adds) or searching on information that exists in the intrinsic name text in the description (__mm512i permutes that work across lanes using 32-bit integers).

Just some thoughts.

Thanks for reporting these issues, I've resolved several issues, the new update should be posted shortly. 

I'll also get started a larger update to add some additional features that have been requested, both publicly and internally.

Patrick, the new description of _mm_test_all_ones is still incorrect. The pseudocode contains:

IF (a[127:0] AND NOT 0xFFFFFFFF == 0)

First, instead of 0xFFFFFFFF, which is a 32-bit constant, tmp should be used. Second, this condition will always return true regardless of the value of a. The correct condition should be:

IF (tmp[127:0] AND NOT a[127:0] == 0)

 

Hi, intrinsic functions for vmovsd are listed under AVX-512, although it is actually an AVX instruction. Best, André

The description part of _mm512_fmadd233_ps on KNC seems to be wrong. It does not match the description in EAS

 

Descriptions for _mm_test_all_zeros, _mm_test_mix_ones_zeros, and all testc intrinsics are still bad (they are not technically wrong, but they are badly misleading.)

As far as I can tell, they all assume that "a AND NOT b"  means "(~a) & b", because that's the way e.g. _mm_andnot_si128 works. But that's not a natural reading of the phrase and that's not how 99% of people would understand it. Probably best to spell this out explicitly in each case.

This also relates to andysem's comment above. "a AND NOT 0xFFFFFFFF" always evaluates to zero under natural interpretation of "AND NOT". Instead _mm_test_all_ones computes "IF ((NOT a) AND 0xFFFFFFFF == 0)" (or, in other words, simply IF ((NOT a) == 0) ).

Thanks all for the feedback. I've submitted an update to resolve these issues.

These intrinsics repeat two times:

      _mm_loadu_si16
      _mm_loadu_si32
      _mm_loadu_si64
      _mm_storeu_si16
      _mm_storeu_si32
      _mm_storeu_si64

One copy is missing CPUID flags and some details differ (e.g. machine instruction for _mm_*_si16). Maybe the intention was to have two versions depending on CPUID flags, as is the case with for example _mm_prefetch or _mm512_cmplt_epi32_mask.

Best Regards,

Vaclav

P.S. By the way - big thanks for this guide!! It is far better than anything else I've seen so far.

 

Yes, the intention is these intrinsics can work on SSE-supporting systems using SSE instructions, but they will also work on non-SSE-supporting systems, it's up to the compiler how they will be interpreted and what instructions will be emitted.

All these intrinsics involve movd or movq to move the data to an xmm register. SSE2 is required for that. I guess, you could also use movss and reduce the requirement to SSE, but still the requirement is there. How can these intrinsics be implemented without SSE when their purpose is to initialize an xmm register?

Anyway, I think duplicating intrinsics is not the correct choice.

in this intrinsic :

__m128i _mm_mpsadbw_epu8 (__m128i a, __m128i b, const int imm8)

CPUID Flags: SSE4.1

 

.

.

.

In this section

tmp[i+15:i] := ABS(a[k+7:k] - b[l+7:l]) + ABS(a[k+15:k+8] - b[l+15:l+8]) + ABS(a[k+23:k+16] - b[l+23:l+16]) + ABS(a[k+31:k+24] - b[l+31:l+24])

...

I think it should be tmp[i*2+15:i*2], Am I wrong?

I found the issue behind comment #88 (reported 01/20/2015) is still present in the Intrinsics Guide 3.3.14 (1/12/2016).

For each F16C intrinsic, the timing info contains duplicated entries for different CPU architectures - with and without throughput.

 

Version 3.3.14 (currently live on the site):

The vpermi2w / vpermt2w / vpermw intrinsics are categorized as "misc", not "swizzle".  The other element-sizes of permi/t2 and vpermb/d/q are correctly categorized as shuffles.

e.g.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_...

HI,

First of all I would like to thank you for this great tool. I often use it in my HPC class at university because it can help my students to understand what is going on.

But I am curious, are there any efforts going on to add latencies and throughputs for new processor generations like broadwell or skylake?

I'm asking because I have the impression that the latencies for VSQRTPD and VDIVPD have dramatically changed in the past and I would really like to know what their current values are in modern hardware.

The latencies and throughputs for most instructions are included in Appendix C of the "Intel 64 and IA-32 Architectures Optimization Reference Manual" (document 248966, currently available at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia...).

Using this data, I recently posted some graphs of the relative throughput of scalar, 128-bit, and 256-bit VDIVPS and VDIVPD instructions for Core2, Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell, and Skylake (client) at https://software.intel.com/en-us/forums/intel-isa-extensions/topic/62336...  

"Dr. Bandwidth"

Thanks yoir for the links, John!

This definitely emphasizes my suspicion that Intel has really tuned their Instructions division and squareroot computation in the past.

 

I just discovered this great tool!

I have two feature requests:

1. List the category (used by the filter) in the detailed description of each item.  "swizzle" vs "convert" vs "miscellaneous" can be tricky.  If these were discoverable (other than by trying all of the checkboxes), then users could limit results to "ones like this result"

2. Add additional filters for integer vs. floating point.  Even better would be filter on various characteristics of input and output: width of packed value, signed/unsigned, etc.

 

There is a typo in the __m128i _mm_madd_epi16 and __m256i _mm256_madd_epi16 intrinsics operation description.

st[i+31:i] should be dst[i+31:i] of course

This description talks about a “dst” operand which isn’t in the formal argument list, so something is wrong somewhere…

 

__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)

Synopsis

__m512i _mm512_mask_mullox_epi64 (__m512i src, __mmask8 k, __m512i a, __m512i b)
#include "immintrin.h"
CPUID Flags: AVX512F

Description

Multiplies elements in packed 64-bit integer vectors a and b together, storing the lower 64 bits of the result in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

Operation

FOR j := 0 to 7

      i := j*64

      IF k[j]

            dst[i+63:i] := a[i+63:i] * b[i+63:i]

      ELSE

            dst[i+63:i] := src[i+63:i]

      FI

ENDFOR

dst[MAX:512] := 0

 

Hi,
I think I have found some "bugs" in the current online version (3.3.14) of the guide :

  • __m128 _mm_mask_i64gather_ps (__m128 src, float const* base_addr, __m128i vindex, __m128 mask, const int scale)
  • __m128 _mm_i64gather_ps (float const* base_addr, __m128i vindex, const int scale) :

    • Instruction: vgatherqps xmm, vm32x, xmm

      • vm32x should be vm64x
    • dst[i+31:i] := MEM[base_addr + SignExtend(vindex[i+63:i])*scale]

      • vindex[i+63:i] should be vindex[m+63:m]
  • __m128d _mm_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
  • __m128d _mm_mask_i32gather_pd (__m128d src, double const* base_addr, __m128i vindex, __m128d mask, const int scale)
  • __m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)
    • Instruction: vgatherdpd xmm, vm64x, xmm

      • vm64x should be vm32x

Anyway, many thanks for this useful tool.

I think there is an error for _mm256_shuffle_epi8 intrinsic instruction. Currently it is:

dst[128+i+7:i] := a[128+index*8+7:128+index*8]

but I think it should be:

dst[128+i+7:128+i] := a[128+index*8+7:128+index*8]

For _mm512_shuffle_epi8 intrinsic instruction, I am not sure to understand correctly the pseudo code:

FOR j := 0 to 63
    i := j*8
    IF b[i+7] == 1
        dst[i+7:i] := 0
    ELSE
        index[3:0] := b[i+3:i]
        dst[i+7:i] := a[index*8+7:index*8]
    FI
ENDFOR
dst[MAX:512] := 0

It seems like only the first 128 bits of a can be shuffled?

First of all - thanks so much for this guide, I have found it to be invaluable!

I think I found a small error in version 3.3.14 for _mm_sqrt_sd. The guide claims that:

__m128d _mm_sqrt_sd (__m128d a, __m128d b)

computes the sqrt of the lower double from a and copies the lower double from b to the upper double of the result. However, it actually seems to be the opposite (the lower double from a is copied, and the sqrt of the lower double from b is computed). I am using clang on OSX. I don't have access to Windows or ICC, but for what it's worth, the MSN documentation at https://msdn.microsoft.com/en-us/library/1994h1ay(v=vs.90).aspx seems to agree with me.

Cheers,

Serge

Thanks for the feedback, most of this will be addressed in the next release.

1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

2. This will be resolved in the next release.

3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.

4. This will be resolved in the next release.

5. This will be resolved in the next release.

I have not added any additional latency and throughput data yet, but I may get to this soon.

Thanks for the feedback, most of this will be addressed in the next release.

1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

2. This will be resolved in the next release.

3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.

4. This will be resolved in the next release.

5. This will be resolved in the next release.

I have not added any additional latency and throughput data yet, but I may get to this soon.

Hi,

Description of _mm256_extractf128_si256 states  (composed of integer data), which seems confusing given the F for float?  Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?

-Harry

引文:

Harry V. (Intel) 写道:

Description of _mm256_extractf128_si256 states  (composed of integer data), which seems confusing given the F for float?  Looks like _mm256_extracti128_si256 is correct for integer data, or am I missing something?

There are two instructions: vextractf128 and vextracti128. The former is part of AVX and is generated by _mm256_extractf128_* and the latter is only added in AVX2 and is generated by _mm256_extracti128_si256. The effect of both instructions is the same and _mm256_extractf128_si256 is a convenient wrapper to allow interaction between __m256i and __m128i even on systems lacking AVX2.

 

By the way, are there any updates planned to the Intrinsics Guide? There were a number of bug reports and performance info for Skylake is still missing.

 

Thanks for the feedback. I've posted an update that addresses all the reported issues. This does not include performance info for Skylake, although I may add that in the future.

Each of the _mm_storeu_si16/si32/si64 intrinsics are listed twice, some of them having slightly different instructions.

I have posted an update that includes updated latency/throughput. This removes data from pre-Sandybridge, and adds Broadwell, Skylake, and Knights Landing.

Thank you Patrick, although I think the removal of Sandy Bridge and Nehalem is a bit premature. Those CPUs are still relevant.

I believe that the "_MM_CMPINT_NEQ" constant listed in various integer comparison operations should read _MM_CMPINT_NE. (At least this is what GCC, Clang, etc. implement)

The guide has a significant mislabelling of throughput in all intrinsics which list them. Specifically, when the guide gives a throughput value, it is actually reporting reciprocal throughput.    This is consistently misreported throughout the guide

For example, the guide reports Skylake having a lower throughput for pmulhuw than Haswell or Broadwell. It's the opposite, Skylake's thoughput is higher than the older architectures.  This mislabelling is repeated for about 100 other intrinsics.

Reporting reciprocal throughput is a good idea, since those values can be more easily compared to latency clocks.  But the labels in the whole guide must be updated to state "reciprocal throughput."   I was even reorganizing my AVX code to minimize calls to these certain apparently lower-throughput  changes to x86 vector math! 

Luckily I realized the mismatch with Agner Fox's independent tables.

Specifically, when the guide gives a throughput value, it is actually reporting reciprocal throughput.

This is a common convention in most x86-related materials. I suppose, that's for historical reasons. You can see "Intel® 64 and IA-32 Architectures Software Developer’s Manual" for example, in quite a few places it uses the "n-cycle throughput" wording, which actually implies reciprocal throughput. Here's another resource: http://x86.renejeschke.de/html/file_module_x86_id_244.html. You can see it also uses the term throughput to describe the clock cycles.

I agree that the term is confusing and probably poorly chosen in the beginning. But for anyone familiar with the domain it should be understandable.

 

_mm512_stream_si512 should have "Instruction: vmovntdqa m512, zmm" not "Instruction: vmovntdqa zmm, m512".  The destination is first and the source is second.  This instruction stores to memory from a register.

Edit: it should instead be  "Instruction: vmovntdq m512, zmm".  VMOVNTDQA is a load and VMOVNTDQ is a store.

http://jeffhammond.github.io/

_mm_countbits_64 description says input is 32 bits, but should say 64 bits.

The description of _xbegin in the intrinsics guide does not say what the result of the intrinsic is.

believe that it returns -1 if execution is continuing inside a transaction, and the value of EAX on an abort, but none of that information is given here, and it should be!

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today