Bugs in Intrinsics Guide

Bugs in Intrinsics Guide

Imagen de andysem

Hi,

I've found a few bugs in the Intel Intrinsics Guide 2.7 (I'm using Linux version):

1. When the window is maximized, the search field is stretched vertically while still being a one-line edit box. It sould probably be sized accordingly.

2. __m256 _mm256_undefined_si256 () should return __m256i.

3. In some instructions description, like _mm_adds_epi8, the operation is described in terms of SignedSaturate while, e.g. _mm256_adds_epi16 is described with SaturateToSignedWord. This applies to other operations with unsigned saturation as well. Also, the vector elements are described differently. More consistent description would be nice.

4. _mm_alignr_epi8 has two descriptions.

5. I'm not sure _mm_ceil_pd signature and description is correct. It says the intrinsic returns a vector of single-precision floats. Shouldn't it be double-precision?

I didn't read all instructions so there may be more issues. I'll post if I find anything else.

PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to newer instructions but still this info is useful and I hope it will be added.

publicaciones de 66 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Imagen de Sergey Kostrov

Thanks for the feedback!

It would be nice to duplicate these errors online on doc-html-pages where you found issues or problems. As far as I know there is a special button to provide a feedback.

>>...
>>PS: This is not a bug per se but some instructions are missing the Latency & Throughput information. This mostly relates to
>>newer instructions...

This is a known issue and was addressed several times during last a couple of months. Even some older instructions are missing, unfortunately.

Best regards,
Sergey

Imagen de Patrick Konsor (Intel)

Thanks for the feedback, most of this will be addressed in the next release.

1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

2. This will be resolved in the next release.

3. All the descriptions and operations have been updated for the next release, so they should now be much more consistent.

4. This will be resolved in the next release.

5. This will be resolved in the next release.

I have not added any additional latency and throughput data yet, but I may get to this soon.

Imagen de Sergey Kostrov

>>...I have not added any additional latency and throughput data yet, but I may get to this soon.

Thanks for the update and please keep everybody informed!

Imagen de andysem

@Sergey Kostrov

> It would be nice to duplicate these errors online on doc-html-pages where you found issues or problems. As far as I know there is a special button to provide a feedback.

I don't quite understand what pages do you mean. Could you provide a link.

@Patrick Konsor

> 1. I'm not able to replicate this issue with maximizing the window on Linux. What distro are you using? What version of Java?

I'm seeing this on Kubuntu 12.04 and 12.10, both x86-64, KDE 4.9.5 whil dual monitors attached. I'm using Oracle Java:

java version "1.7.0_11"
Java(TM) SE Runtime Environment (build 1.7.0_11-b21)
Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)

I've attached a screenshot to illustrate the problem.

Adjuntos: 

AdjuntoTamaño
Descargar snapshot1.png240.46 KB
Imagen de andysem

A new pack of bugs:

1. _mm_cvtss_f32 is described to be equivalent to cvtss2si instruction. I suppose, the intrinsic should not generate any instructions, if the compiler uses SSE for math calculations or should simply store the value to some memory or general purpose register. But it sould not convert the float value to an integer.

2.  _mm_cvtsi32_si128 is said to extend the upper bits of the operand in the description, but it should extend it with zeros.

Imagen de Patrick Konsor (Intel)

Version 2.8 has been released:
http://software.intel.com/en-us/articles/intel-intrinsics-guide

Note that this release does include additional latency and throughput data.

Regarding the two new issues:
1. You're correct, cvtss2si is the wrong instruction. movss is the official instruction, although you'll often see different instructions based on context. This will be resolved in the next release.
2. This issue was already resolved in v2.8.

Are you still seeing the issue with the search box expanding on Linux with v2.8? 

Imagen de andysem

Thanks for the updated release.

Yes, the problem with the search box is still present. I must say, it wasn't present before 2.7 (I think, that version introduced some interface changes; aside the search field, I think, the fonts also changed).

Imagen de Sergey Kostrov

>>Version 2.8 has been released:
>>software.intel.com/en-us/articles/intel-intrinsics-guide
>>
>>Note that this release does include additional latency and throughput data.

Thank you, Patrick.

Imagen de Christian M.

This release is great!

Now there are latency and throughput data for Ivy Bridge, too!

I waited for this quite some time. One always had to look in the really big manuals to find that sort of information.

Imagen de andysem

One additional bug: _mm_max_epu32 signature contains three arguments: __m128i _mm_max_epu32 (__m128i a, __m128i b, __m128i b). I believe, the last one should be removed.

Imagen de Sergey Kostrov

Yes. That is correct and here is a declaration from smmintrin.h header file ( Intel version ):
...
extern __m128i __ICL_INTRINCC _mm_max_epu32( __m128i, __m128i );
...

Imagen de andysem

__int _mm256_movemask_epi8 (__m256i a)

Please, remove the leading underscores in the return type.

Imagen de Patrick Konsor (Intel)

Thanks, this issue will be fixed in the next release.

Imagen de Sergey Kostrov

>>...__int _mm256_movemask_epi8 (__m256i a)

Here is a declaration from immintrin.h header file ( Intel version ):
...
/*
* Returns a 32-bit mask made up of the most significant bit of each byte
* of the 256-bit vector source operand.
*/
extern int __ICL_INTRINCC _mm256_movemask_epi8(__m256i);
...

Imagen de andysem

The description of the  _mm256_shuffle_epi8 intrinsic looks like it acts cross-lane. And its formal algorithm doesn't clarify that because its index value is [0..15] bounded, and it is not adjusted for the second lane (this would result in lane 0 of a being distributed to both lanes of b).

Imagen de andysem

Just noted that 2.8.1 has been released. Thanks for the update.

_mm256_shuffle_epi8 description is still confusing. And the original issue with the search bar is not fixed too. I somehow forgot to mention that the problem shows not only with maximized window, but also with normal window larger than a certain size vertically. I suppose, the field size is ok when the window height is less or equal to the total height of all widgets, and when it exceeds it the search field is stretched instead of adding unused space in the bottom. Is there any estimate for the fix?

Imagen de Sergey Kostrov

>>Just noted that 2.8.1 has been released...

Here is a link to download a recently released Intel Intrinsics Guide for Windows verion 2.8.1:

software.intel.com/sites/default/files/Intel_Intrinsics_Guide-windows-v2.8.1.zip

Imagen de Patrick Konsor (Intel)

You're correct about _mm256_shuffle_epi8, it is not a cross lane operation, I will fix the description and operation in the next release. Regarding the search bar issue, I have not been able to reproduce this on Ubuntu.

Imagen de andysem

> Regarding the search bar issue, I have not been able to reproduce this on Ubuntu.

Hmm, I can reproduce it on all 3 of my systems, with Nvidia and AMD graphics and different drivers, on Kubuntu from 12.04 to 13.04. I'm using Oracle Java 1.7.

I have quite large displays though - 2560x1440 on two of my machines and 1920x1200 on another laptop. I'm not sure that a 1920x1080 display is big enough for the problem to manifest itself as this height will be filled with widgets. If you don't have access to a bigger display you can try to attach a second display and arrange it to be below your main display and stretch the window vertically. Or you can do the same with a single display if you move the window to the lower side of the screen (so that the window goes partially below the edge) and then resize the window vertically by dragging its top edge upwards.

Imagen de andysem

I can see version 3.0.1 has been released. It seems the problem with the search field has been resolved, thanks!

Some AVX-512 intrinsics include latency/throughput information for CPUs that do not support the according instructions. For example, _mm512_add_epi32 and similar intrinsics have this data for 06_3C CPUs, which I believe are Haswell. The data also applies to 06_45/46, but I don't know what CPUs these are. _mm512_maskz_cvtepi32_ps has latency/throughput for 06_2A (Sandy Bridge) and later CPUs. There are other intrinsics with this problem as well.

Imagen de Patrick Konsor (Intel)

The latency and throughput data is for instructions, not intrinsics, so if an instruction exists on an earlier architecture then that data will be shown for all intrinsics that share that instruction. I'll look into improving this.

06_45/46 are additional Haswell models.

Imagen de andysem

But the latency/throughput can be different depending on the instruction argument width, isn't it? And the intrinsics operate on zmm registers which are not available on Haswell and earlier architectures. BTW, latency/throughput data for AVX-512 instructions is presented for ymm operands, not zmm.

Imagen de Patrick Konsor (Intel)

We do not include latency/throughput data for unreleased micro-architectures, so none of the latency/throughput data is currently applicable for AVX-512 intrinsics (or anything newer than AVX2/FMA). In the future we will add this data, and it will be marked with zmm (where appropriate) operands. Any latency/throughput data shown for an AVX-512 intrinsic is referring to the 128-bit or 256-bit version of the instruction that corresponds to that intrinsic. I will look into limiting the latency/through data that is shown to supported architectures.

Imagen de Sergey Kostrov

>>... In the future we will add this data, and it will be marked with zmm (where appropriate) operands. Any latency/throughput
>>data shown for an AVX-512 intrinsic...

Please also inform Intel C++ compiler team to update a comment in zmmintrin.h header file about '...512-bit compiler intrinsics...'. It doesn't have a prefix AVX as you can see and ideally it should look like:

'...AVX 512-bit compiler intrinsics...'

Imagen de Igor Levicki

What about latency and throughput for intrinsics that map to more than one instruction at compiler's discretion (depending on /arch switch)?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Imagen de Patrick Konsor (Intel)

Well that would depend on the specific instructions the compiler chose. You can look up the latency/throughput of specific instructions in the Optimization Manual: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia...

Imagen de andysem

Another bug: monitor/mwait instructions are said to be detectable with SSE3 cpuid flag. This is not correct, there is a dedicated flag (ecx bit 3 for cpuid function 1) for these instructions.

Imagen de Patrick Konsor (Intel)

You are correct. This will be corrected in the next release.

Imagen de Sergey Kostrov

Patrick,

I wonder if a weblink to a latest version of Intel Intrinsic Guide could be provided, for example, in some Sticky Post? Thanks in advance.

Imagen de Patrick Konsor (Intel)

The latest version is always available here:

http://software.intel.com/en-us/articles/intel-intrinsics-guide

Imagen de Filippo Bistaffa

Am I missing something or "_mm_bsrli_si128" and "_mm_srli_si128" have the same description? What's the difference between them?

Imagen de Patrick Konsor (Intel)

These intrinsics are identical, they are just two different names with the exact same functionality.

Imagen de Sergey Kostrov

This is a short follow up.

>>The latest version is always available here:
>>
>>http://software.intel.com/en-us/articles/intel-intrinsics-guide

Patrick, Why wouldn't you add the link to:

Sticky Thread Forum Topic: Links to instruction documentation
Web-link: http://software.intel.com/en-us/forums/topic/285900

Imagen de Sergey Kostrov

>>..."_mm_bsrli_si128" and "_mm_srli_si128" have the same description?

_mm_bsrli_si128 - by the way, I don't see that function in Microsoft's version of emmintrin.h header file and I don't see it intrin.h as well

_mm_srli_si128 - shifts right

Imagen de Sergey Kostrov

Please also take a look at:

Intel® 64 and IA-32 Architectures
Software Developer’s Manual
Volume 2 (2A, 2B & 2C):
Instruction Set Reference, A-Z

Order Number: 325383-047US
June 2013

Page: 886

Imagen de Patrick Konsor (Intel)

Quote:

Sergey Kostrov wrote:

>>..."_mm_bsrli_si128" and "_mm_srli_si128" have the same description?

_mm_bsrli_si128 - by the way, I don't see that function in Microsoft's version of emmintrin.h header file and I don't see it intrin.h as well

_mm_srli_si128 - shifts right

I believe the bsrli intrinsic was recently added to the Intel compiler headers.

Imagen de Sergey Kostrov

Patrick, The question was:

...What's the difference between them?...

and if this is a new function please provide more details. Thanks.

Imagen de andysem

Sergey, I think Patrick has already stated that the two intrinsics are equivalent.

Imagen de Sergey Kostrov

>>...I think Patrick has already stated that the two intrinsics are equivalent...

I don't see it as a logical thing to create another intrinsic function which does exactly the same processing and has a different name.

Imagen de andysem

In _mm256_permute2x128_si256 descriprion it says:

dst[255:128] := SELECT4(b[255:0], b[255:0], imm[7:4])

 

I believe, the first argument for SELECT4 should be a[255:0]? 

 

Imagen de iliyapolak

Yep it seems so.

Imagen de Patrick Konsor (Intel)

Yes, you are correct, this will be resolved in the next release, which will be later this month.

Imagen de andysem

I see there is a major update to the Intrinsics Guide. Nice job, thanks!

There is an error in the tooltip that pops up when I hover the pointer over the non-VEX instructions. Regardless of the instruction, the tooltip always says:

This intrinsic may generate the VEX-encoded instruction vpunpcklwd. If the instruction is not VEX encoded, punpcklwd may cause performance penalties if mixed with 256-bit or 512-bit instructions.

I suppose, the text should either be more generic or mention the corresponding instructions.

BTW: Is there a downloadable (standalone) version of the guide?

Imagen de Patrick Konsor (Intel)

That is indeed an issue. The fix is on its way up right now, should be visible soon.

The standalone version is not available at the moment, but hopefully we'll have it ready early next year.

Imagen de bronxzv

I just remarked a few errors with the Haswell throughput for these instructions :

VBLENDVPS/PD : should be 2 instead of 1

VMULPD/PS : should be 0.5 instead of 1

Imagen de andysem

When I select AVX-512F in the filters, the SVML intrinsics are also listed. This doesn't happen when I select AVX-512 though.

Imagen de andysem

Some intrinsics are missing timing information that was present in the 3.0.1 (the last standalone) version. For example, _mm_alignr_epi8 and _mm_alignr_pi8.

Are there any news on the standalone version?

Imagen de Patrick Konsor (Intel)

Regarding the VBLENDVPD/PS and VMULPD/PS throughputs on Haswell, you're correct, those will be updated momentarily.

Regarding SVML intrinsics showing under AVX-512F, that was resolved a while ago, but may not have been universally visible. It should be visible soon.

Regarding missing performance data for _mm_alignr_epi8 and _mm_alignr_pi8, _mm_alignr_epi8 was indeed a mistake and will be added shortly. In the process of validating all the intrinsics performance data across multiple sources, some of the data was identified as invalid and thus removed, which was the case for _mm_alignr_pi8.

Imagen de Diego Caballero

Hi,

_mm512_permutevar_epi32 / _mm512_mask_permutevar_epi32 and _mm512_alignr_epi32 are missing for KNC in the last version.

Is there any plan to include latency and throughput for KNC ISA?

 

Thank you! 

 

 

Barcelona Supercomputing Center
Imagen de Kevin Davis (Intel)

User Patrick S. wrote about other mistakes here: http://software.intel.com/en-us/forums/topic/500971#comment-1779043. He wrote:

Quote:

Patrick S. wrote:I have also found some mistakes:

in the intrinsics guide:

 http://software.intel.com/sites/landingpage/IntrinsicsGuide/
 
the instruction _mm512_alignr_epi32 is not listed under "KNC". It is only listed under AVX-512, but KNC supports the alignr instruction.
 
The same for:

_mm512_mask_alignr_epi32/epi64
_mm512_load_ps/pd
_mm512_store_ps/pd
_mm512_fmadd_ps/pd
_mm512_fnmadd_ps/pd
_mm512_fmsub_ps/pd
_mm512_fnmsub_ps/pd

also all cast instructions like _mm512_castpd_ps are not listed under "KNC".

I guess that there a lot more mistakes, but these are the ones I remember.

Páginas

Inicie sesión para dejar un comentario.