Blog on Haswell NI (has some pictures)

Blog on Haswell NI (has some pictures)

I posted a blog with some additional information on HSW-NI (nothing new vs. what's in the spec, but condensed for your pleasure) at .

I'm encouraged by the initial feedback from the team. If we didn't get the instruction you always wanted in this batch, please do make yourself heard....


8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Support for gather, FMA and 256-bit integer operations is absolutely awesome! I've always believed that homogeneous archtitectures are more powerful since their serial performance fights off Amdahl's Law. So merging GPU-like technology into Haswell's cores is a brilliant move. It seems it will offer a tenfold increase in computing power over Nehalem, while also greatly extending the flexibility. I can't wait to get my hands on this hardware and create some applications the world has never seen before...
Some people seem to think the gather instructions will be micro-coded though. While that would still be better than no gather instructions at all, it would be a bit of a shame. The fact that Sandy Bridge has two load ports and Larrabee is supposedly capable of fetching any number of elements from each cache line makes me hopeful though that a nice balance between throughput and implementation cost can be achieved...

I also believe that subsequent architectures should further increase performance/Watt by executing 1024-bit instructions on 256-bit execution units, using a single uop. This seems feasible by implementing 1024-bit registers as four 256-bit physical registers, and sequentially feeding them into the pipelines (please correct me if I'm wrong). This would compensate for the power consumption of out-of-order execution, allowing high numbers of cores and high clock frequencies for desktop parts, and also unencumbered homogeneous mobile architectures.

Hello Mark!

OK, here come my wishes and ideas.

I had already posted some proposals in this forum:

Here are other ones in no particular order (numbers for reference):

  1. "subx" analog to the new andn/sarx/... bmi commands featuring 2 sources and an independent target.
  2. mulx with 4 definable registers, why is there a dependency of edx?
  3. Universal bitwise blending for sse and avx registers, like AMD's vpcmov to replace the often used combination pand/pandn/por (or the float equivalents). It would come handy for general purpose registers too.
  4. "packwb" shall work analog to packuswb/packsswb but without saturation. Also for other sizes.
  5. pack/punpck for avx without slices, i.e. natural extension from mm=>xmm=>ymm (cross-lane).
  6. palignr, pslldq, psrldq for avx registers without slices.
  7. Horizontal ops for avx registers without slices.
  8. Enhanced pshufb for sse and avx which can fetch from 2 sources and without slices, like AMD's vpperm.
  9. Unsigned average also for dwords and larger (sse/avx).
  10. Signed average (sse/avx).
  11. "psadwd" analog to psadbw: Sum of absolute differences also for other sizes.
  12. Absolute differences without summing, signed and unsigned.
  13. Load/store a variable number of bytes of sse/avx registers, enabling e.g. to store (the least significant) 3 bytes. The number of bytes to be moved shall come from a register (e.g. cl) or be a constant. This should be faster than the complicated to use maskmovdqu or a series of pextr commands.
  14. pcmp variants which shall compare unsigned values, e.g. like AMD's vpcomu.
  15. sse/avx (packed singles and doubles) command for linear interpolation / blending: a*f+b*(1-f) [=(a-b)*f+b].
  16. Dito for bytes/words/..., e.g. for unsigned bytes: (a*f+b*(256-f))>>3 [=((a-b)*f)>>3)+b]
  17. Bit permutations, see below.

Bit permutations:
I have written some documentation and emulation software (free C and Pascal source) as well as a proposal for x86 to bit permutations on my web page
Very usable but also very expensive would be the GRP instruction.
Nice to have would also be other special permutations such as rotation of an 8*8 bit matrix by 90 and 270 or a diagonal flip.

I don't see the necessity for some of the new commands, like blsi, blsmsk, blsr. Most other bmi commands are also easily emulated.
However pdep and pext are great - why are they not mentioned on this blog and why don't they (also) act on sse/avx registers?

Best regards
Jasper Neumann

Quoting Mark Buxton (Intel)

If we didn't get the instruction you always wanted in this batch, please do make yourself heard....

- Better support for bytes: pmullb, pmulhb, psllb, psrlb. See for some motivation.

- An unsigned variant of pmaddubsw

It seems that nobody has really new ideas or wants to discuss my lengthy list on #2 - or it might be that the mentioned blog could have been a better place to post the list - is it?

The SSE4.2 string commands are very mighty but are often difficult to use, especially when the string memory is near the end of an allocation block to that an AV might occur if too much of the string is fetched - one must fill an SSE register in order to use the commands.
In order have a workaround there should be a command which can load from memory and ignore the potential AV (but possibly setting flags or a counter) and there should be a command which can easily store a short string from an SSE register with a length given by a GPR ("maskmovdqu" is not the best command one can imagine for this case).

Talking of short strings:
I often have RGB pixels occupying 3 bytes. It would help if there were commands available which can store e.g. the lower 3 bytes from a GPR. Stuff like "mov [edx],ax; shr eax,16; mov [edx+2],al" is not very nice and also trashes eax. However it most often does not matter if I read 4 bytes when I only need 3...

Another slightly off-topic comment:
I heartily wish the string commands "rep stosb", "rep movsb", "rep cmpsb"
and "rep scasb" (yes, especially the byte operations) were so fast and
smart that all the many replacement routines would be outperformed and thus obsoleted. See e.g. Agner Fog's subroutine library for examples of such routines and the necessary effort to choose the right one.
Is there any reason for the more or less poor performance of the mentioned commands?
A variant of "rep scasb" (and other sizes as well) which looks for other
condition but "equal" and "not equal" can also be useful.

Hello Mark,

I have an idea for several bit manipulation instructions that would operate on XMM/YMM integer registers.

1. ROL/ROR XMM/YMM, IMM8/GPR -- rotate whole XMM/YMM register left or right by a number of bits set in IMM8 or GPR (i.e. bits going out go in on the other side).

2. ROLB/W/D/Q / RORB/W/D/Q XMM/YMM, XMM/YMM/MEM128/MEM256 -- rotate B/W/D/Q elements left or right by the number of bits specified in corresponding XMM/YMM/MEM128/MEM256 elements (i.e. variable B/W/D/Q rotate, independent bit count for each element).

3. RCL/RCR XMM/YMM, IMM8/GPR -- rotate whole XMM/YMM register left or right by a number of bits set in IMM8 or GPR through carry flag. It should be possible to shift in the bit (which was shifted out last to carry flag) to another register.

4. RCLB/W/D/Q / RCRB/W/D/Q XMM/YMM, XMM/YMM/MEM128/MEM256 -- rotate B/W/D/Q elements left or right by the number of bits specified in corresponding XMM/YMM/MEM128/MEM256 elements through carry flag. This one would be tricky -- you would obviously need to have 32 carry flags for each YMM register but it would enable mind-boggling bit transformations.

5. BRVS XMM/YMM, GPR -- reverse bit order in XMM/YMM starting with bit index specified in GPR[15:8] and ending with bit count (or index) specified in GPR[7:0].

Igor Levicki

Seems lots of Arithmetic calc lack of AVX VEX support there, like (v)padd[s/u]s[b/w], (v)p[add/sub][b/w/d/q], (v)pmaddubsw etc..So might no large room got from SSE2/SSE4 to AVX rewriting for traditional video coding/decoding application..

I don't see a single one from your list missing in AVX2, btw they are available with VEX.128 encoding in AVX already

Leave a Comment

Please sign in to add a comment. Not a member? Join today