Haswell New Instruction Descriptions Now Available!

By Mark Buxton (Intel) (2 posts) on June 13, 2011 at 8:52 am

Intel just released public details on the next generation of the x86 architecture. Arriving first in our 2013 Intel microarchitecture codename “Haswell”, the new instructions accelerate a broad category of applications and usage models. Download the full Intel® Advanced Vector Extensions Programming Reference (319433-011).

These build upon the instructions coming in Intel® microarchitecture code name Ivy Bridge, including the digital random number generator, half-float (float16) accelerators, and extend the Intel® Advanced Vector extensions (Intel® AVX) that launched in 2011.

The instructions fit into the following categories:

AVX2 - Integer data types expanded to 256-bit SIMD. AVX2’s integer support is particularly useful for processing visual data commonly encountered in consumer imaging and video processing workloads. With Haswell, we have both Intel® Advanced Vector Extensions (Intel® AVX) for floating point, and AVX2 for integer data types.

Bit manipulation instructions are useful for compressed database, hashing , large number arithmetic, and a variety of general purpose codes.

Gather Useful for vectorizing codes with nonadjacent data elements. Haswell gathers are masked for safety, (like the conditional loads and stores introduced in Intel® AVX) , which favors their use in codes with clipping or other conditionals.

Any-to-Any permutes – incredibly useful shuffling operations. Haswell adds support for DWORD and QWORD granularity permutes across an entire 256-bit register.

Vector-Vector Shifts: We added shifts with the vector shift controls. These are critical in vectorizing loops with variable shifts.

Floating Point Multiply Accumulate – Our floating-point multiply accumulate significantly increases peak flops and provides improved precision to further improve transcendental mathematics. They are broadly usable in high performance computing, professional quality imaging, and face detection. They operate on scalar, 128-bit packed single and double precision data types, and 256-bit packed single and double-precision data types. [These instructions were described previously, in the initial Intel® AVX specification].

The vector instructions build upon the expanded (256-bit) register state added in Intel® AVX, and as such as supported by any operating system that supports Intel® AVX.
For developers, please note that the instructions span multiple CPUID leaves. You should be careful to check all applicable bits before using these instructions.
Please check out the specification and stay tuned for supporting tools over the next couple of months.

Mark Buxton
Software Engineer
Intel Corporation

Categories: Graphics & Media, Parallel Programming, Performance and Optimization

For more complete information about compiler optimizations, see our Optimization Notice.

Comments (19)

June 13, 2011 4:09 PM PDT


Ajax
wah interesting


wish if i have more then one brains ,one is simply not enough to follow all that development :P
June 14, 2011 3:51 AM PDT

c0d1f1ed
c0d1f1edTotal Points:
2,907
Brown Belt
This is downright revolutionary. Finally the SIMD instruction set will feature a vector equivalent of every (relevant) scalar instruction. This allows parallelizing lots of code loops, in theory offering up to an eightfold increase in throughput computing performance. GPGPU is history, as Haswell will prove that mainstream CPUs can combine the power of GPUs into a superior homogeneous architecture.
June 15, 2011 12:21 AM PDT


MythBuster
"GPGPU is history, as Haswell will prove that mainstream CPUs can combine the power of GPUs into a superior homogeneous architecture."

I really doubt that. Heterogeneous IS the way to go if you want throughput computing done right (thanks to GPGPU).
June 20, 2011 12:52 AM PDT


Wladik
I am looking for commands that can speed-up the contour dot algorithm of edge detection that is described in my open-source project http://outliner.codeplex.com/
It would be great to have commands that help statistics calculations - find average, average squares, average cubes, average 4-th power, correlation between the two rows of numbers, etc.
June 20, 2011 8:05 PM PDT


Geoff Langdale
Can we get 'order of magnitude' indications on scatter/gather latency and throughput? I know this is a long way from silicon, but it would be really helpful if we could know ahead of time whether scatter/gather are internally equivalent to a big bunch of load/store uops or whether they'll happen more aggressively parallel somehow. Either way it's exciting but it would be nice to know.

It would be interesting to know this information even for Knights Corner / Larrabee.
June 26, 2011 2:53 PM PDT

Nick Black
Nick BlackTotal Points:
30
Registered User
It's good to see FMA in this iteration. Finally combining that with the 256-bit AVX YMM registers will be tasty.
June 27, 2011 4:18 AM PDT


Michael
will there be an AVX2 emulation header available? date?
June 30, 2011 3:14 PM PDT

amdn
amdnTotal Points:
50
Green Belt
Do cryptographic applications need high data rate source of random numbers? They unquestionably need high quality non-deterministic, hardware generated random numbers, but is there a need for these at a high data rate?

One application that could use high quality pseudo-random numbers is Monte Carlo simulations. According to Wikipedia "Many of the most useful techniques use deterministic, pseudorandom sequences, making it easy to test and re-run simulations."

The RDRAND instruction will be useful in cryptography because of it's non-determinism and high quality, but maybe not for its high data rate.
The RDRAND instruction will be useful in Monte Carlo simulations because of its high quality and high data rate, but sadly the non-determinism will not satisfy the need to, as Wikipedia put it, "make it easy to test and re-run simulations."

It is tantalizing though, to read that the hardware implementation of RDRAND actually has a deterministic PRNG as one of the components, but it gets frequently reset non-deterministically.

I suspect the high performance computing community will use a software PRNG for Monte Carlo simulations during the debugging stages of the software, at least, and may not take advantage of RDRAND even in production.
July 1, 2011 2:52 AM PDT


Phillip Wayne
@amdn: If *everything* is not pushed down at a high data rate, that means that what ever is not will slow down whatever it, and (weakest link) things will run at the slowest rate of a mix of instructions. Standing alone, a high data rate may note be needed for crypto, but I would hate to see it slow everything down just because no one could predict that someone would be running crypto and a graphics program at the same time. Two programs at once?? Who would want to do that :)
July 1, 2011 3:11 AM PDT

sirrida
sirridaTotal Points:
875
Brown Belt
It seems that the BMI commands BEXTR, PDEP and PEXT which are described in the referenced document 319433-011 are missing.
Are they about to be dropped from the proposed instruction set?
In my opinion PDEP and PEXT are by far the most usable BMI commands - all other BMI commands are easily replaceable.

I would like to see example application where the motivation for some commands is illuminated, e.g. where can I use these lowest bit operations?
July 1, 2011 6:37 AM PDT


Q-FP
A search through the < Intel's Advanced Vector Extensions Programming Reference #319433-011 from JUNE 2011> reveals nothing.
Not a single reference to Quad-Precision.
All the vectorized quads are of single or double precision.
Basically, it means Newtonian and/or Maxwellian simulations. That's it.
Intel participated in development of quads in the IEEE 754R.
The standard is meanwhile 3 years old and will be 5 so in 2013.
Some engineers within Intel understand the quads importance.
Look: <http://www.intel.com/standards/floatingpoint.pdf>;.
Lets be sure, quad-precision is not a mass market.
It is however the enabling technology. Of importance to mankind.
Is Intel's silence on the subject only a case of its deeply nurtured modesty?
July 1, 2011 10:50 AM PDT

Narayan
Narayan
Happy day ! Good wave programming. Thanks.
July 3, 2011 11:19 AM PDT

amdn
amdnTotal Points:
50
Green Belt
Phillip, you are right and I agree with you. Reading my post again I realize that my main point wasn't in the first paragraph, my mistake. My main point is that if we are going to get a high data rate random number source it would be most useful for it to (optionally) be pseudo-random (deterministic sequence given a seed). There are several high performance computing applications where having a source of high quality random numbers at a high data rate would be useful, and RDRAND can certainly by used for that, but it would be better, for those applications, if the process that generates those random numbers was deterministic (for testing and debugging). It appears that the implementation of RDRAND already has a fast PRNG internally... but only the non-deterministic output is available with the RDRAND instruction. Maybe one can use the AES cryptography instructions or the CRC32 instruction to implement a fast PRNG. AES decryption in CBC mode with 128 bit keys takes about 20 cycles to generate a 16-byte block, that's 10 cycles to generate 8 bytes and more cycles if you then want to use those bytes to produce a double precision number in the range 0.0 to 1.0. Ideally I would like to see an SSE/AVX instruction that loads a vector register with packed float16, float32, or float64 pseudo-random values each in the range 0.0 to 1.0, with 1 cycle throughput and 1 cycle latency. In order for it to be deterministic it would have to be architected like the CRC32 instruction, where the state (seed) is in architected registers that are saved/restored on a context switch.
July 3, 2011 10:57 PM PDT

c0d1f1ed
c0d1f1edTotal Points:
2,907
Brown Belt
"I really doubt that. Heterogeneous IS the way to go if you want throughput computing done right (thanks to GPGPU)."

Nothing is preventing the CPU from becoming a throughput computing device itself. AVX2 brings us FMA and gather, two features which used to be exclusive to the GPU! That only leaves competitive power efficiency. That can be achieved with AVX-1024 executed as a single uop on the existing 256-bit units. By sequencing the execution over 4 cycles, you get the benefits of in-order execution (plus latency hiding), without losing any of the out-of-order execution benefits for legacy workloads.

Heterogeneous computing has no future because the communication overhead does not scale. It's better to keep things local and execute parallel computations in AVX2 units capable of breaking up 1024-bit instructions. We're only a tiny step away from that.
July 5, 2011 3:50 AM PDT


Software development
Thank you for useful information and informative review, well done.
August 9, 2011 12:21 AM PDT


Madis Kalme
3-operand GPR instructions are interesting. I can imagine the benefits. Shame they came so late. Now it takes a lot of time for them to be implemented in CPU and then developers to make use of them and then be able to use in software (because it takes time for people to buy new stuff).

Error in document: 319433-011.pdf documents an instruction MPSADBW with AVX2, but comment mentions xmm2,xmm3 and m128 still (there should be ymm2,ymm3 and m256 respectively).
September 14, 2011 3:13 AM PDT


Deepak
"I really doubt that. Heterogeneous IS the way to go if you want throughput computing done right (thanks to GPGPU)."

Nothing is preventing the CPU from becoming a throughput computing device itself. AVX2 brings us FMA and gather, two features which used to be exclusive to the GPU! That only leaves competitive power efficiency. That can be achieved with AVX-1024 executed as a single uop on the existing 256-bit units. By sequencing the execution over 4 cycles, you get the benefits of in-order execution (plus latency hiding), without losing any of the out-of-order execution benefits for legacy workloads.

Heterogeneous computing has no future because the communication overhead does not scale. It's better to keep things local and execute parallel computations in AVX2 units capable of breaking up 1024-bit instructions. We're only a tiny step away from that.
September 19, 2011 10:48 PM PDT

Igor Levicki
Igor LevickiTotal Points:
16,822
Black Belt
The only thing I want to know is when I will be able to buy one?
April 22, 2012 5:33 AM PDT


Lary
Nice, good works. ???? Predictive 3D matrix manipulation ?????

Trackbacks (33)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*