Haswell New Instruction Descriptions Now Available!

Intel just released public details on the next generation of the x86 architecture. Arriving first in our 2013 Intel microarchitecture codename “Haswell”, the new instructions accelerate a broad category of applications and usage models. Download the full Intel® Advanced Vector Extensions Programming Reference (319433-011).

These build upon the instructions coming in Intel® microarchitecture code name Ivy Bridge, including the digital random number generator, half-float (float16) accelerators, and extend the Intel® Advanced Vector extensions (Intel® AVX) that launched in 2011.

The instructions fit into the following categories:

AVX2 - Integer data types expanded to 256-bit SIMD. AVX2’s integer support is particularly useful for processing visual data commonly encountered in consumer imaging and video processing workloads. With Haswell, we have both Intel® Advanced Vector Extensions (Intel® AVX) for floating point, and AVX2 for integer data types.

AVX2 - Integer data types .

Bit manipulation instructions are useful for compressed database, hashing , large number arithmetic, and a variety of general purpose codes.

Bit manipulation instructions

Gather Useful for vectorizing codes with nonadjacent data elements. Haswell gathers are masked for safety, (like the conditional loads and stores introduced in Intel® AVX) , which favors their use in codes with clipping or other conditionals.

Haswell gathers are masked for safety,

Any-to-Any permutes – incredibly useful shuffling operations. Haswell adds support for DWORD and QWORD granularity permutes across an entire 256-bit register.

Any-to-Any permute

Vector-Vector Shifts: We added shifts with the vector shift controls. These are critical in vectorizing loops with variable shifts.

Vector-Vector Shifts

Floating Point Multiply Accumulate – Our floating-point multiply accumulate significantly increases peak flops and provides improved precision to further improve transcendental mathematics. They are broadly usable in high performance computing, professional quality imaging, and face detection. They operate on scalar, 128-bit packed single and double precision data types, and 256-bit packed single and double-precision data types. [These instructions were described previously, in the initial Intel® AVX specification].

Floating Point Multiply Accumulate

The vector instructions build upon the expanded (256-bit) register state added in Intel® AVX, and as such as supported by any operating system that supports Intel® AVX.
For developers, please note that the instructions span multiple CPUID leaves. You should be careful to check all applicable bits before using these instructions.
Please check out the specification and stay tuned for supporting tools over the next couple of months.

Mark Buxton
Software Engineer
Intel Corporation

For more complete information about compiler optimizations, see our Optimization Notice.

19 comments

Top
anonymous's picture

Nice, good works. ???? Predictive 3D matrix manipulation ?????

Igor Levicki's picture

The only thing I want to know is when I will be able to buy one?

anonymous's picture

"I really doubt that. Heterogeneous IS the way to go if you want throughput computing done right (thanks to GPGPU)."

Nothing is preventing the CPU from becoming a throughput computing device itself. AVX2 brings us FMA and gather, two features which used to be exclusive to the GPU! That only leaves competitive power efficiency. That can be achieved with AVX-1024 executed as a single uop on the existing 256-bit units. By sequencing the execution over 4 cycles, you get the benefits of in-order execution (plus latency hiding), without losing any of the out-of-order execution benefits for legacy workloads.

Heterogeneous computing has no future because the communication overhead does not scale. It's better to keep things local and execute parallel computations in AVX2 units capable of breaking up 1024-bit instructions. We're only a tiny step away from that.

anonymous's picture

3-operand GPR instructions are interesting. I can imagine the benefits. Shame they came so late. Now it takes a lot of time for them to be implemented in CPU and then developers to make use of them and then be able to use in software (because it takes time for people to buy new stuff).

Error in document: 319433-011.pdf documents an instruction MPSADBW with AVX2, but comment mentions xmm2,xmm3 and m128 still (there should be ymm2,ymm3 and m256 respectively).

anonymous's picture

Thank you for useful information and informative review, well done.

c0d1f1ed's picture

"I really doubt that. Heterogeneous IS the way to go if you want throughput computing done right (thanks to GPGPU)."

Nothing is preventing the CPU from becoming a throughput computing device itself. AVX2 brings us FMA and gather, two features which used to be exclusive to the GPU! That only leaves competitive power efficiency. That can be achieved with AVX-1024 executed as a single uop on the existing 256-bit units. By sequencing the execution over 4 cycles, you get the benefits of in-order execution (plus latency hiding), without losing any of the out-of-order execution benefits for legacy workloads.

Heterogeneous computing has no future because the communication overhead does not scale. It's better to keep things local and execute parallel computations in AVX2 units capable of breaking up 1024-bit instructions. We're only a tiny step away from that.

amdn's picture

Phillip, you are right and I agree with you. Reading my post again I realize that my main point wasn't in the first paragraph, my mistake. My main point is that if we are going to get a high data rate random number source it would be most useful for it to (optionally) be pseudo-random (deterministic sequence given a seed). There are several high performance computing applications where having a source of high quality random numbers at a high data rate would be useful, and RDRAND can certainly by used for that, but it would be better, for those applications, if the process that generates those random numbers was deterministic (for testing and debugging). It appears that the implementation of RDRAND already has a fast PRNG internally... but only the non-deterministic output is available with the RDRAND instruction. Maybe one can use the AES cryptography instructions or the CRC32 instruction to implement a fast PRNG. AES decryption in CBC mode with 128 bit keys takes about 20 cycles to generate a 16-byte block, that's 10 cycles to generate 8 bytes and more cycles if you then want to use those bytes to produce a double precision number in the range 0.0 to 1.0. Ideally I would like to see an SSE/AVX instruction that loads a vector register with packed float16, float32, or float64 pseudo-random values each in the range 0.0 to 1.0, with 1 cycle throughput and 1 cycle latency. In order for it to be deterministic it would have to be architected like the CRC32 instruction, where the state (seed) is in architected registers that are saved/restored on a context switch.

anonymous's picture

A search through the < Intel's Advanced Vector Extensions Programming Reference #319433-011 from JUNE 2011> reveals nothing.
Not a single reference to Quad-Precision.
All the vectorized quads are of single or double precision.
Basically, it means Newtonian and/or Maxwellian simulations. That's it.
Intel participated in development of quads in the IEEE 754R.
The standard is meanwhile 3 years old and will be 5 so in 2013.
Some engineers within Intel understand the quads importance.
Look: <http://www.intel.com/standards/floatingpoint.pdf>.
Lets be sure, quad-precision is not a mass market.
It is however the enabling technology. Of importance to mankind.
Is Intel's silence on the subject only a case of its deeply nurtured modesty?

sirrida's picture

It seems that the BMI commands BEXTR, PDEP and PEXT which are described in the referenced document 319433-011 are missing.
Are they about to be dropped from the proposed instruction set?
In my opinion PDEP and PEXT are by far the most usable BMI commands - all other BMI commands are easily replaceable.

I would like to see example application where the motivation for some commands is illuminated, e.g. where can I use these lowest bit operations?

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.