Haswell New Instruction Descriptions Now Available!

Intel just released public details on the next generation of the x86 architecture. Arriving first in our 2013 Intel microarchitecture codename “Haswell”, the new instructions accelerate a broad category of applications and usage models. Download the full Intel® Advanced Vector Extensions Programming Reference (319433-011).

These build upon the instructions coming in Intel® microarchitecture code name Ivy Bridge, including the digital random number generator, half-float (float16) accelerators, and extend the Intel® Advanced Vector extensions (Intel® AVX) that launched in 2011.

The instructions fit into the following categories:

AVX2 - Integer data types expanded to 256-bit SIMD. AVX2’s integer support is particularly useful for processing visual data commonly encountered in consumer imaging and video processing workloads. With Haswell, we have both Intel® Advanced Vector Extensions (Intel® AVX) for floating point, and AVX2 for integer data types.

AVX2 - Integer data types .

Bit manipulation instructions are useful for compressed database, hashing , large number arithmetic, and a variety of general purpose codes.

Bit manipulation instructions

Gather Useful for vectorizing codes with nonadjacent data elements. Haswell gathers are masked for safety, (like the conditional loads and stores introduced in Intel® AVX) , which favors their use in codes with clipping or other conditionals.

Haswell gathers are masked for safety,

Any-to-Any permutes – incredibly useful shuffling operations. Haswell adds support for DWORD and QWORD granularity permutes across an entire 256-bit register.

Any-to-Any permute

Vector-Vector Shifts: We added shifts with the vector shift controls. These are critical in vectorizing loops with variable shifts.

Vector-Vector Shifts

Floating Point Multiply Accumulate – Our floating-point multiply accumulate significantly increases peak flops and provides improved precision to further improve transcendental mathematics. They are broadly usable in high performance computing, professional quality imaging, and face detection. They operate on scalar, 128-bit packed single and double precision data types, and 256-bit packed single and double-precision data types. [These instructions were described previously, in the initial Intel® AVX specification].

Floating Point Multiply Accumulate

The vector instructions build upon the expanded (256-bit) register state added in Intel® AVX, and as such as supported by any operating system that supports Intel® AVX.
For developers, please note that the instructions span multiple CPUID leaves. You should be careful to check all applicable bits before using these instructions.
Please check out the specification and stay tuned for supporting tools over the next couple of months.

Mark Buxton
Software Engineer
Intel Corporation

For more complete information about compiler optimizations, see our Optimization Notice.


anonymous's picture

@amdn: If *everything* is not pushed down at a high data rate, that means that what ever is not will slow down whatever it, and (weakest link) things will run at the slowest rate of a mix of instructions. Standing alone, a high data rate may note be needed for crypto, but I would hate to see it slow everything down just because no one could predict that someone would be running crypto and a graphics program at the same time. Two programs at once?? Who would want to do that :)

amdn's picture

Do cryptographic applications need high data rate source of random numbers? They unquestionably need high quality non-deterministic, hardware generated random numbers, but is there a need for these at a high data rate?

One application that could use high quality pseudo-random numbers is Monte Carlo simulations. According to Wikipedia "Many of the most useful techniques use deterministic, pseudorandom sequences, making it easy to test and re-run simulations."

The RDRAND instruction will be useful in cryptography because of it's non-determinism and high quality, but maybe not for its high data rate.
The RDRAND instruction will be useful in Monte Carlo simulations because of its high quality and high data rate, but sadly the non-determinism will not satisfy the need to, as Wikipedia put it, "make it easy to test and re-run simulations."

It is tantalizing though, to read that the hardware implementation of RDRAND actually has a deterministic PRNG as one of the components, but it gets frequently reset non-deterministically.

I suspect the high performance computing community will use a software PRNG for Monte Carlo simulations during the debugging stages of the software, at least, and may not take advantage of RDRAND even in production.

anonymous's picture

will there be an AVX2 emulation header available? date?

Nick Black's picture

It's good to see FMA in this iteration. Finally combining that with the 256-bit AVX YMM registers will be tasty.

anonymous's picture

Can we get 'order of magnitude' indications on scatter/gather latency and throughput? I know this is a long way from silicon, but it would be really helpful if we could know ahead of time whether scatter/gather are internally equivalent to a big bunch of load/store uops or whether they'll happen more aggressively parallel somehow. Either way it's exciting but it would be nice to know.

It would be interesting to know this information even for Knights Corner / Larrabee.

anonymous's picture

I am looking for commands that can speed-up the contour dot algorithm of edge detection that is described in my open-source project http://outliner.codeplex.com/
It would be great to have commands that help statistics calculations - find average, average squares, average cubes, average 4-th power, correlation between the two rows of numbers, etc.

anonymous's picture

"GPGPU is history, as Haswell will prove that mainstream CPUs can combine the power of GPUs into a superior homogeneous architecture."

I really doubt that. Heterogeneous IS the way to go if you want throughput computing done right (thanks to GPGPU).

c0d1f1ed's picture

This is downright revolutionary. Finally the SIMD instruction set will feature a vector equivalent of every (relevant) scalar instruction. This allows parallelizing lots of code loops, in theory offering up to an eightfold increase in throughput computing performance. GPGPU is history, as Haswell will prove that mainstream CPUs can combine the power of GPUs into a superior homogeneous architecture.

anonymous's picture

wah interesting

wish if i have more then one brains ,one is simply not enough to follow all that development :P


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.