4,580 Posts served
11,094 Conversations started
- Academic

- Android

- Art, Music, & Animation

- Embedded Computing

- Events

- Game Development

- Graphics & Media

- Intel SW Partner Program

- Intel® AppUp Developer Program

- Manageability & Security

- Mobility

- Open Source

- Parallel Programming

- Performance and Optimization

- Power Efficiency

- Server

- Site News & Announcements

- Software Tools

- Ultrabook

- Association for Computing Machinery TechNews (ACM)
- Go Parallel! (Dr. Dobbs)
- HPCwire (Tabor Communications, Inc.)
- insideHPC (John West)
- Joe Duffy's Weblog (Microsoft)
- Microsoft Parallel Programming Development Center (Microsoft Germany)
- MultiCoreInfo.com
- scalability.org (Scalable Informatics)
- Software Dev Blog (Intel Germany)
- Soft Talk Blog (Intel United Kingdom)
- The Moth (Microsoft)
Haswell New Instruction Descriptions Now Available!
By Mark Buxton (Intel) (2 posts) on June 13, 2011 at 8:52 am
Intel just released public details on the next generation of the x86 architecture. Arriving first in our 2013 Intel microarchitecture codename “Haswell”, the new instructions accelerate a broad category of applications and usage models. Download the full Intel® Advanced Vector Extensions Programming Reference (319433-011).
These build upon the instructions coming in Intel® microarchitecture code name Ivy Bridge, including the digital random number generator, half-float (float16) accelerators, and extend the Intel® Advanced Vector extensions (Intel® AVX) that launched in 2011.
The instructions fit into the following categories:
AVX2 - Integer data types expanded to 256-bit SIMD. AVX2’s integer support is particularly useful for processing visual data commonly encountered in consumer imaging and video processing workloads. With Haswell, we have both Intel® Advanced Vector Extensions (Intel® AVX) for floating point, and AVX2 for integer data types.

Bit manipulation instructions are useful for compressed database, hashing , large number arithmetic, and a variety of general purpose codes.

Gather Useful for vectorizing codes with nonadjacent data elements. Haswell gathers are masked for safety, (like the conditional loads and stores introduced in Intel® AVX) , which favors their use in codes with clipping or other conditionals.

Any-to-Any permutes – incredibly useful shuffling operations. Haswell adds support for DWORD and QWORD granularity permutes across an entire 256-bit register.

Vector-Vector Shifts: We added shifts with the vector shift controls. These are critical in vectorizing loops with variable shifts.

Floating Point Multiply Accumulate – Our floating-point multiply accumulate significantly increases peak flops and provides improved precision to further improve transcendental mathematics. They are broadly usable in high performance computing, professional quality imaging, and face detection. They operate on scalar, 128-bit packed single and double precision data types, and 256-bit packed single and double-precision data types. [These instructions were described previously, in the initial Intel® AVX specification].
The vector instructions build upon the expanded (256-bit) register state added in Intel® AVX, and as such as supported by any operating system that supports Intel® AVX.
For developers, please note that the instructions span multiple CPUID leaves. You should be careful to check all applicable bits before using these instructions.
Please check out the specification and stay tuned for supporting tools over the next couple of months.
Mark Buxton
Software Engineer
Intel Corporation
Categories: Graphics & Media, Parallel Programming, Performance and Optimization
For more complete information about compiler optimizations, see our Optimization Notice.
Comments (19)
| June 14, 2011 3:51 AM PDT
c0d1f1ed
| This is downright revolutionary. Finally the SIMD instruction set will feature a vector equivalent of every (relevant) scalar instruction. This allows parallelizing lots of code loops, in theory offering up to an eightfold increase in throughput computing performance. GPGPU is history, as Haswell will prove that mainstream CPUs can combine the power of GPUs into a superior homogeneous architecture. |
| June 15, 2011 12:21 AM PDT
MythBuster |
"GPGPU is history, as Haswell will prove that mainstream CPUs can combine the power of GPUs into a superior homogeneous architecture." I really doubt that. Heterogeneous IS the way to go if you want throughput computing done right (thanks to GPGPU). |
| June 20, 2011 12:52 AM PDT
Wladik |
I am looking for commands that can speed-up the contour dot algorithm of edge detection that is described in my open-source project http://outliner.codeplex.com/ It would be great to have commands that help statistics calculations - find average, average squares, average cubes, average 4-th power, correlation between the two rows of numbers, etc. |
| June 20, 2011 8:05 PM PDT
Geoff Langdale |
Can we get 'order of magnitude' indications on scatter/gather latency and throughput? I know this is a long way from silicon, but it would be really helpful if we could know ahead of time whether scatter/gather are internally equivalent to a big bunch of load/store uops or whether they'll happen more aggressively parallel somehow. Either way it's exciting but it would be nice to know. It would be interesting to know this information even for Knights Corner / Larrabee. |
| June 26, 2011 2:53 PM PDT
Nick Black
| It's good to see FMA in this iteration. Finally combining that with the 256-bit AVX YMM registers will be tasty. |
| June 27, 2011 4:18 AM PDT
Michael | will there be an AVX2 emulation header available? date? |
| June 30, 2011 3:14 PM PDT
amdn
|
Do cryptographic applications need high data rate source of random numbers? They unquestionably need high quality non-deterministic, hardware generated random numbers, but is there a need for these at a high data rate? One application that could use high quality pseudo-random numbers is Monte Carlo simulations. According to Wikipedia "Many of the most useful techniques use deterministic, pseudorandom sequences, making it easy to test and re-run simulations." The RDRAND instruction will be useful in cryptography because of it's non-determinism and high quality, but maybe not for its high data rate. The RDRAND instruction will be useful in Monte Carlo simulations because of its high quality and high data rate, but sadly the non-determinism will not satisfy the need to, as Wikipedia put it, "make it easy to test and re-run simulations." It is tantalizing though, to read that the hardware implementation of RDRAND actually has a deterministic PRNG as one of the components, but it gets frequently reset non-deterministically. I suspect the high performance computing community will use a software PRNG for Monte Carlo simulations during the debugging stages of the software, at least, and may not take advantage of RDRAND even in production. |
| July 1, 2011 2:52 AM PDT
Phillip Wayne | @amdn: If *everything* is not pushed down at a high data rate, that means that what ever is not will slow down whatever it, and (weakest link) things will run at the slowest rate of a mix of instructions. Standing alone, a high data rate may note be needed for crypto, but I would hate to see it slow everything down just because no one could predict that someone would be running crypto and a graphics program at the same time. Two programs at once?? Who would want to do that :) |
| July 1, 2011 3:11 AM PDT
sirrida
|
It seems that the BMI commands BEXTR, PDEP and PEXT which are described in the referenced document 319433-011 are missing. Are they about to be dropped from the proposed instruction set? In my opinion PDEP and PEXT are by far the most usable BMI commands - all other BMI commands are easily replaceable. I would like to see example application where the motivation for some commands is illuminated, e.g. where can I use these lowest bit operations? |
| July 1, 2011 6:37 AM PDT
Q-FP |
A search through the < Intel's Advanced Vector Extensions Programming Reference #319433-011 from JUNE 2011> reveals nothing. Not a single reference to Quad-Precision. All the vectorized quads are of single or double precision. Basically, it means Newtonian and/or Maxwellian simulations. That's it. Intel participated in development of quads in the IEEE 754R. The standard is meanwhile 3 years old and will be 5 so in 2013. Some engineers within Intel understand the quads importance. Look: <http://www.intel.com/standards/floatingpoint.pdf>. Lets be sure, quad-precision is not a mass market. It is however the enabling technology. Of importance to mankind. Is Intel's silence on the subject only a case of its deeply nurtured modesty? |
| July 1, 2011 10:50 AM PDT
Narayan | Happy day ! Good wave programming. Thanks. |
| July 3, 2011 11:19 AM PDT
amdn
| Phillip, you are right and I agree with you. Reading my post again I realize that my main point wasn't in the first paragraph, my mistake. My main point is that if we are going to get a high data rate random number source it would be most useful for it to (optionally) be pseudo-random (deterministic sequence given a seed). There are several high performance computing applications where having a source of high quality random numbers at a high data rate would be useful, and RDRAND can certainly by used for that, but it would be better, for those applications, if the process that generates those random numbers was deterministic (for testing and debugging). It appears that the implementation of RDRAND already has a fast PRNG internally... but only the non-deterministic output is available with the RDRAND instruction. Maybe one can use the AES cryptography instructions or the CRC32 instruction to implement a fast PRNG. AES decryption in CBC mode with 128 bit keys takes about 20 cycles to generate a 16-byte block, that's 10 cycles to generate 8 bytes and more cycles if you then want to use those bytes to produce a double precision number in the range 0.0 to 1.0. Ideally I would like to see an SSE/AVX instruction that loads a vector register with packed float16, float32, or float64 pseudo-random values each in the range 0.0 to 1.0, with 1 cycle throughput and 1 cycle latency. In order for it to be deterministic it would have to be architected like the CRC32 instruction, where the state (seed) is in architected registers that are saved/restored on a context switch. |
| July 3, 2011 10:57 PM PDT
c0d1f1ed
|
"I really doubt that. Heterogeneous IS the way to go if you want throughput computing done right (thanks to GPGPU)." Nothing is preventing the CPU from becoming a throughput computing device itself. AVX2 brings us FMA and gather, two features which used to be exclusive to the GPU! That only leaves competitive power efficiency. That can be achieved with AVX-1024 executed as a single uop on the existing 256-bit units. By sequencing the execution over 4 cycles, you get the benefits of in-order execution (plus latency hiding), without losing any of the out-of-order execution benefits for legacy workloads. Heterogeneous computing has no future because the communication overhead does not scale. It's better to keep things local and execute parallel computations in AVX2 units capable of breaking up 1024-bit instructions. We're only a tiny step away from that. |
| July 5, 2011 3:50 AM PDT
Software development | Thank you for useful information and informative review, well done. |
| August 9, 2011 12:21 AM PDT
Madis Kalme |
3-operand GPR instructions are interesting. I can imagine the benefits. Shame they came so late. Now it takes a lot of time for them to be implemented in CPU and then developers to make use of them and then be able to use in software (because it takes time for people to buy new stuff). Error in document: 319433-011.pdf documents an instruction MPSADBW with AVX2, but comment mentions xmm2,xmm3 and m128 still (there should be ymm2,ymm3 and m256 respectively). |
| September 14, 2011 3:13 AM PDT
Deepak |
"I really doubt that. Heterogeneous IS the way to go if you want throughput computing done right (thanks to GPGPU)." Nothing is preventing the CPU from becoming a throughput computing device itself. AVX2 brings us FMA and gather, two features which used to be exclusive to the GPU! That only leaves competitive power efficiency. That can be achieved with AVX-1024 executed as a single uop on the existing 256-bit units. By sequencing the execution over 4 cycles, you get the benefits of in-order execution (plus latency hiding), without losing any of the out-of-order execution benefits for legacy workloads. Heterogeneous computing has no future because the communication overhead does not scale. It's better to keep things local and execute parallel computations in AVX2 units capable of breaking up 1024-bit instructions. We're only a tiny step away from that. |
| September 19, 2011 10:48 PM PDT
Igor Levicki
| The only thing I want to know is when I will be able to buy one? |
| April 22, 2012 5:33 AM PDT
Lary | Nice, good works. ???? Predictive 3D matrix manipulation ????? |
Trackbacks (33)
- Intel 发布AVX2指令集 » 编译点滴
June 13, 2011 8:52 PM PDT - Интел обяви подробности около следващата си процесорна архитектура – hardwareBG.com
June 14, 2011 5:36 AM PDT - POP@4bit » Blog Archive » AVX2は256bit整数演算をサポートか
June 15, 2011 5:22 AM PDT - [RUMOR] Intel Haswell para 2013 « Framebuffer.com.br
June 17, 2011 3:58 PM PDT - La description des nouvelles instructions de Haswell – Intel® Software Network (FR) - Intel® Software Network
June 20, 2011 7:59 AM PDT - Intel’s ‘Haswell’ chip in focus: Heads up Nvidia - Technews
June 24, 2011 1:52 AM PDT - Intel’s ‘Haswell’ chip in focus: Heads up Nvidia | Gurubootcamp: Gadget
June 24, 2011 2:16 AM PDT -
Intel’s ‘Haswell’ chip in focus: Heads up Nvidia | Mini Laptop King
June 24, 2011 2:45 AM PDT -
Intel’s ‘Haswell’ chip in focus: Heads up Nvidia | Mini Laptop King
June 24, 2011 2:46 AM PDT - Apple News » Intel’s ‘Haswell’ chip in focus: Heads up Nvidia
June 24, 2011 2:52 AM PDT - Intel’s ‘Haswell’ chip in focus: Heads up Nvidia Read more: http://news.cnet.com/8301-13924_3-20073883-64/intels-haswell-chip-in-focus-heads-up-nvidia/#ixzz1QBAEeaIH | eSchoolHome :: Virtual Training Center
June 24, 2011 3:10 AM PDT - Tech Reviews » Intel’s ‘Haswell’ chip in focus: Heads up Nvidia
June 24, 2011 3:12 AM PDT - Intel’s ‘Haswell’ chip in focus: Heads up Nvidia | Christian Media Cross
June 24, 2011 3:23 AM PDT - Intels Haswell chip in focus Heads up Nvidia | AppleiGaga
June 24, 2011 5:40 AM PDT - Intel Says Haswell Coming in 2013, Will Rival Today’s Discrete Graphics – Dating
June 24, 2011 11:50 AM PDT - Intel’s ‘Haswell’ chip in focus: Heads up Nvidia | Custom Software Solutions (Antigua)-Digital World
June 24, 2011 6:10 PM PDT - Intel’s ‘Haswell’ chip in focus: Heads up Nvidia | MMC-NEWS
June 24, 2011 6:23 PM PDT - SilentDefender.co.uk » Blog Archive » Intel’s ‘Haswell’ chip in focus: Heads up Nvidia
June 24, 2011 9:08 PM PDT - Intel’s ‘Haswell’ chip in focus: Heads up Nvidia | Fresh Gadget News | ToGadget.info
June 25, 2011 8:14 AM PDT - Intel Haswell - konkurencja dla AMD Fusion? - Newsy ze świata Hardware, Software, IT | Newsy ze świata Hardware, Software, IT
June 26, 2011 12:23 PM PDT - Intel Says Haswell Coming in 2013, Will Rival Today’s Discrete Graphics | Oo! News
June 26, 2011 12:48 PM PDT - Intel "Haswell" llegará en 2013 con soporte para AVX2 | eWEEK Europe España
June 27, 2011 3:06 AM PDT -
one-techno » Intel parle déjà du successeur de l’Ivy Bridge
June 28, 2011 7:53 AM PDT - Haswell New Instruction Descriptions Now... | Performance Tuning and Intel | Syngu
July 13, 2011 4:17 AM PDT - Встречаем архитектуру Haswell и AVX2.0 – Блоги - ISN
July 25, 2011 7:40 AM PDT - INSTRUCTIONS BUILD | Materials Find
October 25, 2011 3:20 AM PDT - Home finance – yes it is still available in Australia and at good rates | Office for Business
November 8, 2011 4:47 PM PST - Home finance – yes it is still available in Australia and at good rates | Office for Business
November 8, 2011 4:47 PM PST - DOSPARA forGamer » 2013年的「Haswell」是很大的轉捩點。虎視眈眈想以Ultrabook來「重新定義PC」的英特爾(Intel)在Haswell世代將為我們帶來什麼呢
November 21, 2011 9:18 PM PST - Intels Haswell chip in focus Heads up Nvidia | AppleiGaga
February 18, 2012 8:56 PM PST - 50 Pips for Forex 121 : Forex Service Reviews
April 11, 2012 10:58 PM PDT - Intel @ 22nm « Pink Iguana
April 29, 2012 5:59 AM PDT - Intel @ 22nm « Pink Iguana
April 29, 2012 6:00 AM PDT






Ajax
wish if i have more then one brains ,one is simply not enough to follow all that development :P