Intel® AVX2 optimization in Intel® MKL

Haswell is the codename next generation x86 processor micro architecture (tock). This architecture is expected in 2013. Haswell's new instructions accelerate a broad category of applications and usage models. Download the full Intel® Advanced Vector Extensions Programming Reference (319433). This new instruction set is built upon the instructions of Intel® microarchitecture code-named Ivy Bridge, including the digital random number generator, half-float (float16) accelerators, and an extended set of Intel® Advanced Vector extensions (Intel® AVX) instructions.
The instructions fit into the following categories:

AVX2 - Integer data types expanded to 256-bit SIMD. AVX2's integer support is particularly useful for processing visual data commonly encountered in consumer imaging and video processing workloads. With Haswell, we have both Intel® Advanced Vector Extensions (Intel® AVX) for floating point data types as well as AVX2 for integer data types.

Bit manipulation instructionsare useful for compressed databases, hashes, large number arithmetic, and a variety of general purpose codes.

Gather instructions are useful for vectorized code that accesses non-adjacent data elements. Haswell gather operations are mask-based for safety (like conditional loads and stores introduced in Intel® AVX). Gather operations are favorable to clip values, to clamp boundaries, or similar conditional operations.

Any-to-Any permutesare incredibly useful shuffle operations. Haswell adds support for DWORD and QWORD granularity and allows to permute across an entire 256-bit register.

Vector-Vector Shifts are added to shift vectors where the amount of shift is controlled by vector. These are critical in vectorized loops with variable shifts.

Floating Point Multiply Accumulate - Our floating-point multiply accumulate significantly increases peak flops and provides improved precision to further improve transcendental mathematics. They are broadly usable in high performance computing, professional quality imaging, and face detection. They operate on scalars, 128-bit packed single and double precision data types, and 256-bit packed single and double-precision data types. [These instructions were described previously, in the initial Intel® AVX specification].


Intel MKL 11.0 is fully supporting AVX2; more optimizations are available in the following functions.

Basic Linear Algebra Subprograms (BLAS)








• xHER2K





Discrete Fourier transform (DFT):

• 1D, power-of-2

• 2D, power-of-2

• 3D, power-of-2

• 1D, non-power-of-2

• 2D, non-power-of-2

• 3D, non-power-of-2

Sparse BLAS

• dcsrmm

• scsrmm

• dcoomm

• scoomm

Vector Statistical Library (VSL)
• MRG32k3a


Intel® AVX optimization in Intel® MKL

Haswell New Instruction Descriptions Now Available!

Intel® Advanced Vector Extensions Programming Reference

For more complete information about compiler optimizations, see our Optimization Notice.