Intel ISA Extensions

unaligned loads avx-128 vs. -256

I just saw that my cases using _mm256_loadu_ps show better performance than _mm_loadu_ps on corei7-4, where the latter was faster on earlier AVX platforms (in part due to the ability of ICL/icc to compile the SSE intrinsic to AVX-128).

Does this mean that advice to consider AVX-128 will soon be of only historical value?  I'm ready to designate my Westmere and corei7 linux boxes as historic vehicles.

Is there some books about SIMD(sse, avx and so on) optimization?

~Can someone please recommend a few books on program optimization?

I use  multithreading and simd to improve the performance of the program.

I always learn simd through the website, and ask questions in the web site.

Now I want to buy some books to learn. Is there any books on simd ? Thanks

Instruction set extensions programming reference, revision 17,

An updated instruction set extensions programming reference, revision 17, has been posted here. 

It includes information about:

  • Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions
  • Intel® Secure Hash Algorithm (Intel® SHA) extensions 
  • Intel® Memory Protection Extensions (Intel® MPX) 

For more information about the technologies:


MOVNTI and alignment for real mode

In the SDM rev. 48, vol. 2A, page 3-546, in the description of the exceptions for the MOVNTI instruction in the real-mode, it is specified that the instruction can generate

#GP If a memory operand is not aligned on a 16-byte boundary, regardless of segment.

There is no exceptions specified for unaligned stores for protected or long mode, except for AC enabled.  AMD reference is also silent about the unaligned stores.  Is this indeed an irregularity in real mode, or just a typo in the spec ?

Studying Intel TSX Performance: strange results

Dear all,

I've made studying of Intel TSX performance - its abort cases and comparison with spin lock. The study with reference to source code is available at .

I see some performance gain for TSX in comparison with spin lock. However I stll have few of questions:

AVX-512 is a big step forward - but repeating past mistakes!

AVX512 is arguably the biggest step yet in the evolution of the x86 instruction set in terms of new instructions, new registers and new features. The first try was the Knights Corner instruction set. It had some problems and AVX512 is better, so I am quite happy that AVX512 seems to be replacing the Knights Corner instruction set. But there are still some shortsighted issues that are lilkely to cause problems for later extensions.

Poor Code Gen of FMA3 instructions in SPEC FP 06 using Intel 14.0.0 compiler suite

I have compiled a SPEC FP 06 using the Intel 14.0.0 compiler suite.  I've observed great performance but upon looking at the code gen distributions through SDE, I note that only about 0.1% of the instructions executed are FMA3.  When I've compiled with Open64 in the past, I noted that 7% of the instructions executed were FMA variants, and between compiling with and without FMA3, the performance increased 5% approximately.  I'm using the -xCORE-AVX2 compiler flag upon my Haswell, but it's not "efficienctly leveraging" the use of FMA3.

Subscribe to Intel ISA Extensions