Diagnostic 15326: loop was not vectorized: implied FP exception model prevents vectorization
Diagnostic 15326: function was not vectorized: implied FP exception model prevents vectorization
Consider changing compiler flags and/or directives in the source to enable fast FP model and to mask FP exceptions
The code samples for the webinar "Further Vectorization Features of the Intel Compiler" given on 4/7/2015 are attached below.
Here are some examples of command lines that may be used to build them. This list is not intended to be complete or to be a tutorial as such, just a guide to things to try. It uses Linux* switch syntax; Windows* equivalents are closely similar.
See the presentation for more detail; slides and video will be posted separately and later.
icpc -c -qopt-report-phase=vec -qopt-report=3 no_stl.cpp
As we all know AVX2 has extended (256 bit) comparing to SSE4.2 (128 bit) vector length. For basic instructions like packed add, sub, mul… this leads to ~2 times performance advantage (as vector length is 2 times wider), but for some instructions performance gain is not so obvious. This blog is about such instructions, about permutations.
Briefly a set of AVX2 permutations are applied to high and low 128 bit parts separately. These instructions are: vpalignr, all vpack instructions, all vpunpck instructions and vpshufb instruction.