Assessing the accelerator buzz: Tips and Tricks for Intel® Compiler vectorization

Here at Intel we have spent much of the last year assessing the rising buzz about GPGPU’s and other accelerator cards in the financial services community.  These technologies promise tremendous computing capability, but often we see performance claims that are exaggerated by comparing the best possible accelerator implementation to a very unoptimal version of the software running on the CPU cores.

One of the first things we do in working toward a true top-end performance measurement is to rebuild the code with the Intel® Compiler.  Whether comparing against gcc 3.x or Microsoft Visual C++ compilations we often see a considerable performance gain right out of the box.

Beyond that we typically find further improvements by analyzing the source code hotspots and making adjustments to enable even better code generation from the Intel® Compiler.  The biggest gains often come from finding frequently executed computational loops that are not being vectorized, that is, they are not effectively using the SIMD capabilities of the x86 instruction set.  In this entry we’ll look at a couple examples where a simple tweak to a loop allowed it to vectorize and execute much more quickly.

In the first example, we have a “daxpy” loop that should vectorize using packed SSE3 instructions:

44: void DaxpyArray (double *x, double *y, double a, double *r)
45: {
47:      for (unsigned int i = 0; i < LENGTH; i++)
48:           r[i] = a * x[i] + y[i];
50: }

We’ll compile this with icc 10.1, using the ‘-xP’ switch to target processors supporting SSE3 instructions and add ‘-vec_report3’ to get an explanation for loops that did not vectorize:

$ icc -xP -vec_report3 SSE3_example.C

SSE3_example.C(47): (col. 5) remark: loop was not vectorized: existence of vector dependence.
SSE3_example.C(48): (col. 8) remark: vector dependence: proven FLOW dependence between r line 48, and x line 48.
SSE3_example.C(48): (col. 8) remark: vector dependence: proven ANTI dependence between x line 48, and r line 48.

This loop looks simple enough, so why didn’t it vectorize? The reports can be cryptic but from experience we know that the compiler can be pretty picky about what it will accept. It requires a very specific ‘for’ loop structure and in this case the “unsigned int” loop counter is throwing it off course. The simplest fix is to change the type to a regular signed ‘int’. This will be fine as long as LENGTH does not exceed the range of 32-bit signed integers.

47: for (int i = 0; i < LENGTH; i++)
48: r[i] = a * x[i] + y[i];

This was worth a 1.6x gain as measured on a 3.0 Ghz “Woodcrest” CPU, so a little code change can go a long way. Of course we’d like to see the compiler be less stringent and vectorize with an unsigned loop index. This has already been fixed in the compiler mainline builds and will most likely appear in the 11.0 release. We’re also working on making the vectorization reports more intuitive and useful.

$ icc -xP -vec_report3 SSE3_example.C
SSE3_example.C(47): (col. 5) remark: LOOP WAS VECTORIZED.

A second example involves the use of STL vector container classes. The compiler doesn’t yet know how to vectorize loops referencing data in that fashion. For example, the following:

void DaxpyVector (const vector& x, const vector& y, const double a, vector& r)
for (int i = 0; i < LENGTH; i++)
r[i] = a * x[i] + y[i];

must be recoded this way:
void DaxpyVector (const vector& x, const vector& y, const double a, vector& r)
double *xP, *yP, *rP;

xP = (double*)&x[0];
yP = (double*)&y[0];
rP = &r[0];

for (int i = 0; i < LENGTH; i++)
rP[i] = a * xP[i] + yP[i];


... reassigning the vectors to double* pointers so the compiler can emit vectorized code. A fix for this issue is a bit more complex and we are not yet sure when the compilers will be able to handle it optimally.

A few other things that can trip up the vectorizer:

  • use of class member variables in the loop, either as array pointers or loop count variables. Either reassign them to local stack variables or experiment with no aliasing options like /Oa, /Qansi-alias, /Ow.

  • any manipulation of vector classes, for example, size() and resize() calls, even outside the target loop can disable vectorization because of C++ exception handling anomalies. Either move them outside the function scope or try using the ‘–fno-exceptions’ switch.












Optimization Notice

Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors.  In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors.  For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options."  Many library routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other microprocessors.  While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.


Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.  These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations.  Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.  Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.


While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements.  We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.


Notice revision #20101101











For more complete information about compiler optimizations, see our Optimization Notice.


anonymous's picture

Note that the solution here may make your code less portable. I've had code seg fault on some systems when taking the address of the zeroth element when the vector length is zero. It also seems reckless to assume the underlying stl vector holds all values sequentially in memory as a C array would.

Michael Stoner (Intel)'s picture

Thanks for the comment. We originally filed that vectorization problem against the 10.0.025 compiler for Linux/Intel64. I rebuilt the code today with a 10.1 Windows 32-bit compiler and found it does vectorize in that environment.

anonymous's picture

My version 10.1 vectorizes the loop containing the unsigned int:

laptop: icpc -v
Version 10.1

laptop: icpc -c -xP -vec_report vec1.c
vec1.c(4): (col. 3) remark: LOOP WAS VECTORIZED.

laptop: cat vec1.c
const int LENGTH = 1000;
void DaxpyArray (double *x, double *y, double a, double *r)
for (unsigned int i = 0; i &lt; LENGTH; i++)
r[i] = a * x[i] + y[i];

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.