Vectorization Toolkit


The following toolkit discusses six steps you can follow to increase performance through vectorization in your application. Want more context? Click on the links below to learn more about optimizing your application with Intel® Software Development Tools.

   View the webinar      Read a white paper      Try the Tools     Ask a question   

The 6-Step Process for Vectorizing your Application

Step 1. Measure Baseline Release Build Performance

It is important to work with a release build (and not a debug build) because:

  1. The compiler will optimize your code and may change which loop is in a hotspot.
  2. You need to know what baseline time your application has so that you can determine if vectorization has improved performance.

You should set a goal for performance so you know when you are done (when the release build is fast enough).

A release build is the default in the Intel Compiler. You have to specifically turn off optimizations by doing a DEBUG build on Windows* (or using the -Zi switch) or using the -O0 switch (or using the -g switch) on Linux* or Mac OS X*.

Step 2. Determine Hotspots Using Intel® VTune™ Amplifier XE

You can use Intel® VTune™ Amplifier XE, Intel's performance profiler, to find the most time-consuming functions in your application.  The "Hotspots" analysis type is recommended, although "Lightweight Hotspots" would work as well (it will profile the whole system as opposed to just your application). Identifying which areas of your code are taking the most time will allow you to focus your optimization efforts in the areas where performance improvements will have the most effect. Generally you want to focus on only the top few hotspots, or functions taking at least 10% of your application's total time. Make note of the hotspots you want to focus on for the next step.  

Additional Resources for this Step

Step 3. Determine Loop Candidates Using Intel Compiler Optimization Report

The vectorization section of the optimization report can tell you whether or not each loop in your code was vectorized. Look at the output of the optimization report for the hotspots you determined in Step 2. If there are loops in your hotspots that did not vectorize, check whether they have math, data processing, or string calculations on data in parallel (for instance in an array). If they do, they might benefit from vectorization. Move to step 4 if any candidates are found. 

To generate information about compiler vectorization from the optimization report, compile with the options "-qopt-report -qop-report-phase=loop,vec" on Linux or "/Qopt-report /Qopt-report-phase:loop,vec" on Windows.

Note that the Intel Compiler can be run on just a portion of the code and will be compatible with the native compilers (gcc on Linux and Mac OS and Microsoft* Visual C++ on Windows*).

Begin optimization report for: test_scalar_dep(double *, int)

    Report from: Vector optimizations [vec]

LOOP BEGIN at scalar_dep.cpp(79,1)
   remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details
   remark #15346: vector dependence: assumed ANTI dependence between  line 81 and  line 80

Additional Resources for this Step

Step 4. Get Advice Using Intel® Advisor XE

Use the vectorization analysis capability of Intel Advisor XE to analyze your application's run time behavior and identify the components of the application that will benefit most from vectorization.

Example Non-vectorizing Loop:

for (i=0; i<n; i++) {
    if (A[i] > 0) { b=A[i]; A[i] = 1 / A[i]; }
    if (A[i] > 1) { A[i] += b; }

After running an analysis and generating a result, you can view the result through the GUI or you can generate text-based reports as needed. The analysis type used in collection determines which report types can be generated from the result. 

An example report based on the code above:

ID                       Loops                         Self Time   Total Time   Loop Type                        Why No Vectorization?                             Source Location      Module
 4   [loop in test_scalar_dep at scalar_dep.cpp:79]         0.35         0.35      Scalar                            vector dependence prevents vectorization      scalar_dep.cpp:79    default
 3   [loop in main at main.cpp:39]                             0         0.35      Scalar   loop with function call not considered an optimization candidate.      main.cpp:39     		default
 2   [loop in __libc_start_main]                               0         0.35      Scalar                                                                

Additional Resources for this Step

Step 5. Implement Vectorization Recommendations

In Step 4, Intel Advisor XE identified that "Scalar vector dependence" is preventing the Intel compiler from generating vectorized code. Make sure the recommended change does not affect the semantics nor safety of your loop calculations. One way to ensure that the loop has no dependencies that may be affected is to see if executing the loop in backwards order would change the results. Another is to think about the calculations in your loop being in a scrambled order. If the results would be changed, your loop has dependencies and vectorization would not be "safe".  You may still be able to vectorize by eliminating dependencies in the loop. Modify your source code to give additional information to the compiler or optimize your loop for better vectorization.

Modified Example from Step 4:

  for (i=0; i<n; i++) {
        b = A[i];
    if (A[i] > 0) { A[i] = 1 / A[i]; }
    if (A[i] > 1) { A[i] += b; }

Additional Resources for this Step

Step 6. Repeat.

Continue to modify the application code based on the performance analysis and check the new performance metrics to compare the results. Repeat this cycle until the results match your performance goals or until no significant hot spots are found.

For more complete information about compiler optimizations, see our Optimization Notice.