Getting Code Ready for Parallel Execution with Intel® Parallel Composer


Developers have the choice among a number of approaches for introducing parallelism into their code. This article provides an overview of the methods available in Intel® Parallel Composer, along with a comparison of their key benefits. While Intel Parallel Composer covers development in C/C++ on Windows* only, many of the methods also apply (with the suitable compiler) to development in Fortran and/or on Linux*.

This article is part of the larger series, "Intel Guide for Developing Multithreaded Applications," which provides guidelines for developing efficient multithreaded applications for Intel® platforms.


While the Intel® Compilers have some ways of automatically detecting and optimizing suitable code constructs for parallel execution (e.g., vectorization and auto/parallelization), most of the methods require code modifications. The pragmas or functions inserted rely on runtime libraries that actually perform the decomposition and schedule execution of the parallel threads, such as Intel® CilkTM Plus, Array Building Blocks (Intel® ArBB), Threading Building Blocks (Intel® TBB), OpenMP*, and the Win32* API. The main difference among the approaches is the level of control they provide over execution details; in general, the more control provided, the more intrusive the required code changes are.

Intel® CilkTM Plus

Intel Cilk Plus language extensions included in the Intel® C++ Compiler adds fine-grained task support to C and C++, making it easy to add parallelism to both new and existing software to efficiently exploit multiple processors. Intel Cilk Plus is made up of these main features:

  • Set of keywords (_Cilk_spawn, _Cilk_sync, and _Cilk_for), for expression of task parallelism.
  • Reducers, which eliminate contention for shared variables among tasks by automatically creating views of them for each task and reducing them back to a shared value after task completion.
  • Array notations, which enable data parallelism of whole functions or operations which can then be applied to whole or parts of arrays or scalars.
  • The simd pragma, which lets you express vector parallelism for utilizing hardware SIMD parallelism while writing standard compliant C/C++ code wth an Intel compiler.
  • Elemental function, which can be invoked either on scalar arguments or on array elements in parallel. You define an elemental function by adding "__declspec(vector)" (on Windows*) and "__attribute_((vector))" (on Linux*) before the function signature

cilk Keyword/pragmaDescription
cilk spawn (Keyword)Modifies a function call statement to tell the runtime system that the function may (but is not required to) run in parallel with the caller. The "cilk spawn" variant requires #include <cilk/cilk.h>
cilk sync (Keyword)Indicates that the current function cannot continue past this point until its spawned children have returned. The “cilk_sync” variant requires #include <cilk/cilk.h>
cilk for (Keyword)Specifies a loop that permits loop iterations to run in parallel; is a replacement for the normal C/C++ for loop. This statement divides a loop into chunks containing one or more loop iterations. Each chunk is executed serially, and is spawned as a chunk during the execution of the loop. The “cilk_for” variant requires #include <cilk/cilk.h>
cilk grainsize (Pragma)Specifies the grain (chunk) size for one cilk_for loop
CILK_NWORKERS (environment variable)Specifies the number of worker threads

Example: cilk spawn, cilk sync

In the example below, cilk_spawn does not spawn anything nor create any new threads. It indicates to the Cilk Plus runtime that a free worker thread can steal the code following the call to fib(n-1) and do it in parallel with the function call.


Example: cilk for, cilk Reducer

In the example below, cilk_for causes multiple instances of the code in the loop body be spawned into execution cores and executed in parallel. The reducer avoids data race and allows the reduction operation without locking.


Example: Array Notation

The example below, the usual subscript syntax in C/C++ is replaced by an array section descriptor to achieve the same result. The difference between using the array notation and the standard C/C++ loop is that here, there is no serial ordering implied. Therefore, the expected code generation strategy from the compiler is vector parallelism, i.e. use the SSE instructions to implement the additions in a SIMD fashion. The compiler generates vector, SIMD code for operations on arrays including all the language built-in operators, such as '+', '*', '&', '&&', etc.


Example: pragma simd

Vectorization using #pragma simd instructs the compiler to enforce vectorization of loops. It is designed to minimize the amount of source code changes needed in order to obtain vectorized code. The simd pragam can be used to vectorize loops that the compiler does not normally auto-vectorize even with the use of vectorization hints such as "pragma vector always" or "pragma ivdep".

char foo(char *A, int n){

  int i;
  char x = 0;

#ifdef SIMD
#pragma simd reduction(+:x)   
#ifdef IVDEP
#pragma ivdep
  for (i=0; i<n; i++){
    x = x + A[i];
  return x;

>icl /c /Qvec-report2 simd.cpp

simd.cpp(12) (col. 3): remark: loop was not vectorized: existence of vector dependence.

>icl /c /Qvec-report2 simd.cpp /DIVDEP

simd.cpp(12) (col. 3): remark: loop was not vectorized: existence of vector dependence.

>icl /c /Qvec-report2 simd.cpp /DSIMD

simd.cpp(12) (col. 3): remark: SIMD LOOP WAS VECTORIZED.

Example: Elemental function

int vfun_add_one(int x)
  return x+1;

>icl /c /Qvec-report2 elementalfunc.cpp

elementalfunc.cpp(3) (col. 1): remark: FUNCTION WAS VECTORIZED.

For more details on Cilk Plus and elemental functions please see the Intel® C++ Compiler XE 12.0 User and Reference Guide, and the article titled "Elemental functions: Writing data parallel code in C/C++ using Intel® Cilk Plus" .

Intel® Threading Building Blocks (Intel® TBB)

Intel TBB is a library that offers a rich methodology to express parallelism in C++ programs and take advantage of multicore processor performance. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanism for performance and scalability while fitting smoothly into the object-oriented and generic framework of C++. Intel TBB uses a runtime-based programming model and provides developers with generic parallel algorithms based on a template library similar to the standard template library (STL).



#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"
#include <vector>
void foo() {
 tbb::task_scheduler_init init;

 size_t length = 1000000;
 std::vector<float> a(length, 2), b(length, 3), c(length, 0);

 tbb::parallel_for(tbb::blocked_range<size_t>(0, length), 
 [&](const tbb::blocked_range<size_t> &r){ 
 for (size_t i=r.begin(); i<r.end(); i++) 
 c[i] = a[i] + b[i]; 

The Intel TBB task scheduler performs load balancing automatically, relieving the developer from the responsibility to perform that potentially complex task. By breaking programs into many small tasks, the Intel TBB scheduler assigns tasks to threads in a way that spreads out the work evenly.

Both the Intel C++ Compiler and Intel TBB support the new C++11 lambda functions, which make STL and Intel TBB algorithms much easier to use. In order to use Intel's implementation of lambda expressions, one must compile the code with the /Qstd=c++0X compiler option.

Parallelization using OpenMP*

OpenMP is an industry standard for portable multi-threaded application development. The Intel® C++ Compiler supports the OpenMP C/C++ version 3.0 API specification available at the OpenMP web site ( Parallelism with OpenMP is controllable by the user through the use of OpenMP directives. This approach is effective at fine-grain (loop-level) and large-grain (function-level) threading. OpenMP directives provide an easy and powerful way to convert serial applications into parallel applications, enabling potentially big performance gains from parallel execution on multi-core systems. The directives are enabled with the /Qopenmp compiler option and will be ignored without the compiler option. This characteristic allows building both the serial and parallel versions of the application from the same source code. For shared memory parallel computers, it also allows for simple comparisons between serial and parallel runs.

The following table shows commonly used OpenMP* directives:

#pragma omp parallel for [clause] ... for - loopParallelizes the loop that immediately follows the pragma.
#pragma omp parallel sections [clause] ... { [#pragma omp section structured-block] ... }Distributes the execution of the different sections among the threads in the parallel team. Each structured block is executed once by one of the threads in the team in the context of its implicit task.
#pragma omp master structured-blockThe code contained within the master construct is executed by the master thread in the thread team.
#pragma omp critical [ (name) ] structured-blockProvides mutual exclusion access to the structured-block. Only one critical section is allowed to execute at one time anywhere in the program.
#pragma omp barrierUsed to synchronize the execution of multiple threads within a parallel region. Ensures all the code occurring before the barrier has been completed by all the threads, before any thread can execute any of the code past the barrier directive.
#pragma omp atomic expression-statementProvides mutual exclusion via hardware synchronization primitives. While a critical section provides mutual exclusion access to a block of code, the atomic directive provides mutual access to a single assignment statement.
#pragma omp threadprivate (list)Specifies a list of global variables being replicated, one instance per thread (i.e., each thread works on an individual copy of the variable).


void sp_1a(float a[], float b[], int n) {
int i;
#pragma omp parallel shared(a,b,n) private(i)
#pragma omp for
for (i = 0; i < n; i++)

a[i] = 1.0 / a[i];

#pragma omp single
a[0] = a[0] * 10;
#pragma omp for nowait
for (i = 0; i < n; i++)
b[i] = b[i] / a[i];
icl /c /Qopenmp par1.cpp
par2.cpp(5): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
par2.cpp(10): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
par2.cpp(3): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

The /Qopenmp-report[n] compiler option, where n is a number between 0 and 2, can be used to control the OpenmMP parallelizer's level of diagnostic messages. Use of this option requires the programmer to specify the /Qopenmp option. If n is not specified, the default is /Qopenmp-report1 which displays diagnostic messages indicating loops, regions, and sections successfully parallelized.

Because only directives are inserted into the code, it is possible to make incremental code changes. The ability to make incremental code changes helps programmers maintain serial consistency. When the code is run on one processor, it gives the same result as the unmodified source code. OpenMP is a single source code solution that supports multiple platforms and operating systems. There is also no need to determine the number of cores, because the OpenMP runtime chooses the right number automatically.

OpenMP version 3.0 contains a new task-level parallelism construct that simplifies parallelizing functions, in addition to the loop-level parallelism for which OpenMP is most commonly used. The tasking model allows parallelizing programs with irregular patterns of dynamic data structures or with complicated control structures like recursion that are hard to parallelize efficiently. The task pragmas operate in the context of a parallel region and create explicit tasks. When a task pragma is encountered lexically within a parallel region, the code inside the task block is conceptually queued to be executed by one of the threads executing the parallel region. To preserve sequential semantics, all tasks queued within the parallel region are guaranteed to complete by the end of the parallel region. The programmer is responsible for ensuring that no dependencies exist and that dependencies are appropriately synchronized between explicit tasks, as well as between code inside and outside explicit tasks.


#pragma omp parallel
#pragma omp single
  for(int i = 0; i < size; i++)
    #pragma omp task
    setQueen (new int[size], 0, i, myid);


Win32* Threading API and Pthreads*

In some cases, developers prefer the flexibility of a native threading API. The main advantage of this approach is that the user has more control and power over threading than with the threading abstractions discussed so far in this article. At the same time, however, the amount of code required to implement a given solution is higher, as the programmer must implement all the tedious thread implementation tasks, such as creation, scheduling, synchronization, local storage, load balancing, and destruction, which in the other cases are handled by the runtime system. Moreover, the number of cores available, which influences the correct number of threads to be created, must be determined. That can be a complex undertaking, particularly for platform-independent solutions.


void run_threaded_loop (int num_thr, size_t size, int _queens[])
  HANDLE* threads = new HANDLE[num_thr];
  thr_params* params = new thr_params[num_thr];

  for (int i = 0; i < num_thr; ++i)
    // Give each thread equal number of rows
    params[i].start = i * (size / num_thr);
    params[i].end = params[i].start + (size / num_thr);
    params[i].queens = _queens;
    // Pass argument-pointer to a different 
    // memory for each thread's parameter to avoid data races
    threads[i] = CreateThread (NULL, 0, run_solve, 
      static_cast<void *> (¶ms[i]), 0, NULL);

  // Join threads: wait until all threads are done
  WaitForMultipleObjects (num_thr, threads, true, INFINITE);

  // Free memory
  delete[] params;
  delete[] threads;

Threaded Libraries

Another way to add parallelism to an application is to use threaded libraries such as Intel® Math Kernel Library (Intel® MKL, not part of Intel Parallel Composer) and Intel® Performance Primitives (Intel® IPP). Intel MKL offers highly optimized threaded math routines for maximum performance, using OpenMP for threading. To take advantage of threaded Intel MKL functions, simply set the OMP_NUM_THREADS environment variable to a value greater than one. Intel MKL has internal thresholds to determine whether to perform a computation in parallel or serial, or the programmer can manually set thresholds using the OpenMP API, specifically the omp_set_num_threads function. The online technical notes have some additional information about MKL parallelism (MKL 11.0 for Windows*, Intel® MKL 10.x threading).

Intel IPP is an extensive library of multicore-ready, highly optimized software functions particularly well suited to multimedia data processing and communications applications. Intel IPP uses OpenMP for threading, as well. The online technical notes provide more information about IPP threading and OpenMP support.

The Intel C++ Compiler also provides an implementation of the STL valarray using Intel IPP for data-parallel performance of math and transcendental operations. The C++ valarray template class consists of array operations that support high-performance computing. These operations are designed to take advantage of low-level hardware features such as vectorization. The Intel implementation of valarray provides Intel IPP-optimized versions of several valarray operations through an optimized replacement valarray header file without requiring any source code changes. To optimize valarray loops with Intel Optimized Performance header files, use the /Quse-intel-optimized-headers compiler option.


Auto-parallelization is a feature of the Intel C++ Compiler. In auto-parallelization mode, the compiler automatically detects parallelism inherent in the program. The auto-parallelizer analyzes the dataflow of the loops in the application source code and generates multithreaded code for those loops which can safely and efficiently be executed in parallel. If data dependencies are present, loop restructuring may be needed for the loops to be auto-parallelized.

In auto-parallelization mode, all parallelization decisions are made by the compiler, and the developer does not have any control over which loops are to be parallelized. Auto-parallelization can be combined with OpenMP to achieve higher performance. When combining OpenMP and auto-parallelization, OpenMP will be used to parallelize loops containing OpenMP directives and auto-parallelization will be used to parallelize non-OpenMP loops. Auto-parallelization is enabled with the /Qparallel compiler option.


#define N 10000
float a[N], b[N], c[N];

void f1() {
  for (int i = 1; i < N; i++)
  c[i] = a[i] + b[i];

> icl /c /Qparallel par1.cpp
par1.cpp(5): (col. 4) remark: LOOP WAS AUTO-PARALLELIZED.

By default, the auto-parallelizer reports which loops were successfully auto-parallelized. Using the /Qpar-report[n] option, where n is a number between 0 and 3, the auto-parallelizer can report diagnostic information about auto-parallelized loops and those that did not get auto-parallelized. For example, /Qpar-report3 tells the auto-parallelizer to report diagnostics messages for loops successfully and unsuccessfully auto-parallelized plus information about any proven or assumed dependencies inhibiting auto-parallelization. The diagnostics information helps restructure loops to be auto-parallelized.


Vectorization is the technique used to optimize loop performance on Intel® processors. Parallelism defined by vectorization technique is based on vector-level parallelism (VLP) made possible by the processor's SIMD hardware. The auto-vectorizer in the Intel C++ Compiler automatically detects low-level operations in the program that can be performed in parallel and then converts the sequential code to process 1-, 2-, 4-, 8-, or up to 16-byte data elements in one operation with extensions up to 32- and 64-byte in the future processors. Loops need to be independent for the compiler to auto vectorize them. Auto-vectorization can be used in conjunction with the other thread-level parallelization techniques such as auto-parallelization and OpenMP discussed earlier. Most floating-point applications and some integer applications can benefit from vectorization. The default vectorization level is /arch:SSE2 which generates code for Intel® Streaming SIMD Extensions 2 (Intel® SSE2). To enable auto-vectorization for other than the default target, use the /arch (e.g., /arch:SSE4.1) or /Qx (e.g., /QxSSE4.2, QxHost) compiler options.

The figure below shows the serial execution of the loop iterations on the left without vectorization, where the lower parts of the SIMD registers are not utilized. The vectorized version on the right shows four elements of the A and B arrays added in parallel for each iteration of the loop, utilizing the full width of the SIMD registers.

Figure 1. Loop iterations with and without vectorization.


#define N 10000 
float a[N], b[N], c[N]; 

void f1() {
  for (int i = 1; i < N; i++)
  c[i] = a[i] + b[i];

> icl /c /QxSSE4.2 par1.cpp 
par1.cpp(5): (col. 4) remark: LOOP WAS VECTORIZED.

By default, the vectorizer reports which loops got vectorized. Using the /Qvec-report[n] option, where n is a number between 0 and 5, the vectorizer can report diagnostic information about vectorized and non-vectorized loops. For example, the /Qvec-report5 option tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized. The diagnostics information helps restructure the loops to be vectorized.

Tradeoffs Between Different Methods

Various parallelism methods can be categorized in terms of abstraction, control, and simplicity. Intel® TBB, Intel® ArBB, and the API models do not require specific compiler support, but Intel® CilkTM Plus, and OpenMP do. The use of Intel Cilk Plus, and OpenMP require the use of a compiler that recognizes Cilk Plus keywords and array syntax, and the OpenMP directives. The API-based models require the programmer to manually map concurrent tasks to threads. There is no explicit parent-child relationship between the threads; all threads are peers. These models give the programmer control over all low-level aspects of thread creation, management, and synchronization. This flexibility is the key advantage of library-based threading methods. The tradeoff is that to obtain this flexibility, significant code modifications and a lot more coding are required. Effort spent on performance tuning will often not scale up or down to different core counts or operating system versions. Concurrent tasks must be encapsulated in functions that can be mapped to threads. The other drawback is that most threading APIs use arcane calling conventions and only accept one argument. Thus, it is often necessary to modify function prototypes and data structures that may break the abstraction of the program design, which fits better in a C approach than an Object-oriented C++ one.

As a compiler-based threading method, OpenMP provides a high-level interface to the underlying thread libraries. With OpenMP, the programmer uses OpenMP directives to describe parallelism to the compiler. This approach removes much of the complexity of explicit threading methods, because the compiler handles the details. Due to the incremental approach to parallelism, where the serial structure of the application stays intact, there are no significant source code modifications necessary. A non-OpenMP compiler simply ignores the OpenMP directives, leaving the underlying serial code intact.

With OpenMP, however, much of the fine control over threads is lost. Among other things, OpenMP does not give the programmer a way to set thread priorities or perform event-based or inter-process synchronization. OpenMP is a fork-join threading model with an explicit master-worker relationship among threads. These characteristics narrow the range of problems for which OpenMP is suited. In general, OpenMP is best suited to expressing data parallelism, while explicit threading API methods are best suited for functional decomposition. OpenMP is well known for its support for loop structures and C code, but it offers nothing specific for C++. OpenMP version 3.0 supports tasking, which extends OpenMP by adding support for irregular constructs such as while loops and recursive structures. Nevertheless, OpenMP remains reminiscent of plain C and FORTRAN programming, with minimal support for C++.

Intel ArBB provides a generalized vector parallel programming solution that frees application developers from dependencies on particular low-level parallelism mechanisms or hardware architectures. Intel ArBB uses C++ languae extensions for compatibility with all standard compilers and IDEs and is not tied to a particular compiler. Use ArBB when:

  • It is natural to express an algorithm in a data parallel form:
    • Operations on arrays
    • Elemental functions that apply in parallel to all elements of an array
    • When you want a compiler to determine optimal use of cores, thread, and SIMD execution resources
    • When you are interested in a "compile once, run everywhere" deployment model, based on JIT compilation
  • When you are interested in deterministic execution and have the Intel ArBB runtime manage memory space consumption

Intel Cilk Plus supports both data and task-parallelism in both C and C++ language. With its three simple keywords that provide simple fork-join parallelism, Cilk Plus is the easiest way to introduce parallelism to an existing serial program. It provides the lowest overhead among the parallel models discussed here to invoke parallel threads. The cilk keywords can be ignored by a non-Cilk compiler simply by using the preprocessor to replace the keywords with their serial equivalents (A header file is provided for this purpose.) However, the array notation cannot be so easily elided. Unlike OpenMP that does not compose well with itself (e.g. nested OpenMP) and other threading models, Intel Cilk Plus composes well with Intel TBB and Intel ArBB without causing thread over subscription. This allows a programmer to use Cilk Plus for the majority of the code, and use Intel TBB to implement parallelism where other parallel constructs and parallel data structures are needed such as scoped lock, parallel hash, etc.

Intel TBB supports generic scalable parallel programming using standard C++ code like the STL. It does not require special languages or compilers. If one needs a flexible and high-level parallelization approach that fits nicely in an abstract and even generic object-oriented approach, Intel TBB is an excellent choice. Intel TBB uses templates for common parallel iteration patterns and supports scalable data-parallel programming with nested parallelism. In comparison to the API approach, one specifies tasks rather than threads, and the library maps tasks onto threads in an efficient way using the Intel TBB runtime. The Intel TBB scheduler favors a single, automatic divide-and-conquer approach to scheduling. It implements task stealing, which moves tasks from loaded cores to idle ones. In comparison to OpenMP, the generic approach implemented in Intel TBB allows developer-defined parallelism structures that are not limited to built-in types.

The following table compares the different threading techniques available in Intel Parallel Composer:

Explicit Threading APIsLow-level APIs such as the Win32* Threading API and Pthreads* for low-level multi-threaded programming
  • Maximum control and flexibility
  • Does not need special compiler support
  • Relatively complex code to write, debug, and maintain; very time-consuming
  • All thread management and synchronization done by the programmer
(Enabled by /Qopenmp compiler option)
A specification defined by to support shared-memory parallel programming in C/C++ and Fortran through the use of APIs and compiler directives
  • Potential for large performance gain with relatively little effort
  • Good for rapid prototyping
  • Can be used for C/C++, and Fortran
  • Allows incremental parallelism using compiler directives
  • User control over what code to parallelize
  • Single-source solution for multiple platforms
  • Same code base for both serial and parallel version

Not much user control over threads such as setting thread priorities or performing event-based or inter-process synchronization


    Intel® Cilk™ PlusNew Keywords for C and C++ (cilk spawn, cilk sync, cilk for), reducers to avoid race conditions, and array notations to take advantage of vectorization.
    • Clean syntax that preserves serial organization and semantics of the original program.
    • Multicore and SIMD in a single package.
    • Composable, easy to reason about.
    • Requires compiler support
    • No support for Fortran
    • No fine-grain control over threading
      Intel® Threading Building BlocksIntel's C++ runtime library that simplifies threading for performance by providing parallel algorithms and concurrent data structures that eliminate tedious threading implementation work
      • Does not need special compiler support
      • Uses standard C++ code like STL
      • Automatic thread creation, management, and scheduling
      • Allows expressing parallelism in terms of tasks rather than threads
      • Mostly suited to C++ programs
      • No support for Fortran
      (Enabled by /Qparallel compiler option)
      A feature of the Intel® C++ Compiler to automatically parallelize loops with no loop-carried dependency in a program
      • Compiler automatically generates multi-threaded code for parallelizable loops
      • Can be used together with other threading techniques
      Works on loops that compiler can statically prove are parallelizable through data-dependency and aliasing analysis
      (Enabled by /arch: and /Qx options)
      Technique used to optimize loop performance through vector-level parallelism on Intel® processors by converting sequential instructions to SIMD instructions that can operate on multiple data elements at once
      • Automatic vector level parallelism done by the compiler
      • Can be used together with other threading techniques
      Resulting code may not run on all processors if processor-specific options are used

      The article, "Solve the N-Queens problem in parallel," provides hands-on training about applying each of the parallelization techniques discussed in this document to implement a parallel solution to the N-Queens problem, which is a more general version of the Eight Queens Puzzle. Additional examples are provided in the "Samples" folder under the Intel® C++ Compiler installation folder.

      Optimization Notice in English

      Additional Resources

      Intel® Developer Zone Parallel Programming Community

      General information on Intel Compilers, documentation, White Papers, Knowledge Base

      The Software Optimization Cookbook (2nd Edition) High performance Recipes for the Intel Architecture

      Intel Developer Zone Forums

      Additional information on OpenMP, including the complete specification and list of directives

      Intel® Threading Building Blocks

      Intel Threading Building Blocks for Open Source

      James Reinders, Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O'Reilly Media, Inc. Sebastopol, CA, 2007.

      Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.