An Overview of the Parallelization Implementation Methods in Intel(R) C++ Compilers

Submit New Article

Last Modified On :   June 19, 2009 4:30 PM PDT
Rate
 



Introduction

This document provides an overview of some of the popular techniques and technologies available for parallelizing applications and explains how Intel® C++ Compilers can help with the parallelization effort. The document introduces various parallelization technologies supported by the Intel C++ Compilers without getting into implementation details. The “Parallelizing N-Queens with the Intel® C++ Compiler” Tutorial referenced at the end of the document illustrates the utility and performance of the parallelization techniques discussed in this document.

The goal is to present some of the parallelization approaches that have been successfully used to thread many applications with the information needed for users to make an educated decision on which approach or approaches to use. The user can then drill down into the specific technical information for the selected approach as appropriate, to which this paper provides pointers.



NOTE: This document uses and references compiler options specific to the Intel® C++ Compiler for Windows*.  For equivalent options supported by the Intel® C++ Compiler for Linux* and Mac OS* X please refer to the Intel® C++ Compiler User and Reference Guides.  Except for the Intel® C++ Compiler Language Extensions for Parallelism (__taskcomplete, __task, __par, and __critical) all features discussed in this article equally apply to the Intel® C++ Compilers for Linux* and Mac OS* X.



Overview

For decades in the Giga Hertz era software applications enjoyed free performance boost due to ever increasing processor clock speeds. Faster clock speeds pushed the power envelope to a point where increases in processor power requirements made the trend unsustainable. That was the point where the shift to multicore architectures was made. In the multicore era, microprocessors with multiple processing cores provide the potential for dramatic increases in performance at substantially reduced power requirements. To effectively take advantage of the performance capabilities inherent in multicore processors, existing serial applications need to be parallelized.

Need%20for%20Parallelism%20Graph

 

Parallelization Techniques

  • Parallelization using OpenMP*. OpenMP* is an industry standard for portable multi-threaded application development. The Intel® C++ Compiler supports the OpenMP C/C++ version 3.0 API specification available at the OpenMP web site http://www.openmp.org.  Parallelism with OpenMP is controllable by the user through the use of OpenMP directives. This approach is effective at fine-grain (loop-level) and large-grain (function-level) threading. OpenMP directives provide an easy and powerful way to convert serial applications into parallel applications, enabling potentially big performance gains from parallel execution on multi-core and symmetric multiprocessor systems. The directives are enabled with the /Qopenmp compiler option and will be treated as comments without the compiler option. This allows building both the serial and parallel versions of the application from the same source code. For shared memory parallel computers, this allows for simple comparisons between serial and parallel runs.

 


Commonly used OpenMP* Directives

 

DirectiveDescription
#pragma omp parallel for [clause] … for - loop Parallelizes the loop that immediately follows the parallel for directive. Execution of the parallel loop is controlled through optional clauses such as “private”,“shared”, “schedule”, etc.
#pragma omp parallel sections [clause] …
{
[#pragma omp section structured-block] …
}
Distributes the execution of the different sections among the threads in the parallel team. Each section is executed once, and each thread executes zero or more sections.
#pragma omp master structured-block The code contained within the master construct is executed by the master thread in the thread team.
#pragma omp critical [ (name) ] structured-block Provides mutual exclusion access to the structured-block. Only one critical section is allowed to execute at one time anywhere in the program.
#pragma omp barrier Used to synchronize the execution of multiple threads within a parallel region. Ensures all the code occurring before the barrier has been completed by all the threads, before any thread can execute any of the code past the barrier directive.
#pragma omp atomic expression-statement Provides mutual exclusion via hardware synchronization primitives. While a critical section provides mutual exclusion access to a block of code, the atomic directive provides mutual access to a single assignment statement.
#pragma omp threadprivate (list) Identifies a global variable as being private to each thread.



 


OpenMP* Parallelization

void sp_1a(float a[], float b[], int n) {
int i;
#pragma omp parallel shared(a,b,n) private(i)
{
#pragma omp for
for (i = 0; i < n; i++)

a[i] = 1.0 / a[i];


#pragma omp single
a[0] = a[0] * 10;
#pragma omp for nowait
for (i = 0; i < n; i++)
b[i] = b[i] / a[i];
}
}
icl /c /Qopenmp par1.cpp
par2.cpp(5): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
par2.cpp(10): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
par2.cpp(3): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

 

The /Qopenmp-report[n] compiler option where “n” is a number between 0 and 2 can be used to control the OpenmMP* parallelizer’s level of diagnostic messages. To use this option, you must also specify the /Qopenmp option. If you do not specify n, the default option is /Qopenmp-report1which displays diagnostic messages indicating loops, regions, and sections successfully parallelized. Because only directives are inserted into the code, it is possible to make incremental code changes. The ability to make incremental code changes helps programmers maintain serial consistency. When the code is run on one processor, it gives the same result as the unmodified source code. OpenMP is a single source code solution that supports multiple platforms and operating systems. There is also no need to determine the number of cores as the OpenMP runtime chooses the right number for you. The latest OpenMP version 3.0, contains a new task-level parallelism construct that simplifies parallelizing functions, in addition to the loop-level parallelism for which OpenMP is most commonly used.

One of the new features in OpenMP 3.0 is the tasking model that allows parallelizing programs with irregular patterns of dynamic data structures or with complicated control structures like recursion that are hard to parallelize efficiently. The task pragmas operate in the context of a parallel region, and create explicit tasks. When a task pragma is encountered lexically within a parallel region, the code inside the task block is conceptually queued to be executed by one of the threads executing the parallel region. To preserve sequential semantics, all tasks queued within the parallel region are guaranteed to complete by the end of the parallel region. The user is responsible for ensuring that no dependencies exist or that dependencies are appropriately synchronized, between explicit tasks, or between code inside and outside explicit tasks.


OpenMP* 3.0 Tasking

#pragma omp parallel
#pragma omp single
{
for(int i=0; i<size; i++) {
#pragma omp task
setQueen(new int[size], 0, i, myid);
}

 

  • Intel® C++ Compiler Language Extensions for Parallelism. The Intel® Compiler uses simple C and C++ language extensions to make parallel programming easier. There are four keywords introduced within this version of the compiler:
    __taskcomplete, __task, __par, and __critical. These are used as statement prefixes.

 


Simple Language Extensions

__par for (i = 0; i < size; i++)
setSize (new int[size], 0, i)

__taskcomplete {
__task sum(500, a, b, c);
__task sum(500, a+500, b+500, c+500)
)

if ( !found )
__critical item_count++;

 

In order for the application to benefit from the parallelism afforded by these keywords, the compiler switch /Qopenmp must be used during compilation. The compiler will link in the appropriate runtime support libraries, and the runtime system will manage the actual degree of parallelism. The parallel extensions utilize the OpenMP 3.0 runtime library, but abstract out the use of the OpenMP pragmas and directives, keeping the code more naturally written in C or C++. The mapping between the parallelism extensions and the OpenMP constructs are as follows:

 

Parallel Extension OpenMP
__par == #pragma omp parallel for
__critical == #pragma omp critical
__taskcomplete S1 ==

#pragma omp parallel
#pragma omp single
{ S1 }

__task S2
== #pragma omp task
{ S2 }

 

  • Threading Building Blocks (TBB).Intel® Threading Building Blocks (TBB) offers a rich methodology to express parallelism in a C++ program. It is a library that helps you take advantage of multi-core processor performance. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanism for performance and scalability. It nicely fits into the object oriented and generic framework of C++. TBB uses a runtime based programming model and provides the user with generic parallel algorithms based on a template library similar to the standard template library (STL).

 


Threading Building Blocks Code Sample

#include “tbb/ParallelFor.h”
#include “tbb/BlockedRange2D.h”
void solve() {
parallel_for(blocked_range<size_t>(0, size, 1), [](const blocked_range<int> &r) {
for (int i = r.begin(); i != r.end(); ++i)
setQueen(new int[size], 0, (int)i);
});
}

 

The TBB task scheduler does the load balancing for you. With thread-based programming, you are often stuck dealing with load-balancing yourself, which can be tricky to get right. By breaking your program into many small tasks, the Intel TBB scheduler assigns tasks to threads in a way that spreads out the work evenly.

Both Intel® C++ Compiler and TBB support the new C++0x lambda functions which make STL/TBB algorithms much easier to use. In order to use Intel's implementation of lambda expressions, you need to compile the code with the /Qstd=c++0X compiler option.

  • Win32* Threading API, Pthreads*. In some cases developers would like to choose the flexibility of a native threading API. The main advantage of the API approach is that the user has more control and power over threading than with the threading abstractions discussed earlier. But the amount of code it takes to implement a given solution is higher as the programmer has to implement all the tedious thread implementation tasks such as creation, scheduling, synchronization, local storage, load balancing, and destruction which in the other cases are handled by the runtime system. Also the number of cores which influences the number of created threads has to be determined which may not be so easy if you have a platform-independent solution in mind.

 


Windows* Threading API

void run_threaded_loop (int num_thr, size_t size, int _queens[]) {
HANDLE* threads = new HANDLE[num_thr];
thr_params* params = new thr_params[num_thr];

for (int i = 0; i < num_thr; ++i) {
// Give each thread equal number of rows
params[i].start = i * (size/num_thr);
params[i].end = params[i].start + (size/num_thr);
params[i].queens = _queens;
// Pass argument-pointer to a different
// memory for each thread's parameter to avoid data races
threads[i] = CreateThread (NULL, 0, run_solve,
static_cast<void *> (&params[i]), 0, NULL);
}
// Join threads: wait until all threads are done
WaitForMultipleObjects (num_thr, threads, true, INFINITE);

// Free memory
delete[] params;
delete[] threads;
}

 

  • Threaded Libraries. Another way to add parallelism to an application is to use threaded libraries such as Intel® Math Kernel Library (Intel® MKL) and Intel® Performance Primitives (Intel® IPP). Intel MKL offers highly optimized threaded math routines for maximum performance. MKL uses OpenMP* for threading. To take advantage of threaded MKL functions, simply set OMP_NUM_THREADS environment variable to a value greater than one. MKL does have internal thresholds to determine whether to perform a computation in parallel or serial. You can also apply your own thresholds using the OpenMP API, specifically the omp_set_num_threads function. The online technical notes have some additional information about MKL parallelism on Windows* and Linux*.  IPP is an extensive library of multi-core-ready, highly optimized software functions for multimedia data processing, and communications applications. IPP uses OpenMP for threading, as well. The online technical notes provide more information about IPP threading and OpenMP support. The Intel® C++ Compiler also provides an implementation of the Standard C++ Template Library (STL) valarray using Intel® IPP for data parallel performance of math and transcendental operations. C++ valarray template class consists of array operations that allow for high performance computing. These operations are designed to exploit low level hardware features such as vectorization. The Intel implementation of valarray provides Intel IPP optimized versions of several valarray operations through an optimized replacement valarray header file without requiring any source code changes.  To optimize valarray loops with Intel Optimized Performance header files you must use the /Quse-intel-optimized-headers compiler option.

Additional Parallelization Techniques

  • Auto-Parallelization. Auto-parallelization is a feature of the Intel C++ Compiler.  In Auto-parallelization mode the compiler automatically detects parallelism inherent in the program. The auto-parallelizer analyzes the dataflow of the loops in the application source code and generates multithreaded code for those loops which can safely and efficiently be executed in parallel. If data dependencies are present, loop restructuring may be needed for the loops to be auto parallelized. In Auto-parallelization mode all parallelization decisions are made by the compiler and the user does not have any control over which loops are to be parallelized.  Auto-parallelization can be combined with OpenMP* to achieve higher performance. When combining OpenMP and Auto-parallelization, OpenMP will be used to parallelize loops containing OpenMP directives and Auto-parallelization will be used to parallelize non-OpenMP loops.  Auto-parallelization is enabled with the /Qparallel compiler option.

Auto-parallelization

#define N 10000
float a[N], b[N], c[N];

void f1() {
for (int i = 1; i < N; i++)
c[i] = a[i] + b[i];

}

icl /c /Qparallel par1.cpp
par1.cpp(5): (col. 4) remark: LOOP WAS AUTO-PARALLELIZED.

 

By default the auto-parallelizer reports which loop got successfully auto-parallelized. Using the /Qpar-report[n]option where "n" is a number between 0 and 3 the auto-parallelizer can report diagnostic information about auto-parallelized loops and those that did not get auto-parallelized. For example, the /Qpar-report3 tells the auto-parallelizer to report diagnostics messages for loops successfully and unsuccessfully auto-parallelized plus information about any proven or assumed dependencies inhibiting auto-parallelization. Using the diagnostics information you may be able to restructure the loops so they get auto-parallelized.

  • Auto-Vectorization. Vectorization is the technique used to optimize loop performance on Intel® Core Architecture processors and Intel IA32 processors. Parallelism defined by vectorization technique is based on vector-level parallelism (VLP) made possible by the processor's SIMD hardware. The Auto-vectorizer in the Intel® C++ Compiler automatically detects low-level operations in the program that can be done in parallel, and then converts the sequential code to process 2-, 4-, 8-, or up to 16-byte data elements in one operation with extensions up to 32- and 64-byte in the future processors. Loops need to be independent for the compiler to auto vectorize. Auto-vectorization can be used in conjunction with other thread-level parallelization techniques such as Auto-parallelization and OpenMP discussed earlier. Most Floating-point applications and some integer applications can benefit from Vectorization.  The default vectorization level is /arch:SSE2 which generates code for Intel® Streaming SIMD Extensions 2 (Intel® SSE2). To enable auto-vectorization other than the default target, use /arch (e.g. /arch:SSE4.1) or /Qx (e.g /QxSSE4.2, QxHost) compiler options.


    The figure below shows the serial execution of the loop iterations on the left without Vectorization where the lower parts of the SIMD registers are not utilized. The Vectorized version on the right shows four elements of the A and B arrays are added in parallel for each iteration of the loop utilizing the full width of the SIMD registers.

    DataParallelism.JPG


    Auto-vectorization

    #define N 10000
    float a[N], b[N], c[N];

    void f1() {
    for (int i = 1; i < N; i++)
    c[i] = a[i] + b[i];
    }

    icl /c /QxSSE4.2 par1.cpp
    par1.cpp(5): (col. 4) remark: LOOP WAS VECTORIZED.

     

    By default the vectorizer reports which loop got vectorized. Using the /Qvec-report[n]option where "n" is a number between 0 and 5 the vectorizer can report diagnostic information about vectorized and non-vectorized loops. For example, the /Qvec-report5option tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized. Using the diagnostics information you may be able to restructure the loops so they get vectorized.

     


    Tradeoff between Different Methods

    The parallelism methods OpenMP, TBB, and Win32* Threading API and Pthreads* can be categorized in terms of abstraction, control, and simplicity. TBB and the API models do not require specific compiler support but OpenMP does. The use of OpenMP requires that you compile with a compiler that recognizes OpenMP directives. The API based models require the programmer to manually map concurrent tasks to threads. There is no explicit parent-child relationship between the threads – all threads are peers. These models give the programmer control over all low-level aspects of thread creation, management, and synchronization. This flexibility is the key advantage of library-based threading methods. The price of this flexibility is significant code modifications and a lot more coding. Concurrent tasks must be encapsulated in functions that can be mapped to threads. The other drawback is that most threading API’s use arcane calling conventions and only accept one argument. So it is often necessary to modify function prototypes and data structures that may break the abstraction of the program design which fits better in a C than an Object-oriented C++ approach.

    OpenMP, a compiler-based threading method, provides a high-level interface to the underlying thread libraries. With OpenMP, the programmer uses OpenMP directives to describe parallelism to the compiler. This removes much of the complexity of explicit threading methods because the compiler handles the details. Due to the incremental approach to parallelism where the serial structure of the application stays intact there are no significant source code modifications necessary. A non OpenMP compiler simply ignores the OpenMP directives, leaving the underlying serial code intact. However, much of the fine control over threads is lost. Among other things, OpenMP does not give the programmer a way to set thread priorities or perform event-based or inter-process synchronization. OpenMP is a fork-join threading model with an explicit master-worker relationship among threads. This narrows the range of problems for which OpenMP is suited. In general, OpenMP is best suited for expressing data parallelism while explicit threading API methods are best suited for functional decomposition. OpenMP is well known for its support for loop structures and C code but it offers nothing specific for C++. The OpenMP version 3.0 supports tasking which extends OpenMP by adding support for irregular constructs such as while loops and recursive structures. But still OpenMP remains reminiscent of plain C and FORTRAN programming with minimal support for C++.

    The Threading Building Blocks library supports generic scalable parallel programming using standard C++ code like the Standard Template Library (STL). It does not require special languages or compilers. If one needs a flexible and high level parallelization approach which fits nicely in an abstract and even generic object oriented approach then TBB is the right choice. TBB uses templates for common parallel iteration patterns and supports scalable data parallel programming with nested parallelism. In comparison to the API approach one specifies tasks, not threads, and let the library map tasks onto threads in an efficient way using the TBB runtime. The TBB scheduler favors a single, automatic, divide-and-conquer approach to scheduling. It implements task stealing which moves tasks from loaded core to idle ones. In comparison to OpenMP the generic approach implemented in TBB allows the user to work with user defined parallelism structures which are not limited to built-in types.

    Table 1: Summary of Parallelization Techniques

     

    Method Description Benefits Caveats
    Explicit Threading APIs Low level Application Programming Interfaces such as Win32* Threading API, and Pthreads* for low level multi-threaded programming - Maximum control and flexibility

    - Does not need special compiler support

    - Relatively complex code to write, debug, and maintain – very time-consuming

    - All thread management and synchronization done by the programmer

    OpenMP*

    (Enabled by /Qopenmp compiler option)

    A specification defined by OpenMP.org to support shared-memory parallel programming in C/C++ and Fortran through the use of APIs and compiler directives. - Potential for big performance gain with relatively low effort

    - Good for rapid prototyping

    - Can be used for C/C++, and Fortran

    - Allows incremental parallelism by the use of compiler directives

    - User control over what code to parallelize

    - Single source solution for multiple platforms

    - Same code base for both serial and parallel version

    - Not much user control over threads such as setting thread priorities or performing event-based or inter-process synchronization.

     

    Intel Parallel Extensions (Enabled by /Qopenmp compiler option) Intel® C++ language extensions (__tasktemplate, __task, __par, __critical) to simplify parallel programming - Simple syntax to express parallelism

    - Easier to use syntax than OpenMP for C++ applications

    - Required compiler support

    - No support for Fortran

    - The syntax expands to OpenMP* syntax so the same OpenMP restrictions apply.

    Intel Threading Building Blocks Intel’s C++ runtime library that simplifies threading for performance by providing parallel algorithms and concurrent data structures that eliminate tedious threading implementation work. - Does not need special compiler support

    - Uses standard C++ code like Standard Template Library

    - All thread creation, management, scheduling done for the user

    - Allows expressing parallelism in terms of tasks rather than threads

    - Mostly suited to C++ programs

    - No support for Fortran

    Auto-Vectorization

    (Enabled by /arch: and /Qx options)

    Technique used to optimize loop performance through vector level parallelism on Intel® Processors by converting sequential instructions to single SIMD instructions that can operate on multiple data elements at once. - Automatic vector level parallelism done by the compiler

    - Can be used with other threading techniques

    - Resulting code may not run on all processors if processor specific options are used.

    Auto-Parallelization (Enabled by /Qparallel compiler option) A feature of the Intel® C++ Compiler to automatically parallelize simple loops with no loop-carried dependency in a program - Compiler automatically generates multi-threaded code for parallelizable loops

    - Can be used with other threading techniques

    - Works on loops that compiler can statically prove parallelizable through data-dependency and aliasing analysis

     

    The "Solve the N-Queens problem in parallel" article provides a hands-on training on how to apply each of the parallelization techniques discussed in this document to implement a parallel solution to the N-Queens problem which is a more general version of the Eight Queens Puzzle. Additional examples are provided in the "Samples" folder under the Intel® C++ Compiler installation folder.

     

     

    Technical Support

    Visit our online Community Support User Forums and Knowledge Base to get all of the help you need from our own tools and parallelism experts, and your fellow developers. Go to www.intel.com/software/support to start your search.

     


    References

    Additional information related to the Intel® Parallel Composer is available:

     

    Intel%20Logo

     

    For product and purchase information visit:

    www.intel.com/software/products

    Intel, the Intel logo, Intel. Leap ahead. and Intel. Leap ahead. logo, Pentium, Intel Core, Itanium and IA-64 are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

    *Other names and brands may be claimed as the property of others.

    INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice.

    Copyright © 2008, Intel Corporation. All Rights Reserved.

     

     

     






  • This article applies to: Intel® C++ Compiler for Linux* Knowledge Base,   Intel® C++ Compiler for Mac OS X* Knowledge Base,   Intel® C++ Compiler for Windows* Knowledge Base,   Intel® Parallel Composer Knowledge Base