| Last Modified On : | June 19, 2009 4:30 PM PDT |
Rate |
|
This document provides an overview of some of the popular techniques and technologies available for parallelizing applications and explains how Intel® C++ Compilers can help with the parallelization effort. The document introduces various parallelization technologies supported by the Intel C++ Compilers without getting into implementation details. The “Parallelizing N-Queens with the Intel® C++ Compiler” Tutorial referenced at the end of the document illustrates the utility and performance of the parallelization techniques discussed in this document.
The goal is to present some of the parallelization approaches that have been successfully used to thread many applications with the information needed for users to make an educated decision on which approach or approaches to use. The user can then drill down into the specific technical information for the selected approach as appropriate, to which this paper provides pointers.
For decades in the Giga Hertz era software applications enjoyed free performance boost due to ever increasing processor clock speeds. Faster clock speeds pushed the power envelope to a point where increases in processor power requirements made the trend unsustainable. That was the point where the shift to multicore architectures was made. In the multicore era, microprocessors with multiple processing cores provide the potential for dramatic increases in performance at substantially reduced power requirements. To effectively take advantage of the performance capabilities inherent in multicore processors, existing serial applications need to be parallelized.
Parallelization Techniques
| Directive | Description | ||
|---|---|---|---|
| #pragma omp parallel for [clause] … for - loop | Parallelizes the loop that immediately follows the parallel for directive. Execution of the parallel loop is controlled through optional clauses such as “private”,“shared”, “schedule”, etc. | ||
| #pragma omp parallel sections [clause] … { [#pragma omp section structured-block] … } |
Distributes the execution of the different sections among the threads in the parallel team. Each section is executed once, and each thread executes zero or more sections. | ||
| #pragma omp master structured-block | The code contained within the master construct is executed by the master thread in the thread team. | ||
| #pragma omp critical [ (name) ] structured-block | Provides mutual exclusion access to the structured-block. Only one critical section is allowed to execute at one time anywhere in the program. | ||
| #pragma omp barrier | Used to synchronize the execution of multiple threads within a parallel region. Ensures all the code occurring before the barrier has been completed by all the threads, before any thread can execute any of the code past the barrier directive. | ||
| #pragma omp atomic expression-statement | Provides mutual exclusion via hardware synchronization primitives. While a critical section provides mutual exclusion access to a block of code, the atomic directive provides mutual access to a single assignment statement. | ||
| #pragma omp threadprivate (list) | Identifies a global variable as being private to each thread. | ||
void sp_1a(float a[], float b[], int n) {
int i;
#pragma omp parallel shared(a,b,n) private(i)
{
#pragma omp for
for (i = 0; i < n; i++)a[i] = 1.0 / a[i];
#pragma omp single
a[0] = a[0] * 10;
#pragma omp for nowait
for (i = 0; i < n; i++)
b[i] = b[i] / a[i];
}
}
icl /c /Qopenmp par1.cpp
par2.cpp(5): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
par2.cpp(10): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
par2.cpp(3): (col. 3) remark: OpenMP DEFINED REGION WAS PARALLELIZED.
The /Qopenmp-report[n] compiler option where “n” is a number between 0 and 2 can be used to control the OpenmMP* parallelizer’s level of diagnostic messages. To use this option, you must also specify the /Qopenmp option. If you do not specify n, the default option is /Qopenmp-report1which displays diagnostic messages indicating loops, regions, and sections successfully parallelized. Because only directives are inserted into the code, it is possible to make incremental code changes. The ability to make incremental code changes helps programmers maintain serial consistency. When the code is run on one processor, it gives the same result as the unmodified source code. OpenMP is a single source code solution that supports multiple platforms and operating systems. There is also no need to determine the number of cores as the OpenMP runtime chooses the right number for you. The latest OpenMP version 3.0, contains a new task-level parallelism construct that simplifies parallelizing functions, in addition to the loop-level parallelism for which OpenMP is most commonly used.
One of the new features in OpenMP 3.0 is the tasking model that allows parallelizing programs with irregular patterns of dynamic data structures or with complicated control structures like recursion that are hard to parallelize efficiently. The task pragmas operate in the context of a parallel region, and create explicit tasks. When a task pragma is encountered lexically within a parallel region, the code inside the task block is conceptually queued to be executed by one of the threads executing the parallel region. To preserve sequential semantics, all tasks queued within the parallel region are guaranteed to complete by the end of the parallel region. The user is responsible for ensuring that no dependencies exist or that dependencies are appropriately synchronized, between explicit tasks, or between code inside and outside explicit tasks.
#pragma omp parallel
#pragma omp single
{
for(int i=0; i<size; i++) {
#pragma omp task
setQueen(new int[size], 0, i, myid);
}
__par for (i = 0; i < size; i++)
setSize (new int[size], 0, i)
__taskcomplete {
__task sum(500, a, b, c);
__task sum(500, a+500, b+500, c+500)
)
if ( !found )
__critical item_count++;
In order for the application to benefit from the parallelism afforded by these keywords, the compiler switch /Qopenmp must be used during compilation. The compiler will link in the appropriate runtime support libraries, and the runtime system will manage the actual degree of parallelism. The parallel extensions utilize the OpenMP 3.0 runtime library, but abstract out the use of the OpenMP pragmas and directives, keeping the code more naturally written in C or C++. The mapping between the parallelism extensions and the OpenMP constructs are as follows:
| Parallel Extension | OpenMP | |
|---|---|---|
| __par | == | #pragma omp parallel for |
| __critical | == | #pragma omp critical |
| __taskcomplete S1 | == |
#pragma omp parallel |
| __task S2 |
== | #pragma omp task { S2 } |
#include “tbb/ParallelFor.h”
#include “tbb/BlockedRange2D.h”
void solve() {
parallel_for(blocked_range<size_t>(0, size, 1), [](const blocked_range<int> &r) {
for (int i = r.begin(); i != r.end(); ++i)
setQueen(new int[size], 0, (int)i);
});
}
The TBB task scheduler does the load balancing for you. With thread-based programming, you are often stuck dealing with load-balancing yourself, which can be tricky to get right. By breaking your program into many small tasks, the Intel TBB scheduler assigns tasks to threads in a way that spreads out the work evenly.
Both Intel® C++ Compiler and TBB support the new C++0x lambda functions which make STL/TBB algorithms much easier to use. In order to use Intel's implementation of lambda expressions, you need to compile the code with the /Qstd=c++0X compiler option.
void run_threaded_loop (int num_thr, size_t size, int _queens[]) {
HANDLE* threads = new HANDLE[num_thr];
thr_params* params = new thr_params[num_thr];for (int i = 0; i < num_thr; ++i) {
// Give each thread equal number of rows
params[i].start = i * (size/num_thr);
params[i].end = params[i].start + (size/num_thr);
params[i].queens = _queens;
// Pass argument-pointer to a different
// memory for each thread's parameter to avoid data races
threads[i] = CreateThread (NULL, 0, run_solve,
static_cast<void *> (¶ms[i]), 0, NULL);
}
// Join threads: wait until all threads are done
WaitForMultipleObjects (num_thr, threads, true, INFINITE);// Free memory
delete[] params;
delete[] threads;
}
Additional Parallelization Techniques
#define N 10000
float a[N], b[N], c[N];
void f1() {
for (int i = 1; i < N; i++)
c[i] = a[i] + b[i];}
icl /c /Qparallel par1.cpp
par1.cpp(5): (col. 4) remark: LOOP WAS AUTO-PARALLELIZED.
By default the auto-parallelizer reports which loop got successfully auto-parallelized. Using the /Qpar-report[n]option where "n" is a number between 0 and 3 the auto-parallelizer can report diagnostic information about auto-parallelized loops and those that did not get auto-parallelized. For example, the /Qpar-report3 tells the auto-parallelizer to report diagnostics messages for loops successfully and unsuccessfully auto-parallelized plus information about any proven or assumed dependencies inhibiting auto-parallelization. Using the diagnostics information you may be able to restructure the loops so they get auto-parallelized.
#define N 10000
float a[N], b[N], c[N];
void f1() {
for (int i = 1; i < N; i++)
c[i] = a[i] + b[i];
}
icl /c /QxSSE4.2 par1.cpp
par1.cpp(5): (col. 4) remark: LOOP WAS VECTORIZED.
By default the vectorizer reports which loop got vectorized. Using the /Qvec-report[n]option where "n" is a number between 0 and 5 the vectorizer can report diagnostic information about vectorized and non-vectorized loops. For example, the /Qvec-report5option tells the vectorizer to report on non-vectorized loops and the reason why they were not vectorized. Using the diagnostics information you may be able to restructure the loops so they get vectorized.
The parallelism methods OpenMP, TBB, and Win32* Threading API and Pthreads* can be categorized in terms of abstraction, control, and simplicity. TBB and the API models do not require specific compiler support but OpenMP does. The use of OpenMP requires that you compile with a compiler that recognizes OpenMP directives. The API based models require the programmer to manually map concurrent tasks to threads. There is no explicit parent-child relationship between the threads – all threads are peers. These models give the programmer control over all low-level aspects of thread creation, management, and synchronization. This flexibility is the key advantage of library-based threading methods. The price of this flexibility is significant code modifications and a lot more coding. Concurrent tasks must be encapsulated in functions that can be mapped to threads. The other drawback is that most threading API’s use arcane calling conventions and only accept one argument. So it is often necessary to modify function prototypes and data structures that may break the abstraction of the program design which fits better in a C than an Object-oriented C++ approach.
OpenMP, a compiler-based threading method, provides a high-level interface to the underlying thread libraries. With OpenMP, the programmer uses OpenMP directives to describe parallelism to the compiler. This removes much of the complexity of explicit threading methods because the compiler handles the details. Due to the incremental approach to parallelism where the serial structure of the application stays intact there are no significant source code modifications necessary. A non OpenMP compiler simply ignores the OpenMP directives, leaving the underlying serial code intact. However, much of the fine control over threads is lost. Among other things, OpenMP does not give the programmer a way to set thread priorities or perform event-based or inter-process synchronization. OpenMP is a fork-join threading model with an explicit master-worker relationship among threads. This narrows the range of problems for which OpenMP is suited. In general, OpenMP is best suited for expressing data parallelism while explicit threading API methods are best suited for functional decomposition. OpenMP is well known for its support for loop structures and C code but it offers nothing specific for C++. The OpenMP version 3.0 supports tasking which extends OpenMP by adding support for irregular constructs such as while loops and recursive structures. But still OpenMP remains reminiscent of plain C and FORTRAN programming with minimal support for C++.
The Threading Building Blocks library supports generic scalable parallel programming using standard C++ code like the Standard Template Library (STL). It does not require special languages or compilers. If one needs a flexible and high level parallelization approach which fits nicely in an abstract and even generic object oriented approach then TBB is the right choice. TBB uses templates for common parallel iteration patterns and supports scalable data parallel programming with nested parallelism. In comparison to the API approach one specifies tasks, not threads, and let the library map tasks onto threads in an efficient way using the TBB runtime. The TBB scheduler favors a single, automatic, divide-and-conquer approach to scheduling. It implements task stealing which moves tasks from loaded core to idle ones. In comparison to OpenMP the generic approach implemented in TBB allows the user to work with user defined parallelism structures which are not limited to built-in types.
Table 1: Summary of Parallelization Techniques
| Method | Description | Benefits | Caveats |
| Explicit Threading APIs | Low level Application Programming Interfaces such as Win32* Threading API, and Pthreads* for low level multi-threaded programming | - Maximum control and flexibility
- Does not need special compiler support |
- Relatively complex code to write, debug, and maintain – very time-consuming
- All thread management and synchronization done by the programmer |
| OpenMP*
(Enabled by /Qopenmp compiler option) |
A specification defined by OpenMP.org to support shared-memory parallel programming in C/C++ and Fortran through the use of APIs and compiler directives. | - Potential for big performance gain with relatively low effort
- Good for rapid prototyping - Can be used for C/C++, and Fortran - Allows incremental parallelism by the use of compiler directives - User control over what code to parallelize - Single source solution for multiple platforms - Same code base for both serial and parallel version |
- Not much user control over threads such as setting thread priorities or performing event-based or inter-process synchronization.
|
| Intel Parallel Extensions (Enabled by /Qopenmp compiler option) | Intel® C++ language extensions (__tasktemplate, __task, __par, __critical) to simplify parallel programming | - Simple syntax to express parallelism
- Easier to use syntax than OpenMP for C++ applications |
- Required compiler support
- No support for Fortran - The syntax expands to OpenMP* syntax so the same OpenMP restrictions apply. |
| Intel Threading Building Blocks | Intel’s C++ runtime library that simplifies threading for performance by providing parallel algorithms and concurrent data structures that eliminate tedious threading implementation work. | - Does not need special compiler support
- Uses standard C++ code like Standard Template Library - All thread creation, management, scheduling done for the user - Allows expressing parallelism in terms of tasks rather than threads |
- Mostly suited to C++ programs
- No support for Fortran |
| Auto-Vectorization
(Enabled by /arch: and /Qx options) |
Technique used to optimize loop performance through vector level parallelism on Intel® Processors by converting sequential instructions to single SIMD instructions that can operate on multiple data elements at once. | - Automatic vector level parallelism done by the compiler
- Can be used with other threading techniques |
- Resulting code may not run on all processors if processor specific options are used. |
| Auto-Parallelization (Enabled by /Qparallel compiler option) | A feature of the Intel® C++ Compiler to automatically parallelize simple loops with no loop-carried dependency in a program | - Compiler automatically generates multi-threaded code for parallelizable loops
- Can be used with other threading techniques |
- Works on loops that compiler can statically prove parallelizable through data-dependency and aliasing analysis |
The "Solve the N-Queens problem in parallel" article provides a hands-on training on how to apply each of the parallelization techniques discussed in this document to implement a parallel solution to the N-Queens problem which is a more general version of the Eight Queens Puzzle. Additional examples are provided in the "Samples" folder under the Intel® C++ Compiler installation folder.
Visit our online Community Support User Forums and Knowledge Base to get all of the help you need from our own tools and parallelism experts, and your fellow developers. Go to www.intel.com/software/support to start your search.
Additional information related to the Intel® Parallel Composer is available:
For product and purchase information visit:
www.intel.com/software/products
Intel, the Intel logo, Intel. Leap ahead. and Intel. Leap ahead. logo, Pentium, Intel Core, Itanium and IA-64 are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.
*Other names and brands may be claimed as the property of others.
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice.
Copyright © 2008, Intel Corporation. All Rights Reserved.
| April 28, 2009 7:39 PM PDT
toc21c | good!! :) |
| May 9, 2009 2:07 AM PDT
creativesensee
| This is really too good |

English | 中文 | Русский | Français
Mark Sabahi (Intel)
|
sujeet