User Guide

Contents

Glossary

Amdahl's law:
A theoretical formula for predicting the maximum performance benefits of parallelizing application programs. Amdahl's law states that run-time execution time speedup is limited by the part of the program that is not parallelized (executes serially). To achieve results close to this potential, overhead must be minimized and all cores need to be fully utilized. See also .
annotation:
A method of conveying information about proposed parallel execution. In the Intel® Advisor, you create annotations by adding macros or function calls. These annotations are used by
Intel Advisor
tools to predict parallel execution. For example, the C/C++
ANNOTATE_SITE_BEGIN(sitename)
macro identifies where a
parallel site
begins. Later, to allow this code to execute in parallel, you replace the annotations with code needed to use a
parallel framework
.
See also
parallel framework
and .
atomic operation:
An operation performed by a thread on a memory location(s) that is guaranteed not to be interfered with by other threads. See also
synchronization
.
chunking:
The ability of a parallel framework to aggregate multiple instances of a task into groups for more efficient parallel processing. For tasks that do small amounts of computation and many iterations, task chunking can minimize task overhead. You can also restructure a single loop into an inner and outer loop (strip-mining). See also
code region:
A subtree of loops/functions in a call tree. Synonym
whole Loopnest
.
critical section:
A
synchronization
construct that allows only one thread to enter its associated code region at a time. Critical sections enforce
mutual exclusion
on enclosed regions of code. With
Intel Advisor
, mark critical sections by using
ANNOTATE_LOCK_ACQUIRE()
and
ANNOTATE_LOCK_RELEASE()
annotations.
data race:
When multiple threads share (read/write) a memory location, if the program does not implement controls to manage the sequence of concurrent memory accesses, one thread can inadvertently overwrite data written by another thread, or otherwise read or write stale data. This can produce execution errors that are difficult to detect and reproduce, such as obtaining different calculated results when the same  executable is run on different systems. To prevent data races, you can add data synchronization constructs that restrict shared memory access to one thread at a time, or you might eliminate the sharing.
data parallelism:
Occurs when a single portion of code is paired with multiple portions of  data, and each pairing executes as a task. For example, tasks are made by pairing a loop body with each element of an array iterated by the loop, and the tasks execute in parallel. See also . Contrast
task parallelism
.
data set:
A set of data to be used as input or with an interactive application the way you interact with the application to cause a portion of the application to be executed. Because the Dependencies tool watches each memory access in a parallel site in great detail, the parallel site's code takes much longer to run than usual. To limit the time needed to run Dependencies analysis, reduce the data (such as the number of loop iterations) and when using an interactive program, create a very small test case. See also .
deadlock:
A situation where a set of threads have each acquired some locks and are waiting for other locks to be released. All threads in the set are waiting for a lock held by a different thread, and since none can proceed and release their lock(s), they all remain waiting.
dynamic extent:
All code that may possibly be executed by a
parallel site
or
task
. For example, a dynamic extent might include a loop, all functions called from the loop, all functions the called functions may in turn call, and so on. Contrast
static extent
.
See also .
false positive:
When viewing the Dependencies Report, a problem reported by the Dependencies tool that is not an actual problem.
framework:
See
parallel framework
head:
A loop or function at the top of a subtree, which contains one or more child loops/functions.
hotspot:
A small code region that consumes much of the program's run time. Hotspots can be identified by a profiler, such as the
Intel Advisor
Survey tool. See also .
Intel® Threading Building Blocks (Intel® TBB)
:
A C++ template library for writing programs that take advantage of multiple cores. You can use this library to write scalable programs that specify tasks rather than threads, emphasize data parallel programming, and take advantage of concurrent collections and parallel algorithms. This is provided as an Intel® software product -
Intel® Threading Building Blocks (Intel® TBB)
- as well as open source.
Intel® Threading Building Blocks (Intel® TBB)
is one of several
parallel frameworks
. Abbreviation
Intel TBB
.
load balancing:
The equal division of work among cores. If the load is balanced, the cores are busy most of the time.
lock:
A
synchronization
mechanism that allows one thread to wait until another thread allows it to continue. A lock can be used to synchronize threads accessing a specific memory location. See also
synchronization
and
nested lock
.
multi-core:
A processor that combines two or more independent cores. Although each core shares interconnection to the rest of the system, it executes instructions independently by using its dedicated CPU, architectural state, and interrupt controllers, as well as private and/or shared cache. Most multi-core systems use identical cores. The number of cores used determines whether it is called dual-core (2), quad-core (4), or many-core system.
multithreaded processing:
See
parallel processing
mutual exclusion:
A type of locking typically used to prevent actions occurring at the same time. Abbreviation
mutex
. See also
synchronization
nested lock:
A type of
lock
that can be locked again by a task when the task already owns the lock. Nested locks are convenient when several inter-related functions use the same lock. See also
synchronization
and
lock
node:
A loop or function.
OpenMP*:
A high-level parallel framework and language extension designed to support shared-memory parallel programming that consists of compiler directives (C/C++ pragmas and Fortran directives), library functions, and environment variables. The OpenMP specification was developed by multiple hardware and software vendors to provide a scalable, portable interface for parallel programming on a variety of platforms. OpenMP is one of several parallel frameworks. See also http://openmp.org.
parallel framework:
A
combination of libraries, language features, or other software techniques that enable code for a program to execute in parallel.
Examples include
OpenMP
,
Intel® Threading Building Blocks (Intel® TBB)
, Message Passing Interface (MPI), Intel® Concurrent Collections for C/C++, Microsoft Task Parallel Library* (TPL), and low-level, basic threading APIs, like POSIX* threads (Pthreads).
Some parallel frameworks support
shared-memory parallel processing
, while others like MPI support non-shared-memory parallel processing.
See also
Intel® Threading Building Blocks (Intel® TBB)
and .
parallel processing:
The use of multiple threads during execution of a program.
Intel Advisor
focuses on parallel processing for
shared-memory systems
. There are other types of parallel processing, such as for clusters or grids and vector processing. Shortened version is
parallelism
. See also
hotspot
and
thread
.
parallel region:
Offload Advisor term
. A code region that starts with a specific parallel framework construction. Treading Building Blocks (TBB), Intel® Data Analytics Acceleration Library (Intel® DAAL), OpenMP*, Data Parallel C++ (DPC++) parallel frameworks are supported.
parallel site:
A region of code that contains tasks that can execute in parallel. See also
annotation
and
pipeline:
An approach to organizing task computations that uses both data parallelism and task parallelism, and organizes the computation into stages that run in a predetermined order.
self time:
In the Survey Report window, how much time was spent in a particular function or loop.
site:
See
parallel site
shared-memory parallelism:
See
parallel processing
static extent:
The code between a site's or a task's _BEGIN and _END annotations. A static extent might not be lexically paired; for example, a parallel site may have one _BEGIN point, but may require multiple independent _END exit points. Contrast with
dynamic extent
. See also
annotation
,
parallel site
, and Task Organization and Annotations.
synchronization:
Coordinating the execution of multiple threads. In some cases, you can provide synchronization within a task by using a private memory location instead of a shared memory location. In other cases, a
lock
or
mutex
can be used to restrict access to a shared data. See also Data Sharing Problem Types.
task:
A portion of code and its data that can be given to a thread to execute. See also Task Organization and Annotations, Choosing the Tasks, and
chunking
.
task parallelism:
Occurs when two different portions of the code are made into tasks and execute in parallel. For example, a task is made by pairing a display algorithm with the state to display, another task by pairing a compute-next-state algorithm with the same state, and the two tasks execute in parallel. See also Task Patterns. Contrast
data parallelism
Intel TBB
:
See
I
Intel® Threading Building Blocks (Intel® TBB)
thread:
A thread executes instructions within a process. Each process has one or more threads active at a time. Threads share the address space of the process, but have their own stack, program counters, and other registers.
total time:
In the Survey Report window, how much time was spent in a particular function or loop, plus the time spent by anything that entity calls.
vector processing:
A form of parallel processing where multiple data items are packed together in vector registers to allow vector instructions to operate on the packed data with a single instruction. Reducing the number of instructions needed to process the packed vector data minimizes memory use and latency, and provides good locality of reference and data cache utilization. Vector instructions are Single Instruction Multiple Data (SIMD) instructions. Some SIMD vector instructions support large register sizes to accommodate more packed data, such as Intel® Advanced Vector Extensions (Intel® AVX).

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804