Intel® System Studio - Multicore Programming with Intel® Cilk™ Plus

Overview

Intel System Studio not only provides a variety of signal processing primitives via:

  • Intel® Integrated Performance Primitives (Intel® IPP), and
  • Intel® Math Kernel Library (Intel® MKL)

It also allows developing high-performance low-latency custom code using the:

  • Intel C++ Compiler together with Intel Cilk Plus.

Intel Cilk Plus is built into the compiler, therefore it can be used where it demands an efficient threading runtime in order to extract even small amounts of parallelism. For example, when using a library for signal processing (e.g., Intel IPP) it is now possible to make use of multicore even when this library does not provide internal multithreading. This article sketches, how to effectively introduce multicore parallelism even without introducing it into each of the important algorithms. This is possible by employing a parallel pattern called pipeline.

Introduction

Intel Cilk Plus in concert with the compiler enables forward-scaling task parallelism for C and C++ with a runtime-dynamic scheduler that maps an arbitrary number of tasks to a limited set of workers (pool of threads). This allows for composable designs where multicore parallelism can be added without supervising call chains (nested parallelism), and without oversubscribing resources. Intel Cilk Plus guarantees to unfold parallelism according to the number of workers in the pool instead of an unbound resource usage according to the number of scheduled tasks. Intel Cilk Plus is part of the Intel® C++ Compiler, and there are no additional steps required to start working with Intel Cilk Plus in C and C++ code.

There are two main usages of multicore that are important for embedded applications:

  1. Singular, long-running or permanent tasks including background tasks. This category usually implements an event handling e.g., queuing user interactions.
  2. Multiple tasks that operate on the same data. This kind of data-parallelism can still include an embarrassing parallel structure but may require synchronization constructs (locks).

In the first category, the task is in a 1:1-relationship to an OS thread, and there is no need to involve Intel Cilk Plus. An OS-specific thread interface (e.g., POSIX threads / Pthreads) can be used. However, C++ libraries provide an easy interface to interact with OS threads and synchronization primitives in a portable manner. With C++11 in particular, no additional library would be needed. Also, Intel® Threading Building Blocks (Intel® TBB) includes a rich set of primitives incl. std::thread (in case of C++11 is not available; TBB_IMPLEMENT_CPP0X). Note with Intel System Studio, Intel TBB can be built from source.

Case Study and Example

In the second category above a threading runtime is about to map many tasks to a limited pool of worker threads. Let's consider an application where we want to introduce some multicore parallelism, but without introducing it into each of the important algorithms. This is a realistic use case where for example a signal processing library is used that does not provide internal multi-threading, or that should not use an available internal multi-threading because of the higher efficiency in case of smaller data sets, latency-constraint applications, or for better control via application-level threading.

--------     -----------     ---------
| Read | --> | Process | --> | Print |
-------1     2---------3     4--------

The signal processing pipeline consists of three stages: (1) the signal reader that reads/parses values into its own dedicated buffer #1, (2) the actual processing stage for the signal that operates out of place on buffer #2 and #3, and a final print stage (3) that reads its own buffer #4 and prints it.

struct { size_t i, n; } stage[] = { { 0, 0 }, { 1, 0 }, { 2, 0 }, { 3, 0 } };  
for (size_t i = 0; i <= nsteps && (0 < stage[1].n || 0 == i); ++i) {
   read_signal   (size,       x + stage[0].i, y + stage[0].i, stage[0].n);
   process_signal(stage[1].n, x + stage[1].i, y + stage[1].i,
                              x + stage[2].i, y + stage[2].i);
   print_signal  (stage[3].n, x + stage[3].i, y + stage[3].i, std::cout);

   stage[2].n = stage[1].n;  
   std::rotate(stage, stage + 4 - 1, stage + 4); // quad-buffering
}  

The previously introduced pipeline now indirectly assigns one of four buffers to the each stages. Here, a quad buffering approach consists of two actual quad-buffers x and y for a signal that consists of multiple components. At each cycle of the loop, the stage's destination buffer is rotated by one position within the ring buffer (called "stage") to become the source of the next stage.

Intel® Cilk™ Plus

for (size_t i = 0; i <= nsteps && (0 < stage[1].n || 0 == i); ++i) {
cilk_spawn read_signal   (size,       x + stage[0].i, y + stage[0].i, stage[0].n);
cilk_spawn process_signal(stage[1].n, x + stage[1].i, y + stage[1].i,
                                         x + stage[2].i, y + stage[2].i);
              print_signal  (stage[3].n, x + stage[3].i, y + stage[3].i, std::cout);
cilk_sync;

   stage[2].n = stage[1].n;  
   std::rotate(stage, stage + 4 - 1, stage + 4); // quad-buffering
}  

Compared to the previously shown code, cilk_spawn and cilk_sync have been added. In order to generate a serial version of a program that uses Intel Cilk Plus (keywords, reducers, etc.) one can compile with the "-cilk-serialize" option (with just cilk_spawn, cilk_sync, and cilk_for one can simply elide these keywords using the preprocessor). Note that the above multi-buffering approach actually allows calling read_signal, process_signal, and print_signal in any order which can be of interest with Intel Cilk Plus' continuation-passing style.

Thinking of cilk_spawn as asynchronously launching an invocation can explain what is running concurrently before it is synced by cilk_sync. However, the worker thread that launches the first cilk_spawn also executes the spawned function (i.e., read_signal in the previous code fragment). This is in contrast to what a library-based threading runtime is able to achieve. The continuation however is eventually stolen by another worker (i.e., after the sequence point behind read_signal; hence the next spawn). There are also a number of implicit synchronization points (where cilk_sync can be omitted). These are mostly obvious, but also complete the definition of the language extension in presence of exceptions.

for (size_t i = 0; i < N; ++i) {
    cilk_spawn little_work(i);

}

for (size_t i = 0; i < N; i += G) {
cilk_spawn work(i, std::min(G, N - i));
}

/*A*/



/*C*/
cilk_for (size_t i = 0; i < N; ++i) {
    little_work(i);
}

void work(size_t I, size_t N) {
     for (size_t i = I; i < N; ++i) little_work(i);
}
/*B*/



/*D*/

In situation A (above) with only little work for each induction of i, the keyword cilk_for is introduced in code B to not only amortize the cilk_spawn, but to also employ a launch-scheme similar to a binary tree. Intel Cilk Plus allows adjusting the grain size of a cilk_for (#pragma cilk grainsize=<expression>) using a runtime expression. The grain size G in code C is able to accumulate more work in function D. With respect to the launch scheme the examples B and C are still not equivalent. Splitting the loop range according to a binary tree avoids accumulating the total launch overhead on a core.

There are two notable consequences from Intel Cilk Plus' continuation-passing style: (1) a thread continues with what is locally prepared or "hot in cache" and (2) the instructions of a scope (in the sense of C/C++ scope) may not be executed by the same thread. A conclusion from #1 is that tuning a sequential program maps to a tuned parallel program in a more straight-forward manner. In case of #2, a thread-local storage with a lifetime according to the scope cannot be used with Intel Cilk Plus. Now without a myth left, it should be also said that Intel Cilk Plus uses regular OS threads in order to perform the work. However, tasks are conceptually very lightweight user-space objects similar to fibers but this is not much different from other threading libraries such as Intel TBB.

Dynamic scheduling employs workers as needed, and the grain size varies the amount of available parallelism of a cilk_for loop. Of course, with cilk_spawn the number of spawned functions directly refers to the amount of parallelism that can be exploited e.g., the number of pipeline stages that can run in parallel. To summarize, setting the number of workers (see below code fragment) for different parallel sections is not the way to adjust concurrency.

void main(int argc, char* argv[])
{
const char *const threads = 1 < argc ? argv[1] : 0;
const int nthreads = 0 != threads ? std::atoi(threads) : -1;

if (0 <= nthreads) __cilkrts_set_param("nworkers", threads);
}  

The above example code illustrates three cases: (1) no given argument omits to setup the number of workers and eventually defers this step to a later point, (2) where a given "0" argument on the command line instructs the API to automatically initialize with all available workers according to the system and to the process' affinity mask, and (3) where an explicit number of workers is given. The API takes precedence over the environment variable CILK_NWORKERS which is also exist for convenience. Note an explicit number of workers is not necessarily truncated to the number of available hardware threads.

Increasing the grain size lowers the number of worker involved into a cilk_for loop. The default (when the pragma is not used) aims to maximize parallelism, and involves (1) the size of the iteration space as well as (2) the total number of workers (see above code). A constant grain size removes these two dependencies. However, the order of processing these partitions may still vary, and hence impede non-deterministic results in particular with floating-point data where the order of calculations impacts a final result. There are actually many more reasons for not reproducing the same bit-wise result on a system or across systems.

Conclusions

The pipeline pattern is only able to extract a limited amount of parallelism, and the longest running stage always becomes the bottleneck. In the above example, the pipeline consists of only three stages that can run in parallel where two of them are I/O. The latter can be an additional burden (in terms of scalability) if the I/O functionality uses locks in order to protect internal state.

It is actually more interesting in our example, that successive fork-joins (every cycle of our pipeline) demand an efficient threading runtime. Of course, one can try to hide the overhead with larger buffer sizes within the parallel pipeline. However, a parallel region that executes longer is obviously increasing the latency of the application.

Software optimizations including in-core optimizations can save energy. Cache-blocking can avoid unnecessary memory loads. With multicore one can take this idea further with cache-oblivious algorithms. However, multicore parallelism by itself is able to save energy in the following way:

  • 4x the die area of a microprocessor gives 2x the performance in one core, but
  • 4x the performance when the same area is dedicated to 4 cores.

Polack's rule should be considered alongside the fact that performance may scale linearly with clock frequency, but energy consumption will roughly scale with the square of the clock frequency. Amdahl's Law limits the practical use of a system that only provides performance in presence of parallelism. However, it is very attractive to prepare signal processing applications to make use of multiple cores because of possible energy savings, or to consolidate specialized hardware by loading the system with various different tasks in addition to accelerating the signal processing itself.

Case Study

One can find out more by reading the attached case study (2-up) carrying out the above example.

For more complete information about compiler optimizations, see our Optimization Notice.