Intel® Cilk™ Plus Support in Intel® Parallel Amplifier 2011

Intel® Cilk™ Plus is a simple and powerful abstraction for expressing parallelism. It is one of the Intel® Parallel Building Blocks and it is included in Intel® Parallel Composer 2011, which is part of Intel® Parallel Studio 2011. In this initial introduction of Intel® Cilk™ Plus it is important to understand how the analysis features of Intel® Parallel Studio 2011 display results when Intel® Cilk™ Plus is used in your software. This article details the level of support provided by Intel® Parallel Amplifier 2011. Display of analysis results of software using Cilk™ Plus will become more informative in future releases.

Intel® Parallel Amplifier 2011 will analyze Cilk Plus code and provide results. However, information about Cilk Plus code may not be represented in an intuitive way, and some features of Parallel Amplifier, such as source view, may not work properly on Cilk Plus code. Most of the limitations in how results are presented are due to the current implementation of the Cilk Plus abstractions which do not preserve a clean symbol mapping between the source code and the binary. Although you can expect this to be improved as the product matures, in its initial state this causes Cilk Plus functions to be referred to as <unnamed-tag>::operator(), which may be misleading. To assist you in interpreting your results, this article will walk you through some examples of how Cilk Plus code is currently represented in Parallel Amplifier 2011 analysis output.

Hotspots Analysis and General Parallel Amplifier Functionality:
When it encounters cilk_for or cilk_spawn statements, the compiler creates lambda (on-the-fly anonymous) functions. In the case of a cilk_for it will encapsulate the body of the for loop in a lambda function so that it can be executed by multiple threads. A cilk_spawn statement results in a “spawn helper” lambda function that enables the passing of parameters to the spawned function. In either case, when Parallel Amplifier results contain samples from these lambda functions, they will be attributed to the proper module, but will be named <unnamed-tag>::operator() in the list of hotspots. Figure 1 shows the results of running Hotspots analysis on some simple Cilk Plus code that goes through a range of integers and counts prime numbers. The code contains one cilk_for call, wrapped in a function called parallel_count_primes.

Figure 1 - Hotspots results for simple cilk_for

The function called <unnamed-tag>::operator() in these results is the lambda function created by the compiler to execute the body of the cilk_for loop. The for loop is then implemented using a “divide and conquer” algorithm by the Cilk Plus runtime which calls the lambda function. This results in code which can be distributed efficiently across multiple workers, but also creates an unusual callstack.

Figure 2 shows the partially expanded Top-Down Tree for the same result. The code that was run contained a cilk_for inside the function parallel_count_primes, and no other significant function calls. However, instead of seeing all the execution time grouped into the parallel_count_primes function, 39.7% of the execution time is attributed to an operator() within the parallel_count_primes tree, and 59.2% is attributed to a long chain of __cilkrts_cilk_for_32 function calls. This call chain demonstrates how the Cilk Plus run-time code is recursively expanding the work of the operator() function into chunks of work that get distributed to available threads. The Hotspots results show this expansion happening in the Cilk Plus runtime (cilkrts20) code.

Figure 2 – Top-down Tree Results for simple cilk_for

The recursive distribution of tasks is also evidenced by having many call stacks for an <unnamed-tag>::operator() function. Figure 3 shows the Call Stack pane for the simple count primes cilk_for code. The operator function for this code has 71 call stacks. 70 of the stacks are associated with the Cilk Plus run-time code (like the one shown).

Figure 3 – Call Stacks for <unnamed-tag>::operator() function for simple cilk_for

Figure 4 shows the results of running Hotspots analysis on code that uses a cilk_spawn and cilk_sync in its algorithm to recursively find Fibonacci numbers. This time the <unnamed-tag>::operator() functions that show up in the results are the spawn helpers created each time a function is spawned. Like the previous example, these results show some Cilk Plus runtime activity as hotspots – such as the SwitchToFiber function (from kernel32.dll) that is called by the scheduler. The recursion inherent in the Fibonacci algorithm results in chains of fib-> <unnamed-tag>::operator() calls in the call stacks.

Because of the nature of the scheduling system, and the current partial support for Cilk Plus, Hotspots results for code using Cilk Plus may result in time being attributed to multiple hotspots. In this example, the fib, <unnamed-tag>::operator(), and background Cilk Plus functions in the list are all likely related to the same piece of code being executed – the spawned work of the fib function.

Figure 4 – Hotspots results for simple cilk_spawn and cilk_sync

When a project is analyzed that contains more than one Cilk Plus construct (such as 2 separate cilk_for loops), the time spent in the separate constructs may be grouped together into the same <unnamed-tag>::operator() function. Figure 5 shows the results of running Hotspots analysis on a project that contained two separate functions for finding primes using cilk_for, each called once.

Figure 5 – Hotspots results for code containing 2 cilk_for loops, each executed

The two cilk_for loops were wrapped in two functions called parallel_count_primes and parallel_counter_2. However the hotspots results and call stacks do not contain any references to the parallel_counter_2 function – all significant cilk_for execution time was attributed to the <unnamed-tag>::operator() function with the parallel_count_primes function as its caller.

Finally, because time may not be correctly attributed in code that runs multiple Cilk Plus constructs, double-clicking the <unnamed-tag>::operator functions or their call stacks may not open at the right place in Source View mode. For example, double-clicking on the <unnamed-tag>::operator function Figure 5 always results in viewing the parallel_count_primes function in the Source View, and never shows the parallel_counter_2 function. It may not be possible to separate which time was spent in which construct.

Although this is not guaranteed, for code containing only one Cilk Plus construct, double-clicking the <unnamed-tag>::operator will usually open the correct source code file and pinpoint the correct function. Double-clicking on a __cilkrts call stack for the <unnamed-tag>::operator may take you to the Cilk Plus run-time instead of the correct place in user code.

Concurrency Analysis:
Concurrency Analysis represents Cilk Plus constructs in the same way as Hotspots Analysis: code will be grouped into one or more <unnamed-tag>::operator() functions, may have callstacks showing Cilk Plus run-time code, may not attribute time within Cilk Plus constructs properly, and may not open source view properly for Cilk Plus code. The concurrency values (poor, OK, ideal, etc) for Cilk Plus code have generally been correct in limited internal testing, but are not guaranteed to be. Figure 6 shows the results of Concurrency Analysis on the simple find primes code with one cilk_for (the same code used in Figures 1 and 2). When the code ran, 3 software threads executed – a main thread, and two worker threads created by the Cilk Plus runtime. The results correctly show that all three of these threads executed tasks created from the cilk_for construct. In general, the main thread (labeled wmainCRTStartup) will bind to the Cilk Plus runtime and begin running the scheduling code. It may also complete some of the tasks. Worker threads are created by the Cilk Plus runtime (executed by the main thread) and will only be executing available spawned work.

Figure 6- Concurrency results in thread view for simple cilk_for code

Locks and Waits Analysis:
Locks and Waits Analysis shows the synchronization objects in an application and how long the processing cores spent waiting on each, as well as how utilized the cores were during the wait. For a Cilk Plus program, several synchronization objects that are part of the run-time library may show up in the results. These synchronization objects will be labeled as being part of Cilk Plus (under Sync Object Type). Figure 7 shows an example of Locks and Waits analysis results for the simple count primes program with one cilk_for loop. This code also contains one Cilk Plus hyper-object: a reducer used to hold the count of primes found. There are three synchronization objects from within Cilk Plus that are identified as having significant waiting time: one Intel Cilk Plus Scheduler object, one Intel Cilk Plus Completion Semaphore object, and one Intel Cilk Plus Initialization object.

At this point, wait times and utilizations are not guaranteed to be correct for the Cilk Plus constructs and run-time objects.

The Wait Times and utilizations are not guaranteed to be correct for the Cilk Plus constructs and run-time objects, but might give an idea of where overhead is occurring. Where wait times with poor utilization are seen in objects in the Cilk Scheduler, this may be an indication that there is not enough work to keep the Cilk Worker Threads busy (increase problem size), too much scheduling overhead (increase task or grain size, change algorithm), or another issue. Double-clicking the synchronization objects in the Cilk Scheduler or the Cilk User Thread may not lead to the proper source code line in Source Code View.

Wait times on the three objects in this example are generally not an indication of an issue, regardless of the utilization time shown. Waiting on the Intel Cilk Plus Scheduler object is by design when only one Cilk Plus application is being run and the default number of threads is being used. It occurs because the Cilk Plus run-time creates N worker threads, where N is by default the number of processing cores available, but for a single running application will usually only have N-1 threads executing tasks, plus the main (user) thread. Waiting occurs on the Nth thread that was created but is not doing work. Some waiting on the Cilk Plus Completion Semaphore is also expected – this occurs when the main thread completes its last task before one or more Cilk Plus worker threads complete their last tasks. Once all the Cilk Plus worker threads are done, the main thread will resume and return to the main application. A small wait on the Cilk Plus Initialization object should occur normally as part of the start-up of the run-time.

Figure 7- Concurrency results in thread view for simple cilk_for code

Summary and Where to go for Help
As mentioned in the introduction, Parallel Amplifier should analyze projects containing Cilk Plus code without crashing. The information above should give some guidance as to how to interpret the results of analysis on Cilk Plus constructs. For additional help, please post a question on the Intel Parallel Studio forum.