Using Intel® Parallel Advisor 2011 to determine if your Intel® Threading Building Blocks application will scale

I have a new appreciation for the Suitability tool in Intel® Parallel Advisor. Intel Parallel Advisor was created to help us add parallelism to existing serial code, but I’ve discovered another useful, possibly unconventional, use for Intel Parallel Advisor with my parallel application. I’ve discovered that I can use Intel Parallel Advisor to collect valuable performance and scalability information about my parallel application that would be difficult to collect otherwise.

Let me provide some background information about Intel Parallel Advisor: Intel Parallel Advisor allows you to easily model different parallel implementations in your serial code and provides information about how that code would behave if it were actually parallelized. Intel Parallel Advisor’s Suitability tool provides information specific to the performance gains you could expect to see in the parallel version, as well as how well the parallel implementation would scale to higher core counts.

Knowing this type of information before you parallelize your code is extremely valuable. It helps ensure that you don’t waste effort threading code that will not improve performance or will not scale acceptably. Even if you already have a parallelized application, it’s not too late to use Intel Parallel Advisor. You may know how your application performs on a couple of test systems, but in some cases, you can use Intel Parallel Advisor’s Suitability tool to see how your application will scale to other core counts. In this blog, I’ll show you one way to use Intel Parallel Advisor to find the scalability of your parallel application.

For this example I will use the Intel® Threading Building Blocks (Intel® TBB) version of Tachyon included with the Intel® Parallel Studio 2011 samples. This application is using a Intel® TBB parallel_for to render a ray-traced image. The rendering is broken up into separate tasks by Intel TBB. Each task is rendering a separate horizontal strip of the image. Intel TBB dynamically determines how many of these tasks will be completed by each thread.

In order to determine estimated scalability for this parallel version of Tachyon using Intel Parallel Advisor, I take the following steps:

1. Set the number of threads Intel TBB will use to 1 in the task scheduler initialization function:

tbb::task_scheduler_init init (1);

This will essentially serialize the application so that Intel Parallel Advisor can analyze the tasks properly. You may not be able to do this if your application requires multiple threads. For example, if it uses blocking queues.

2. The Intel TBB parallel_for will partition the iterations of the for loop into tasks based on the policy defined by the “partitioner” argument. In order to determine the scalability of the parallel_for, we need to make sure these chunks are as close to the minimum reasonable task size as possible. The “minimum reasonable task size” is the minimum amount of work a thread needs to perform so that the performance gains of doing the parallel work overcome the overhead of adding an additional thread. Intel Parallel Advisor recommends this to be at least 10 microseconds of work. In Tachyon, I know that each iteration of the loop does much more work than this, so I want each iteration to be its own task. I replace the auto_partitioner with a simple_partitioner which will create a separate task for each iteration.

tbb::parallel_for (tbb::blocked_range (starty, stopy,1), draw_task (), tbb::simple_partitioner() );

If your loop iterations are smaller than the minimum reasonable task size, you can use the grain size parameter to adjust the number of iterations per task assigned by the simple partitioner.

3. Annotate an Intel Parallel Advisor Site around the Intel TBB parallel_for

tbb::parallel_for (tbb::blocked_range (starty, stopy,1), draw_task (), tbb::simple_partitioner() );

This tells Intel Parallel Advisor that within this site we will be modeling multiple tasks that may run in parallel.

4. Annotate an Intel Parallel Advisor Task around the call operator ( operator() ) of the class that defines the body of the parallel_for:

class draw_task {
void operator() (const tbb::blocked_range &r) const
unsigned int serial = 1;
unsigned int mboxsize = sizeof(unsigned int)*(max_objectid() + 20);
unsigned int * local_mbox = (unsigned int *) alloca(mboxsize);
for (int y = r.begin(); y != r.end(); ++y) { {
drawing_area drawing(startx, totaly-y, stopx-startx, 1);
for (int x = startx; x next_frame()) {ANNOTATE_TASK_END(MyTask1); return; }

Annotating this task will tell the Suitability tool that this chunk of code will run in parallel. Make sure you have an ANNOTATE_TASK_END at all possible exit points of the function, just like you would if you were annotating serial code. Note that this does introduce some “fuzz” into the calculations because the time added by Intel TBB for creating and dispatching tasks is not accounted for in the annotation. However, as long as your task sizes are relatively large this effect should be minimized. Also, this time will be added to the serial portion of the calculation, so the performance predictions will be a conservative estimate.

5. Add the Intel Parallel Advisor annotation header file to the files that you added annotations to.

#include "advisor-annotate.h"

6. Build a release version and run the Intel Parallel Advisor Suitability tool.

Make sure to select “Intel TBB” from the Threading Model dropdown box. The results will show an estimate of how this Intel TBB version will scale its performance on more cores. The Tachyon Suitability Report is shown in figure 1:

Figure 1

Figure 1 - Tachyon Suitability Report

The Suitability Report estimates that the Intel TBB parallel_for will scale very well up to at least 32 cores. Be sure to check the “Enable Task Chunking” box since Intel TBB will automatically support chunking. Chunking is a technique in which multiple small tasks are combined into a single larger task to amortize the overhead of task execution. In the case of Tachyon, chunking may not provide a benefit, but when task times are relatively small, chunking can make a big difference. For example, with a different program, the white circles in Figures 2 and 3 represent the change in scalability when chunking is enabled.

Figure 2

Figure 2 - Without Chunking

Figure 3

Figure 3 - With Chunking

Figure 2 shows scalability tapering off beyond 4 cores. Figure 3 shows that, with chunking enabled, this site should scale well all the way up to 32 cores.

Without Intel Parallel Advisor, you are able determine the performance gains of your application on the hardware you have available. However, it would be difficult to determine how well your application will scale to higher core counts. Now, with Intel Parallel Advisor, you have techniques to estimate the scalability, even if the test hardware is not available. Remember to change your partitioner and grain size back to their original state before you continue with the application.

This example only showed how to use Intel Parallel Advisor with a Intel TBB parallel_for, but many of the considerations and techniques are the same for determining scalability of other parallel applications.

You can find more information about Intel Parallel Advisor here, and talk with other users in our forum here.

For more complete information about compiler optimizations, see our Optimization Notice.