Scheduling Overhead in Intel® Threading Building Blocks (Intel TBB) Apps

This recipe shows how to detect and fix scheduling overhead for an Intel TBB application.

Content expert: Dmitry Prohorov

Scheduling overhead is a typical problem of code threading when fine-grain chunks of work between threads are distributed dynamically. In the case of scheduling overhead, a significant amount of time spent by the scheduler to assign the work for working threads and the time worker threads spend waiting for a new portion of work make parallelism inefficient and, in ultimate cases, a threaded version of a program can be even slower than the sequential one. The majority of Intel TBB constructs use a default auto-partitioner that tailors the number of chunks larger than the default grain size to avoid overhead. If you use a simple partitioner either intentionally or with constructs like parallel_deterministic_reduce, you need to take care of a grain size since a simple partitioner divides the work into chunk sizes up to the default grain size of one iteration. Intel® VTune™ Amplifier helps detect scheduling overhead for Intel TBB applications and provides some advice on increasing the grain size to avoid a slowdown connected with it.

Ingredients

This section lists the hardware and software tools used for the performance analysis scenario.

  • Application: a sample application that calculates the sum of vector elements using Intel TBB parallel_deterministic_reduce template function.

  • Compiler : Intel® Compiler or GNU* compiler with compiler/linker options, for example:

    -I <tbb_install_dir>/include -g -O2 -std=c++11 -o vector-reduce vector-reduce.cpp -L <tbb_install_dir>/lib/intel64/gcc4.7 -ltbb

  • Performance analysis tools: Intel VTune Amplifier 2019: Threading analysis

    Note

    • For trial VTune Amplifier downloads and product support, visit https://software.intel.com/en-us/vtune.

    • All the Cookbook recipes are scalable and can be applied to VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.

  • Operating system: Ubuntu* 16.04 LTS

  • CPU: Intel Xeon® CPU E5-2699 v4 @ 2.20GHz

Create a Baseline

The initial version of the sample code uses parallel_deterministic_reduce with the default grain size (line 17-23):

#include <stdlib.h>
#include "tbb/tbb.h"

static const size_t SIZE = 50*1000*1000;
double v[SIZE];

using namespace tbb;

void VectorInit( double *v, size_t n )
{
    parallel_for( size_t( 0 ), n, size_t( 1 ), [=](size_t i){ v[i] = i * 2; } );
}

double VectorReduction( double *v, size_t n )
{

    return parallel_deterministic_reduce(
        blocked_range<double*>( v, v + n ),
        0.f,
        [](const blocked_range<double*>& r, double value)->double {
            return std::accumulate(r.begin(), r.end(), value);
        },
        std::plus<double>()
    );
}

int main(int argc, char *argv[])
{
    task_scheduler_init( task_scheduler_init::automatic );

    VectorInit( v, SIZE );

    double sum;

    for (int i=0; i<100; i++)
                  sum = VectorReduction( v, SIZE );

    return 0;
}

The vector sum calculation is repeated in the loop in line 35 to make compute work more significant and measurable for statistical analysis.

Running the compiled application takes about 9 seconds. This is a performance baseline that could be used for further optimizations.

Run Threading Analysis

To estimate threading parallelism of your application and time spent on scheduling overhead, run the Concurrency analysis:

  1. Click the New Project button on the toolbar and specify a name for the new project, for example: vector-reduce.

  2. Click Create Project.

    The Configure Analysis window opens.

  3. On the WHERE pane, select the Local Host target system type.

  4. On the WHAT pane, select the Launch Application target type and specify an application for analysis.

  5. On the HOW pane, click the browse button and select Parallelism > Threading analysis.

  6. Click theStart button.

VTune Amplifier launches the application, collects data, finalizes the data collection result resolving symbol information, which is required for successful source analysis.

Note

Since the analysis is based on instrumentation and uses a stack stitching technology, the elapsed time of the instrumented application can be slower than the original application run because of the collection overhead.

Identify Scheduling Overhead

Start your analysis with the Summary view that displays application-level statistics.

The Effective CPU Utilization Histogram shows that, on average, the application utilized only ~3 physical cores out of available 88:

Flagged Overhead Time metric and the Scheduling sub-metric signal a threading inefficiency problem that should be explored:

The Top Hotspots section shows [TBB Dispatch Loop] as the main time-consuming function. The hint on the flag associated with the function informs about scheduling overhead that should be addressed with increasing a grain size of parallel work:

The sample application contains two Intel TBB constructs - parallel_for initialization and parallel_deterministic_reduce - to calculate the sum of vector elements. Discover which Intel TBB construct brings overhead with the help of the Caller/Callee data view. Expanding the CPU Time: Total > Overhead Time columns sorts the grid by the Scheduling column. Find the first row with the Intel TBB parallel construct in the Function list:

This row points to the parallel_deterministic_reduce construct in the VectorReduction function that has the highest scheduling overhead. Try to make work chunks more coarse-grain to eliminate the overhead in parallelization with this construct.

Note

To better visualize the imbalance during the application run, explore the Timeline view. Useful work shows up in green while the black color marks the time waste.

Increase Grain Size of Parallel Work

As mentioned earlier, the fine-grain chunks of work assigned to worker threads, cannot compensate the time spent by the scheduler on the work assignment. The default chunk size for parallel_deterministic_reduce that uses a simple partitioner is 1. This means that worker threads will take just one loop iteration to execute before turning back to the scheduler for a new portion of work. Consider increasing the minimal chunk size to 10000 (line 5 in the snippet below):

double VectorReduction( double *v, size_t n )
{

    return parallel_deterministic_reduce(
        blocked_range<double*>( v, v + n, 10000 ),
        0.f,
        [](const blocked_range<double*>& r, double value)->double {
            return std::accumulate(r.begin(), r.end(), value);
        },
        std::plus<double>()
    );
}

And re-run the Threading analysis:

You see that the elapsed time of the application has significantly reduced, average effective CPU utilization is ~38 logical cores (because the metric counts a warmup phase, compute phase CPU Utilization is closer to 80 cores), and CPU time spent on Intel TBB scheduling or other parallel work arrangement is negligible. An overall speedup of this small code modification is more than 10X in comparison with the original version of the application run without collection time.

Note

To discuss this recipe, visit the VTune Amplifier developer forum.

For more complete information about compiler optimizations, see our Optimization Notice.
Select sticky button color: 
Orange (only for download buttons)