Reduce Boilerplate Code in Parallelized Loops with C++11 Lambda Expressions

The C++11 standard, formerly known as C++0x, brought lambda expressions, and their usage greatly reduces boilerplate code. Intel® C++ compiler allows us to take advantage of lambda expressions to reduce boilerplate code when parallelizing loops with Intel® Threading Building Blocks (Intel® TBB).

During the last year, I’ve been working on many projects that had a very similar scenario. They required performance improvements and they had many embarrassingly parallel workloads as the main hotspots. It was necessary to parallelize loops and take advantage of vectorization to boost performance. However, the same situation happened for all these projects: developers complained about the boilerplate code required to parallelize a loop with Intel® TBB. I was surprised because I’ve been working with C++11 code for many years, specifically, taking advantage of C++11 lambdas to make it easier to write code that uses Intel® TBB.

The problem was that many developers moved away from C++ years ago and now are coming back to the language. However, they follow old practices that are not necessary anymore and make the code difficult to read, understand and maintain. Intel® C++ compiler has been including support for a huge number of C++11 features since version 11.0, including lambda expressions.

A lambda expression is a syntactic shortcut for a functor, and therefore, you can use lambda expressions to replace functors. The following line shows the basic syntax and elements for C++11 lambda expressions:

[ captures ] (parameters) -> returnTypesDeclaration { lambdaStatements; }

[ captures ]: This starts the lambda expression and represents the capture clause, also known as the lambda introducer. Here, you specify which outside variables are available for the lambda function. You indicate whether these outside variables should be captured by value (copying) or by reference. There are many capture defaults and the capture clause makes it easy to identify the start of a lambda expression.

( parameters ): This indicates the optional parameters list, also known as the lambda declarator. In case the function takes zero arguments (a parameterless function), you can omit the parameters list.

-> returnTypeDeclaration: This indicates the optional return type for the lambda expression.

{ lambdaStatements; }: This is the lambda body. All the statements we write within the lambda body can access the captured variables and the parameters.

An empty capture clause is declared with [] and it means that the lambda captures nothing, i.e., the lambda expression body doesn’t access variables in the enclosing scope. In most cases, a parallelized loop requires us to access at least one variable to know which number we have to process. It is very common to see this option in the lambdas included in parallelized loops.

The [=] capture option captures any referenced variable within the lambda by value (making a copy). The option automatically captures by value the outside variables referenced within the lambda. It is very common to see this option in the lambdas included in parallelized loops. However, in many cases, it can be replaced with [] when the lambda doesn’t need to capture variables.

The [&] capture option captures any referenced variable within the lambda by reference. The option automatically captures by reference the outside variables referenced within the lambda. 

You can also use the default capture option with exceptions. For example, [=, &number] will capture any reference variable within the lambda by value (making a copy) except the number variable. The number variable will be captured by reference.

Now, consider that we have the following code that declares a function that receives a number argument and runs CPU-intensive code related to this number. In this case, the function declares an empty body but it represents the function that will be launched for each number in the parallelized loop, that is, from 0 to 1023.

#include <iostream>

#include "tbb/tbb.h"

using namespace tbb;

const size_t numbers_to_process = 1024;

void process_number(size_t number) {
              // Function that processes the number received in number
}

The following lines use a C++11 lambda expression to easily write a parallelized loop that calls the process_number method with a number from 0 to numbers_to_process - 1 (1023):

int main()

{
              parallel_for(
                           blocked_range<size_t>(0, numbers_to_process),
                           [](const blocked_range<size_t>& r) {

                           for (size_t i = r.begin(); i != r.end(); ++i) {
                                         process_number(i);
                           }
              });

              std::cin.ignore();

              return 0;
}

 

The following code declares the lambda expression specified in parallel_for (tbb::parallel_for).

[](const blocked_range<size_t>& r) {
    for (size_t i = r.begin(); i != r.end(); ++i) {
        process_number(i);
}

The code is very easy to understand. The lambda expression doesn’t capture variables, and therefore, it starts with []. The lambda takes blocked_range<size_t> &r as a parameter and its body invokes a sequential for loop that uses the received range to establish the beginning and ending numbers for i and call the process_number method with i as an argument.

The following line will also work and you will probably see many lambda expressions included in parallel_for starting with [=] even when the lambda doesn’t capture variables. However, you should use the empty capture clause when you don’t need to capture variables.

[=](const blocked_range<size_t>& r) {

    for (size_t i = r.begin(); i != r.end(); ++i) {

        process_number(i);

}

In case we want to use automatic chunking, and we just need uni-dimensional iteration of integers and we want to loop over a consecutive range of integers, we can write even more compact code.

The following lines uses a C++11 lambda expression and the Intel® TBB compact form to easily write a parallelized loop that calls the process_number method with a number from 0 to numbers_to_process - 1 (1023).

int main()

{

              parallel_for(size_t(0), numbers_to_process, [](size_t i) {

                           process_number(i);

              });



              std::cin.ignore();

             

              return 0;

}

The following code declares the lambda expression specified in parallel_for (tbb::parallel_for).

[](size_t i) { process_number(i); });

The code is very easy to understand. The lambda expression doesn’t capture variables, and therefore, it starts with []. The lambda takes size_t i as a parameter and its body just calls the process_number method with i as an argument.

The following line will also work and you will probably see many lambda expressions included in parallel_for starting with [=] even when the lambda doesn’t capture variables. However, you should use the empty capture clause when you don’t need to capture variables.

[=](size_t i) { process_number(i); });

Parallelized loops with Intel® TBB that use C++11 lambdas are as easy to understand as sequential for loops. Whenever we have to write modern code that takes advantage of modern Intel® CPUs, we have to make sure we are also taking advantage of all the features available in Intel® C++ compilers to reduce boilerplate code.

There is an important disclaimer I must mention. Once you start using C++11 lambdas, there is no going back!

有关编译器优化的更完整信息,请参阅优化通知