Aggregator: a new Community Preview Feature in Intel® Threading Building Blocks

Intel® Threading Building Blocks (Intel® TBB) 4.0 Update 4 introduces a new Community Preview feature, the aggregator. An internal version of the aggregator has been in use in Intel® TBB for some time, appearing in the flow graph and concurrent priority queue implementations. An aggregator is like a mutex in that it enforces mutually exclusive access to a critical section of program code. However, it can perform better than a mutex in many cases. It differs significantly from a mutex in how it works, and that can have deeper implications on how it performs and how it can be used. It does its magic by aggregating the critical sections from multiple threads into a single critical section executed by a single thread, which can have a significant impact on cache performance.

There are two modes of use for this feature: basic mode and expert mode. Basic mode is straightforward and not much more complex than using a mutex. Expert mode requires some understanding of how the aggregator works, and additional coding, but can enable additional performance improvements. In this blog, I will first illustrate how to use the aggregator in the basic mode. Then I’ll give a brief overview of how the aggregator works, followed by an example of how to use the aggregator in the expert mode. Finally, I’ll examine the performance of the aggregator and suggest approaches to help decide whether or not to use it.

Side-by-side Comparison of Basic Aggregator Usage with Mutex Usage

In this simple example, I’ll compare the usage of a mutex with an aggregator to lock push and pop operations on a serial priority queue object of type std::priority_queue. This example uses C++1x features, such as lambdas, but one could use function objects instead. Fair warning: I’m interspersing code snippets below, because this blog format doesn’t allow for side-by-side code comparison. Please don’t try to use both a mutex and an aggregator to protect the same code.

First, declare the priority queue. I'll use a simple integer priority queue here:

typedef int value_type;
typedef priority_queue < value_type, std::vector < value_type > , compare_type > pq_t;
pq_t my_pq;

Declare a mutex to protect my_pq:

spin_mutex my_mutex;

Alternatively, declare an aggregator to protect my_pq:

aggregator my_aggregator;

Declare an element to push/pop from queue:

value_type elem = 42;

Now, push an element on the queue using the mutex:

tbb::spin_mutex::scoped_lock my_lock(my_mutex);

Or, push the element on the queue using the aggregator and a lambda expression:

my_aggregator.execute( [&my_pq, &elem](){
} );

Pop an element off the queue using the mutex:

bool result = false;
tbb::spin_mutex::scoped_lock my_lock(*my_mutex);
if (!my_pq.empty()) {
result = true;
elem =;

Pop an element off the queue using the aggregator:

bool result = false;
my_aggregator.execute( [&my_pq, &elem, &result](){
if (!my_pq.empty()) {
result = true;
elem =;
} );

How the Aggregator Works

As we see above, the usage of the aggregator in basic mode is trivially different from using a mutex. However, it is clearly working in a different way. In order to execute a critical section, you pass it to an aggregator via the execute method. When the execute method returns, the critical section has been executed, but how this happened is hidden inside the black box of the aggregator.

Looking at the header file aggregator.h that defines the aggregator, these details become clear. To use the aggregator in expert mode, you should have some familiarity with the header file, and I'll guide you through the most important features in the rest of this blog.

First note that aggregator inherits from a class aggregator_ext that takes a template parameter. Aggregator instantiates that template parameter with a simple handler defined in the header, handler_type = internal::basic_handler. We will discuss this more later.

The execute method of aggregator takes a function body as parameter, and encapsulates body in a basic_operation object, which inherits from aggregator_operation. Aggregator_operations are sent to the aggregator_ext’s mailbox where they may concurrently accumulate while they await execution. One thread, the active handler, i.e. the first thread to place an aggregator_operation in the empty mailbox, will grab all the operations that have accumulated there, effectively emptying the mailbox. It will then go through all the operations that it grabbed, and serially execute the function bodies stored in those objects. The mechanism used to execute function bodies is specified by aggregator_ext’s template parameter, which in the default case is called basic_handler.

This basic_handler is straightforward in its functioning: it is passed the list of aggregator_operations, and it loops through this list and handles each item. It makes use of a few methods on aggregator_operation to do this properly: next is used to traverse to the next operation in the list, start prepares the operation to be handled, and finish is called after the operation is handled to inform the thread waiting on the execution of the operation that the operation is completed. When all operations are handled, the active handler thread can leave the aggregator, since its own call to execute has been satisfied in the process.

The details of the synchronization that make this all possible can be found in aggregator.h. We won’t explain them fully here, because we already have enough information to proceed to use the aggregator in expert mode. It is enough to know that threads hand over critical sections to the aggregator, and one of these threads will execute all the operations serially on behalf of the other threads as a single critical section.

Using the Aggregator in Expert Mode

I’ll use the same example as before, allowing threads to safely push and pop to a serial std::priority_queue. The expert mode of aggregator allows the user to pass any sort of data in to the aggregator as an aggregator_operation via the process method (note the different method name – we were using execute in basic mode), along with an aggregating function object that is called by the active handler to perform the serial execution of operations. In this case, I’ll pass data about a push or pop operation to the aggregator via process, and provide a custom function object to perform the operations.

First, create a class derived from aggregator_operation to hold the operation data.

class op_data : public aggregator_operation {
value_type* elem;
bool success;
bool is_push;
op_data(value_type* e, bool push=false): elem(e), success(false), is_push(push) {}

Then, create a handler to pass in as the aggregator’s template parameter:

class my_handler_t {
pq_t *pq;
my_handler_t() {}
my_handler_t(pq_t *pq_) : pq(pq_) {}
void operator()(aggregator_node* op_list) {
op_data* tmp;
while (op_list) {
tmp = (op_data*)op_list;
op_list = op_list->next();
// handle tmp here
if (tmp->is_push) pq->push(*(tmp->elem));
else {
if (!pq->empty()) {
tmp->success = true;
*(tmp->elem) = pq->top();
// done handling tmp

Now, to create an aggregator, use the aggregator_ext type name and pass this handler’s type in as the template parameter, and initialize the handler and pass it in as an argument to the constructor:

aggregator_ext < my_handler_t > my_aggregator(my_handler_t(my_pq));

To perform a push, simply create the op_data node with the push information and pass it to process:

op_data my_push_op(&elem, true);

And to perform a pop:

bool result;
op_data my_pop_op(&elem);
result = my_pop_op.success;

When to use Aggregator and why use Expert Mode?

A good way to start is to compare the performance of your code using your current locking mechanism to a version of your code that uses an aggregator instead. In practice, we (developers of TBB) have often found that a mutex is sufficient and outperforms aggregator when contention on the critical region is low. For higher contention, we often find that the use of the aggregator is justified.

The aggregator provides most of its performance improvements in hot cache execution of operations on a single thread. (Recall the active handler?) Thus, the more concurrent contention on your critical region, the larger the aggregations will be that are assembled, and the greater the benefits of executing operations with a hot cache on a single thread.

If you do find that the basic aggregator improves your code’s performance, consider moving to the expert level. To begin with, you can simply transform your code as I’ve shown in the expert example above. This should result in better performance over the basic interface. The reason for this is that, in the basic interface, the function object or lambda expression you wish to execute and all the references to data that you want that code to access are stored on the stack of the thread that originated the operation. Referring back to the basic example, this means that for each operation, we look up a different reference to the same priority queue. But, in the expert example above, note that we store just a few data references in the aggregator_operation, and the code to execute the operation and references to the shared data (my_pq) are local to the aggregating functor and only need to be looked up once to handle all the operations in an aggregation. This enhances the hot cache effect by reducing the quantity of non-local stack accesses.

The expert-level usage of aggregator shown above is quite straightforward. However, you are free to handle operations in the aggregating handler in whatever manner you like. Consider the aggregation of operations an opportunity to develop new and interesting serial algorithms. This gives you a unique opportunity to make use of a kind of lookahead capability: you know the set of operations that you need to perform. For example, Intel® TBB’s concurrent_priority_queue handles the operations in two passes, performing some of them and postponing others, because some orderings of operations are more efficient than others. The only rules for processing operations in the aggregating handler are that they should all be handled, and, in some cases, there should be some serial sequence of the operations that achieves the same result (i.e. sequential consistency).

I’d like to hear about your experiences using aggregator, so if you get a chance, give it a try, and let me know how it went! You can comment here, or better yet, start a discussion on the Intel® TBB forum.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.