An Introduction to Cilk™ Plus Reducers

If you are new to Cilk™ Plus, you have probably been impressed by how easy it is to turn a serial program into a parallel program. You’ve also realized, though, that adding cilk_sync or cilk_for to a program doesn’t automatically solve the harder parts of parallel programming: dealing with data races, and coordinating work that is done in parallel.

Fortunately, there is more to Cilk Plus than just the keywords. The Cilk Plus library contains some nice features for solving problems in parallel without even thinking about races, locks, and coordination. In this series I’m going to talk about Cilk Reducers, which let you parallelize “accumulator algorithms” efficiently and without data races. This post will just be an introduction, but later I will explain how and why reducers work, overview the predefined reducers that are available in the Cilk Plus library, and describe how you can write your own reducers when there isn’t a predefined one that solves your problem.

An accumulator algorithm computes a value by updating a variable (the “accumulator variable”) in each step of a computation. You might call this the “Hello, World” of accumulator algorithms:

#include <iostream>
int main()
{
    unsigned long accum = 0;
    for (int i = 0; i != 1000; i++) {
        accum += i*i;
    }
    std::cout << accum << "\n";
}

This looks like it should be pretty easy to parallelize. Just throw in a cilk_for, right?

#include <iostream>
#include <cilk/cilk.h>

int main()
{
    unsigned long accum = 0;
    cilk_for (int i = 0; i != 1000; i++) {
        accum += i*i;
    }
    std::cout << accum << "\n";
}

Of course not! You would end up with multiple occurrences of accum += i executing simultaneously. Presto: instant data race.

Avoiding Conflict

At this point, in traditional parallel programming, you would start thinking about adding locks for the accumulator updates. Cilk Plus has a more elegant solution, though. Replace the accumulator variable with a Cilk Plus reducer, and Cilk Plus will take care of the rest:

#include <iostream>
#include <cilk/cilk.h>
#include <cilk/reducer_opadd.h>

int main()
{
    cilk::reducer_opadd<unsigned long> accum(0);
    cilk_for (int i = 0; i != 1000; i++) {
        *accum += i*i;
    }
    std::cout << accum.get_value() << "\n";
}

You just

  • use a “reducer” (cilk::reducer_opadd<unsigned long>) instead of a simple variable for the accumulator variable,
  • treat the reducer like a pointer to the actual variable when you update it (*accum += a[i]), and
  • get the final value out of the reducer when the computation is done (accum.get_value()),

and it all just works.

A note about the examples: If you try executing the examples from this post, you will probably find that your parallel program runs slower than the one you started with. Cilk Plus parallelism and reducers are remarkably efficient, but they still have enough overhead to wipe out the advantages of parallelizing ridiculously small tasks. Alternatively, you may find that the programs are so small that nothing actually executes in parallel.

I’ve chosen trivial examples to keep the details from obscuring the issues. We’ll get a couple of realistic programs later in this series.

Maintaining Order

That may not have seemed all that impressive. Adding locks isn’t all that big a bother. But coordinating a computation is often more of a problem than avoiding the data races. Consider this example:

#include <string>
#include <iostream>
#include <cilk/cilk.h>

int main()
{
    std::string alphabet;
    cilk_for(char letter = 'A'; letter <= 'Z'; ++letter) {
        alphabet += letter;
    }
    std::cout << alphabet << "\n";
}

It has the same race problem as the previous example — simultaneous appends to the string variable — but it also has a much worse problem. If you just add locking code to keep the string appends from stepping on each other, you might get output like:

    KPFGLABCUHQRSVWXDMTYZMEIJO

Locks makes sure that each update happens correctly, independent of all the other updates, but they don’t do a thing to make sure that they happen in the right order. In fact, it seems as though you would have to serialize the tasks to build the output in the right order, which would defeat the point of parallelizing the program.

Cilk Plus reducers have a remarkable property, though: they are guaranteed to get the same result in a parallel computation as in a serial computation. That is, even in a parallel computation, they will combine all the input values in the same order as in the serial computation!

You can make the same changes that you made to the first example:

#include <string>
#include <iostream>
#include <cilk/cilk.h>
#include <cilk/reducer_string.h>

int main()
{
    cilk::reducer_string alphabet;
    cilk_for(char letter = 'A'; letter <= 'Z'; ++letter) {
        *alphabet += letter;
    }
    std::cout << alphabet.get_value() << "\n";
}

When you run this program, it will always print

    ABCDEFGHIJKLMNOPQRSTUVWXYZ

A More Substantial Example

In the next installment of this series, I’ll pull back the curtain and reveal how reducers can avoid races without locks, and produce serial results from a parallel computation. Meanwhile, here is one last example to show that the power of reducers isn’t limited to loops and arithmetic.

The function filter_tree() is called with a binary tree and a key. It finds all the nodes in the tree whose key field matches the key, and returns a list of the value fields of those nodes. The values in the list are in “in-order” (left subtree / root / right subtree). The filter_tree() function uses a worker function, filter_and_collect(), that does a recursive walk over the binary tree to construct the list.

I’ll just show the version with reducers this time.

#include<cilk/cilk.h>
#include <cilk/reducer_list.h>

// The tree node structure.
//
template <typename Key, typename Value>
struct TreeNode {
    TreeNode* left_subtree;
    TreeNode* right_subtreee;
    Key key;
    Value value;
};

// The worker function. Walk a subtree and add the values 
// of all nodes that match a key to a list reducer.
//
template <typename Key, typename Value>
void filter_and_collect(const TreeNode<Key, Value>* subtree, 
                   const Key& key,
                   cilk::reducer_list_append<Value>& list)
{
    if (!subtree) return;
    cilk_spawn filter_and_collect(subtree->left, key, list);
    if (subtree->key == key) {
        list->push_back(subtree->value);
    }
    filter_and_collect(subtree->right, key, list);
}

// The main function. Compute and return a list of the 'value'
// fields of all the nodes in a tree whose 'key' fields match a 
// specified 'key'.
//
template <typename Key, typename Value>
std::list<Value> filter_tree(const TreeNode<Key, Value>* tree, 
                             const Key& key)
{
    cilk::reducer_list_append<Value> list;
    filter_and_collect(tree, key, list);
    return list.get_value();
}

Summary

Note the common theme throughout these examples.

  1. You replace an accumulator variable with a Cilk Plus reducer of a kind that is appropriate to the algorithm being parallelized.
  2. You parallelize the loop or recursion that does the updates to the accumulator variable with cilk_for or cilk_spawn.
  3. You replace updates of the accumulator variable with updates of a dereference of the reducer.
  4. At the end of the computation, you call the reducer’s get_value() function to retrieve the computed value.
  5. The final value is the same as the value that was computed by the original serial code.
  6. You don’t add any code to avoid data races.
  7. You don’t add any code to preserve the serial order of the computation.

For more information about Intel Cilk Plus, see the Cilk Plus web site. For questions and discussions about Intel Cilk Plus, see the Cilk Plus Forum.

Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.
Возможность комментирования русскоязычного контента была отключена. Узнать подробнее.