When Threading Building Blocks Attack!

 This week (25-29 JUL 2011) I'm lecturing at the UPCRC UIUC Summer School on Multicore Programming. I like teaching parallel programming and I live near the UIUC campus, so it's convenient for me. Overall I have a good time and often learn some things from the students. As an example, I found the following oddity about using lambdas within Intel® Threading Building Blocks (TBB).

One of the problems that I provided for the TBB hands-on lab was Prim's Algorithm to compute a Minimum Spanning Tree (MST) on a given undirected, weighted graph. I know this makes a good coding exercise since a TBB solution is given as Example 10-10 in The Art of Concurrency.  Written in 2008, that code uses the imperative method of creating a new class for the parallel_reduce algorithm. Today, we have the possibility of using lambdas instead. But, I'm getting ahead of myself. A quick review of what is involved in Prim's Algorithm is in order.

After choosing some arbitrary node to be in the partial MST, new graph nodes are added, one at a time per iteration of a loop. The first part of the processing within the loop body that adds a node is a determination of that node that is not currently part of the partial MST and is closest (connected by an edge of least weight) to some node that is in the MST. The smallest weight of nodes not part of the MST are held in a separate vector and are updated after a new node is added. Thus, to find the closest edge, a search for the index of the minimum value in the vector is done. This search can be done in parallel and involves a reduction operation.

The fact is that the computation to find the index of the minimum element in a vector is the prototypical example for illustrating how to use the TBB parallel_reduce() algorithm. Two of the students that were working on the problem expected that this example should be easily translatable to using the lambda notation from the serial Prim's Algorithm code. So did I. What we didn't expect are the rigmarole that was required to actually carry out such a simple sounding translation.

To make this blog more worthwhile than me just ranting about the vagaries of TBB, please allow me to take an instructional bent to the remainder. Here is the original serial code of interest that finds the minimum value in the minDist vector, but returns the location (index) of that element in nodeIdx. (Note: a negative value in minDist is used to signal that the node has been previously added to the MST.)

min = FLT_MAX;

   for (j = 1; j < N; j++) {

     if (0 <= minDist[j] && minDist[j] < min) {

       min = minDist[j];

       nodeIdx = j;

     }

   }


 And here is the TBB class I've used to perform this minimum index reduction computation through a call to parallel_reduce():

class NearestNeighbor {

const float *const NNDist; 


public:

  float minDistVal;

  int minDistIndex; 


  void operator()(const blocked_range<int>& r) {

    for(int j = r.begin(); j != r.end(); ++j) {

      if (0 <= NNDist[j] && NNDist[j] < minDistVal) {

        minDistVal = NNDist[j];

        minDistIndex = j;

      }

    }

  } 


  void join( const NearestNeighbor& y ) {

    if (y.minDistVal < minDistVal) {

      minDistVal = y.minDistVal;

      minDistIndex = y.minDistIndex;

    }

  } 


  NearestNeighbor( const float *nnd ) :

    NNDist(nnd), minDistVal(FLT_MAX), minDistIndex(-1) {}

  NearestNeighbor( NearestNeighbor& x, split ) :

    NNDist(x.NNDist), minDistVal(FLT_MAX), minDistIndex(-1) {}

};


From the code above, we can extract the desired index from the minDistIndex member of the NearestNeighbor object. The lambda version of parallel_reduce() returns a value. During my lecture on Threading Building Blocks, I showed an example of parallel_reduce() that utilizes a more traditional summing reduction to compute an approximation of pi through numerical integration.  Using that code as a template, the students transformed those computations and fields and parameters into code that should return the index of the minimum value. And so, we thought that this should work...

nodeIdx = parallel_reduce(

  blocked_range(1, N),

  int(-1),

  [&](blocked_range& r, int local_idx ) -> int

    {

      int minDistVal = FLT_MAX;

      local_idx = -1;

      for (size_t i = r.begin(); i < r.end(); ++i) {

        if (0 <= NNDist[j] && NNDist[j] < minDistVal) {

          minDistVal = NNDist[j];

          local_idx = j;

        }

      }

      return local_idx;

  },

  [&] ( int idx1, int idx2 ) -> int

    {

      if (NNdist[idx1] <= NNdist[idx2])

        return idx1;

      else

        return idx2;

    }

);


Using Intel® C++ Composer XE 2011 within Microsoft Visual Studio 2010, there was an error on the join lambda. The message kept saying that we were only allowed to have a single return in the lambda.

Interesting, but easily fixed by adding a local variable to hold the index to be returned (assigned in the if-then-else statement) and then the single return statement. Unfortunately, that didn't get rid of the error.

After cogitating on the error message, we decided that maybe it was telling us that not only were we restricted to a single return statement in the body of the lambda, but that the only thing that was allowed in the lambda was that single return statement. (Surely that couldn't be right. I mean, look at the body lambda. That got more than one line of code. And the imperative join method uses more than one line of code.) Thus, we tried this version of the join method...

[&]( int idx1, int idx2 ) -> int

  {

    return (NNdist[idx1] <= NNdist[idx2]) ? idx1 : idx2;

  }


And that worked.

Is this "feature" of TBB documented? (I haven't yet found it, if it is.) Was it some odd restriction that is only in force within the C++ Composer XE 2011 version of the library? Or is it just my own ignorance of C++ and the lambda features being added to the C++0X standard?

It does make an odd kind of sense, I suppose. If you've only got two things that can be considered for your reduction operation, it should be an easy matter of picking one or the other or combining the two values into a single value that can be returned. Who would need more than one line to do that? Even so, it was just a weird thing to run up against.

(Disclaimer: I've recreated some of the code segments above from memory and witness statements provided after the fact. The code is provided for illustration purposes only and may not compile or execute correctly without modification.)

Update (04 AUG 11): Modified the return value as pointed out in the first comment.  My memory isn't what it used to be.

Pour de plus amples informations sur les optimisations de compilation, consultez notre Avertissement concernant les optimisations.