Porting OpenMP SPEC benchmarks to TBB.

By Alexey Murashov, Published: 05/08/2008, Last Updated: 05/08/2008

Greetings, everyone! I would like to share my experience in porting OpenMP applications to the Threading Building Blocks library. Last year I managed to port some of the SPEC OMP 2001 benchmarks to TBB. It's a well-known benchmark suite in the multi-threading world, that's why it was chosen. There are three benchmarks written in C: 320.equake, 330.art and 332.ammp. TBB is C++ library, so the other Fortran-based applications in the SPEC OMP suite are not so easily ported.

From OpenMP to TBB - some general tips

Basically, OpenMP applications contain pragma directives to specify parallelism. They are required to parallelize loops, provide locks and protect data. To port to TBB, we needed to match OpenMP pragmas to TBB classes and templates. Let's consider this very simple loop from 320.equake test:

#pragma omp parallel for private(i)
    for (i = 0; i < nodes; i++) {
      w2[j][i] = 0;
    }
  }

OpenMP executes the loop in parallel; each thread has its own instance of variable "i". The analogous TBB code is shown below - I encapsulated the loop body in a function object that is passed to the template function tbb::parallel_for. The "i" variable is used as the loop variable in the internal loop in operator().

class SVMPLoop1Class
{
    int m_j;
public:
    SVMPLoop1Class (int j):m_j(j) {};
    void operator () ( const tbb::blocked_range<int>& range) const{
        for ( int i = range.begin(); i != range.end(); ++i )
            w2[m_j][i] = 0;
    }
};

SVMPLoop1Class svmp1loop(j);
tbb::parallel_for(tbb::blocked_range<int>(0, nodes), svmp1loop, tbb::auto_partitioner());

TBB adds one more class to the code so it looks more complicated than the original code.  However the tbb::parallel_for call itself looks similar to the original loop. Note that I used tbb::auto_partitioner here to allow TBB to choose the grain size automatically.
The scheme here is common for  #pragma omp parallel for directives. If you have several variables in private directive, just move them to local variables inside the loop function. Shared variables can be copied into data members, like j variable in the example above. The SPEC applications excessively use global variables; I found replacing them by local ones works fine in many cases.
Explicit OpenMP locks can be replaced by TBB mutexes, for example tbb::spin_mutex. Like in this example from 332.ammp:

OpenMP code:

      omp_set_lock(&(a1->lock));
      ...
      omp_unset_lock(&(a1->lock));

TBB code:

{
      // scoped_lock acquires lock on creation
      // and releases it automatically after leaving the scope
      tbb::spin_mutex::scoped_lock lock (a1->spin_mutex);
      ...
}

This construction is quite simple too.

Resolving thread ID dependency - tbb::parallel_reduce instead of tbb::parallel_for.

 While I was able to replace most omp parallel for directives using the approach above, I found the loop in 320.quake that finds the hypocenter and epicenter to be more complicated. It separates the data between threads, finds out the minimum value for each thread and after that finds the global minimum. The thread ID received using omp_get_thread_num() is used to protect the data from races. Using thread IDs doesn't fit well into the higher-level task-based programming model used by TBB. I found tbb::parallel_reduce suits here. An algorithm that finds a global minimum from per-thread minima is a good example of a reduction. Here is the example from 320.quake benchmark (I skipped some code just to simplify things):

OpenMP code:

#pragma omp parallel private(my_cpu_id,d1,d2,c0)
{
   my_cpu_id=omp_get_thread_num();// thread ID
   bigdist1[my_cpu_id]=1000000.0; // minimum per thread
   temp1[my_cpu_id]=-1; // node for which the minimum was found
#pragma omp for
     for (i = 0; i < ARCHnodes; i++) {
        d1 = distance(...);
        if (d1 < bigdist1[my_cpu_id]) {
           bigdist1[my_cpu_id] = d1;
           temp1[my_cpu_id] = i;
        }
    }
}

// finding out the global minimum
numthreads = omp_get_max_threads();
d1=bigdist1[0];
Src.sourcenode=temp1[0];
for (i=0;i<numthreads;i++) // minimum per thread
{
   if (bigdist1[i] < d1) {
      d1=bigdist1[i];
      Src.sourcenode = temp1[i];
   }
}

TBB code:

struct SearchEpicenterAndHypocenter
{
    double m_bigdist1;
    struct source *m_Src;
    int m_temp1;

    SearchEpicenterAndHypocenter(struct source * Src): m_bigdist1(1000000.0), m_temp1(-1), m_Src (Src) {}

    SearchEpicenterAndHypocenter(SearchEpicenterAndHypocenter &search, tbb::split)
    {
        m_Src = search.m_Src;
        m_bigdist1 =1000000.0;
        m_temp1 = -1;
    }

    void operator () ( const tbb::blocked_range<int>& range)
    {
        for (int i = range.begin(); i != range.end(); ++i)
        {
            // The function finds out the local minimum
            SearchCycle (i, m_temp1, m_bigdist1, m_Src);
        }
    }

    void join (SearchEpicenterAndHypocenter &search)
    {
        if (search.m_bigdist1 < m_bigdist1) {
            m_bigdist1 = search.m_bigdist1;
            m_temp1 = search.m_temp1;
        }
    }
};

SearchEpicenterAndHypocenter searchcenters (&Src);
tbb::parallel_reduce(tbb::blocked_range<int>(0, ARCHnodes), searchcenters, tbb::auto_partitioner());
Src.sourcenode = searchcenters.m_temp1;

As usual, the TBB version creates an additional function object that is passed to the template function  tbb::parallel_reduce. Due to the ability to use any user-defined code for reduction, parallel_reduce does not need the thread ID based trick used with OpenMP. So the thread ID dependency problem was naturally resolved.

Resolving thread ID dependency - placing a thread ID dependent buffer inside the tbb::parallel_for and tbb::parallel_reduce

The above approach won't resolve all thread ID dependencies. I found it doesn't work when I ported other SPEC benchmarks: 330.art and 332.ammp. Parallel loops in these benchmarks use some global buffers , which are protected from data races using thread IDs. Here is an example of such a buffer from the 330.art test:

f1_neuron **f1_layer; // The buffer protected by thread ID
int numthreads = omp_get_max_threads();
f1_layer = (f1_neuron **)malloc(numthreads * sizeof (f1_neuron*));

#pragma omp parallel for private (i)
   for (i=0;i<numthreads;i++)
       f1_layer[i] = (f1_neuron*) malloc(numf1s * sizeof(f1_neuron));

// After that each thread works with its own part of f1_layer buffer.
o = omp_get_thread_num();

#pragma omp for private (k,m,n, gPassFlag) schedule(dynamic)
    for (ij = 0; ij < ijmx; ij++)
    { 
       j = ((ij/inum) * gStride) + gStartY;
       i = ((ij%inum) * gStride) +gStartX;
       k=0;
       for (m=j;m<(gLheight+j);m++)
         for (n=i;n<(gLwidth+i);n++)
           f1_layer[o][k++].I[0] = cimage[m][n];
       ...
    }

This approach is very simple but tricky, and doesn't work with TBB because we have no access to thread IDs. So I had to implement another approach - such buffers can be moved inside the parallel_for and parallel_reduce function objects as members. Each task will operate with the buffer from its copy of the body object, and so thread safety won't be violated. Here is an implementation example from 332.ammp benchmark (I skipped some code to shorten the example):

//this class is for reduce version of the main loop
class MMFVUpdateLoopReduceClass
{
  ATOM **m_atomall; //This buffer was threadID dependent in OpenMP code
  public:
  MMFVUpdateLoopReduceClass(...)
    {
        //Allocating memory for the buffer in constructor
        m_atomall = (ATOM**) malloc( m_natoms * sizeof(ATOM *) );
        ...
    }

    //This method is emtpy, it doesn't do the exact reduction
    void join (MMFVUpdateLoopReduceClass &reduceloop) {}

    // Splitting constructor
    MMFVUpdateLoopReduceClass(MMFVUpdateLoopReduceClass &reduceloop, tbb::split)
    {
        //The buffer is allocated in splitting constructor too
        m_atomall = (ATOM**) malloc( m_natoms * sizeof(ATOM *) );
    }

    void operator () ( const tbb::blocked_range<int>& range) const{
        for ( int ii = range.begin(); ii != range.end(); ++ii )
        {
            // do some work using m_atomall
        }
    }

    ~MMFVUpdateLoopReduceClass()
    {
        free(m_atomall);
    }
};

// The parallel loop
  MMFVUpdateLoopReduceClass loop(...);
  tbb::parallel_reduce(tbb::blocked_range<int>(0, jj), loop, tbb::auto_partitioner());

You probably noticed that tbb::parallel_reduce doesn't do any reduction here. Why not just use parallel_for? The reason is due to a performance difference. Note that we placed memory allocation in the constructor. It is heavy operation in terms of performance, so if the constructor is called frequently, you get a performance degradation. I found that TBB does fewer constructor calls in tbb::parallel_reduce compared to tbb::parallel_for, so parallel_reduce with an empty "join" method is more efficient here.

I was able to get a slight performance improvement with TBB over OpenMP - even though that was not the purpose of this exercise.  Finally, I hope my experience will help you to add TBB support to your applications :-).

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804