Porting OpenMP SPEC benchmarks to TBB.

Greetings, everyone! I would like to share my experience in porting OpenMP applications to the Threading Building Blocks library. Last year I managed to port some of the SPEC OMP 2001 benchmarks to TBB. It's a well-known benchmark suite in the multi-threading world, that's why it was chosen. There are three benchmarks written in C: 320.equake, 330.art and 332.ammp. TBB is C++ library, so the other Fortran-based applications in the SPEC OMP suite are not so easily ported.

From OpenMP to TBB - some general tips

Basically, OpenMP applications contain pragma directives to specify parallelism. They are required to parallelize loops, provide locks and protect data. To port to TBB, we needed to match OpenMP pragmas to TBB classes and templates. Let's consider this very simple loop from 320.equake test:

#pragma omp parallel for private(i)
    for (i = 0; i < nodes; i++) {
      w2[j][i] = 0;
    }
  }

OpenMP executes the loop in parallel; each thread has its own instance of variable "i". The analogous TBB code is shown below - I encapsulated the loop body in a function object that is passed to the template function tbb::parallel_for. The "i" variable is used as the loop variable in the internal loop in operator().

class SVMPLoop1Class
{
    int m_j;
public:
    SVMPLoop1Class (int j):m_j(j) {};
    void operator () ( const tbb::blocked_range<int>& range) const{
        for ( int i = range.begin(); i != range.end(); ++i )
            w2[m_j][i] = 0;
    }
};

SVMPLoop1Class svmp1loop(j);
tbb::parallel_for(tbb::blocked_range<int>(0, nodes), svmp1loop, tbb::auto_partitioner());

TBB adds one more class to the code so it looks more complicated than the original code.  However the tbb::parallel_for call itself looks similar to the original loop. Note that I used tbb::auto_partitioner here to allow TBB to choose the grain size automatically.
The scheme here is common for  #pragma omp parallel for directives. If you have several variables in private directive, just move them to local variables inside the loop function. Shared variables can be copied into data members, like j variable in the example above. The SPEC applications excessively use global variables; I found replacing them by local ones works fine in many cases.
Explicit OpenMP locks can be replaced by TBB mutexes, for example tbb::spin_mutex. Like in this example from 332.ammp:

OpenMP code:

      omp_set_lock(&(a1->lock));
      ...
      omp_unset_lock(&(a1->lock));

TBB code:

{
      // scoped_lock acquires lock on creation
      // and releases it automatically after leaving the scope
      tbb::spin_mutex::scoped_lock lock (a1->spin_mutex);
      ...
}

This construction is quite simple too.

Resolving thread ID dependency - tbb::parallel_reduce instead of tbb::parallel_for.

 While I was able to replace most omp parallel for directives using the approach above, I found the loop in 320.quake that finds the hypocenter and epicenter to be more complicated. It separates the data between threads, finds out the minimum value for each thread and after that finds the global minimum. The thread ID received using omp_get_thread_num() is used to protect the data from races. Using thread IDs doesn't fit well into the higher-level task-based programming model used by TBB. I found tbb::parallel_reduce suits here. An algorithm that finds a global minimum from per-thread minima is a good example of a reduction. Here is the example from 320.quake benchmark (I skipped some code just to simplify things):

OpenMP code:

#pragma omp parallel private(my_cpu_id,d1,d2,c0)
{
   my_cpu_id=omp_get_thread_num();// thread ID
   bigdist1[my_cpu_id]=1000000.0; // minimum per thread
   temp1[my_cpu_id]=-1; // node for which the minimum was found
#pragma omp for
     for (i = 0; i < ARCHnodes; i++) {
        d1 = distance(...);
        if (d1 < bigdist1[my_cpu_id]) {
           bigdist1[my_cpu_id] = d1;
           temp1[my_cpu_id] = i;
        }
    }
}

// finding out the global minimum
numthreads = omp_get_max_threads();
d1=bigdist1[0];
Src.sourcenode=temp1[0];
for (i=0;i<numthreads;i++) // minimum per thread
{
   if (bigdist1[i] < d1) {
      d1=bigdist1[i];
      Src.sourcenode = temp1[i];
   }
}

TBB code:

struct SearchEpicenterAndHypocenter
{
    double m_bigdist1;
    struct source *m_Src;
    int m_temp1;

    SearchEpicenterAndHypocenter(struct source * Src): m_bigdist1(1000000.0), m_temp1(-1), m_Src (Src) {}

    SearchEpicenterAndHypocenter(SearchEpicenterAndHypocenter &search, tbb::split)
    {
        m_Src = search.m_Src;
        m_bigdist1 =1000000.0;
        m_temp1 = -1;
    }

    void operator () ( const tbb::blocked_range<int>& range)
    {
        for (int i = range.begin(); i != range.end(); ++i)
        {
            // The function finds out the local minimum
            SearchCycle (i, m_temp1, m_bigdist1, m_Src);
        }
    }

    void join (SearchEpicenterAndHypocenter &search)
    {
        if (search.m_bigdist1 < m_bigdist1) {
            m_bigdist1 = search.m_bigdist1;
            m_temp1 = search.m_temp1;
        }
    }
};

SearchEpicenterAndHypocenter searchcenters (&Src);
tbb::parallel_reduce(tbb::blocked_range<int>(0, ARCHnodes), searchcenters, tbb::auto_partitioner());
Src.sourcenode = searchcenters.m_temp1;

As usual, the TBB version creates an additional function object that is passed to the template function  tbb::parallel_reduce. Due to the ability to use any user-defined code for reduction, parallel_reduce does not need the thread ID based trick used with OpenMP. So the thread ID dependency problem was naturally resolved.

Resolving thread ID dependency - placing a thread ID dependent buffer inside the tbb::parallel_for and tbb::parallel_reduce

The above approach won't resolve all thread ID dependencies. I found it doesn't work when I ported other SPEC benchmarks: 330.art and 332.ammp. Parallel loops in these benchmarks use some global buffers , which are protected from data races using thread IDs. Here is an example of such a buffer from the 330.art test:

f1_neuron **f1_layer; // The buffer protected by thread ID
int numthreads = omp_get_max_threads();
f1_layer = (f1_neuron **)malloc(numthreads * sizeof (f1_neuron*));

#pragma omp parallel for private (i)
   for (i=0;i<numthreads;i++)
       f1_layer[i] = (f1_neuron*) malloc(numf1s * sizeof(f1_neuron));

// After that each thread works with its own part of f1_layer buffer.
o = omp_get_thread_num();

#pragma omp for private (k,m,n, gPassFlag) schedule(dynamic)
    for (ij = 0; ij < ijmx; ij++)
    { 
       j = ((ij/inum) * gStride) + gStartY;
       i = ((ij%inum) * gStride) +gStartX;
       k=0;
       for (m=j;m<(gLheight+j);m++)
         for (n=i;n<(gLwidth+i);n++)
           f1_layer[o][k++].I[0] = cimage[m][n];
       ...
    }

This approach is very simple but tricky, and doesn't work with TBB because we have no access to thread IDs. So I had to implement another approach - such buffers can be moved inside the parallel_for and parallel_reduce function objects as members. Each task will operate with the buffer from its copy of the body object, and so thread safety won't be violated. Here is an implementation example from 332.ammp benchmark (I skipped some code to shorten the example):

//this class is for reduce version of the main loop
class MMFVUpdateLoopReduceClass
{
  ATOM **m_atomall; //This buffer was threadID dependent in OpenMP code
  public:
  MMFVUpdateLoopReduceClass(...)
    {
        //Allocating memory for the buffer in constructor
        m_atomall = (ATOM**) malloc( m_natoms * sizeof(ATOM *) );
        ...
    }

    //This method is emtpy, it doesn't do the exact reduction
    void join (MMFVUpdateLoopReduceClass &reduceloop) {}

    // Splitting constructor
    MMFVUpdateLoopReduceClass(MMFVUpdateLoopReduceClass &reduceloop, tbb::split)
    {
        //The buffer is allocated in splitting constructor too
        m_atomall = (ATOM**) malloc( m_natoms * sizeof(ATOM *) );
    }

    void operator () ( const tbb::blocked_range<int>& range) const{
        for ( int ii = range.begin(); ii != range.end(); ++ii )
        {
            // do some work using m_atomall
        }
    }

    ~MMFVUpdateLoopReduceClass()
    {
        free(m_atomall);
    }
};

// The parallel loop
  MMFVUpdateLoopReduceClass loop(...);
  tbb::parallel_reduce(tbb::blocked_range<int>(0, jj), loop, tbb::auto_partitioner());

You probably noticed that tbb::parallel_reduce doesn't do any reduction here. Why not just use parallel_for? The reason is due to a performance difference. Note that we placed memory allocation in the constructor. It is heavy operation in terms of performance, so if the constructor is called frequently, you get a performance degradation. I found that TBB does fewer constructor calls in tbb::parallel_reduce compared to tbb::parallel_for, so parallel_reduce with an empty "join" method is more efficient here.

I was able to get a slight performance improvement with TBB over OpenMP - even though that was not the purpose of this exercise.  Finally, I hope my experience will help you to add TBB support to your applications :-).

For more complete information about compiler optimizations, see our Optimization Notice.

5 comments

Top

Can you make these TBBified OMP benchmarks available? Perhaps as diffs to the SPEC sources.

randomizer's picture

Ok. Thank you.

It will be interesting to compare exactly threading runtime in current TBB implementation vs. some veteran OpenMP implementation. For example with such synthetic task (I'm not sure how to port it to OpenMP, if possible at all):

class TestTask : public tbb::task
{
unsigned* const sum_;
unsigned value_;
unsigned x_, y_;
bool is_continuation_;

public:
TestTask(unsigned value, unsigned* sum )
: value_(value), sum_(sum), is_continuation_(false)
{
}

tbb::task* execute()
{
if( !is_continuation_ )
{
if (value_ &gt;= 2)
{
tbb::task* a = new( allocate_child() ) TestTask(value_ - 1, &amp;x_);
tbb::task* b = new( allocate_child() ) TestTask(value_ - 1, &amp;y_);
recycle_as_continuation();
is_continuation_ = true;
set_ref_count(2);
spawn(*b);
return a;
}
else
{
*sum_ = value_;
return 0;
}
}
else
{
*sum_ = x_ + y_;
return 0;
}
}
};

Dmitriy V'jukov

MAD\akukanov's picture

An additional note regarding the performance: the digits Alexey listed above should not be considered as representative results to compare TBB with OpenMP. The original SPECOMP codes had to be refactored in order to port them to TBB; the performance improvement should mostly be contributed to this refactoring rather than replacing the threading runtime. And of course a different HW configuration could show completely different results.

Hi,

The hardware was 2xClovertown 2.6 Ghz (8 cores).
Windows Server 2003, Visual Studio 2005 + Intel Compiler 10.0
So OpenMP version was from VS 2005 runtime, AFAIK it is 2.0

Performance difference on "ref" workload was:

320.equake - flat
330.art - TBB version was 14% faster
332.ammp - TBB version was 6% faster

I tried to make my TBB versions do the the same work as OpenMP versions, although there was some refactoring.

Best Regards,
Alexey

Re: I was able to get a slight performance improvement with TBB over OpenMP...

I am curious what exactly improvement. And what version of OpenMP. And on what hardware.
Can you, please, describe briefly?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.