Something weird with enumerable_thread_specific

Something weird with enumerable_thread_specific

Hello,
I've got an issue with enumerable_thread_specific that I do not understand.

I've got a first level of parallelism that does a simple parallel_for, with a limited number of tasks (from 1 to 4), and I force the grain size to be 1 (each task would last about 1 minute).
Each of the tasks work on different memory areas.
For each task, I need some local storage, so I'm using the thread local storage enumerable_thread_specific.
Everything was working fine.
Then I realized that some part of code in the first level of tasks could be parallelized, with a parallel_reduce.
That would create a second level of parallelism.
Problems started to occur from times to times, with the data stored in the local storage.
The documentation says that objects in enumerable_thread_specific are created lazily when local() is called. I only call the local() method in the parallel_for section (1st level), not in the parallel_reduce (2nd level).
I managed to reproduce on a small example.
It simply finds all the max values of 3000 vectors of integers.
The first level of parallelism treats the vectors independantly with a parallel_for. The second level finds the max value in a vector, with a parallel_reduce.
The example is a bit dumb, and the TLS is not really used, but it's just to illustrate the issue.
For about 15 vectors over 3000, I get an incorrect result: the TLS object obtained with local() method, before and after the call to parallel_reduce, has changed.
From what I could see, calls to local() method before and after the parallel_reduce return the same object (same pointer, as expected), but the content has been modified (not expected).
It's like if the same TLS object was shared between several threads, which should not happen.
Following this, the inside of a TLS object seem to be modified by several threads at the same time, generating conflicts.
I'm using TBB 3.0 on linux RedHat Enterprise 4.

Thanks for any help or clues about this!

Here is the code:

#include 
#include 
#include 
#include 
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/parallel_reduce.h"
#include "tbb/blocked_range.h"
#include "tbb/enumerable_thread_specific.h"

using namespace std;

//===================== struct Cell ===================
struct Cell
{
  Cell( int d) : data(d) {}
  int data;
};

//===================== struct TLS ===================
struct TLS
{
  vector* datas;
  TLS() : datas(0) {}
  TLS( const TLS& tls) : datas(tls.datas) {}
  static void setLocalData( vector* dts);
  static const TLS& getLocal();
  bool operator!=( const TLS& ref) const { return datas!=ref.datas; }
};

//=================== global ==================
const unsigned g_taskNb = 3000;
vector< Cell> g_lists[ g_taskNb];
typedef tbb::enumerable_thread_specific< TLS > _TLS_;
_TLS_ g_tls;

void TLS::setLocalData( vector* dts) { TLS& tls = g_tls.local(); tls.datas = dts; }
const TLS& TLS::getLocal() { return g_tls.local(); }

//===================== struct BestElement ===================
struct BestElement {
  const Cell* best;
  // public method section
  static BestElement find( const vector* datas)
    {
      BestElement found;
      tbb::parallel_reduce( tbb::blocked_range( datas->begin(), datas->end()), found, tbb::auto_partitioner());
      return found;
    }
private:
  BestElement( const BestElement& elt); // no copy constructor
  // internal methods section: for tbb::parallel_reduce usage
public:
  BestElement() : best(0) {}
  BestElement( BestElement& elt, tbb::split) : best(0) {}
  ~BestElement() {}
  typedef vector::const_iterator Iter;
  void operator()( const tbb::blocked_range& r)
    { // simply finds the max value
      Iter i = r.begin(), end = r.end();
      best = &*i;
      while (++i!=end)
        if (i->data > best->data)
          best = &*i;
    }
  void join( BestElement& rhs)
    { if (rhs.best->data > best->data) // finds the max value
        best = rhs.best;
    }
};


//===================== struct ParallelTasks ===================
struct ParallelTasks
{
  ParallelTasks( unsigned taskNb) : m_taskNb( taskNb) {}
  void operator()( const tbb::blocked_range& range) const
    {
      unsigned threadId = range.begin();
      assert( range.size()==1);
      TLS::setLocalData( &g_lists[ threadId]);
      TLS tls1 = TLS::getLocal(); // copy
      assert( &g_lists[ threadId] == tls1.datas);

      BestElement found;
      found = BestElement::find( tls1.datas);
      char buf[512];
      sprintf( buf, "Performing task #%d on list of size %d : found cell %d", threadId, tls1.datas->size(), found.best->data);
      TLS tls2 = TLS::getLocal(); // copy
      if (tls1!=tls2) sprintf( buf, "%s Error Datas(%p vs %p)", buf, tls1.datas, tls2.datas);
      else sprintf( buf, "%s OK", buf);
      cout<( 0, m_taskNb, grainSize), *this, tbb::simple_partitioner());
    }
private:
  unsigned m_taskNb;
};


int main()
{
  tbb::task_scheduler_init tbbInit( tbb::task_scheduler_init::deferred);
  tbbInit.initialize( tbb::task_scheduler_init::automatic);

  // build arrays
  for (unsigned i=0; i& list = g_lists[ i];
    for (unsigned j=0; j<(i+1)*100; ++j)
      list.push_back( j);
  }

  ParallelTasks tasks( g_taskNb);
  tasks.parallel_for();

  tbbInit.terminate();

  return 0;
}

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

"It's like if the same TLS object was shared between several threads, which should not happen."
I have not looked at the code yet, but TBB threads may steal work originated by other threads, so you should not assume that the outer-loop chunks remain isolated from each other. The LTS objects may be shared among outer-loop chunks because their threads are.

Do you mean that a task can be started by one thread, interrupted by another one which will finish the execution of the task?
What I understood about task stealing was that each thread has a pool of tasks that it will execute sequentially. If a thread is underloaded, it would steal a task from another thread (given that the task has not started its execution yet). If a task has been started by a thread, it will be finished by the same thread.
I'll try too stay concise but here is what I see:
1/ 2 tasks from the outer-loop have been executed at about the same time (their printf occur just the one after the other on the screen)
2/ when I do a gettid() in the 2 tasks' body, they are the same, which mean that the 2 tasks are executed by the same thread, which means that the 2 tasks are executed sequentially, not in parallel.
3/ for the 2 tasks, a call to local() returns the same reference to a TLS object, which is expected since it's the same thread
4/ but the pointer "datas" of my TLS structure changes within the task' body (after the initial set done by TLS::setLocalData()): when I compare the pointer "datas" before and after the call to the inner-loop (BestElement::find()), it sometimes changes. It's like if my 2 tasks are executed in parallel (but they belong to the same thread), and the pointer "datas" is modified (written) by the 2 tasks at the same time, leading to undeterministic results
5/ if I don't call the inner-loop (BestElement::find()), everything is fine and the error doesn't occur.

Is there something I don't understand about the thread scheduling or the TLS? I'm fairly a newbee in this area.
thanks for your answer, Raf.

Best Reply

"Do you mean that a task can be started by one thread, interrupted by another one which will finish the execution of the task?"
A task executes from start to finish within one thread but with other tasks possibly intervening.

"What
I understood about task stealing was that each thread has a pool of
tasks that it will execute sequentially."
It has a "ready pool" of tasks, currently organised as a deque. Even if you spawn a list of tasks onto a single thread's deque, their execution may still be nested in another thread that steals a few. But parallel_for will cause tasks to be spawned on different deques at different times through the scalability strategy of recursive parallelism, making non-intervening sequential execution even less guaranteed.

"If a thread is underloaded, it
would steal a task from another thread (given that the task has not
started its execution yet). If a task has been started by a thread, it
will be finished by the same thread."
Yes (except for the word "underloaded").

"[...] It's like if my 2 tasks are
executed in parallel (but they belong to the same thread), and the
pointer "datas" is modified (written) by the 2 tasks at the same time,
leading to undeterministic results
5/ if I don't call the inner-loop (BestElement::find()), everything is fine and the error doesn't occur."
I suspect that the thread started executing a task for one chunk (executed part of a range), started on BestElement::find(), stole a task with another chunk, completed that, completed BestElement::find(), and continued on the first chunk, which had its TLS overwritten by the intervening chunk task. Some more tactically placed trace statements may provide the evidence you might need to convince yourself. Maybe what you want is a stack of contexts in a linked list headed by the TLS? I have not studied the code, though, so I cannot offer any higher-level advice.

Hi,

I'm sorry I couldn't get the test code to work (on VS2008 the privatized copy-constructor is a problem; on Linux with gcc I gota segfault at line 88.)

If a thread has no work to do, it will attempt to steal from another thread, but it will take the oldest of the of tasks on that thread's deque. So, for instance, suppose the length of theparallel_for is 2, and there are two threads involved. The first thread will start its iteration of the parallel_for, which starts itsparallel_reduce, and the second thread may do so also. If the first thread completes its work, it looks to steal a task from the other thread. The only tasks on the second thread's task deque are parallel_reduce tasks, so the first thread will steal the oldest such task.

So you will have two threads executing parts ofthe same parallel reduce, and each thread has its own copy in the ETS variable.

In general it is difficult to predict the execution order of tasks, so the results of a "thread-aware" program are going to vary.

I will have to reread your original description of the problem to see if I can figure what you were originally trying to do.

Regards,
Chris

"So, for instance, suppose the length of theparallel_for is 2, and there are two threads involved."
In this case execution of the outermost-level tasks wouldn't overlap, and the LTS wouldn't be overwritten. It's only with more tasks at the outer level that recursive parallelism would conspire with stealing to create overlap situations (one's execution nested within another's).

My impression is that the description and code don't go deeper than the overlap issue just to have a self-contained reproduction of the observed problem (which is great), but this also injects observer bias (which warrants caution). If it is all right to speculate that the ultimate intention is to link parallel_reduce code with parallel_for chunk context, then the answer would be that this is destined to fail, and the suggested solution would be to pass a pointer/reference as a BestElement member variable instead (taking into consideration that its use may have to be synchronised).

Thanks a lot Raf and Christopher for your answers.
I understand a lot better now. I had no idea that "A task executes from start to finish within one thread but with other tasks possibly intervening".
Raf, you pointed exactly at what was the issue. A thread would start a task from the outer-loop, arrive at BestElement::find() (the inner-loop), and start a new task from the outer-loop, before finishing the one it had already started. This creates the variable "datas" to be overwritten and causing the issue when coming back in the original context.
What I need here is a stack of context, as you described, to solve the issue. I tried it and it works fine.
So thanks a lot for your contribution.
As well, the code in the inner-loop does not need (fortunately) access to the TLS.
So no need to pass it as member of BestElement (maybe in the future if needs be).

Thanks again to you both. It solved all my issues.
Matthieu

Leave a Comment

Please sign in to add a comment. Not a member? Join today