Best way to deal with thread specific workspace

Best way to deal with thread specific workspace

Dear TBB experts !

 

I would like to know what is the best solution to the following simple problem. I want to perfom a parallel_for with a work functor that requires some workspace (e.g. an array that should not be accessed concurently.

 

1) The simplest solution is to use an automatic vector inside the () operator of my functor

struct WorkFunctor{
  void operator()(block_range..){
       std::vector<double> workspace(2048);
       ... do some work
   }
};
 
Unfortunately, for small granularity, this solution is slow because of the time consumed for allocating/deallocating the workspace array.
 

2) I may improve the situation be using tbb scalable allocator

struct WorkFunctor{
  void operator()(block_range..){
       std::vector<double,tbb::cache_aligned_allocator<double> > workspace(2048);   

    ... do some work
   }
};

3) I improve a bit the perf by using static size array, be I have to be  very carefull with the

stack size per thread (I have encountered erratic bugs due to this issue).

struct WorkFunctor{
  void operator()(block_range..){
      double workspace[2048];

       ... do some work
   }
};

4) I wonder if the use of thread specific storage is the appropriate solution (tbb::enumerable_thread_specific). I found very few examples for this.
 

Thank you in advance for your help.

27 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

I would go with option 2 for peace of mind and postpone any further attempts at optimisation until I had identified the true needs of the program. Well, I'd like to think that I would. :-)

Maybe the question you should ask is why you have "small granularity"? You should have at least a few good chunks when using the default auto_partitioner. If you're not doing something to sabotage that, maybe the total number of chunks is not that high after all?

(Added 2013-12-02) To be explicit: option 3 will always be fastest, so if it really improves the situation only "a bit", then that's your clue to look elsewhere right there. And even if you improve the rest of the program sufficiently to open up this gap, I doubt that hanging on to the memory using TLS could improve things at all, let alone make a splash, but I could always be mistaken.

I think option 4 is also fine, especially given the way threads are used in TBB. Just make sure you clean up any data that may be present it the thread local storage from the last go.

tbb::enumerable_thread_specific<std::vector<double> > workspaces;

struct WorkFunctor{
  void operator()(block_range..){
      bool exists;
      std::vector<double> &ws=workspaces.local(exists);
      if (exists)
      {
        clear_data_in_workspace(ws);
      }
      else
      {
        build_new_workspace(ws);
      }
     ... do some work
   }
};

//after parallel_for
workspaces.clear();

Thank you very much Raf and Jiri !

I will explore the different tracks to evaluate their relative efficiencies. I will then post the results.

Actually, I consider (maybe erroneously) that this need for a thread specific workspace must be frequently encountered. Hence, I found a bit strange not being able to find the canonical tbb pattern corresponding to this issue. Because of its name, the  "thread_specific" stuff looks promising  but I have found very few examples with this construct. In addition, the enumerable or combinable properties are not required for my problem.

Thank you again for your help.

Laurent

hello,

you can try to use tbb:concurrent_vector instead of std::vector

--Vladimir

About #3: We have no indication that the data left over from a previous use can be reused (I'll assume it's overwritten each time), which means that option 4 won't really help: at best the performance will be closer to 3 (best time), but I understand there's only "a bit" of difference between 2 and 3, so Amdahl's Law has some sobering things to say about that...

About #5: concurrent_vector is good for other things, but never for fixed-size collections (even if they were used concurrently).

 

Ouch, there is more options than I thought...

For Vladimir: I do not understand how a concurrent_vector can help in this context.

For Jiri : As Raf guessed, I do not need to clear the workspace (I use it to perform some 3D permutations like X[i][j][k]=Y[j][k][i])  hence I think I do not need the "exists" parameter.

For Raf : I did not finish my  measurements but I wonder if #5 should not used because I make several calls to the parallel_for running the WorkFunctor Functor. I believe that the local() method of workspaces will build only one std::vector for each thread (during the first parallel_for call). Hence the allocation overhead may become neglictible:

typedef std::vector<double> WorkSpace;
typedef tbb::enumerable_thread_specific<WorkSpace> WorkSpacePool;
struct WorkFunctor{
   WorkSpacePool * wsPoolPtr_;
   WorkFunctor(WorkSpacePool * wsPoolPtr,...):wsPoolPtr_(wsPoolPtr){}
  void operator()(block_range..){

	      WorkSpace & ws=wsPoolPtr_->local();

	      .. do some work with ws

	   }

	};
void aFunction(int ncall=100){
    WorkSpacePool wsPool;
    for (int call=0 ; call<ncall ; call++){
      WorkFunctor wFunctor(&wsPool,...);
      tbb::parallel_for(range,wFunctor);
   }
}

Quote:

laurent.plagne wrote:

I believe that the local() method of workspaces will build only one std::vector for each thread (during the first parallel_for call). Hence the allocation overhead may become neglictible:

Yes, the local() works this way. And given the way threads are used in TBB, this means that consecutive executions of parallel_for will reuse the existing workspaces. So, if your aim is to limit the number of workspace allocations, this would IMHO be a good solution. Without experiments, its not possible to say if it is really worth it. But I think at the very least it should not make things worse compared to other solutions that allocate memory dynamically. You could also consider replacing the standard allocator with TBB's scalable allocator.

I still don't see the point of trying to improve on the scalable allocator. What percentage are we talking about?

Raf, the WorkSpacePool eliminates calls to any allocator (on subsequent calls). No matter how fast you make a scalable allocator, not calling it is always faster.

I was going to suggest using a pointer (NULLed) in TLS, then one-time allocated.

The tbb::enumerable_thread_specific... accomplishes a similar goal.

While this does mean the buffer hangs around...
Well that is the point.

Jim Dempsey

www.quickthreadprogramming.com

I admit that it could be faster, strictly speaking, but we still don't know by how much (necessarily only part of "a bit"), and what's the point of being afraid of the scalable allocator, invisibly used anyway by all the tasks spawned to support that parallel_for() and now also used here with deallocation (optimally) always in the same thread as allocation, in code that's not heavily reused, i.e., where complexity is not redeemed in number of applications? Should there really be a "pattern": in any non-recursive code, which includes knowing that the thread will never steal other similar work (!), whenever you have a use for more memory than will comfortably fit on the stack, be as phobic toward the scalable allocator as you ever were for other allocators and use TLS for every last bit of performance? I hear already in the original posting that in this case the optimum (stack-based allocation) is only "a bit" faster, and later that the memory is used intensively (for permutations). That suggests that allocation cost may well be a red herring and that maybe extra performance should be sought elsewhere, instead, e.g., by simply specifying a grain size anyway (it is also obeyed by the auto_partitioner), or somewhere in the code that is not visible to us.

(2013-12-08 Shortened) And to repeat the most important objection: the code must know that the memory won't be reused by stolen work, which limits composability (no nested TBB algorithms) and requires maintained attention.

Quote:

Raf Schietekat wrote:

And to repeat the most important objection: the code must know that the memory won't be reused by stolen work, which limits composability (no nested TBB algorithms) and requires maintained attention.

You mean when using the TLS containing a vector (with normal or scalable allocator)? Could you elaborate on that?

Hi, thank you again for your help !

I made some time measurements on the following code:

#include "tbb/tick_count.h"
#include "tbb/parallel_for.h"

	#include "tbb/task_scheduler_init.h"

	#include "tbb/cache_aligned_allocator.h"

	#include "tbb/enumerable_thread_specific.h"

	#include <vector>

	#include <iostream>
//#define STACK_BASED

	#define STD_VECTOR

	//#define SCAL_STD_VECTOR
#ifdef STACK_BASED

	template <int BLOCK_SIZE>

	struct WorkSpace{

	  float data_[BLOCK_SIZE*BLOCK_SIZE*BLOCK_SIZE];

	};

	#endif
#ifdef STD_VECTOR

	template <int BLOCK_SIZE>

	struct WorkSpace{

	  std::vector<float> data_;

	  WorkSpace(void):data_(BLOCK_SIZE*BLOCK_SIZE*BLOCK_SIZE){}

	};

	#endif
#ifdef SCAL_STD_VECTOR

	template <int BLOCK_SIZE>

	struct WorkSpace{

	  std::vector<float,tbb::cache_aligned_allocator<float> > data_;

	  WorkSpace(void):data_(BLOCK_SIZE*BLOCK_SIZE*BLOCK_SIZE){}

	};

	#endif

	template <int BLOCK_SIZE>

	class MyFunctor{

	private:

	  std::vector<float> & X_;

	  tbb::enumerable_thread_specific<WorkSpace<BLOCK_SIZE> > * wPoolPtr_;

	public:

	  MyFunctor(std::vector<float> & X,

	        tbb::enumerable_thread_specific<WorkSpace<BLOCK_SIZE> > * wPoolPtr):X_(X),

	                                        wPoolPtr_(wPoolPtr)

	  {}
  template <class BR>

	  void operator()(const BR & br) const {
    const int bs=BLOCK_SIZE;

	    const int bs2=bs*BLOCK_SIZE;

	    const int bs3=bs2*BLOCK_SIZE;
#define LOCAL_ALLOC

	#ifdef LOCAL_ALLOC    

	    WorkSpace<bs> wsp;

	#else

	    WorkSpace<bs> & wsp=wPoolPtr_->local();

	#endif
    for (int b=br.begin() ; b!=br.end() ; b++){

	      

	      const int istart=b*bs3;
      float * XShift=&X_[istart];

	      

	      //swap into the block w_[j][k][i]=Xb[i][j][k]

	      for (int k=0 ; k<bs ; k++){

	        for (int j=0 ; j<bs ; j++){

	          for (int i=0 ; i<bs ; i++){

	            wsp.data_[j+bs*k+bs2*i]=XShift[i+bs*j+bs2*k];

	          }

	        }

	      }
      //dummy work

	      for (int I=0 ; I<bs3 ; I++){

	         XShift[I]+=wsp.data_[I];

	      }

	    }

	  }

	};
int main( int argc,  char *argv[] ){

	 

	  const int bs=128;

	  const int nblocks=4;
  const int tn=-1;

	  tbb::task_scheduler_init init(tn);

	 

	  std::vector<float> X(nblocks*bs*bs*bs,1.f);
  tbb::enumerable_thread_specific<WorkSpace<bs> > workspacePool;
  {

	    MyFunctor<bs> mf(X,&workspacePool);

	    

	    tbb::tick_count start=tbb::tick_count::now();

	    

	    for (int i=0 ; i<100 ; i++){

	            tbb::parallel_for(tbb::blocked_range<int>(0,nblocks),mf,tbb::auto_partitioner());

	      //tbb::parallel_for(tbb::blocked_range<int>(0,nblocks),mf,tbb::simple_partitioner());

	    }

	    tbb::tick_count stop=tbb::tick_count::now();

	    double d1=(stop-start).seconds();
    std::cout << "d1="<< d1 <<" s" << std::endl;

	  }

	}

*******************

For bs=64

With automatic allocation

STACK_BASED 0.42 s

STD_VECTOR 0.63 s

SCAL_STD_VECTOR 0.49 s

With TLS

STACK_BASED 0.43 s

STD_VECTOR 0.43 s

SCAL_STD_VECTOR 0.43 s

For bs =128

With automatic allocation

STACK_BASED seg fault if the stack size per thread is not modified

STD_VECTOR 6.65 s

SCAL_STD_VECTOR 5.28

With TLS

STD_VECTOR 4.2 s

SCAL_STD_VECTOR 4.2 s

So Raf is right to say that the scalable allocator give performance that are close to what we obtain with TLS. When the data gets bigger, the stack based workspace becomes problematic and the TLS perfomance gets better than the scalable allocator. In addition, in my code I have several workspace arrays  that I can group in a single Workspace class used in the enumerable_thread_specific pool. Actually I would feel comfortable with the TLS solution but I am a bit afraid of Raf warnings...

I may explain that I have two different goals for my design:

* I must deliver reliable kernels for my physics group and I guess that they could afford a 10% loss in perf.

* I wrote research articles where the CPU peak performance ratio obtained by a kernel has some importance (40% is not as good looking as 50 % ;) )

More essentially I want to learn good patterns from the experienced user !

P.S. Sorry for the ugly indent style of the code (I copy past it from emacs...)

 

 

 

 

 

About #12: If the task, or in this case the parallel_for Body, gets and uses the TLS memory and exits without having recursed to another use of the TLS, all is well. But even though the TLS is invisible from another thread (normal concurrency), the code is still non-reentrant for same-thread recursion, either the form you would expect (but also still have to look out for), or the one you don't expect: if the Body, or code it calls, either written specifically for the Body or reused from a library maintained by somebody else (or perhaps yourself with a different hat and on a different day), internally uses a TBB algorithm, this algorithm may at some point have to wait for some stolen work to finish, and may then steal other work, which may have been spawned by other work stolen from the same parallel_for, which may use the same Body, which uses the same TLS… So, in some convoluted and unexpected way (it wasn't my first thought, either), the Body recurses to itself, and enumerable_thread_specific by itself does not have a stack to save the previous content and restore it later. You could of course write around that, to provide the stack semantics yourself, or encapsulate that in a class with a different interface that automatically provides stack semantics, but you have to do the work and think about it. And the code has to be maintained to keep it crypto-recursion-free or crypto-recursion-safe.

Is that really worth it to save part of "a bit" of difference? Maybe at this point Laurent could quantify "a bit" (absolute and relative difference between option 2 and absolute best option 3): I think it will be far less than the point at which this might be considered a useful pattern, making its use here more like an anti-pattern instead.

(Added) I see he has done so in the meantime... Apparently the code is for reuse, so it does deserve some extra TLC for maximum performance. But how do the numbers change with a larger grain size, starting with 2 perhaps? And with values for work that is representative of real use?

(Added) I also see that it's a rather larger workspace than I had thought, at 2^21 elements, or 8 MB. I don't know much about the scalable allocator's performance here (it used to be that this was just shipped off to plain old malloc()). Another issue is initialisation: isn't the vector then filled with all-zero elements, and how much time does that take? Perhaps you could also try a plain array (still dynamically allocated) to get around that and see whether it makes a difference?

Thanks for the explanation. Clearly, if you let go of the thread, you can get into serious trouble with the TLS. It would be nice to have a Task Local Storage :-).

(Edit) Just to make it clear, I am aware of the fact that "Task Local Storage" does not make sense, because it just means data of the task object... ;-)

Regarding the vector initialization - the values in the vector (using the vector(size_t) constructor) will be default-initialized, which should really set them to zero and (in this case) waste time. As far as I know, there is no way to avoid some initialization with the standard vector. Changing vector for something else (e.g., a dynamically allocated array) could improve the performance, especially for the non-TLS version.

I've been thinking about possible context-local storage, associated with a task group context subtree. CLS<TLS<vector<float>>> could be the right mix here (note that the other way around wouldn't work), if the performance gap cannot be closed otherwise.

Maybe the initialisation can be avoided by overriding allocator::construct()?

Would the context local storage also solve the problem created when the task hands over the thread to the scheduler which then runs a different task from the same parallel_for? It seems to me it would only deal with the composability, assuming that the context is not reused to run the nested algorithm.

So far, the only general solution I can think of is something like TLS<list<pair<bool,vector>>>, which would maintain a list of workspaces per each thread and a bool flag that specifies whether that particular vector is being used. The functor would get the list from TLS and find the first unused vector or add a new one at the end. To make it exception-safe, the flag shoud probably not be set directly by assignment in the code, but rather by something similar to the scoped lock used by TBB mutexes. In the ideal case, the list will only contain one workspace for each thread. It would grow if there is a "collision". So far I couldn't find a "hole" in this approach, but it sure is ugly :-).

Overriding allocator::construct is an interesting idea, but it seems that for example the latest MSVC compiler does not call it for the vector elements, but only invokes in-place new on them. Unfortunately, I currently don't have the latest C++ standard to check what it says about the issue.

Paragraph by paragraph:

You're right, I got a bit carried away there. (I think there's a use for something like CLS, but it needs some more thought.)

A thread-local pool of workspaces might be as you propose, or something that uses the notion that the workspaces are activated as a stack even if they might be used out of order by stolen work. But it's still complicated, so let's hope a slightly coarser granularity and/or avoiding initialisation will make all this unnecessary?

I was looking at a recent draft, but have no idea about the practicality of such an override.

(2013-12-12 Corrected) "was getting"->"got a bit" (right?)

I think it is important to insert an FYI here.

Not all threading paradigms are equivalent with respect to Thread Local Storage (this is not inherently clear).

I know this is a TBB forum, but we programmers tend to port code. Should you port to Intel Cilk+ be aware that the software thread that continues after a parallel region (call/construct) is not necessarily the same thread that initiated the parallel region (a stack swap may be involved). The impact of this is the TLS memory prior to the entry to the parallel region (call/construct) may reside in a different thread context following the parallel region (call/construct). In TBB, OpenMP, pthreads the resume is run on the initiating thread. This is one of those things that might bite you some day. You might want to place an assert in there as well as for __DEBUG__ build insert a serial number (e.g. __rdtsc) into the object (e.g. vector) and a copy into a scope visible variable, do this prior to creating a parallel region, then verify local copy == TLS copy upon resumption (insert a comment as to why you do this, as the next misguided programmer might yank this out thinking it unnecessary).

Jim Dempsey

www.quickthreadprogramming.com

Then you'll probably have to comment other constructs as well. A recursive mutex should be fun to watch (once): lock, do something parallel, lock again… and you're deadlocked.

Moving stacks (e.g., with fibers) opens up a nice selection of minefields to explore, so when porting to such environment you certainly have to be really careful and TLS is an obvious thing to watch. The recursive mutexes are a good example of one issue that is better concealed.

Would rdtsc be really a good serial number here? It seems to me this could fail to detect stack movement in some (extremely rare) cases...

Thank you again for your support !

I confess that I was not able to follow the last posts about TLC (?) flying a little far above my head ;)

As a non TBB expert my conclusions would be :

* The normal-safe pattern to deal with temporary workspace is to use automatic vector variables defined in the scope of the functor operator() together with the scalable tbb allocator. In most cases, the allocation cost should be negligible.

* In order to remove completely the allocation cost, one could consider to use the TLS tbb mechanism but this implies to completely control the context of the parallel loop.

At the present time I am not 100% satisfied by these conclusions for two reasons:

* I still don't understand why the TLS construct is unsafe: maybe it is possible to build a simple example where this construct fails. I recall that the workspace obtained by the .local() method does not need to be specifically bounded to a given thread. The only requirement is that the thread calling the local() function gets a private workspace instance (not accessible by other threads during the functor execution).

* I still wonder why my request seems to be so singular: I find rather natural the need for a structured workspace private to each concurrent thread performing a piece of parallel loop. Since I did not found simple and safe tbb construct to fulfill this need neither than a clear explanation why this need is unjustified in tbb documentation, tutorial or forum, I am afraid that my request is based on a personal fundamental misconception of the tbb programming paradigm, or more generally shared memory programming principles...

 

Have you timed how long it takes to allocate just an array (to avoid initialisation by a vector), and what happens if granularity is increased somewhat?

Maybe with direct knowledge of the application it would be possible to trigger the problem, but it might not be easy, because a thread has to be robbed and then it has to steal from the robber, or a more complicated scenario (no clear opinion at this time about which is the less unlikely). The innermost task assumes it can just overwrite the workspace, but the original task would be sabotaged if that happens. It's not like a cache, or a reduction of a commutative operation, or something else where things may safely happen out of order. Do you want to live in fear that the user of your code uses a nested TBB algorithm and causes this to happen? Or is there a reason why this would be impossible?

This particular problem arises from the desire to second-guess the scalable allocator by declaring a TLS instance in a wider scope (outside the parallel_for instead of inside a Body), and I can't immediately think of another situation where you would need a construct with exactly the properties needed here (apparently shared but requiring isolation), out of the box. But you can use TLS with a pool of workspaces as described, where each Body takes out a new or free workspace, and where all code operating on a "chunk" (subrange executed by a Body) uses the same reference.

Of course, humare erranum est...

Trying to make a simple "conclusion" of the TLS discussion: There is a dangerous block in your code - the block that uses the local workspace obtained from TLS. If you make a dangerous call to TBB (I'll explain later) in the dangerous block, it could damage the local workspace and thus make your results incorrect. To make it worse, the error will most likely be undetected (no exception or assertion triggered) and occur only randomly from time to time. The question is "what is a dangerous call?". It is any call to TBB that would allow the TBB scheduler to spawn a different task on the calling thread. Mainly running a different parallel algorithm or invoking any of the functions that wait for a task(s) to complete (wait_for_all, spawn_and_wait_for_all, spawn_root_and_wait). Naturally, running a function (e.g., from some third party library) that uses TBB internally is also dangerous.

If no dangerous call is made in the dangerous block, you should be OK. As long as nobody changes your code to include a dangerous call...

Using TLS (Thread Local Storage) is sound practice provided:

a) The Task (not thread) that produced the data within the TLS does not change thread context between/during/after the run of the task. On TBB this is the case. Cilk++ this is not necessarily the case. I suggest you insert a #define ... filter that errors the compilation should you compile for threading paradigm that does not meet this requirement. BTW I use TLS a lot, but I am careful.

b) The routine using the TLS is not recursive with the expectation (requirement) that it is not.

Jim Dempsey

www.quickthreadprogramming.com

Thank you,

a) and b) requirements can be meet in my case (only TBB) and the routine is not recursive.

I did not try (yet) to use unitialized vector. In this case I guess that I should use the scalable_malloc from the C tbb interface.

Quote:

laurent.plagne wrote:

the routine is not recursive

If you insist on simple STL (instead of thread-specific pools), make sure that you don't call TBB algorithms, or unknown code, including caller-provided code, that might call TBB algorithms, and put warnings all over the code that you do use to make clear to those who will maintain the code what to avoid.

Quote:

laurent.plagne wrote:

In this case I guess that I should use the scalable_malloc from the C tbb interface.

How about using the STACK_BASED Workspace from #7 as the element type in a vector of 1 element with the scalable allocator, which also takes care of cleaning up? You may also have to provide a default constructor for Workspace that doesn't initialise the array.

BTW, I also noticed that you use (void) in the default constructors, but () is preferred.

Don't forget experimenting with grainsize!

(2013-12-20 Added) I've observed on OS X that even an automatic array of floating-point values seems to get initialised, at first sight anyway. Very inconsistent results with integral types, depending on debug/release and/or size, and with mysteriously recurring non-zero values. Your mileage may vary, and I didn't try the exact suggestion, just various nonsystematic combinations of automatic/new and single/array, but it may be necessary to use scalable_malloc directly after all.

发表评论

登录添加评论。还不是成员?立即加入