Adapting worker threads count during execution

Adapting worker threads count during execution

Hi, 

I have an application with multiple parallel regions (with serial portions in between), each having different characteristics in terms of scalability (i.e. some scale strongly, other weakly, etc.), and thus different requirements in terms of optimal number of cores to use. Is it possible in TBBs to dynamically change the number of worker threads for a specific parallel region? Would the creation of multiple task_scheduler_init objects in different scopes (i.e. "{ }" blocks) work? Ideally, I would like to avoid the continuous creation/destruction of worker threads (due to the relatively large overhead), and employ a less disruptive scheme to suspend/resume workers on demand. 

Thanks in advance, 

Nick

11 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

>>...Would the creation of multiple task_scheduler_init objects in different scopes (i.e. "{ }" blocks) work?..

I found a statement in a TBB documentation that says: '...A thread may construct multiple task_scheduler_inits...'. However, I didn't try to test this.

Best regards,
Sergey

It is possible to let an application thread use different degrees of parallelism by way of scoped task_scheduler_init instances. Note that only the outermost instance on the stack counts (unless I missed something), and this may be an implicit instance arising from any use of certain TBB scheduler-related features (things like atomics or passive containers don't count, but try to avoid them anyway). That means that you should have one application thread to start the TBB work from blocks with their own specific task_scheduler_init instances, and another thread to keep the scheduler features alive for reuse, including the TBB thread pool; it's probably easiest to do the latter in the main thread and the former in an explicitly created thread. Please let us know whether this works out well for you.

Thank you both for your answers!

It seems that Raf's solution (if I have understood it correctly) works.

If I try to create multiple task_scheduler_init instances within the same (main) thread, the number of total worker threads created is determined by the argument passed to the first instance. E.g., in the following scenario:


    tbb::task_scheduler_init init(nthreads); 
    {

        tbb::task_scheduler_init init1(nthreads1);

        //tbb::parallel_for

    }
    {

        tbb::task_scheduler_init init2(nthreads2);

        //tbb::parallel_for

    }
    {

        tbb::task_scheduler_init init3(nthreads3);

        //tbb::parallel_for

    }

each parallel region will always execute in time proportional to that of "nthreads" workers.

However, by employing some kind of nested parallelism like this:


#pragma omp parallel

{

    #pragma omp sections

    {

        //dummy section -- hopefully sleeps politely until the other section finishes

        #pragma omp section

        {

            tbb::task_scheduler_init init(1);

        }
       //the "useful" section, corresponding to application code

       #pragma omp section

       {

           {

               tbb::task_scheduler_init init1(nthreads1);

               //tbb::parallel_for 1

           }
           {

               tbb::task_scheduler_init init2(nthreads2);

               //tbb::parallel_for 2

           }
           {

                tbb::task_scheduler_init init3(nthreads3);

                //tbb::parallel_for 3

           }

      }

   }

}

each one of the 3 parallel_for's executes with the worker threads it requests within its block.

Anyway, it would be a good feature if a future version of TBBs could support a more elegant way to accomplish this kind of malleability, as OpenMP does with the omp_set_num_threads function.

The "however" code does not meaningfully differ from the code above it: it's just 4 separate blocks, without any task_scheduler_init scope nesting. I'll leave the interpretation of the OpenMP pragma's and their effect on TBB to others, though.

(Added 2012-11-19) Sorry, it seems I missed the point, so please ignore that... I'll use as an excuse that the provided code doesn't do what its author intended: when the first block exits, the task_scheduler_instance immediately disappears, and after that it doesn't matter whether this section's thread lasts as long as that of the other section or not. I don't know enough about OpenMP to confidently suggest another way to emulate what I've suggested above (using explicit threads).

Why not manage the number of threads per region of you code by managing the number of tasks generated to execute those regions. Example: create a partitioner that places an upper limit on the number of partitions.

Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:

Why not manage the number of threads per region of you code by managing the number of tasks generated to execute those regions. Example: create a partitioner that places an upper limit on the number of partitions.

Jim Dempsey

I would like to maintain the so-called "parallel slackness" property (tasks >> workers) that libraries such as Cilk or TBBs implement, and which subsequently guarantees proper load balancing. I am not sure that the solution you propose would maintain that. But anyway, my applications do not only include parallel skeletons with partitionable iteration spaces (e.g. parallel_for), but also other constructs such as raw tasks.

A different option (hack) is to temporarily remove some threads from the idle tread pool by having those threads wait for an event or condition variable. Once the region where you desire to be run with diminished thread count completes you set the event/condition variable. This does result in those threads not being available for other tasks. There are ways to work around that too.

parallel_invoke(
[](){WaitForSingleEvent(...); }, // remove 1st thread
[](){WaitForSingleEvent(...); }, // remove 2nd thread
[](){
parallel_for(...); // do work
SetEvent(...)} // release waiting threads
); // parallel_invoke

Jim Dempsey

www.quickthreadprogramming.com

I'm afraid that blocking a thread may trap some work on the stack, with possibly "unwelcome" results.

Quote:

Raf Schietekat wrote:

I'm afraid that blocking a thread may trap some work on the stack, with possibly "unwelcome" results.

OK. Here is another question: I was browsing the TBB source code to find functionality related to adding/removing workers. I came up to the following function in market.h:

//! Request that arena's need in workers should be adjusted.
/** Concurrent invocations are possible only on behalf of different arenas. **/
void adjust_demand ( arena&, int delta );

which seems to end up waking (or launching) "delta" extra workers (unfortunately "delta" cannot be negative, as far as I understand, to facilitate implementing parallelism shrinkage).
For the case where more parallelism is needed, would it be a "proper" solution to call this library routine from user code, or would it lead to unexpected behaviour?

Unless it is documented, it is liable to change in a new release, especially in areas that have been discussed recently. See my first reaction in this thread for my still-current best suggestion on how to handle this situation.

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen