I need low-level control over task->thread->core mapping. Is TBB right for me?

I need low-level control over task->thread->core mapping. Is TBB right for me?

I am impressed with many of TBB's features, particularly flow graphs, but am unsure if it is suitable for use in a HPC environment.

My program runs a simulation on a very large dataset, which is first split into N partitions, where N is the number of NUMA nodes on the system. Each partition is allocated on memory associated with a unique NUMA node. Over a series of timesteps, calculations are performed on each partition, which synchronize upon completion of a single step. These calculations are done in a parallel-for fashion, using K threads per partition, where K is the number of cores on the NUMA node (so N*K = number of cores on the system). Each of these K threads has their affinity set to run on the NUMA node closest to the partition they work on.

Is there any way to implement this in TBB? You can use tbb::task_scheduler_observer to set the affinity of threads to a certain NUMA node, and tbb::affinity_partitioner to ensure tasks are run on the same thread (and therefore NUMA node) across timesteps. The only problem left to be solved is how to map tasks to threads initially such that the task executes on a thread associated with the correct NUMA node. Is there a way to do this? I quote from the reference on task::affinity_id: "Do not assume anything about non-zero values. The mapping of non-zero values tothreads is internal to the Intel TBB implementation." The provided way of mapping tasks to threads seems to be more of a hint than a command anyway.

If there is no way to do this in TBB, my question changes - if I were to implement this manually using some other thread library, say boost::threads or OpenMP, would this play nicely with a TBB flow graph? So the code to execute a boost::thread/OpenMP update of each partition is in a node in the graph, with other graph nodes that could execute parallel to it. Would the TBB tasks and boost::threads/OpenMP stomp on each other with thread context switches?

I am aware this could be accomplished with TBB if a separate process were used for each NUMA node (see:http://software.intel.com/en-us/blogs/2010/12/28/tbb-30-and-processor-affinity/) but would like all of this to be in the same process. Also, NUMA partitioning of the space has enormous performance benefits - I see a 50-70% speedup across all test cases.

Thanks,
Andrew

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

You could discover affinity_id values by executing task::affinity() from various threads and recording the results to then use with task::set_affinity(), but then you would probably have to take direct control of the top-level loop instead of relying on parallel_for and affinity_partitioner.

A somewhat unrelated question:currently, TBBs scheduler employs random stealing, presumably because this kind of victim selection is fast. This seems a natural fit for "flat" multicore architectures, where all cores are treated as peers. But things change in more complex and hierarchical platforms (e.g. NUMA), with multiple core groupings and different access and communication times between cores. In that case, would it be meaningful to follow an alternate approach to task stealing that would be aware of the underlying architecture? For example, a possible scheme could be for a worker thread to first try to steal work from its neighborhood, before trying getting work from more distant processors.The motivation behind this question is the fact that I've seen notable performance gains when enabling affinity-aware execution in various TBB codes (in the context of affinity_partitioner). So I was wondering if this could be generalized to task stealing, as well.

hi n.anastop,Did you try affinity partitioner?--Vladimir

Hi Vladimir,
Yes I have tried affinity partitioner, as I mentioned above. But this is suitable only for certain types of parallelism (e.g. parallel_for, parallel_reduce) and offers gains for certain memory access patterns (i.e. reuse over the same data). So I was thinking of other ways which could augment the runtime system with locality awareness, and could be applied to all (higher or lower-level) parallel constructs provided by the library. As far as I understand, the task-stealing mechanism lies at the heart of the runtime system, and I thought that it could be a good candidate to add some architectural awareness. Random stealing is fast, but also "blind", architecture-agnostic. Would a more judicious selection of victims (i.e., with locality concerns) make difference in performace?

Andrew,

Affinity and NUMA awarness is built in to QuickThread www.quickthreadprogramming.com

// partition yourArray by rows into NUMA nodes
parallel_for(OneEach_M0$, doRows, 0, nRows, yourArray);
...
void doRows(int rBegin, int rEnd, Array_t& Array)
{
//for this set of rows (running in current NUMA node)
// construct thread team of cores only within this NUMA node
parallel_for(OneEach_Within_M0$+L1$, doTile, 0, nCols, rBegin, rEnd, Array);
// use OneEach_Within_M0$+L0$ if you want team of (HT) threads within node
}

You can also use Lambda function format if you wish.

You can contact me

jim at quickthreadprogramming

com domain.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today