I am impressed with many of TBB's features, particularly flow graphs, but am unsure if it is suitable for use in a HPC environment.
My program runs a simulation on a very large dataset, which is first split into N partitions, where N is the number of NUMA nodes on the system. Each partition is allocated on memory associated with a unique NUMA node. Over a series of timesteps, calculations are performed on each partition, which synchronize upon completion of a single step. These calculations are done in a parallel-for fashion, using K threads per partition, where K is the number of cores on the NUMA node (so N*K = number of cores on the system). Each of these K threads has their affinity set to run on the NUMA node closest to the partition they work on.
Is there any way to implement this in TBB? You can use tbb::task_scheduler_observer to set the affinity of threads to a certain NUMA node, and tbb::affinity_partitioner to ensure tasks are run on the same thread (and therefore NUMA node) across timesteps. The only problem left to be solved is how to map tasks to threads initially such that the task executes on a thread associated with the correct NUMA node. Is there a way to do this? I quote from the reference on task::affinity_id: "Do not assume anything about non-zero values. The mapping of non-zero values tothreads is internal to the Intel TBB implementation." The provided way of mapping tasks to threads seems to be more of a hint than a command anyway.
If there is no way to do this in TBB, my question changes - if I were to implement this manually using some other thread library, say boost::threads or OpenMP, would this play nicely with a TBB flow graph? So the code to execute a boost::thread/OpenMP update of each partition is in a node in the graph, with other graph nodes that could execute parallel to it. Would the TBB tasks and boost::threads/OpenMP stomp on each other with thread context switches?
I am aware this could be accomplished with TBB if a separate process were used for each NUMA node (see:http://software.intel.com/en-us/blogs/2010/12/28/tbb-30-and-processor-affinity/) but would like all of this to be in the same process. Also, NUMA partitioning of the space has enormous performance benefits - I see a 50-70% speedup across all test cases.