TBB - Linux Kernel Support Enhancements?

TBB - Linux Kernel Support Enhancements?

AJ's picture

Hey all,

I've been reading a lot about the cache, and multi-core processors. Still quite a bit of reading to go.

I have read that moving a thread from one processor to another will result in loss of performance improvements from the cache. I understand that the new affinity_partitioner is designed to attempt to run tasks on the same core when possible.

How much control does TBB really have over which core the tasks get mapped to during execution?

As I have been reading, it seems to me that TBB could benefit from some kernel-level support. Indeed TBB is still implemented in standard C++, however what's the harm in adding something to the kernel (i.e. Linux) to help performance along?

Are there areas where TBB could be improved with kernel-level support, for instance memory allocation, context-switching, and execution?

These aren't so much questions as just thinking out loud.

Thanks,

AJ

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
robert-reed (Intel)'s picture

Threading Building Blocks is a thread-level parallelization library. Its primary benefits lie in maximizing concurrency level and cache residencein processor-intensive computational loads, ones which run largely without interruption. Within a process time slice, TBB's greedy-scheduling, non-preemptive processing and stealing only when out of work, all serve to maximize cache residence. But a general purpose process scheduler has many more concerns than that to balance user processing with OS kernel services. Moreover,resorting to kernel-level calls can itself limit performance due to delays in crossing the process boundary (e.g., context switches to and from the kernel).


Besides, TBB is a portable library, running on Windows and MacOS kernels. Targeting a feature in Linux for the specific benefit of TBB could arguably both be bad for the maintenance of TBB (extra work tosustain it and try to propagate it to other OSes)and contrarily, something that should already be there, being most likely somethingthat could be ofvalue to more than just TBB programs.

Alexey Kukanov (Intel)'s picture
I have read that moving a thread from one processor to another will result in loss of performance improvements from the cache. I understand that the new affinity_partitioner is designed to attempt to run tasks on the same core when possible.

How much control does TBB really have over which core the tasks get mapped to during execution?


The affinity_partitioner provides some hint to the TBB scheduler about the preferrable worker thread to execute a task. Still, it's just a hint and another thread could take the task, e.g. by stealing.


While it's true that task-to-core affinity is what matters for cache efficiency, we believe that mapping threads to cores efficiently is the business of operating systems. Modern OS kernels, to our best knowledge, are reasonably good in keeping a thread at the same core most of the time. Also the experience of the OpenMP team at Intel tells us that making thread-to-core affinity right is hard and very HW-dependent, and the improvement in general is not that big as one can think of.


At the same time we admit that there can be algorithms that benefit from thread-to-core affinity, and we were asked by users to provide some means for doing that. In the recent developer updates of TBB, we introduced a new feature called task_scheduler_observer, a class that receives notifications about threads entering and exiting the TBB scheduler. Those who want to ensure certain thread-to-core affinity for TBB worker threads can inherit from this class and override its virtual methods to make affinity settings. But you should better know what you are doing 1), and test for performance on target platforms.


Alexey


1) I have heard a rumor about some author developing a plug-in for Internet Explorer and being at troubles with multithreading; so (s)he decided to avoid dealing with multiple threads and affinitized the execution to a sibgle core. As the result, execution of every IE process in the system get affinitized, including those that implicitly used by system services, and the system got frozen. It's just a rumor so you should not trust it :), but it's something toremember when you start playing with affinity.

AJ's picture

I'm not at all suggesting that TBB be altered to make Linux kernel side calls. Instead, I'm curious if Linux can be altered in some way to maximize the throughput of a TBB-enabled app, based on the properties of TBB's execution behaviour. This is a question of curiosity not a suggestion :-)

I'm not suggesting to break TBB's cross-platform support, or even require some extra module for Linux support. I'm thinking of what optimizations, or even kernel configurations, can be done to best support TBB applications. In particular, if I used TBB for HPC applications, a TBB program could run for months... any time I could save would have significant results in overall running time.

Again, this is me being curious not suggesting new features :-)

AJ

robert-reed (Intel)'s picture

Sorry, Adrien. Nothing comes to mind to me. TBB shines when it can keep an arbitrary number of processing elements busy without contention on a really huge, processor-bound problem. The conditions that are best for TBB have more to do with the underlying algorithm than the overseeing operating system. Increase process time-slice? It might let some TBB problems run longer without an interruption, but may increase load balance issues? Avoiding process eviction? That's more an oversubscription problem than a kernel scheduling problem. Affinity? I think Alexey pretty much shot that down with his response. I don't think there's much up this tree to bark at.

Login to leave a comment.