I am testing on Core i7 2600K here (4 physical cores x 2 logical cores).
My code (video processing plugin for VirtualDub) is threaded using OpenMP.
- With 8 threads I have lower than single-threaded performance.
- With 4 threads I have 3.98x single-threaded performance.
- With 4 threads I also have some periodic slowdowns (when thread is not run on the same logical core as before)
It is obvious that HyperThreading is the problem for this particular algorithm.
What is not obvious is how to control execution such that:
- Only 4 threads are used -- I can use omp_set_num_threads(4) but I still need to find out how many cores I have (both physical and logical)
- Threads are executed always on the same logical core within the same die -- I can use KMP_AFFINITY but that is totally lame way to control it, I want it done from within the application and I want to avoid the need to scan the whole topology in every program I write in order to be able to avoid logical cores.
Why doesn't OpenMP provide API to specify you want only physical cores, and that you don't want OS to juggle the threads between logical cores on the same die thus trashing the caches and decreasing power efficiency?
What are the other threading methods (TBB, Cilk) like compared to OpenMP in this regard? Are they offering more control or not?