HyperThreading getting in the way of performance

HyperThreading getting in the way of performance

Igor Levicki's picture

I am testing on Core i7 2600K here (4 physical cores x 2 logical cores).

My code (video processing plugin for VirtualDub) is threaded using OpenMP.

- With 8 threads I have lower than single-threaded performance.
- With 4 threads I have 3.98x single-threaded performance.
- With 4 threads I also have some periodic slowdowns (when thread is not run on the same logical core as before)

It is obvious that HyperThreading is the problem for this particular algorithm.

What is not obvious is how to control execution such that:

- Only 4 threads are used -- I can use omp_set_num_threads(4) but I still need to find out how many cores I have (both physical and logical)

- Threads are executed always on the same logical core within the same die -- I can use KMP_AFFINITY but that is totally lame way to control it, I want it done from within the application and I want to avoid the need to scan the whole topology in every program I write in order to be able to avoid logical cores.

Why doesn't OpenMP provide API to specify you want only physical cores, and that you don't want OS to juggle the threads between logical cores on the same die thus trashing the caches and decreasing power efficiency?

What are the other threading methods (TBB, Cilk) like compared to OpenMP in this regard? Are they offering more control or not?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
jimdempseyatthecove's picture

Igor,

You could expriment with QuickThread (www.quickthreadprogramming.com). I have a newer release that I can email to you (get my email address of the web site). QuickThread includes API's for thread to core pinning and selection. It is relatively easy to use:

parallel_for(OneEach_L1$, Functor, low, high, arg1, arg2[, ...]);

On your Core i7 2600K that would start a thread team of 4 threads, each thread bound to a core (but not necessarily bound to the same HT within that core). If you want to exclude HT migration that can be done too (which I can show you how). Sketch

start of app
start QuickThread thread pool (qtInit funciton)
issue parallel_distribute(OneEach_L1$, aDummyFunction)
use API to get bitmap from parallel distribute
use other API to state upon next qtInit use only the above bit map (or .NOT. that bit map)
exit qtInit scope
start new qtInit using thread restriction
(IOW thread pool is subset of all threads - one thread per core)

Once you do that hoop jump, you can put the code in a library you build for your multi-threaded apps.

Jim Dempsey

www.quickthreadprogramming.com
Vladimir Polin (Intel)'s picture

Hi Igor, are you sure that the issue is in logical cores vs threads and you have enough workload to compute? --Vladimir

Igor Levicki's picture

@Jim,

Thanks for suggestion, but I was hoping that some of the multi-threading packages already can handle that for me in the background. It seems that they all suffer from the same lack of control over core selection and thread migration -- none of them is focused on extracting the best performance from detected CPU on behalf of the developer. When are we going to have that? Why do we still have to tune workloads manually for different CPUs?

@Vladimir,

I did not profile it with VTune, but the code is using double precision FP and has a lot of byte size memory accesses but with a large radius (for example 16x16px block iterating through the whole video frame). Since logical cores share L1 cache and do not have separate FP unit my first guess would be that competing for resources is the reason why it is working slower. I can easily check if the workload is too small by either providing larger radius or larger video frame size.

Edit:
I checked, it seems that the workload was not large enough to cover for threading overhead. Regardless, 4% speedup with 8 threads compared to 4 threads is not worth the threading overhead penalty with lower video resolutions and lower radius.

Now I have to figure out the optimal number of threads depending on the workload... damn... any ideas?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Vladimir Polin (Intel)'s picture

Well, it is easy to check without VTune by setting a couple of environment variables:) setKMP_AFFINITY=granularity=fine,scatter setOMP_NUM_THREADS=4 You'll get 4 OpenMP threads pinned to phisical cores and check whether your workload scales well.But using VTune you'll get information whether there are bus or synchronization stalls on adjacent blocks. I assume you parallelize on blocks not frames level. So I suggest to play with chunk size to make sure that L2 cache is used in a good way. I can refer to article "Towards Efficient Multi-Level Threading of H.264 Encoder on IntelHyper-Threading Architectures"that shows some advantage in in using all logical threads. BTW, do you have turbo boost enabled? --Vladimir

Vladimir Polin (Intel)'s picture
Quoting Igor Levicki Now I have to figure out the optimal number of threads depending on the workload... damn... any ideas?

For example, you can do dry run for 1-2 frames for selected thread count from 1 toomp_get_max_threads() and select the best time:)

Overhead in 8 frames is not big deal for 135000 frames movie:)

--Vladimir

Igor Levicki's picture

Actually, overhead depends on workload size. If resolution is small and radius is small 8 threads are slower than 1 thread.

I will have to run a loop with variable frame x and y sizes, variable radius and variable thread count to figure out the threshold for disabling/enabling more threads.

Regarding Turbo Boost, I put the multi for all cores to 45x in BIOS so I am running the CPU at 4.5GHz :)

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Vladimir Polin (Intel)'s picture

Well, you don't need to evaluate all options, you need to select areasonable limit:) And regarding 45x. It is impressive but take into account that cache misses should be relativelymore expensive since memory clock is the same. Other words you can get more than 4% gain for 4->8 threads on default frequency. Or memory clock was adjusted for CPU clock? --Vladimir

Igor Levicki's picture

Will have to try that :)

Memory is 1600MHz DDR3, what would be the adequate speed for CPU @ 4.5GHz?

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
Sergey Kostrov's picture
Quoting Igor Levicki I am testing on Core i7 2600K here (4 physical cores x 2 logical cores).

My code (video processing plugin for VirtualDub) is threaded using OpenMP.

- With 8 threads I have lower than single-threaded performance.
- With 4 threads I have 3.98x single-threaded performance.
- With 4 threads I also have some periodic slowdowns (when thread is not run on the same logical core as before)

It is obvious that HyperThreading is the problem for this particular algorithm...

Please take a look at a thread ( Post #6 by Patrick Fay (Intel) ):

http://software.intel.com/en-us/forums/showthread.php?t=103919&o=a&s=lr

It looks like your problem is similar and could be related to the sharing of FPU between different cores.

Best regards,
Sergey

jimdempseyatthecove's picture

I do not believe the problem (.gt. 4x difference) relates to one FPU/SSE/AVX per core as that would result in:

4 threads, 1/core ~= 8 threads, 2/core

The problem is likely due to L1/L2 cache evictions between HT siblings.

Some algorithms can be reworked such that HT siblings can operate with little or no L1/L2 cache evictions.

The MKL matrix multiplication is one example where this appears to have been accomplished.

In QuickThread (www.quickthreadprogramming.com) one could

parallel_distribute( // n-way fork
L1$, // to all threads in current core
[](int iThread, int nThreads){ // functor run by all threads in team
switch(iThread)
{
case 0: // 1st thread of HT siblings
{
parallel_for(
OneEach_L1$, // One thread per core
YourFunctionHere,
arg1, arg2[,...]);
}
break;
case 1: // second thread of HT siblings
{
// you can place non-cache interfering task here
}
break;
case 2: // third thread of HT siblings (MIC)
{
// you can place non-cache interfering task here
}
break;

case 3: // fourth thread of HT siblings (MIC)
{

// you can place non-cache interfering task here

}

break;

} // switch
) // end functor
); // end parallel_distribute

If you are not interested in running background tasks, then the above can be simplified to just use the parallel_for(OneEach_L1$,...

Jim Dempsey

www.quickthreadprogramming.com
Igor Levicki's picture

I will do some analysis with VTune to see if I can figure out what the problem is.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.
tomorrowwillbefine's picture

Microsoft Office 2010 is actually the newest software from microsoft office 2010 keys Microsoft Corporation introduced in the last year. Its leading aims tend to be to catch the present business requirements and to be on top of every competition with regard to the international market criteria. This can be a very good idea to obtain Microsoft Office 2010 Key immediately to maintain norton antivirus keys yourself up-to-date and to present you with the vast qualified progress opportunities for success. Microsoft Office 2010 is available in both 32-bit and 64-bit editions, but attention please the two are not able to co-exist on the very same personal computer. All of the Office 2010 editions are kaspersky antivirus keys suitable for Windows XP SP3, Windows Vista and Windows 7.

www.keyyeah.com

Login to leave a comment.