What will affect the multi-threading performance in 64bit system?

What will affect the multi-threading performance in 64bit system?

Dear all~

I'm developing a multi-threading version for our product now. I used pthread as threading mechanism. I tested my developed 32bit binary on a 32bit machine with 8 cores. Yes, the performance improvement is almost as expected. However, when I tested my 32bit binary on 64bit machine with 16cores, the performance is even worse for 2 threads (the same for 64bit binary on 64bit machine). Then, I turned to use OpenMP for testing the case (32bit binary on 64bit machine), it showed that OpenMP has similar good result as my benchmark as 32bit binary on 32bit machine. Wow, now I'm confused that the benchmark showed that pthread has much larger overhead in 64bit machine!? How does that happen? Is there anything I should keep in mind to develop for 64bit machine? Is there any way I can futher improve the performance in 64bit machine? Any suggestion will be appreciated. Thanks a lot...

Best Regards

Yi-Ju

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

What exactly are your machines? Is 64-bit machine contain 2 processor packages?
It may be data sharing problems that do not show up on smaller machine. On a single package machine with a shared LLC data sharing costs are significantly less that on a 2 package machine.

Hi ~

For the 32bit machine. I have 2 package and each with 4 cores. Intel Xeon CPU E5440 @ 2.83GHz
For the 64bit machine. I have 4 package and each with 4 cores. Quad-Core AMD Opteron Processor 8380

The compiler used is icc

Is it data sharing problem? Will 2 threading use a single package? And if data sharing is the problem, why OpenMP doesn't have this issue? Thanks

Best Regards

There are a lot of differences between machines - sharing costs, memory bandwidth, cache sizes, etc. Anything can cause the difference.
As for OpenMP, well, it's just different program with different requirements for memory bandwidth, sharing, caches, etc. It would be more strange if they would behave exactly equally.

OpenMP uses a thread pooling concept whereby the threads in an OpenMP pool stay alive for a short period following a parallel region. This is called blocking time. The advantage is, after exiting a parallel region, should your program run for less than this blocking time and then start a new parallel region then these threads resume. The anciliary threads wait loop is somewhat like a SpinWait - relatively low overhead. Should the time between parallel regions exceed the blocking time, then the ancilary threads will suspend themselves on an event/condition variable, but not exit. This thread pooling may be the reason for the performance difference.

Pthread model, is often coded: Serial section, run to parallel region, start additional threads, wait for additional threads to complete (often with serial thread not participating in parallel region), additional threads terminate, serial thread resumes.

This technique has two problems: Overhead to start/stop thread team, and cache load of serial thread not necessarily available for use by one/more threads of thread team.

You can code your Pthread model similar to OpenMP (keep a pool running on a block time, and fall back to wait on condition variable when block time expires). Or simply use OpenMP, which does this for you. (Or consider using TBB, Cilk+, QuickThread, or othre tasking based thread models).

Jim Dempsey

Currently, I implemented a pthread pool and each thread will wait (fall asleep) if there is no job to do. So I can imagine the overhead to wake up the threads. But I've tested this sleep and wake up mechanism while comparing to use pthread_create, pthread_join respectively. The overhead is not heavy as in my benchmark (32bit binary on 32bit machine, I didn't do the benchmark for 64bit machine yet). So, my question remains in doubt. Part of the pseudo code is simple, just as

int idx;
for (idx = 0 ; idx < count ; ++idx) {
    doSomething(obj[idx]);
}

I use OpenMP and pthread pool to parallelize the for loop, the results for both implementation are the same. But the performance improvement is different.

For OpenMP, 32bit binary speeds up on both 32bit machine and 64bit machine just as close to expected.

For pthread pool, 32bit binary speeds up on 32bit machine, just like OpenMP. But it slowed down to 0.6x with 2 threads while executed on 64bit machine. (when 2 threads are mentioned, there are exactly 3 threads, one as master in serial section, and 2 runs for parallel region)

I think the machine's difference (bandwidth, cache, etc) is not what makes the difference, since OpenMP can obtain good results in 64bit machine. That's why I question whether there exists any difference (issues) for pthreads to take care when executing on 64bit machine. Or does OpenMP have special mechanism different from thread pool mechanism which we can see from most of the papers?

I have to use pthread and a thread manager because the AP has very different level of parallelization and pthread has more flexibility for the concern. Thanks

Yi-Ju Ho

> That's why I question whether there exists any difference (issues) for pthreads to take care when executing on 64bit machine.

Nope, there are no. There are problems in applications. Pthread in itself is definitely capable of achieving perfect linear speedup. Moreover, in itself it's more efficient that OpenMP, because OpenMP is pthread+something_on_top.

It's difficult to say anything w/o knowing what and how exactly you are doing. Devil in details.
In order to get definitive answer you can collect profiles of you 32 and 64 version (or 64 pthread and 64 openmp) and compare them. You will find places that degrade in 64 pthread version.

I used Intel thread amplifier 2011 version to get the profile, it's strange that it showed the performance degrade is at the function doSomething(). There is nothing different and the performance for doSomething() (total elapse time) is similar. That made me struggle in figuring out what's going wrong with doSomething().

Any way, I finally found the source of the performance degrades. There is nothing wrong with doSomething(). I just recognized that OpenMP has very large CPU utilization (close to 200 while 2 cores are used, by looking at top). That made me thinking about my implementation of thread pool. I made each thread to call sem_wait directly while there is no jobs to do. And each time a job is enqueue into job pool, I tried to wake up a thread for that. The sleep/wake up overhead is not heavy in 32bit machine.

After tring to implement the thread pool by infinite loop without sem_wait(), the performance improvement on 64bit machine is close to as expected (by Amdal's law). I just don't understand till now, why sem_wait() and sem_post() in 32bit machine doesn't have such a heavy overhead, but in 64bit machine, it does.

Yi-Ju

Quoting rudaho

After tring to implement the thread pool by infinite loop without sem_wait(), the performance improvement on 64bit machine is close to as expected (by Amdal's law). I just don't understand till now, why sem_wait() and sem_post() in 32bit machine doesn't have such a heavy overhead, but in 64bit machine, it does.

Yi-Ju

I believe it has nothing to do with machine bit-ness. Perhaps you are missing that there are other differences as well.

Yi-Ju,

Two issues:

a) In OpenMP, the blocking time is performed in a spinwait using an _mm_pause(); (x86 PAUSE instruction). This is similar to your infinite loop(for the duration of the spin wait) excepting than _mm_pause() presents less of a load on the memory/cache system.Should theiteration count of the pause/check for task expire prior to new task becoming available then the thread performs the sem_wait() (or condition variable wait) and then the app sleeps until taskavailable. While the_mm_pause() may introduce a slightly longer latency for the waiting thread, it improves the performance of the otherthread(s) (one of which may be generating the task upon which the thread is waiting for).

b) You have a two socket system. Depending on how the scheduler is setup the2(3) threads may be running on the same socket for one method (suspect OpenMP)anddifferent socketsfor the other method (suspect your Pthread version). When using multiple sockets the sem_wait()/sem_set() may include additional overhead (e.g. inter-processor interrupt). The _mm_pause() spin wait loop, up until it could sleep, relies on the cache coherency system - less overhead.

JimDempsey

Hi Dempsey~

Thanks a lot for the detail information about the implementation of OpenMP and how the OS runs, etc. It's hard for me to get this kind of information quickly. I'll try to fine tune my thread manager with considering the mem/cache, less signaling issues later. Thanks a lot...

Best Regards

Yi-Ju

Yi-Ju,

>>Thanks a lot for the detail information about the implementation of OpenMP and how the OS runs, etc

This was more of an overview than detail.

A third item to list is I have seen a difference in performance for some thread API'sbetweenLinux Distro's/Versions, bit-ness, and platform (Intel vs AMD). (combined with socket distribution). Also with respect to sockets, prior to Intel Nehalem, your Intel dual socketplatform likely had one memorynode whereas the dual socket AMD system likely has two memory nodes. With two memory nodes, you can configure the memory as interleaved or non-interleaved (NUMA). When configured as NUMA, this can help or hinder performance (thiscan be influenced on programming considerations in your applications).

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today