Better performance with more threads than number of cores?

Better performance with more threads than number of cores?

Hello,

 I read a report. The authors parallized a memory intensive application with multithreading (by pthreads). They tested the performance on a compute node of two 6-core Xeon X5650. They found the performance will increase with the number for threads, even if the number of theads(e.g., up to 2000 threads) is greater than the number of phisycal cores (i.e., 12). The author contributed it to "memory level parallelism supported by the latest multi-core processors, i.e., Intel Nehalem enables more than 10 outstanding memory requests".

I could not understand it. I know CPUs have load/store buffers to support out-of-order execution. But it is bound to a single thread. From this report,  it looks a memory requested can be in flight during thread switch. Is it true? If not,  how to explain this kind of things: performance goes up even when thread number exceeds core number?

Thanks!

--Junchao

8 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Junchao,

Thread context switch is a relatively high overhead. You would want to avoid this when it is not benificial. There are situations where it is beneficial:

When a thread waits for I/O
When a thread waits for a long time for a mutex or other event (generally due to thread holding mutex waiting for I/O)

When all threads of an application are fully compute bound, then it would be inefficient to use more threads than hardware threads.

This said, one could poorly construct a fully compute bound application using n threads on system with n hardware threads that performs performs poorer than using k*n threads. The reason generally being that the workspace being divided into n partitions in the former case and k*n partitions in the latter. One could call this n-tiles verses k*n-tiles. When considering the circumstances, this then does not mean more threads == better performance, rather it means more-tiles == better performance. Actually it means for a given archetecture, an application may have a particular "sweet spot" for the working set. What you want to do is not use more threads, rather use the same number of threads to execute more tasks, each using the "sweet spot" size of the working set. This is often called tasking.

A task is (grossly) functionally equivilent to a thread (each has its own work to do), however, when a thread finishes one task, the task switch to the next task, has less overhead than having the operating system perform a thread context switch. A purpose built (light-weight) tasking system can perform a task switch in 10's to 100's of instructions. A heavy-weight tasking system (more functionality) would be 10x of the light-weight. A thread context switch may be 10x that of the heavy-weight (100x that of the light-weight).

As for what is the best number of threads:

Same number as HW threads for compute bound portions of the application
Plus number of additional non-compute bound threads as you have pending I/O operations.

Example: parallel pipeline

(one reader thread) -> (n worker threads) -> (one writer thread)

On a system with n hardware threads, the thread count would be n+2, where the +2 are reserved for performing I/O - when no I/O they suspend (they do not spin-wait).

Jim Dempsey

www.quickthreadprogramming.com

Assuming you run your platform in hyper-thread mode (the default), the CPU has partial hardware support for 2 threads per core. In some well known cases, this will provide a significant boost over 1 thread per core. Such cases include data base application with a high rate of TLB misses. As you hinted, this will permit one thread to progress while another waits to fulfill a memory request.
The 56xx behavior is untypically complicated, as 2 access paths to last level cache in each CPU are shared by 4 cores, while the remaining 2 cores have their own paths. So it is possible to see cases where the specific 2 cores gain more by running 2 threads each than do the others.

>> I read a report. The authors parallized a memory intensive application with multithreading (by pthreads). They tested the performance on
>>a compute node of two 6-core Xeon X5650. They found the performance will increase with the number for threads, even if the number of
>>theads(e.g., up to 2000 threads) is greater than the number of phisycal cores (i.e., 12).

It looks very interesting and could you post a link to that report?

Hi Jim,

>>...Thread context switch is a relatively high overhead...

I will verify how high it is and I'll post my results for Single-core, Dual-core and Quad-core systems.

Best regards,
Sergey

The report is here: http://www.springerlink.com/content/367632302112t6q7/
It is on breadth first search. The interesting figure is Fig.5

>>...Thread context switch is a relatively high overhead...>>>
Thread context switch consist of following operations:
1.Saving and reloading instruction pointer.
2.Saving and reloading kernel stack pointer
3.Sving and reloading a pointer to process's adress space
I suppose that those operations could last for hundred or hundreds of CPU cycles.
Sysenter instruction should speedup transition from ring 3 to ring 0.

Here is very interesting information about overhead of various Intel processors(Linux only) : http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi