Thread thrashing ?

Thread thrashing ?

I have built a 64 bit program(gcc) with C using openmp. The program calculates matrix related results. The program runs well but I find that as threads number > 16 that the time taken to process the job increases. The simple results below showsMatrix sizeprocessed in the vertical and number of threads allocated in the horizontal, the processing times are the data values.Note the increase in processing for 32 threads when compared to 16 threads. I imagine that this is a result of how the memory is configured wrt the processors. ?Similar SMP machines with 32 cores(SUN) do not show this increase in processing time from 16 to 32 threads.

2 4 8 16 32

2000x2000 140 73 40 58 62
4000x4000 502 255 133109 135
6000x60001066 568 292 241 297

I am using qsub to submit my batch job and was wondering whether I need to balance the threads per node? i.e. alllocate threads per node or something like that.
Anyone care to have a guess as to why the increase in processing time with more nodes is happening ?

Regards, Rob

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

hi Rob,

The system does have NUMA architecture. When you calculate on 16 cores (cores 0-15, 2 chips) there are 1 local and 1 remote memory. When you calculate on 32 cores you have 1 local and 3 remote memory pools
Details can be found here,2041-9.html

For better scale you need to reduce data dependencies but I'm not sure if this possible for openmp and simple matrix multiplies.


Thanks for that detail Vladimir. It makes perfect sense once one looks at the memory config in those diagrams. Despite feedback from an earlier posting I was hoping there was a mistake somewhere and the chips were configed/switched to one memory somehow, as in this 48 core described in the following link:

Any idea whether we can get one to use via the MTL facility?

Regards, Rob

Leave a Comment

Please sign in to add a comment. Not a member? Join today