Poor performance on MTL

Poor performance on MTL

Well this is embarrassing. My program runs fine on the Manycore Testing Lab as long as it's singlethreaded. Whenever I enable multithreading the performance gets worse on the same input. Starting more threads only makes it slower.

At first I thought it was just poor design on my part - too much contention for mutexes or something. However, now I'm testing a simplified design with NO synchronization between threads and it's still happening. The performance is what I'd expect from a single core processor.

This is on the login node (acana01). Does it have access to the 40 cores? Do I have to do anything special to enable all of them?

- Rick

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I have just been on acana01 (linux ssh logon) and was also seeing some very odd timings. Everything seemed to be running much slower than I would have excpected. This was on the login node and via a qsub batch. Although at this point I cannot discount the code I have written.

Have you tried on the MTL forum:


Alright, I posted my question the the MTL forum. Now I'm even more convinced that there's something wrong with the MTL. As you said, the qsub batch jobs are also slow. I downloaded a pthreads sample program (that calculates pi) and it exhibited the same performance curve: more threads = slower.

I'm disappointed that I didn't get a chance to test and tune my program on a multicore system. Now I have to hope that they get the MTL properly configured before judging and that my program will scale up well.

- Rick

What is the complexity of code run on each thread? If the complexity of each thread is not greater than the cost of communication between each processor, then it may take longer to communicate between the processors than for them to do the processing.It sounds like this shouldn't be an issue on the MTL, but without really knowing the structure of the batch nodes, we can't tell for sure.

I agree about that and also cahe thrashing (even more so on a quad package machine) can make the times worse.

In my case I can run the applcation in single theaded, 1 thread or N thread mode. And all cases behaved strangly.

I will concede that it's most likely my program design that caused the poor performance. It's virtually lock-free so that's not the problem. In fact, a totally lock-free version also performed poorly with more threads. So it must be cache or memory thrashing. I won't know for sure unless/until I analyze it in a profiler.

The pi calculator does run faster when multithreaded. I was confused because it only reports total CPU time (approx. NTHREADS * elapsed) but calls it "wall clock time".

You may have won this round but watch out! I will be back for problems 2 and 3. =)

- Rick

Leave a Comment

Please sign in to add a comment. Not a member? Join today