CnC++ on multiple-socket NUMA machines

CnC++ on multiple-socket NUMA machines

I had a small window of opportunity to run my CnC++ code on a machine with a high number of processors and cores. Strangely enough, the performance graph was absolutely flat; the number of threads had no effect at all. The exact same code scales normally under the same linux OS on our single-socket 8-core machine.
The benchmark results from "Performance Evaluation of Concurrent Collections
on High-Performance Multicore Computing Systems
" do show scaling over NUMA machines, so I was wondering what I could do to make it run properly.

publicaciones de 6 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.
Best Reply

Hi,
we made some experiments with up to 64 cores shared memory and with suitable applications we could demonstrate good scalability.

Do you see no no speedup at all, or no speedupbeyond 8 cores?

I see several possible explanations for your observation:

- Setting the number of threads might have no effect on the actual number thread used, e.g. it's always using the default number of threads.
- We sometimes see weird things with different versions of TBB. Which version ofTBB are you using?
- if seeting the number of threads is done through the API, CNC::set_num_threads must happen before instantiating the context

- The machine is overloaded with other processes

- The memory bandwidth is the limiting factor. You can use depends, get-counts and lazy tag production to influence that. To give more specific advice I would probably need more details on the program mechanics.

We are working on a new release of CnC which should be ready within the next 1-2 months. It comes with a few improvements with thread-handling. Not sure if this makes a differenceto your application, though.

frank

We are seeing scaling on an 8-core Intel Xeon machine, giving a nice mortar curve up to 8 threads. Beyond that, there is no speedup, which shows hyperthreading has no effect in our case.On the large 40-something core Intel shared-memory NUMA machine, there was no variation in speed at all. Benchmark results for 1 to n threads are consistent with just repeating the same thing with one thread over and over again. The benchmark boxplot of the threads vs. time curve is absolutely flat with the same percentiles across the board.That leaves the following points:

- Setting the number of threads might have no effect on the actual number thread used, e.g. it's always using the default number of threads.- The machine is overloaded with other processes- We sometimes see weird things with different versions of TBB. Which version ofTBB are you using?

I was relatively sure I was the sole user of the machine and we were using TBB 3.0PS. The program in question is the quantum simulator I have been working on, which should be familiar.PSS. Great news about the new release! I am really looking forward to it.

Are you saying that the runtime used only one thread no matter how many threads you requested? Did you verify this with "top" or alike or did you infer this from the performance numbers?

I have checked this with the same source code, under the same OS, but on a different machine: ./my_program & 'top -bH -d 0.1 -p $!'sees a number of threads equal to the cores+hyperthreading (a measly 16 in this case).There is still only one process though, which probably led to it being mapped to only one processor on the large NUMA machine.

I have checked this with the same source code, under the same OS, but on a different machine: ./my_program & 'top -bH -d 0.1 -p $!'sees a number of threads equal to the cores+hyperthreading (a measly 16 in this case).There is still only one process though, which probably led to it being mapped to only one processor on the large NUMA machine.edit: I've asked around and the above is unlikely, normally threads should be divided amongs all processor/cores. In any case, much of this is moot as I don't have access to such larger machine at the moment.

Inicie sesión para dejar un comentario.