Can someone explain how NUMA is not always best for multi-socket systems?

Can someone explain how NUMA is not always best for multi-socket systems?

I have a dual-socket Westmere server running 2 six-core X5680 CPUs with NUMA enabled.

I have been reading about the benefits of NUMA when software is designed with NUMA in mind, but then I've also been reading about how NUMA can harm performance if the software does not specifically take it into account.  How is this possible?  My understanding was that disabling NUMA brings memory access from both CPUs down to the lowest common denominator of remote memory.  For example, local memory access might be 0.5 milliseconds and remote memory access at another node might be 1 millisecond with NUMA enabled; but with NUMA disabled, all memory access is 1 millisecond.  Am I thinking about this the right way?  Can someone explain some cases where NUMA with non-NUMA software might not work well?

22 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Ekleel,

Here is an example which I've seen before.

Say you have a 4 socket (4 numa nodes) system. The application starts up on node 0, allocates all it's memory from node 0 and then spawns worker threads to each of the 4 sockets. All the worker threads are memory intensive (read/write lots of memory) but all the threads are using memory allocated at start up on node 0. So the threads on 3 of the 4 sockets are all doing remote memory accesses and only the threads on node 0 are doing local memory accesses. This is a case where disabling numa would speed the application up (assuming there is a signicant penalty for remote memory access). The QPI links can't dump memory as fast as the cpu can read memory (that is, QPI bandwidth is less than local memory bandwidth). In the 'allocate on node 0 and read/write everywhere' case, all the nodes have to get the memory from node0. This will saturate the node 0 QPI link. When you disable numa (in this case), the overall QPI traffic is usually lower.

One can say this is a poorly threaded application but I've seen it many times.

Pat

Great example!  That makes a lot of sense!  Are you aware of any way to see memory used per node on Windows Server 2008 R2?  I feel like this is what is happening to my software.

There is probably a simpler way to do it but I have a program that reports it.

The code looks like below... pasting screwed up the formatting.

int do_numa(void)
{
unsigned __int64 AvailableBytes;
unsigned __int64 ProcessorMask;
unsigned char Node;
unsigned int ui;
unsigned long HighestNodeNumber=0;
int brc, j, cpu;
HINSTANCE hDLL; // Handle to DLL
typedef BOOL (CALLBACK* LPFNDLLFUNC1)(PULONG);
typedef BOOL (CALLBACK* LPFNDLLFUNC_Mask)(UCHAR, PULONGLONG);
typedef BOOL (CALLBACK* LPFNDLLFUNC_Mem)(UCHAR, PULONGLONG);
typedef BOOL (CALLBACK* LPFNDLLFUNC_Log)(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION,PDWORD);
LPFNDLLFUNC1 lpfnDllFunc1; // Function pointer
LPFNDLLFUNC_Mask lpfnDllFunc_Mask; // Function pointer
LPFNDLLFUNC_Mem lpfnDllFunc_Mem; // Function pointer

hDLL = LoadLibrary("kernel32");
if (hDLL != NULL)
{
lpfnDllFunc1 = (LPFNDLLFUNC1)GetProcAddress(hDLL, "GetNumaHighestNodeNumber");
if (!lpfnDllFunc1)
{
// handle the error
FreeLibrary(hDLL);
printf("This Windows OS kernel32.dll doesn't support NUMA routine GetNumaHighestNodeNumber.\nHardware might be still in NUMA-mode.\n");
return 0;
}
else
{
// call the function
brc = lpfnDllFunc1( &HighestNodeNumber);
printf("This Windows OS kernel32.dll has NUMA routine GetNumaHighestNodeNumber.\n");
}
}
lpfnDllFunc_Mask = (LPFNDLLFUNC_Mask)GetProcAddress(hDLL, "GetNumaNodeProcessorMask");
lpfnDllFunc_Mem = (LPFNDLLFUNC_Mem)GetProcAddress(hDLL, "GetNumaAvailableMemoryNode");
if (lpfnDllFunc_Mask && lpfnDllFunc_Mem)
{
for(ui=0; ui {
Node = (unsigned char)ui;
brc = lpfnDllFunc_Mask( Node, &ProcessorMask);
brc = lpfnDllFunc_Mem( Node, &AvailableBytes);
cpu = -1;
for(j=0; j < sizeof(ProcessorMask)*8; j++)
{
if((ProcessorMask & (1LL << j)) != 0)
{
cpu = j;
break;
}
}
printf("NumaNode= %d Processor= %d AvailableMBytes= %I64d\n", ui, cpu, AvailableBytes/(1024*1024));
}
}
FreeLibrary(hDLL);
return 0;
}

For ease of use, here is a complete version of the above sample. I copied the pasted text back into a tst.c file and it compiles with 'cl tst.c'

 

#include <stdio.h>

#include <windows.h>
 
int do_numa(void)
{
unsigned __int64 AvailableBytes;
unsigned __int64 ProcessorMask;
unsigned char Node;
unsigned int ui;
unsigned long HighestNodeNumber=0;
int brc, j, cpu;
HINSTANCE hDLL; // Handle to DLL
typedef BOOL (CALLBACK* LPFNDLLFUNC1)(PULONG);
typedef BOOL (CALLBACK* LPFNDLLFUNC_Mask)(UCHAR, PULONGLONG);
typedef BOOL (CALLBACK* LPFNDLLFUNC_Mem)(UCHAR, PULONGLONG);
typedef BOOL (CALLBACK* LPFNDLLFUNC_Log)(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION,PDWORD);
LPFNDLLFUNC1 lpfnDllFunc1; // Function pointer
LPFNDLLFUNC_Mask lpfnDllFunc_Mask; // Function pointer
LPFNDLLFUNC_Mem lpfnDllFunc_Mem; // Function pointer

hDLL = LoadLibrary("kernel32");
if (hDLL != NULL)
{
lpfnDllFunc1 = (LPFNDLLFUNC1)GetProcAddress(hDLL, "GetNumaHighestNodeNumber");
if (!lpfnDllFunc1)
{
// handle the error
FreeLibrary(hDLL);
printf("This Windows OS kernel32.dll doesn't support NUMA routine GetNumaHighestNodeNumber.\nHardware might be still in NUMA-mode.\n");
return 0;
}
else
{
// call the function
brc = lpfnDllFunc1( &HighestNodeNumber);
printf("This Windows OS kernel32.dll has NUMA routine GetNumaHighestNodeNumber.\n");
}
}
lpfnDllFunc_Mask = (LPFNDLLFUNC_Mask)GetProcAddress(hDLL, "GetNumaNodeProcessorMask");
lpfnDllFunc_Mem = (LPFNDLLFUNC_Mem)GetProcAddress(hDLL, "GetNumaAvailableMemoryNode");
if (lpfnDllFunc_Mask && lpfnDllFunc_Mem)
{
//printf("got both Mask and Mem\n");

for(ui=0; ui <= HighestNodeNumber; ui++)
{
Node = (unsigned char)ui;
brc = lpfnDllFunc_Mask( Node, &ProcessorMask);
brc = lpfnDllFunc_Mem( Node, &AvailableBytes);
cpu = -1;
for(j=0; j < sizeof(ProcessorMask)*8; j++)
{
if((ProcessorMask & (1LL << j)) != 0)
{
cpu = j;
break;
}
}
printf("NumaNode= %d Processor= %d AvailableMBytes= %I64d\n",
ui, cpu, AvailableBytes/(1024*1024));
}
}

FreeLibrary(hDLL);

return 0;
}

int main(int argc, char **argv)
{
printf("hi\n");
do_numa();
printf("bye\n");
return 0;
}

Hi hardware_guy,

As Pat has already explained in some cases there can be observable NUMA  related degradation of memory performance which can affect overall programme performance.It is related to so called NUMA node distances and inability to pin some thread to its preffered processor on its local NUMA node.For example OS scheduler can reschedule one of the threads to run on different NUMA node (greater memory distance hence bigger overhead and latency) so that thread will need to access its cached memory or local to its previous node remotely thus saturating QPI links.There is also overhead related to NUMA itself because it is implemented as a some kind of network with its own protcol packets and checksum checking.

Few links about the NUMA performance degradation

://communities.vmware.com/thread/391284

://docs.google.com/viewer?a=v&q=cache:K06wsPrSIFYJ:cs.nyu.edu/~lerner/spring10/projects/NUMA.pdf

://kevinclosson.wordpress.com/2009/08/14/intel-xeon-5500-nehalem-ep-numa-versus-interleaved-memory-aka-suma-there-is-no-difference-a-forced-confession/

P.s
I removed http protocol identifier and disabled rich-text option because of anti-spam filter.

Hello Illyapolak,

I'm not really sure what you mean when you say "There is also overhead related to NUMA itself because it is implemented as a some kind of network with its own protcol packets and checksum checking."  In general there is less overhead, higher bandwidth and lower latency for NUMA systems. The degree to which software can take advantage of this depends on how well the code can implement numa-aware strategies. The packets and checksums sounds more like the QPI system than the memory system although the server memory will have ECC checksums (but this is not an enable/disable numa issue).

I briefly read through the 3 articles. The nyu.edu article looks like well... I wouldn't refer anyone to it. It seems to be written by undergrads who don't know what they are talking about.

The vmware article seems like a "we have a problem, we don't know what the problem is, it might have something to do with NUMA or maybe not" rambling postings.

The only point of kevinclosson's posting seems to be that disabling NUMA may provide good enough performance and it just it depends. True, it just depends.

Sorry Pat I sent wrong set of links.Later I will send more technically accurrate links.

>>>I'm not really sure what you mean when you say>>>

Sorry I meant underlying implementation.I think that NUMA use QPI for data transmission between the nodes and on board memory controllers.

Patrick,

Your program looks like it worked perfectly.  I'll test it out some more and let you know what I find!

Thanks!

Hello Illyaploak,

Not to nitpick but in the interest of correctness:

I think that NUMA use QPI for data transmission between the nodes and on board memory controllers.

QPI is used for data transmission between processors. This is independed of NUMA. Messages and memory would still have to be sent between processors over QPI even if NUMA is disabled.  And there is only 1 node if NUMA is disabled.

Thanks for explanation.

Quote:

Patrick Fay (Intel) wrote:

QPI is used for data transmission between processors. This is independed of NUMA. Messages and memory would still have to be sent between processors over QPI even if NUMA is disabled.  And there is only 1 node if NUMA is disabled.

Patrick,

Just to clarify, if NUMA is disabled, the QPI is still used if node 0 needs to access a memory stripe that is in node 1's memory bank, right?

Yes, QPI is still used to send memory between the sockets/processors if numa is disabled. But, if numa is disabled, there is only 1 numa node (which contains all the processors and all the memory). For instance, if numa is enabled, on a multisocket box, you can right click on a process in taskmanager, select set affinity for the process, and you will see a list of cpus and their associated numa node number. If numa is disabled, you will see only the list of cpus to which you can pin the process.

Pat

To clarify (I hope) Patrick Fay's comment:  with NUMA disabled there is only 1 NUMA node <em>from the point of view of the operating system</em>.    There is no way to disable the "NUMA" nature of the hardware short of unplugging everything other than the processor and memory for socket 0.  

Another way to think about this is to say that there exist workloads for which the default (NUMA-aware) OS policy for placement of threads and data gives worse performance than what is obtained with random placement of threads and data.  To "disable NUMA" is to remove the OS's awareness of the underlying NUMA hardware, which <em>may</em> result in more random placement of threads and data.  

On the other hand this is not a very reliable approach.  In general it is better to keep the NUMA-awareness in the OS but change the policy -- for example by using round-robin page placement instead of the default "local first touch" policy.   (I don't know how this might be done in Windows, but it can be controlled in several ways on Linux systems, most easily by using the "numactl" driver to launch the program.)

"Dr. Bandwidth"

The non-NUMA option in BIOS setup, if there is one, is likely to be default.  It usually does something like what John described, alternating cache lines between memory banks, so that placement of threads makes little difference, but a single thread should use various QPI channels about evenly.  For this case, the NUMA option should improve performance with multiple threads properly placed.

The Intel KMP_AFFINITY options or upcoming OpenMP 4.0 ones should be invoked along with NUMA BIOS option to optimize performance on Windows as well.  Windows may be more likely to require a larger than default KMP_BLOCKTIME to make KMP_AFFINITY effective.  The user accessible Windows equivalent to numactl or taskset is the ability to set affinity of individual running threads in task manager; not really suitable for the similar usage.  Microsoft OpenMP library has had prototypes with an affinity feature, but I don't know that it made it to release, so you would support affinity by using the Intel library as a replacement (it includes the same internal function calls).

I don't think the question of how this relates to Cilk(tm) Plus has been answered for Windows or linux.  The work-stealing idea seems to be at odds with dealing with NUMA issues.

Hi Pat,

do you know where I can find any sources related to numa implementation? So far I was able to find only some info about the numa distance tables in ACPI specification.

Thanks in advance.

Hey Illyapolak,

As far as i know, all of the "numa implementation" is done by the bios. The OS just deals with whatever numa settings the bios implements.

You can look at the linux numactl source code to see what the OS can do.regarding numa.

numactl has a "sort of allow an app to use numa disabled memory allocation when numa is enabled" option. For this option, when memory is allocated, the OS will round-robin pick a page from each node. So depending on your memory access details, one could end up having memory accesses which don't seem very "numa disabled". I guess this could happen when numa is disabled in the bios too... but disabling numa in the bios interleaves memory on a cacheline basis so it is (IMO) much less likely that the accesses won't be spread evenly among the nodes.

Pat

Hi,
I supposed that somehow bios is involved in numa managing.On Windows I think that SLIT tables could be implemented by acpi.sys driver,but this is only my theory:)

My question was more related to hardware implemetation or how it is exposed by the hardware(here I mean some kind of registers) to low level software layer.

I looked at this link:http://lxr.free-electrons.com/source/arch/x86/include/asm/numa.h

and I can see that partly numa is managed by acpi on Linux.

Still my posts are blocked:(

I suppose that numa is somehow managed by bios.From what I have been able to understand on Windows platform at least SLIT tables could be managed or exposed to OS by acpi.sys driver.

I was looking through numa.h and numa.c sources and I can see that numa block is simply a memory block with 'start' and 'end' 'address(probably) and 'node id.'.

Finally I have found a lot of useful info about the numa implementation.

I was wondering what is the difference between a system being NUMA and a system having NUMA enabled? Moreover, how can I tell if the system is NUMA inside a compiler? Does being a NUMA system depends solely on the processor inside the system? Therefore, if I have processor X, than I can tell only based on this that the systems having such processors are NUMA systems?

In the context of the types of systems usually discussed here, a NUMA system is one which has memory banks associated with processor packages, and access by one CPU to memory associated with another requires additional steps.  Many early dual CPU systems which supported x86_64 were delivered in a non-NUMA BIOS configuration, mentioned earlier in this thread, where alternate cache lines were local and remote, so that non-NUMA aware applications would perform about the same on either CPU even if suspended and resumed, but wouldn't be capable of full memory performance. 

Enabling NUMA mode in the BIOS then requires setting affinity for the application to maximize performance by having each thread use primarily local memory.  For example, the OpenMP environment variables OMP_PLACES or OMP_PROC_BIND, or Intel-specific KMP_HW_SUBSET would be useful.  Models such as cilk(tm) plus may exclude such optimization.

The NUMA term also comes into play for those CPUs where groups of hardware threads share cache, which also requires thread affinity for full performance.  For example, the MIC KNL has 8 hardware threads, split among 2 cores, attached to each level 2 cache tile.

Leave a Comment

Please sign in to add a comment. Not a member? Join today