Not NUMA???

Not NUMA???

I am puzzled by the fact that the machine has 4 QPI packages and is not NUMA (Windows reports single NUMA node). How did you achieve this? Is it some kind of BIOS setting that effectively blends memory topology and makes a NUMA system to look like a UMA system?
Is it intended? I think that it's much more beneficial for educational purposes to setup the machine as NUMA system. Future concurrent hardware is going to be nonuniform. That's the only purpose of my login to MTL - to test some things on a NUMA system... and it turned out that the beast is not NUMA. It's a pity.
Earlier I tried Intel Parallel Universe, but it features silly Windows Server 2003 w/o any support for NUMA (the system itself is NUMA, though).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Whilemy knowledge of NUMA ispossibly lacking at this time. A quick search reveled the possible reason for the single NUMA node:

http://code.msdn.microsoft.com/64plusLP

The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group.

We have not (I believe)configured any BIOS setting to effectively set UMA mode - I willdouble check this.

Quoting Mike Pearce (Intel)

The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group.

As far as I remember, Windows should not produce any processor groups while number of processors <= 64, because all processors are addressable by old DWORD_PTR mask mechanism. So I expect the system features only 1 processor group. I will re-check this.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting Dmitriy Vyukov

Quoting Mike Pearce (Intel)

The operating system takes physical locality into account when assigning logical processors to groups, for better performance. All of the logical processors in a core, and all of the cores in a physical processor, are assigned to the same group, if possible. Physical processors that are physically close to one another are assigned to the same group.

As far as I remember, Windows should not produce any processor groups while number of processors <= 64, because all processors are addressable by old DWORD_PTR mask mechanism. So I expect the system features only 1 processor group. I will re-check this.

Yes, there is only 1 NUMA node and 1 processor groups. That's strange to me...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

NUMA is usually enabled/disabled in the BIOS.
The batch systems also have HT disabled :(

Both OpenMP and PThreads, as well as Linux process settings,have a means exclude one sibling from the HT pair. At least one of the batch systems should have HT enabled. IMHO. I did find that the log-on systemdoes have HT enabled. Therefor, running your test application on this system during idle periods is an easy way to test with HT. See my upcoming article in the Communities: Parallel Programming section scheduled for next Wednesday (Aug. 25).

Jim Dempsey

www.quickthreadprogramming.com

Quoting jimdempseyatthecove
NUMA is usually enabled/disabled in the BIOS.

Hi Jim,

how does such system work with NUMA disabled?

I would expect that all memory is evenly interleaved between all nodes on a page granularity. I.e.

0x00000000 - 0x00000FFF: NODE 0

0x00001000 - 0x00001FFF: NODE 1

0x00002000 - 0x00002FFF: NODE 2

0x00003000 - 0x00003FFF: NODE 3

0x00004000 - 0x00004FFF: NODE 0

...

Is it so?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

People who work on Linux machine report that it has 4 NUMA nodes.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting Dmitriy VyukovPeople who work on Linux machine report that it has 4 NUMA nodes.

I've been using the MTL Linux machines. Both the log-in Linux system with HT system and the batch systems without HT appear to have the BIOS settings to interleave memory (non-NUMA). The interleaving is likely set for each sequential cache line (PhysicalAddress)to come from a rotating sequence of memory nodes (4 nodes on these systems). Setup this way a single threaded application can attain higher memory throughput on sequential memory accesses (since the memory fetches can work ahead of readsor behind for writes).In the NUMA setup, the physical memory is partitioned into the number of memory nodes available where each sequential cache line (Physical Address) accesses the same memory node (up to the partitioning point). Note, theVirtual Memory system can alternate virtual memory Physical Pages for sequential Virtual Address Pages (reference to Dmitriy's earlier post).NUMA aware multi-threaded programs can attain better aggregate performance when setup this way. (some systems may interleave atpairs of cache line intervals, or cache burst intervals).

Some of the problems may relate to the Red Hat distro on these systems having the older libnuma.
Also, the BIOS documentation relating to settings for NUMAshould be looked at with caution because I think the Chinese English definition of "interleave" may be different than US English. It appears to be backwards on my older Tyan motherboard.

Jim Dempsey

www.quickthreadprogramming.com

Update: I have discovered why the Windows MTL is reporting UMA rather than NUMA, is has to do with the memory (risers) in the system -I have corrected this shorcomming and the Windows MTL should now support NUMA.

Please let me know if this assumption is notcorrect.

I hava also increased the system memory on the Windows MTL to 128GB from 64GB

Quoting Dmitriy VyukovQuoting jimdempseyatthecoveNUMA is usually enabled/disabled in the BIOS.

Hi Jim,

how does such system work with NUMA disabled?

I would expect that all memory is evenly interleaved between all nodes on a page granularity. I.e.

0x00000000 - 0x00000FFF: NODE 0

0x00001000 - 0x00001FFF: NODE 1

0x00002000 - 0x00002FFF: NODE 2

0x00003000 - 0x00003FFF: NODE 3

0x00004000 - 0x00004FFF: NODE 0

...

Is it so?

The BIOS configuration on the motherboard can be set to enable/disable node interleaving. Interleaving is typically performed at the memory bus intervals (128 bits, 256 bits, typically cache line size), but may be done in small multiples of memory bus intervals (motherboard dependent).

Look at page 8 of: http://i.dell.com/sites/content/business/solutions/whitepapers/en/Docume...

When Nodeinterleaving is off you will (should) have NUMA enabled. When Node interleaving is on, you will have aggregate UMA access (every other memory bus interval comes from different NUMA node, long-run sequential accesses have ~UMAlatencies).

What Dmitriy has mentioned is a system implementation that is often found on Linux where the O/S distributes "contiguous" Virtual Memory discontiguous amongst available NUMA nodes on a NUMA configured system. Using this technique, sections of large arrays are distributed amongst NUMA nodes.

A seperate technique implimented on several systems is the "first touch" technique, whereby after allocation, the core that first touches the memory (typically with write), with page granularity,decides which NUMA node to map the virtual memory page. This technique has some advantages and some disadvantages.

Advantages:
No explicit NUMA allocation and mapping code is required in the application.
Large Virtual Memory allocations, when "sliced" in a manner such that the slices are consistently processed by the same thread, will, at page granularity, be mapped at the most effective NUMA node.

Disadvantages:
First use, encounters page faults on first use. On a system with page size of 4KB, a 4MB allocation will encounter 1024 page faults on first use. (4GB has 1024*1024 page faults).
When (libnuma) not implemented right (consistently) applications attempt to optimize NUMA allocations may ge usurped by "first touch". Example: On a system with page size of 4KB, a 4MB allocation, allocated toa specific NUMA node (with "first touch" usurping allocation)will encounter 1024 page faults on first use. When the allocation is short lived, "first touch" usurping can be expensive.
When the CRT heap is ambivalent to first touch, then small allocations may suffer.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today