Knights Landing's MCDRAM Address Mapping

Knights Landing's MCDRAM Address Mapping

I am interested to know how MCDRAM address mapping happens in the Kinghts Landing. For a given physical address how it decides MCDRAM's row, column, bank, and channel? Is there any MCDRAM architecture spec that describes this procedure?

Thanks in advance.

22 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

As far as I know the mapping inside the MCDRAMs is not documented.   The mapping of cache lines to MCDRAM controllers is easy enough to determine using the hardware performance counters.

Bandwidth testing in "Flat-Quadrant" or "Flat-All2All" modes shows big performance drops when accessing arrays that are separated by a multiple of 64 KiB.  This suggests that each of the 8 EDC controllers uses an interleave that results in a bank conflict every 8 KiB, but the details have not been disclosed.

Given measurements from directed benchmarks and knowledge of the size of the MCDRAM (8 banks of 2 GiB each), one can speculate about the lower-level details.  Some of these speculations lead to testable hypotheses, but the limited EDC performance counter events make it difficult to disambiguate among possible implementations.

"Dr. Bandwidth"

>>...Bandwidth testing in "Flat-Quadrant" or "Flat-All2All" modes shows big performance drops when accessing arrays that are
>>separated by a multiple of 64 KiB...

I've experienced a different case and when an application allocates a block of MCDRAM memory and it is greater than amount of total available MCDRAM memory for a Node than there is a huge performance impact. It is applicable for all Flat and Hybrid MCDRAM modes:

MCDRAM = Flat - Cluster = All2All
MCDRAM = Flat - Cluster = SNC-2
MCDRAM = Flat - Cluster = SNC-4
MCDRAM = Flat - Cluster = Hemisphere
MCDRAM = Flat - Cluster = Quadrant
MCDRAM = Hybrid 50-50 - Cluster = All2All
MCDRAM = Hybrid 50-50 - Cluster = SNC-2
MCDRAM = Hybrid 50-50 - Cluster = SNC-4
MCDRAM = Hybrid 50-50 - Cluster = Hemisphere
MCDRAM = Hybrid 50-50 - Cluster = Quadrant

A complete set of performance numbers for [ MCDRAM = Flat - Cluster = Quadrant ] in the next post...

[ KNL Modes: MCDRAM = Flat - Cluster = Quadrant   ]

 [ NUMA Information ]

  [guest@xxxx-xxxx ~]$ numactl --hardware

  available: 2 nodes (0-1)
  node 0 cpus:
    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31
   32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63
   64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95
   96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
  128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
  160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
  192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
  224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
  node 0 size: 98178 MB
  node 0 free: 95051 MB
  node 1 cpus:
  node 1 size: 16384 MB
  node 1 free: 15934 MB
  node distances:
  node   0   1
    0:  10  31
    1:  31  10

 [ Test 1.1 - hbw_malloc ]

  Processing Started
  KNL Modes                       : MCDRAM = Flat         - Cluster = Quadrant
  HBW Memory Available
  HBW Memory Policy               : HBW_POLICY_BIND
  HBW Memory Allocated by         : hbw_malloc
  HBW Memory Allocation Error Code: 0
  HBW Memory Allocated            : 4.00 GB
  HBW Memory Initialization
  HBW Memory Processing
   Iteration:  1 - HBW Memory Processed (    608 ms )
   Iteration:  2 - HBW Memory Processed (    611 ms )
   Iteration:  3 - HBW Memory Processed (    610 ms )
   Iteration:  4 - HBW Memory Processed (    610 ms )
  HBW Memory Released
  Processing Completed

 [ Test 1.2 - hbw_malloc ]

  Processing Started
  KNL Modes                       : MCDRAM = Flat         - Cluster = Quadrant
  HBW Memory Available
  HBW Memory Policy               : HBW_POLICY_BIND
  HBW Memory Allocated by         : hbw_malloc
  HBW Memory Allocation Error Code: 0
  HBW Memory Allocated            : 8.00 GB
  HBW Memory Initialization
  HBW Memory Processing
   Iteration:  1 - HBW Memory Processed (   1231 ms )
   Iteration:  2 - HBW Memory Processed (   1229 ms )
   Iteration:  3 - HBW Memory Processed (   1229 ms )
   Iteration:  4 - HBW Memory Processed (   1227 ms )
  HBW Memory Released
  Processing Completed

 [ Test 1.3 - hbw_malloc ]

  Processing Started
  KNL Modes                       : MCDRAM = Flat         - Cluster = Quadrant
  HBW Memory Available
  HBW Memory Policy               : HBW_POLICY_BIND
  HBW Memory Allocated by         : hbw_malloc
  HBW Memory Allocation Error Code: 0
  HBW Memory Allocated            : 16.00 GB
  HBW Memory Initialization
  HBW Memory Processing
   Iteration:  1 - HBW Memory Processed ( 191376 ms )
   Iteration:  2 - HBW Memory Processed ( 194846 ms )
   Iteration:  3 - HBW Memory Processed ( 197098 ms )
   Iteration:  4 - HBW Memory Processed ( 197608 ms )
  HBW Memory Released
  Processing Completed

 [ Test 1.4 - hbw_malloc ]

  Processing Started
  KNL Modes                       : MCDRAM = Flat         - Cluster = Quadrant
  HBW Memory Available
  HBW Memory Policy               : HBW_POLICY_BIND
  HBW Memory Allocated by         : hbw_malloc
  HBW Memory Allocation Error Code: 0
  HBW Memory Allocated            : 15.36 GB
  HBW Memory Initialization
  HBW Memory Processing
   Iteration:  1 - HBW Memory Processed (   2353 ms )
   Iteration:  2 - HBW Memory Processed (   2331 ms )
   Iteration:  3 - HBW Memory Processed (   2347 ms )
   Iteration:  4 - HBW Memory Processed (   2346 ms )
  HBW Memory Released
  Processing Completed

 [ Test 2.1 - hbw_posix_memalign ]

  Processing Started
  KNL Modes                       : MCDRAM = Flat         - Cluster = Quadrant
  HBW Memory Available
  HBW Memory Policy               : HBW_POLICY_BIND
  HBW Memory Allocated by         : hbw_posix_memalign
  HBW Memory Allocation Error Code: 0
  HBW Memory Allocated            : 4.00 GB
  HBW Memory Initialization
  HBW Memory Processing
   Iteration:  1 - HBW Memory Processed (    614 ms )
   Iteration:  2 - HBW Memory Processed (    614 ms )
   Iteration:  3 - HBW Memory Processed (    607 ms )
   Iteration:  4 - HBW Memory Processed (    608 ms )
  HBW Memory Released
  Processing Completed

 [ Test 2.2 - hbw_posix_memalign ]

  Processing Started
  KNL Modes                       : MCDRAM = Flat         - Cluster = Quadrant
  HBW Memory Available
  HBW Memory Policy               : HBW_POLICY_BIND
  HBW Memory Allocated by         : hbw_posix_memalign
  HBW Memory Allocation Error Code: 0
  HBW Memory Allocated            : 8.00 GB
  HBW Memory Initialization
  HBW Memory Processing
   Iteration:  1 - HBW Memory Processed (   1229 ms )
   Iteration:  2 - HBW Memory Processed (   1231 ms )
   Iteration:  3 - HBW Memory Processed (   1232 ms )
   Iteration:  4 - HBW Memory Processed (   1231 ms )
  HBW Memory Released
  Processing Completed

 [ Test 2.3 - hbw_posix_memalign ]

  Processing Started
  KNL Modes                       : MCDRAM = Flat         - Cluster = Quadrant
  HBW Memory Available
  HBW Memory Policy               : HBW_POLICY_BIND
  HBW Memory Allocated by         : hbw_posix_memalign
  HBW Memory Allocation Error Code: 0
  HBW Memory Allocated            : 16.00 GB
  HBW Memory Initialization
  HBW Memory Processing
   Iteration:  1 - HBW Memory Processed ( 191176 ms )
   Iteration:  2 - HBW Memory Processed ( 197176 ms )
   Iteration:  3 - HBW Memory Processed ( 197110 ms )
   Iteration:  4 - HBW Memory Processed ( 199635 ms )
  HBW Memory Released
  Processing Completed

 [ Test 2.4 - hbw_posix_memalign ]

  Processing Started
  KNL Modes                       : MCDRAM = Flat         - Cluster = Quadrant
  HBW Memory Available
  HBW Memory Policy               : HBW_POLICY_BIND
  HBW Memory Allocated by         : hbw_posix_memalign
  HBW Memory Allocation Error Code: 0
  HBW Memory Allocated            : 15.36 GB
  HBW Memory Initialization
  HBW Memory Processing
   Iteration:  1 - HBW Memory Processed (   2351 ms )
   Iteration:  2 - HBW Memory Processed (   2362 ms )
   Iteration:  3 - HBW Memory Processed (   2358 ms )
   Iteration:  4 - HBW Memory Processed (   2365 ms )
  HBW Memory Released
  Processing Completed

 

As you can see:

- Amount of MCDRAM memory available for Node 1 is 15934 MB ( 15.93 GB / reported by numactl utility )

- When an application allocates 15.36 GB MCDRAM memory and its size is less than 15.93 GB then processing is fast:
...
Iteration: 1 - HBW Memory Processed ( 2353 ms )
Iteration: 2 - HBW Memory Processed ( 2331 ms )
Iteration: 3 - HBW Memory Processed ( 2347 ms )
Iteration: 4 - HBW Memory Processed ( 2346 ms )
...

- When an application allocates 16.00 GB MCDRAM memory and its size is greater than 15.93 GB then processing is very slow:
...
Iteration: 1 - HBW Memory Processed ( 191376 ms )
Iteration: 2 - HBW Memory Processed ( 194846 ms )
Iteration: 3 - HBW Memory Processed ( 197098 ms )
Iteration: 4 - HBW Memory Processed ( 197608 ms )
...

~80x slower !?!?!

Jim Dempsey

Since you are asking for more memory than what is available, it is not clear what the system is doing under the covers.   Depending on how the system is configured, it might even do something as stupid as swapping pages to/from MCDRAM.

If you intend to use only MCDRAM, then you should use an interface that causes the allocation to fail if sufficient MCDRAM is not available.  "numactl --membind=1" will manage this without requiring program changes.

"Dr. Bandwidth"

>>...it might even do something as stupid as swapping pages to/from MCDRAM...

I suspect that a Virtual Memory is somehow involved but can't prove it at the moment.

>>...~80x slower !?!?!

Let me know if you need a C language reproducer ( ~240 code lines ). I've spent a lot of time investigating that problem on a KNL server a couple of months ago.

Because the policy is HBW_POLICY_BIND, libnuma is instructed to use only the high bandwidth NUMA node.  When that is full, no choice but to swap out to the file system.  The NUMA node with DRAM is unavailable because of BIND.

Try HBW_POLICY_PREFERRED, which will allocate in other NUMA node when high bandwidth NUMA node is full, and so the DRAM will be used when MCDRAM is exhausted.

 

 

 

Yet another reason why we run all of our compute nodes with swapping completely disabled....

"Dr. Bandwidth"

>>Because the policy is HBW_POLICY_BIND, libnuma is instructed to use only the high bandwidth NUMA node

Gregg, Read docs for hbw_malloc and hbw_posix_memalign functions. It says:

...
HBW_POLICY_BIND
If insufficient high bandwidth memory from the nearest NUMA node is available to
satisfy a request, the allocated pointer is set to NULL and errno is set to ENOMEM.
If insufficient high bandwidth memory pages are available at fault time
the Out Of Memory ( OOM ) killer is triggered. Note that pages are faulted exclusively
from the high bandwidth NUMA node nearest at time of allocation, not at time of fault.
...

So, it did Not set the pointer to NULL and errno was Not set to ENOMEM when the test application requested more MCDRAM than it was actually available.

Also, I've reported that problem privately more than 2 months ago.

>>...Yet another reason why we run all of our compute nodes with swapping completely disabled....

I would consider it as a workaround but it doesn't solve the problem with incorrect processing in Memkind library.

A concept of using Virtual Memory ( VM ) is not new and it is used since times of DEC VAX/VMS OS. On Windows I use VM a lot and on a system that simulates embedded platform with 128MB of memory and Windows 2000 OS I was able to complete a stress test when 1.99GB of memory was allocated for a 32-bit test application. Processing was slower, when compared to in-memory only processing ( No VM used ), but the ratio wasn't 80x as Jim stressed in his post.

If there is a problem with MCDRAM, or Memkind library, or Linux OSs, and Intel engineers ignore it that doesn't look good.

In my tests the Out Of Memory killer is triggered.

I don't get an error code until I ask for more than 96 GB, which is how much DRAM memory is in the system.

To contact the memkind developers try the mailing list, memkind@lists.01.org

 

 

 

A concept of using Virtual Memory ( VM ) is not new and it is used since times of DEC VAX/VMS OS

Actually it is about twenty years older than that. (1959 on the Atlas machine in Manchester). https://en.wikipedia.org/wiki/Virtual_memory 

From a former memkind developer:  Handling out-of-memory conditions in Linux kernel is complicated and strongly depends on system configuration. In most common scenario NULL will be returned by libnuma only if you allocate more memory than amount of free physical memory in system at the time of allocation of virtual memory.  Memkind documentation is bit inaccurate in that matter; it is rather nontrivial to write comprehensive explanation which would cover every possible kernel behavior.

Ah, good old VMS -- those were the days.  Miss having OS automatically keep last 5 versions of a file. (Relying on GNU/Emacs for that now.)

>>In my tests the Out Of Memory killer is triggered.
>>
>>I don't get an error code until I ask for more than 96 GB, which is how much DRAM memory is in the system.

I've been talking about a problem related to MCDRAM memory. Take a look at Post #4 lines from 63 to 79, for example.

Did you see what happened in that case and please let me know if additional explanations are needed.

rp, Sorry that we are talking about a different problem not related to your subject.

I am also talking about MCDRAM.

I get Out Of Memory killer when I fault slightly more memory than is currently available on the NUMA node with MCDRAM, using either memkind with HBW_POLICY_BIND policy or numactl --membind.  

The memkind documentation for HBW_POLICY_BIND is oversimplified.  Linux of course does a deferred page allocation, and the kernel behavior can be quite complex when allocating more memory than available on a NUMA node.  If for example an application grabs just enough memory to thrash against the kernel''s ~0.5 GB memory in the MCDRAM, the application could become extremely slow.

And so for the Intel Xeon Phi x200 processor it is a best practice to use HBW_POLICY_PREFERRED,  It could be argued that BIND is simply there as an option because it always has been, but lacks a practical use for this specific processor.

 

Binding is definitely useful if swapping is disabled, since it prevents you from silently getting pages allocated where you don't want them.  

Failover allocation to the wrong NUMA node can be very bad for performance testing or for multi-node (synchronous) production jobs.  Better to have the job fail immediately than to either get misleading performance results or waste time by having many nodes waiting on a slow node.
 

"Dr. Bandwidth"

It is something of a performance-testing-centric view.  Most customers would rather have a job finish, even if sub-optimally, than fail.  

For performance testing, better to check whether memory got allocated in a high-bandwidth node using hbw_verify_memory_region().

Leave a Comment

Please sign in to add a comment. Not a member? Join today