Intel® Memory Latency Checker v3.5 test

Intel® Memory Latency Checker v3.5 test

Hi ~ Sir :

We have a project of the Purley platform that run the MLC test tool  . 
Our customer's test criteria : [ Run MLC Benchmark to measure Memory B/W - Total B/W should be at least 90% of theoretical max. B/W ]
And the test result is in below . 
We would like to know if the memory bandwidth result really too low ?
What is the reasonable value for the test ?
Thanks .
--------------------SYS config------------------------------------------------
CPU : IC CPU Skylak 26C 2.6G 205W QN5E LGA x 2 pcs .
Memory : Samsung   M393A8K40B22-CWD    Speed: 2666 MHz    Size: 64 GB  x 24 pcs .  
PCH :    C627       
---------------------MLC Test result -----------------------------------------                
 [root@localhost ~]# ./mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.5
Command line parameters: --bandwidth_matrix 

Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes
Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
        Numa node
Numa node         0         1    
       0    103029.6    34396.5    
       1    34412.8        102846.6    

-----------------Test result -----------------------------------------------------
(2666 * 6 * 8)/1024 = 124.97 GB/s
103.029/124.97 =  82.44 % < 90 %

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Several DRAM stall conditions are associated with the "shared bus" architecture used by DRAMs. 

It looks like your system has two DIMMs per channel.  This configuration incurs a stall when switching between DIMMs (to allow the bus to settle and to ensure clean separation in time between signals driven from different physical locations on the bus).  

The magnitude of this effect is hard to estimate because Intel provides very little information about how much the memory controller is able to reorder the memory accesses to minimize stalls, and Intel no longer publishes information about the memory controller timing parameters in the processor datasheets.

It is usually possible to reach 90% DRAM bus utilization with an Intel processor, a read-only test pattern, and one DIMM per channel.  Adding the second DIMM to each channel reduces the efficiency because of the "DIMM to DIMM turnaround stalls".  This effect tends to be strongest in the read-only case (since many of the other stall conditions don't occur).   

You may be able to get a small improvement by using MLC options to switch to the AVX-512 instruction set and to switch to using no more than 1 thread per core, but it probably won't increase the value from 82.4% to 90%.

For a mainstream Intel processor, you should be able to get >=80% DRAM bus utilization for most memory access patterns on most memory configurations (excluding the case with one single-rank DIMM per channel), but you may need to fiddle with the instruction set and number of cores used, but I would not be surprised to see some cases delivering slightly below 80%. 

On the Purley platform with one dual-rank DIMM per channel, I usually see the best bandwidth using about 16 cores (one thread per core) and with the AVX-512 instruction set, but this is something you should probably test in your environment.





"Dr. Bandwidth"

Hi ~ Sir :

In our customer’s test plan , we also need to run the MLC test tool for stress test .

Test criteria : Run Memory stress tests and look at /var/log/messages, mcelogs for any MCE errors. MLC total Memory B/W should be 85% or above theoretical B/W.

Test config :

CPU : IC CPU Skylak 26C 2.6G 205W QN5E LGA x 2 pcs .

Memory : Samsung   M393A8K40B22-CWD    Speed: 2666 MHz    Size: 64 GB  x 24 pcs . 

PCH :    C627 

In the test result only reach 74% .

Is the test result normal ?

Thanks .


MLC Test result :

Intel(R) Memory Latency Checker - v3.5

Command line parameters: --bandwidth_matrix -t1000


Using buffer size of 100.000MB/thread for reads and an additional 100.000MB/thread for writes

Measuring Memory Bandwidths between nodes within system

Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)

Using all the threads from each core if Hyper-threading is enabled

Using Read-only traffic type

                Numa node

Numa node          0         1   

       0        94760.9   34385.6  

       1        34405.2   94896.0  

The 94.8 GB/s values are lower than I would expect, but I don't have any SKX nodes with 2 DIMMs per channel for comparison.

It is certainly not implausible that this is the "correct" level of performance for 2 DIMMs per channel -- it all depends on how effectively the memory controller is able to avoid the DIMM-to-DIMM stalls.  As far as I can tell, the specific duration of a DIMM-to-DIMM stall is not published (it is probably tuned by the BIOS during the boot process), but historically these values have been in the range of 4-5 ns.  For DDR4/2666, the major clock is 1.333 GHz, giving a 0.75 ns cycle time.  A cache line transfer takes 4 major cycles, or 3.00 ns.  So the stall time for switching ranks is greater than the cache line transfer time, meaning that switching DIMMs on every read would give less than 50% performance (e.g., for a 4 ns DIMM-to-DIMM stall with stalls before every read, the sustained bandwidth would be 3/(3+4)=42.8% of peak).    If the DIMM select was completely random, one would expect a DIMM switch approximately half of the time, so the average overhead would be about half, giving 3/(3+4/2)=60% of peak.   To get to 74% of peak, the memory controller has to rearrange accesses to eliminate 1/2 of the DIMM-to-DIMM stalls that would occur with random DIMM selection.   Is it reasonable to expect more than this?   I suspect that this is not a trivial question to answer even for members of the memory controller design team....

Have you run this test on more than one node?  (If they all get the same result, it is probably the right answer....)

On a 24-core Xeon Platinum 8160 with 1 dual-rank DDR4/2666 DIMM per channel, I get the following results with various combinations of the "-X" (use only one thread per core), "-Y" (use AVX-256 instructions), and "-Z" (use AVX-512 instructions -- requires the "mlc_avx512" binary):

log.8160.R.bandwidth_matrix:Numa node         0         1    
log.8160.R.bandwidth_matrix-       0    111596.0    34335.7    
log.8160.R.bandwidth_matrix-       1    34361.5    111694.6    
log.8160.R.bandwidth_matrix.X:        Numa node
log.8160.R.bandwidth_matrix.X:Numa node         0         1    
log.8160.R.bandwidth_matrix.X-       0    114084.7    34384.3    
log.8160.R.bandwidth_matrix.X-       1    34393.6    113458.3    
log.8160.R.bandwidth_matrix.XY:        Numa node
log.8160.R.bandwidth_matrix.XY:Numa node         0         1    
log.8160.R.bandwidth_matrix.XY-       0    112350.5    34404.1    
log.8160.R.bandwidth_matrix.XY-       1    34429.2    112294.0    
log.8160.R.bandwidth_matrix.XZ:        Numa node
log.8160.R.bandwidth_matrix.XZ:Numa node         0         1    
log.8160.R.bandwidth_matrix.XZ-       0    111992.2    34423.9    
log.8160.R.bandwidth_matrix.XZ-       1    34439.2    111760.0    
log.8160.R.bandwidth_matrix.Y:        Numa node
log.8160.R.bandwidth_matrix.Y:Numa node         0         1    
log.8160.R.bandwidth_matrix.Y-       0    111669.5    34370.5    
log.8160.R.bandwidth_matrix.Y-       1    34396.8    110972.0    
log.8160.R.bandwidth_matrix.Z:        Numa node
log.8160.R.bandwidth_matrix.Z:Numa node         0         1    
log.8160.R.bandwidth_matrix.Z-       0    110938.5    34372.2    
log.8160.R.bandwidth_matrix.Z-       1    34394.0    110518.8    

In this (Read-only traffic type) case there is little difference in performance across this set of XYZ options -- the lowest value (110.5 GB/s) is 86.3% of peak and the highest value (114.1 GB/s) is 89.1% of peak.

The differences are slightly larger with mixes of reads and writes, and much larger (~10% range) with non-temporal stores (traffic types W7, W8, W10).

"Dr. Bandwidth"

Hi ~ Sir :

We run the test on other nodes , and the test result are the same .

So , it is caused by the CPU_Memory controller's efficiency ,right ?

Thanks .  


I just noticed that your second set of results was much slower than the first set, but the only thing that appears different is the longer run time ("-t1000" in the second case. It is possible that this test was getting throttled by power or thermal constraints (which is, of course, one of the reasons to run stress tests....).

Two things to consider:

(1) I would double-check the actual DRAM bus frequency before making any final conclusions. The system should be able to run two DIMMs per channel at full speed (DDR4/2666), but it is not impossible that the frequency could be reduced. Recent Linux versions should support the uncore DRAM cycle counter. I used the following command:

perf stat -a -A -e uncore_imc_0/clockticks/ ./mlc --bandwidth_matrix

The results included
CPU0 102,693,084,167 uncore_imc_0/clockticks/
CPU24 102,659,835,631 uncore_imc_0/clockticks/

77.901356954 seconds time elapsed

Dividing the counts by the elapsed time gives 1,318,245,126.7, which is about 1% below the expected value of 1,333,333,333, so I conclude that my DRAMs are actually running at full speed. (This value can't change on a "live" system, so it only needs to be checked once....)

(2) The "perf stat" command can also report energy usage for the package and for the DRAMs, but I don't think that it has an interface to read the amount of time that the processor frequency was throttled due to package or DRAM power limitations. This information can be read from the RAPL MSRs using the /dev/cpu/*/msr devices. Information is in Section 14.9.5 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384-068, November 2018), and a very useful example code is available at

"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today