Intel® Memory Latency Checker v3.3


Vish Viswanathan, Karthik Kumar, Thomas Willhalm, Patrick Lu, Blazej Filipiak



An important factor in determining application performance is the time required for the application to fetch data from the processor’s cache hierarchy and from the memory subsystem. In a multi-socket system where Non-Uniform Memory Access (NUMA) is enabled, local memory latencies and cross-socket memory latencies will vary significantly. Besides latency, bandwidth (b/w) also plays a big role in determining performance. So, measuring these latencies and b/w is important to establish a baseline for the system under test, and for performance analysis.

Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.



IIntel® MLC supports both Linux and Windows.


  • Copy the mlc binary to any directory on your system
  • Intel® MLC dynamically links to GNU C library (glibc/lpthread) and this library must be present on the system
  • Root privileges are required to run this tool as the tool modifies the H/W prefetch control MSR to enable/disable prefetchers for latency and b/w measurements. Refer readme documentation on running without root privileges
  • MSR driver (not part of the install package) should be loaded. This can typically be done with 'modprobe msr' command if it is not already included.


  • Copy mlc.exe and mlcdrv.sys driver to the same directory. The mlcdrv.sys driver is used to modify the h/w prefetcher settings

There are two sets of binaries (mlc and mlc_avx512). One is compiled with newer tool chain to support Intel® AVX-512 instructions. The other binary supports SSE2 and AVX2 instructions. mlc_avx512 binary is a super set of mlc binary in that it supports SSE2/AVX2 as well. So, mlc_avx512 can be run on processors without support for AVX-512 also. By default AVX-512 instructions won’t be used whether the processor supports it or not unless –Z argument is specified. We recommend you start with mlc_avx512 and if your system does not have the newer versions of glibc, then you can fall back to mlc binary


HW Prefetcher Control

It is challenging to accurately measure memory latencies on modern Intel processors as they have sophisticated h/w prefetchers. Intel® MLC automatically disables these prefetchers while measuring the latencies and restores them to their previous state on completion. The prefetcher control is exposed through MSR ( and MSR access requires root level permission. So, Intel® MLC needs to be run as ‘root’ on Linux. On Windows, we have provided a signed driver that is used for this MSR access. If Intel® MLC can’t be run with root permissions, please consult the readme.pdf that can be found in the download package.


What does the tool measure

When the tool is launched without any argument, it automatically identifies the system topology and measures the following four types of information. A screen shot is shown for each.

1. A matrix of idle memory latencies for requests originating from each of the sockets and addressed to each of the available sockets

2. Peak memory b/w measured (assuming all accesses are to local memory) for requests with varying amounts of reads and writes 

3. A matrix of memory b/w values for requests originating from each of the sockets and addressed to each of the available sockets

4. Latencies at different b/w points

It also measures cache-to-cache data transfer latencies

Intel® MLC also provides command line arguments for fine grained control over latencies and b/w that are measured.

Here are some of the things that are possible with command line arguments:

  • Measure latencies for requests addressed to a specific memory controller from a specific core
  • Measure cache latencies
  • Measure b/w from a subset of the cores/sockets
  • Measure b/w for different read/write ratios
  • Measure latencies for random address patterns instead of sequential
  • Change stride size for latency measurements
  • Measure cache-to-cache data transfer latencies


How does it work

One of the main features of Intel® MLC is measuring how latency changes as b/w demand increases. To facilitate this, it creates several threads where the number of threads matches the number of logical CPUs minus 1. These threads are used to generate the load (henceforth, these threads will be referred to as load-generation threads). The primary purpose of the load-generation threads is to generate as many memory references as possible. While the system is loaded like this, the remaining one CPU (that is not being used for load generation) runs a thread that is used to measure the latency. This thread is known as the latency thread and issues dependent reads. Basically, this thread traverses an array of pointers where each pointer is pointing to the next one, thereby creating a dependency in reads. The average time taken for each of these reads provides the latency. Depending on the load generated by the load-generation threads, this latency will vary. Every few seconds the load-generation threads automatically throttle the load generated by injecting delays, thus measuring the latency under various load conditions. Please refer to the readme file in the package that you download for more details


Command line arguments

Launching Intel® MLC without any parameters measures several things as stated earlier. However, with command line arguments, each of the following specific actions can be performed in sequence:

mlc --latency_matrix

      prints a matrix of local and cross-socket memory latencies

mlc --bandwidth_matrix

      prints a matrix of local and cross-socket memory b/w

mlc --peak_bandwidth

      prints peak memory b/w for various read-write ratios with all local accesses

mlc --idle_latency            

      prints the idle memory latency of the platform

mlc --loaded_latency          

      prints the loaded memory latency of the platform

mlc --c2c_latency

      prints the cache-to-cache transfer latencies of the platform

mlc -e          

     do not modify prefetcher settings

There are more options for each of the commands above. Those are documented in the readme file in more detail and can be downloaded 


Change Log

Version 1.0

  • Initial release 

Version 2.0

  • Support for b/w and loaded latencies added

Version 2.1

  • Launch 'spinner' threads on remote node for measuring better remote memory b/w
  • Automatically disable numa balancing support (if present) to measure accurate remote memory latencies

Version 2,2

  • Fixed a bug in topology detection where certain kernels were numbering the cpus differently. In those cases, consecutive cpu numbers were assigned to the same physical core (like cpus 0 and 1 are on physical core 0..)

Version 2.3

  • Support for Windows O/S
  • Support for single socket (E3 processor line)
  • Support for turning off automatic prefetcher control

Version 3.0

  • Support for client processors like Haswell and Skylake
  • Allocate memory based on NUMA topology. This allows Intel® MLC to measure latencies on all the numa nodes on a processor like Haswell that supports Cluster-on-Die configuration where there are 4 numa nodes on a 2-socket system. We can also measure latencies to NUMA nodes which have only memory resources without any compute resources
  • Support for measuring latencies and bandwidth to persistent memory
  • Options to use 256-bit and 512-bit loads and stores in generating bandwidth traffic
  • Support for measuring cache-to-cache data transfer latencies
  • Control several parameters like read/write ratios, size of buffer allocated, numa node to allocate memory etc on a per-thread basis

Version 3.1

  • Support for Skylake Server

Version 3.1a

  • MLC failing on some guest VMs issue fixed

Version 3.3

  • Several fixes for measuring latencies and b/w on Skylake server



Both Linux and Windows versions of  Intel® MLC are included in the download.

For more complete information about compiler optimizations, see our Optimization Notice.
There are downloads available under the Basic Proprietary Commercial License license. Download Now


Colin's picture


Is MLC only for CPU memory performance? Or, for Skylake can it also profile on-chip Intel Processor Graphics (e.g. HD 530, P530) memory subsystem?

If not, what is best/recommended way to determine peak achievable memory bandwidth values for the Intel Processor Graphics (GPU), and evaluate cache performance?

Is the best/only option VTune Amplifier XE (2017)?  Thank you, Colin

Karthik Kumar (Intel)'s picture

@Andrea P. can you please share how you are emulating persistent memory (as a numa node or as /dev/pmem?)... also how are you pointing MLC to use this emulated persistent memory for measurements?

Andrea P.'s picture

FYI, when I emulate persistent memory (pmem) on my Linux machine, mlc exits immediately without running any tests. When I plug gdb to check what crashes, I get this:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/".
Intel(R) Memory Latency Checker - v3.1a
Command line parameters: --loaded_latency

Program received signal SIGSEGV, Segmentation fault.
0x000000000040dbf9 in ?? ()
(gdb) bt
#0  0x000000000040dbf9 in ?? ()
#1  0x0000000000401b7a in ?? ()
#2  0x00007ffff7811a40 in __libc_start_main (main=0x4018b0, argc=2,
    argv=0x7fffffffe5f8, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffffffe5e8) at libc-start.c:289
#3  0x0000000000403749 in ?? ()


Karthik Kumar (Intel)'s picture

@Kin C: could you please share what version of Linux you are using... can you give us the output of "mlc -v" and "numactl --hardware"? Thanks

Kin C.'s picture


mlc is Exiting on our server.  I'm using mlcv3.1a.tgz.

Intel(R) Memory Latency Checker - v3.1a


With strace, the appears to be caused by a SIGSEGV -- here's a snippet:

open("/sys/devices/system/node/node3/cpumap", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa7b20f000
read(3, "00000000,00000000,00000000,00000"..., 4096) = 252
close(3)                                = 0
munmap(0x7ffa7b20f000, 4096)            = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
rt_sigaction(SIGSEGV, {SIG_IGN, [SEGV], SA_RESTORER|SA_RESTART, 0x7ffa7aa4b670}, {0x40caa0, [SEGV], SA_RESTORER|SA_RESTART, 0x7ffa7aa4b670}, 8) = 0
open("/proc/sys/kernel/numa_balancing_scan_delay_ms", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa7b20f000
lseek(3, 0, SEEK_SET)                   = 0
write(3, "1000", 4)                     = 4
close(3)                                = 0
munmap(0x7ffa7b20f000, 4096)            = 0
write(2, "\nExiting...\n\n\n", 14

Please let me know your suggestion.



Frank D.'s picture

I've tried it on the same infrastructure running a VM with CentOS 7, same problem.  Exiting...

Frank D.'s picture

Thanks Vish,

Unfortunately, the 3.1a did not solve the "exiting" problem.

I've tried mlc.exe and mlc_avx512.exe on a Windows 2012 R2 on an Ivy Bridge 1650 and a Broadwell 2630

Both running vSphere 6.0 update 2. Broadwell 2630 in a cluster running EVC Ivy Bridge baseline. Ivy Bridge machine running without EVC baseline in other clusters.


Vish Viswanathan (Intel)'s picture

@Frank D,

Please download the new version (MLC 3.1a) just released. We have a fix for an issue similar to what you have reported. Hope that works for you too.



Frank D.'s picture

Hi, similar to @miky, MLC throws the statement exit when running it on a Windows 2012 server inside a VM within a vSphere 6.0 (update 2) environment. 

I've tried multiple vCPU configurations (Wide and Narrow) and checked the BIOS settings ensuring all virtualization features are enabled. Is there any setting I'm overlooking? By reviewing the other comments it appears people are successful running MLC inside a VM. Are these vSphere-based VMs?


The system I'm running contains two E5-2630-v4, unlike previous messages related to the exit statement, MLC does not throw any errors  (such as unsupported CPU), it just states exit and returns you to the command prompt. Executing the command MLC with an invalid Argument displays the help content, however when running for example mlc --latency_matrix, it just states exiting...

I'm using Intel Memory Checker v3.1


Jaeyong C.'s picture

Hi, I'm System x L2 engineer on IBM KOREA.

I need officilal answer of Intel.

Please help to us.

[CPU Spec]
- Intel(R) Xeon(R) CPU E7-8890 v2 @ 2.80GHz

- When customer use MLC Tool, they sometimes had take the result which was too low value about latency & bandwidth.
It was not able to fixed by reboot, only have to reseat to fix.

1. Bandwidth of MLC tool would mean that replaced by phisical bandwidth?

2. If yes, does Intel can be accepted a hardware problem by MLC tool result?

3. What kind of factors can affect to bandwidth?


Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.