Table of Contents
- H/W prefetcher control
- What does the tool measure
- How does it work
- Command line arguments
- Change Log
Vish Viswanathan, Karthik Kumar, Thomas Willhalm
An important factor in determining application performance is the time required for the application to fetch data from the processor’s cache hierarchy and from the memory subsystem. In a multi-socket system where Non-Uniform Memory Access (NUMA) is enabled, local memory latencies and cross-socket memory latencies will vary significantly. Besides latency, bandwidth (b/w) also plays a big role in determining performance. So, measuring these latencies and b/w is important to establish a baseline for the system under test, and for performance analysis.
Intel® Memory Latency Checker (Intel® MLC) is a tool used to measure memory latencies and b/w, and how they change with increasing load on the system. It also provides several options for more fine-grained investigation where b/w and latencies from a specific set of cores to caches or memory can be measured as well.
IIntel® MLC supports both Linux and Windows.
- Copy the mlc binary to any directory on your system
- Intel® MLC dynamically links to GNU C library (glibc/lpthread) and this library must be present on the system
- Root privileges are required to run this tool as the tool modifies the H/W prefetch control MSR to enable/disable prefetchers for latency and b/w measurements
- MSR driver (not part of the download) should be loaded. This can typically be done with 'modprobe msr' command if it is not already included.
- Copy mlc.exe and mlcdrv.sys driver to the same directory. The mlcdrv.sys driver is used to modify the h/w prefetcher settings
It is challenging to accurately measure memory latencies on modern Intel processors as they have sophisticated h/w prefetchers. Intel® MLC automatically disables these prefetchers while measuring the latencies and restores them to their previous state on completion. The prefetcher control is exposed through MSR (https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors) and MSR access requires root level permission. So, Intel® MLC needs to be run as ‘root’ on Linux. On Windows, we have provided a signed driver that is used for this MSR access. If Intel® MLC can’t be run with root permissions, please consult the readme.pdf that can be found in the download package.
When the tool is launched without any argument, it automatically identifies the system topology and measures the following four types of information. A screen shot is shown for each.
1. A matrix of idle memory latencies for requests originating from each of the sockets and addressed to each of the available sockets
2. Peak memory b/w measured (assuming all accesses are to local memory) for requests with varying amounts of reads and writes
3. A matrix of memory b/w values for requests originating from each of the sockets and addressed to each of the available sockets
4. Latencies at different b/w points
Intel® MLC also provides command line arguments for fine grained control over latencies and b/w that are measured.
Here are some of the things that are possible with command line arguments:
- Measure latencies for requests addressed to a specific memory controller from a specific core
- Measure cache latencies
- Measure b/w from a subset of the cores/sockets
- Measure b/w for different read/write ratios
- Measure latencies for random address patterns instead of sequential
- Change stride size for latency measurements
One of the main features of Intel® MLC is measuring how latency changes as b/w demand increases. To facilitate this, it creates several threads where the number of threads matches the number of logical CPUs minus 1. These threads are used to generate the load (henceforth, these threads will be referred to as load-generation threads). The primary purpose of the load-generation threads is to generate as many memory references as possible. While the system is loaded like this, the remaining one CPU (that is not being used for load generation) runs a thread that is used to measure the latency. This thread is known as the latency thread and issues dependent reads. Basically, this thread traverses an array of pointers where each pointer is pointing to the next one, thereby creating a dependency in reads. The average time taken for each of these reads provides the latency. Depending on the load generated by the load-generation threads, this latency will vary. Every few seconds the load-generation threads automatically throttle the load generated by injecting delays, thus measuring the latency under various load conditions.
Launching Intel® MLC without any parameters measures several things as stated earlier. However, with command line arguments, each of the following specific actions can be performed in sequence:
prints a matrix of local and cross-socket memory latencies
prints a matrix of local and cross-socket memory b/w
prints peak memory b/w for various read-write ratios with all local accesses
prints the idle memory latency of the platform
prints the loaded memory latency of the platform
do not modify prefetcher settings
There are more options for each of the commands above. Those are documented in the readme file in more detail and can be downloaded by clicking on the button below
- Initial release
- Support for b/w and loaded latencies added
- Launch 'spinner' threads on remote node for measuring better remote memory b/w
- Automatically disable numa balancing support (if present) to measure accurate remote memory latencies
- Fixed a bug in topology detection where certain kernels were numbering the cpus differently. In those cases, consecutive cpu numbers were assigned to the same physical core (like cpus 0 and 1 are on physical core 0..)
- Support for Windows O/S
- Support for single socket (E3 processor line)
- Support for turning off automatic prefetcher control
Both Linux and Windows versions of Intel® MLC are included in the download.