Monitoring Integrated Memory Controller Requests in the 2nd, 3rd, 4th, 5th, 6th generation Intel® Core™ processors

Authors: Roman Dementiev and Angela D. Schmid

Dear Software Tuning, Performance Optimization & Platform Monitoring community,

The recent and upcoming Intel® Core™ processors of 2nd,3rd,4th ,5th and 6th generation (previously codenamed Sandy-Bridge, Ivy-Bridge, Haswell, Broadwell and Skylake) expose model specific counters that allow for monitoring requests to DRAM.

The counters employ circuitry residing in the memory controller, and monitor transaction requests coming from various sources, e.g. the processor cores, the graphic engine, or other I/O agents.  The monitoring interface uses memory-mapped I/O reads from physical memory at the offsets specified in Table 1. Memory traffic metrics can be derived as follows:

  • Data read from DRAM in number of bytes:   UNC_IMC_DRAM_DATA_READS*64
  • Data written to DRAM in number of bytes:   UNC_IMC_DRAM_DATA_WRITES*64

Users and developers may take advantage of Intel tools to easily access the counters or derived memory performance metrics:

Table 1. Addresses of DRAM Counters.

The DRAM counters below are model specific meaning they will change or not be supported in the future. The BAR is available (in PCI configuration space) at Bus 0; Device 0; Function 0; Offset 048H.

UNC_IMC_DRAM_GT_REQUESTSBAR + 0x5040Counts every read/write request entering the Memory Controller to DRAM (sum of all channels) from the GT engine. Each partial write request counts as a request incrementing this counter. However same-cache-line partial write requests are combined to a single 64-byte data transfers from DRAM. Therefore multiplying the number of requests by 64-bytes will lead to inaccurate GT memory bandwidth. The inaccuracy is proportional to the number of same-cache-line partial writes combined.
UNC_IMC_DRAM_IA_REQUESTSBAR + 0x5044Counts every read/write request (demand and HW prefetch) entering the Memory Controller to DRAM (sum of all channels) from IA. Each partial write request counts as a request incrementing this counter. However same-cache-line partial write requests are combined to a single 64-byte data transfers from DRAM. Therefore multiplying the number of requests by 64-bytes will lead to inaccurate IA memory bandwidth. The inaccuracy is proportional to the number of same-cache-line partial writes combined.
UNC_IMC_DRAM_IO_REQUESTSBAR + 0x5048Counts every read/write request entering the Memory Controller to DRAM (sum of all channels) from all IO sources (e.g. PCIe, Display Engine, USB audio, etc.). Each partial write request counts as a request incrementing this counter. However same-cache-line partial write requests are combined to a single 64-byte data transfers from DRAM. Therefore multiplying the number of requests by 64-bytes will lead to inaccurate IO memory bandwidth. The inaccuracy is proportional to the number of same-cache-line partial writes combined.
UNC_IMC_DRAM_DATA_READSBAR + 0x5050Counts every read (RdCAS) issued by the Memory Controller to DRAM (sum of all channels). All requests result in 64-byte data transfers from DRAM. Use for accurate memory bandwidth calculations.
UNC_IMC_DRAM_DATA_WRITESBAR + 0x5054Counts every write (WrCAS) issued by the Memory Controller to DRAM (sum of all channels). All requests result in 64-byte data transfers from DRAM. Use for accurate memory bandwidth calculations.

 

Regards,
Roman Dementiev
Staff Application Engineer
Intel Corporation

Angela D. Schmid
Performance Engineer
Intel Corporation

For more complete information about compiler optimizations, see our Optimization Notice.

9 comments

Top
Zeke W.'s picture

I want to measure the memory read/write bytes on Intel I7 5960X. I still have no idea how to use it in PCM-2.11. Does it mean modify the lines in cpucounters.cpp:  

void ServerPCICFGUncore::program()
{
    uint32 MCCntConfig[4] = {
        MC_CH_PCI_PMON_CTL_EVENT(0x04) + MC_CH_PCI_PMON_CTL_UMASK(3),  
        MC_CH_PCI_PMON_CTL_EVENT(0x04) + MC_CH_PCI_PMON_CTL_UMASK(12),    
    };

to 

    uint32 MCCntConfig[4] = {
        MC_CH_PCI_PMON_CTL_EVENT(0x50) + MC_CH_PCI_PMON_CTL_UMASK(0x50),  
        MC_CH_PCI_PMON_CTL_EVENT(0x50) + MC_CH_PCI_PMON_CTL_UMASK(0x54),    
    };

Thanks. 

 

Qi Z.'s picture

Hi,

I am really interesting about "I was able to write a simple program that opens /dev/mem, mmaps the BAR, reads the DRAM_DATA_READS and DRAM_DATA_WRITES values, and prints them to stdout.".

But I am not expert on the PCI configuration space, and I don't know how to get the value of BAR. So, can anybody provides mores details about that? Some sample code will be great helpful.

Best.

-----------------------------------------------------------------------------------------------------------

The BAR is available (in PCI configuration space) at Bus 0; Device 0; Function 0; Offset 048H.

Does it mean that I can get the value of BAR by typing "sudo setpci -s 00:00.0 48.L" ?

Thomas R.'s picture

Hi,

I'm not a VTune user but I suggest that the limit of 4 comes from the core-local general-purpose counter registers (4 with SMT enabled, 8 without SMT). Since the memory controllers are in the Uncore there should be a possibility to program them next to the core-local counters but I don't know how this is done in VTune.

Writing an own tool is tedious. When you are in the Linux world, you can try LIKWID, PAPI or perf. On the Windows side, I would try Intel PCM to measure the memory events. Maybe you have to change the event that is commonly measured on the memory controllers to the event you want.

Ming L.'s picture

Hi Thomas,

Thanks for your reply. I have use the VTune to obtain these counters. Seems that I can only read at most 4 counters at the same time. Is this the hardware limit? Or I should write many own tool to do read the value. 

Thomas R.'s picture

(1) The iMC performance counters are no MSRegisters. They are located in the PCI address space and refer to a socket, not a core. Almost all units in the Uncore refer to a socket. If you are interested in per-core memory counts, you have to use the OFFCORE_RESPONSE events.

(2) According to the Uncore performance monitoring manual for the Intel Haswell you can have up to 16 banks but one memory module can consist of 1,2 or 4 banks. It depends on your SKU how many modules are supported. The memory controller counts the number of accesses to a specific rank and bank, it does not care about different memory DIMMs.

Ming L.'s picture

Thanks. Very useful. Two questions I have

(1) I think these counters are per-core, right?

(2) As shown in the Vtune, these counter (e.g. UNC_M_RD_CAS_RANK0.BANK1) can show the detailed number of accesses to both rank/bank. How many DIMMs connecting to one memory controller? Can the memory controller differentiate different  memory DIMMs?  

 

Mohammad Hossein H.'s picture

 

 

Hi,

I got a general question and would appropriate it if anybody can help out. Is there any way that I can gather the memory references (physical memory addresses) using these counters?

Thanks,

Mohammad

Alun E.'s picture

IntelPerformanceCounterMonitorV2.5.1/client_bw.h:

#define PCM_CLIENT_IMC_BAR_OFFSET (0x0048)
#define PCM_CLIENT_IMC_DRAM_DATA_READS (0x5050)
#define PCM_CLIENT_IMC_DRAM_DATA_WRITES (0x5054)
#define PCM_CLIENT_IMC_MMAP_SIZE (0x6000)

McCalpin, John's picture

Thanks! This sort of information is very helpful for cases where I need to build my own tools.
I was able to write a simple program that opens /dev/mem, mmaps the BAR, reads the DRAM_DATA_READS and DRAM_DATA_WRITES values, and prints them to stdout.
Tests with the STREAM benchmark show that the results are accurate, almost always within 0.8% of the expected values. For large array sizes, DRAM read values are typically elevated another ~2% due to TLB reloads that miss in the cache. This is a reasonable value, and certainly consistent with prior experience -- though random page coloring makes it difficult to predict in detail for 4KiB pages.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.