KNL - CHA addresses

KNL - CHA addresses

Hello all,

My question is quite simple, is it possible to know what addresses are hashed to a certain CHA in KNL? I have been looking for the function but I have not been able to find it, I guess the function is undocumented.

Best regards,
Marcos HV

9 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

is it possible to know what addresses are hashed to a certain CHA in KNL? 

The answer is "No". That hash is not documented anywhere and is not available even to those of us inside Intel :-)

Note, too, that it is potentially different on every KNL, since which physical CHAs are available will depend on which tiles have not been fused out as a result of manufacturing flaws. Therefore the only approach that seems sane if you're trying to place specific data to optimize communication between pairs of tiles is to run a small benchmark to determine the latency for communication of different cache-lines and then use a good one.  However, my measurements on KNL showed the difference to be small anyway, and therefore the added complexity (and startup time cost of doing the measurement) didn't seem worthwhile.

I noticed that on KNL the MSR interfaces for all 38 CHA's are readable, but that they are reordered so that the CHAs corresponding to inactive tiles are moved to the end of the list.  At least that is what is looks like based on my performance counter measurements.  We have Xeon Phi 7250 processors with 34 active tiles, and no matter which core APIC IDs are disabled, the four CHAs that give anomalously low counts are always numbers 34,35,36,37.

It should be possible to design a microbenchmark suite that allows one to figure out where the CHAs are physically located, but it would have to be run on every KNL chip.   I started working on it, but got bored and dropped the project....

I would still like to know how the boxes are laid out on the various chips -- it is fairly pointless to provide mesh traffic counters if you have no idea where any of the mesh links are connected....

"Dr. Bandwidth"

Citação:

McCalpin, John escreveu:

I noticed that on KNL the MSR interfaces for all 38 CHA's are readable, but that they are reordered so that the CHAs corresponding to inactive tiles are moved to the end of the list.  At least that is what is looks like based on my performance counter measurements.  We have Xeon Phi 7250 processors with 34 active tiles, and no matter which core APIC IDs are disabled, the four CHAs that give anomalously low counts are always numbers 34,35,36,37.

Would it be possible to measure the number of accesses to a given CHA by using the MSR interfaces programmatically? I have read in https://software.intel.com/es-es/forums/software-tuning-performance-optimization-platform-monitoring/topic/700119 that you were discussing how to measure some uncore events. Nonetheless, what would be the code for measuring (if possible) the number of accesses to a given CHA? My experience with perf almost none.

Thanks for your time

It is possible to measure all sorts of stuff at the CHAs -- there are 45 pages of performance counter events for the CHA/CMS block described in Chapters 5-6 of the Xeon Phi Processor Performance Monitoring Reference Manual -- Volume 2: Events (document 334480-002, March 2017).

Figuring out what those events mean is, unfortunately, left as an exercise to the reader.

There are plenty of clues in the documents, but it takes a lot of work to turn those clues into a useful level of understanding....

The query "number of accesses to a given CHA" is not sufficiently precise to be directly answered.  The mesh network that connects the CHAs has four "rings":

  1. AD ring -- read/write requests from the tiles and snoops from the memory controllers
  2. AK ring -- ACK's between tiles and memory controllers, and snoop responses from cores to CHAs
  3. BL ring -- data transfer (cache lines)
  4. IV ring -- snoop requests from CHAs to cores

If you want to know which CHA is responsible for handling coherence for a specific physical address, then you probably want to look at an event like RxR_INSERTS.IRQ.   This is the queue in the CHA that handles incoming requests from cores for the CHA to process. 

My methodology for KNL would be very similar to what I did on Haswell (Xeon E5 v3), which I describe at https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-ar...

 

 

"Dr. Bandwidth"

Citação:

McCalpin, John escreveu:

It is possible to measure all sorts of stuff at the CHAs -- there are 45 pages of performance counter events for the CHA/CMS block described in Chapters 5-6 of the Xeon Phi Processor Performance Monitoring Reference Manual -- Volume 2: Events (document 334480-002, March 2017).

Figuring out what those events mean is, unfortunately, left as an exercise to the reader.

There are plenty of clues in the documents, but it takes a lot of work to turn those clues into a useful level of understanding....

The query "number of accesses to a given CHA" is not sufficiently precise to be directly answered.  The mesh network that connects the CHAs has four "rings":

  1. AD ring -- read/write requests from the tiles and snoops from the memory controllers
  2. AK ring -- ACK's between tiles and memory controllers, and snoop responses from cores to CHAs
  3. BL ring -- data transfer (cache lines)
  4. IV ring -- snoop requests from CHAs to cores

If you want to know which CHA is responsible for handling coherence for a specific physical address, then you probably want to look at an event like RxR_INSERTS.IRQ.   This is the queue in the CHA that handles incoming requests from cores for the CHA to process. 

My methodology for KNL would be very similar to what I did on Haswell (Xeon E5 v3), which I describe at https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-ar...

Hello McCalpin,

Your responses are very detailed, thank you for your effort. Nonetheless I have a doubt: in order to check these values you are talking (RxR_INSERTS.IRQ), do you use a tool like perf or do you need to use the instruction rdpmc? I am not very familiar with the techniques used to measure these values. Sorry for the ignorance and thanks again.

Kind regards,
Marcos HV

The uncore performance counters are accessed by different mechanisms than the core performance counters.

  • The core performance counters can be accessed using the RDPMC instruction, or (in the kernel) using the RDMSR instruction.

    • The user-level interface for MSR access in Linux systems is /dev/cpu/<n>/msr, where <n> is the core that you want to use to read the MSR.
    • For both RDPMC and RDMSR, you need to execute the instruction while running on the specific core for which you want the counter values.
      • With RDPMC you typically pin the process or thread to the desired core, so that all performance counter reads are on the same core.
      • With the /dev/cpu/<n>/msr interface, the kernel sets up an Inter-Processor Interrupt so that the RDMSR instruction will be executed by the target processor.
    • Most people use the "rdmsr.c" and "wrmsr.c" tools from msrtools-1.3 as their starting point for learning to access MSRs using this interface.
  • Some of the uncore performance counters (including the CHA counters) are also accessed by MSRs.
    • Unlike the core counters, the MSRs used by the uncore performance counters can be accessed by any core on the chip.
    • The uncore counters only need to be read by one core per package -- any core reading the counters will get the same values.
  • Other uncore counters (such as the Memory Controller counters) are accessed through "PCI Configuration Space".
    • These can be accessed through command-line tools such as "lspci" and "setpci".
    • These counters also only need to be read by one core per package.
    • There are device drivers for these as well, but I always forget the specific path.  A reference that uses these drivers is https://github.com/TACC/tacc_stats

The Linux "perf" subsystem might be able to read the counters -- it depends on the specific processor and the specific Linux distribution.  The best way to tell if your Linux distribution understands your processors uncore counters is to look in /sys/bus/event_source/devices/.  On a system with good support for the uncore counters, you will see lots of devices with "uncore" in the name -- for example a 2-socket Xeon E5-2690 v3 running Centos 7.3 shows good support:

$ ls -l /sys/bus/event_source/devices/
total 0
lrwxrwxrwx 1 root root 0 Sep  5 11:30 breakpoint -> ../../../devices/breakpoint
lrwxrwxrwx 1 root root 0 Sep  5 11:30 cpu -> ../../../devices/cpu
lrwxrwxrwx 1 root root 0 Sep  5 11:30 intel_bts -> ../../../devices/intel_bts
lrwxrwxrwx 1 root root 0 Sep  5 11:30 intel_cqm -> ../../../devices/intel_cqm
lrwxrwxrwx 1 root root 0 Sep  5 11:30 power -> ../../../devices/power
lrwxrwxrwx 1 root root 0 Sep  5 11:30 software -> ../../../devices/software
lrwxrwxrwx 1 root root 0 Sep  5 11:30 tracepoint -> ../../../devices/tracepoint
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_0 -> ../../../devices/uncore_cbox_0
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_1 -> ../../../devices/uncore_cbox_1
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_10 -> ../../../devices/uncore_cbox_10
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_11 -> ../../../devices/uncore_cbox_11
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_2 -> ../../../devices/uncore_cbox_2
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_3 -> ../../../devices/uncore_cbox_3
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_4 -> ../../../devices/uncore_cbox_4
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_5 -> ../../../devices/uncore_cbox_5
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_6 -> ../../../devices/uncore_cbox_6
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_7 -> ../../../devices/uncore_cbox_7
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_8 -> ../../../devices/uncore_cbox_8
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_cbox_9 -> ../../../devices/uncore_cbox_9
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_ha_0 -> ../../../devices/uncore_ha_0
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_ha_1 -> ../../../devices/uncore_ha_1
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_imc_0 -> ../../../devices/uncore_imc_0
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_imc_1 -> ../../../devices/uncore_imc_1
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_imc_2 -> ../../../devices/uncore_imc_2
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_imc_3 -> ../../../devices/uncore_imc_3
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_imc_4 -> ../../../devices/uncore_imc_4
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_imc_5 -> ../../../devices/uncore_imc_5
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_irp -> ../../../devices/uncore_irp
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_pcu -> ../../../devices/uncore_pcu
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_qpi_0 -> ../../../devices/uncore_qpi_0
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_qpi_1 -> ../../../devices/uncore_qpi_1
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_r2pcie -> ../../../devices/uncore_r2pcie
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_r3qpi_0 -> ../../../devices/uncore_r3qpi_0
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_r3qpi_1 -> ../../../devices/uncore_r3qpi_1
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_r3qpi_2 -> ../../../devices/uncore_r3qpi_2
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_sbox_0 -> ../../../devices/uncore_sbox_0
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_sbox_1 -> ../../../devices/uncore_sbox_1
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_sbox_2 -> ../../../devices/uncore_sbox_2
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_sbox_3 -> ../../../devices/uncore_sbox_3
lrwxrwxrwx 1 root root 0 Sep  5 11:30 uncore_ubox -> ../../../devices/uncore_ubox

On the other hand, if your Linux distribution does not understand the processor uncore, you will only see a few devices, and none with "uncore" in the name.  For example, a 2-socket Xeon Platinum 8160 processor running CentOS 7.3 shows no uncore support:

$ ls -l /sys/bus/event_source/devices/
total 0
lrwxrwxrwx 1 root root 0 Aug 21 03:14 breakpoint -> ../../../devices/breakpoint
lrwxrwxrwx 1 root root 0 Aug 21 03:14 cpu -> ../../../devices/cpu
lrwxrwxrwx 1 root root 0 Aug 21 03:14 intel_bts -> ../../../devices/intel_bts
lrwxrwxrwx 1 root root 0 Aug 21 03:14 intel_cqm -> ../../../devices/intel_cqm
lrwxrwxrwx 1 root root 0 Aug 21 03:14 intel_pt -> ../../../devices/intel_pt
lrwxrwxrwx 1 root root 0 Aug 21 03:14 software -> ../../../devices/software
lrwxrwxrwx 1 root root 0 Aug 21 03:14 tracepoint -> ../../../devices/tracepoint

If your system has uncore support, then you can use these devices as arguments to "perf stat" commands, such as:

perf stat -e "uncore_imc_0/event=0x04,umask=0x03/" -e "uncore_imc_0/event=0x04,umask=0x0c/"  a.out

This command will program and measure two events on DDR channel 0 on a Sandy Bridge/Ivy Bridge/Haswell/Broadwell system.

"Dr. Bandwidth"

Citação:

McCalpin, John escreveu:

...

  • Other uncore counters (such as the Memory Controller counters) are accessed through "PCI Configuration Space".

    • These can be accessed through command-line tools such as "lspci" and "setpci".
    • These counters also only need to be read by one core per package.
    • There are device drivers for these as well, but I always forget the specific path.  A reference that uses these drivers is https://github.com/TACC/tacc_stats

...

About this comment, I have been trying to imitate the way you measure the MCDRAM counters in TACC. However, these counters always return zero or trash values. My code is the following:

/**
 * Start MCDRAM counter
 *
 * @param dev MCDRAM number
 * @param map_dev region of memory
 * @param events array of events to measure
 */
int
edc_begin_dev(uint32_t dev, uint32_t *map_dev, const uint32_t *events)
{
  int i;
  uint32_t ctl  = 0x0UL;
  uint32_t status  = 0x0UL;
  size_t n = 4;

  /* BUS_MCDRAM = 0xFF, FUN_MCDRAM = 0x02*/
  uint32_t pci = pci_cfg_address(BUS_MCDRAM, dev, FUN_MCDRAM);
  memcpy(&map_dev[reg(pci, ECLK_PMON_UNIT_CTL_REG)], &ctl, n);
  memcpy(&map_dev[reg(pci, ECLK_PMON_UNIT_STATUS_REG)], &status, n);

  for (i=0; i < NELEMS(events); i++) {
    memcpy(&map_dev[reg(pci, ECLK_PMON_CTRCTL0_REG +4*i)], &events[i], n);
  }
  
  return 0;
}

/**
 * Start MCDRAM counters
 *
 */
int
edc_begin()
{
  int nr = 0;
  int n_pmcs = 0;
 
  const char *path = "/dev/mem";
  uint64_t mmconfig_base = 0xc0000000;
  uint64_t mmconfig_size = 0x10000000;
  uint32_t *mmconfig_ptr;
  
  int fd = open(path, O_RDWR | O_SYNC); /* first check to see if file can be opened with read/write permission */
  if (fd < 0) {
    ERROR("cannot open /dev/mem\n");
    goto out;
  }
  mmconfig_ptr = mmap(NULL, mmconfig_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, mmconfig_base);
  if (mmconfig_ptr == MAP_FAILED) {
    ERROR("cannot mmap `%s': %m\n", path);
    goto out;
  }
 
  int i;
  /* EDC: ECLK
     Devices: 0x18 0x19 0x1a 0x1b 0x1c 0x1d 0x1e 0x1f (ECLK MCDRAM controllers)
     Function: 0x02 */
  for (i = 0; i < NELEMS(mcdram_dev); i++) {
    if (edc_begin_dev(mcdram_dev[i], mmconfig_ptr, mcdram_evts) == 0)
      nr++;
  }

  munmap(mmconfig_ptr, mmconfig_size);
 out:
  if (fd >= 0)
    close(fd);

  if (nr == 0)
  return nr > 0 ? 0 : -1;  
}

/**
 * Stop MCDRAM 
 *
 * @param dev device number in hexadecimal
 * @param map_dev region of memory
 * @note print results
 */
void
edc_stop_dev(uint32_t dev, uint32_t *map_dev)
{
  int i;
  uint32_t pci = pci_cfg_address(BUS_MCDRAM, dev, FUN_MCDRAM);
  printf("edc_stop_dev %d\n", dev);

#define X(k,r...)							\
  ({                                                                    \
    uint32_t val = 0;                                                   \
    val = map_dev[reg(pci, ECLK_PMON_CTR##k##_REG)];			\
    printf("(32 bits) dev=%llx val=%lu\n",dev,val);			\
  })
  CTL_KEYS;
#undef X

#define X(k,r...)							\
  ({                                                                    \
    uint64_t val = 0;                                                   \
    val = (uint64_t) (map_dev[reg(pci, ECLK_PMON_##k##_HIGH_REG)]) << 32 | (uint64_t) (map_dev[reg(pci, ECLK_PMON_##k##_LOW_REG)]); \
    printf("(64 bits) dev=%llx val=%llu\n",dev,val);				\
  })
  CTR_KEYS;
#undef X

}

/**
 * Stop MCDRAM counters
 *
 */
void
edc_stop()
{
  const char *path = "/dev/mem";
  uint64_t mmconfig_base = 0xc0000000;
  uint64_t mmconfig_size = 0x10000000;
  uint32_t *mmconfig_ptr;

  int fd = open(path, O_RDWR);    // first check to see if file can be opened with read permission
  if (fd < 0) {
    ERROR("cannot open /dev/mem\n");
    goto out;
  }

  mmconfig_ptr = mmap(NULL, mmconfig_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmconfig_base);

  if (mmconfig_ptr == MAP_FAILED) {
    ERROR("cannot mmap `%s': %m\n", path);
    goto out;
  }

  uint32_t zero = 0x00000000;
  int i, j, n = 4;  
  for (i = 0; i < NELEMS(mcdram_dev); i++) {
    edc_stop_dev(mcdram_dev[i], mmconfig_ptr);
  }

  munmap(mmconfig_ptr, mmconfig_size);
 out:
  if (fd >= 0)
    close(fd);
}

With the proper macros (same as in TACC). Basically my code calls edc_begin(), executes a kernel and then calls edc_stop(). Is there something I am missing?

There are lots of ways to run into trouble accessing PCI configuration space directly....

Some ideas:

  1. Make sure you are mapping the correct address for PCI configuration space! 

    1. I typically check the VID/DID fields for some documented devices.  These are documented in Volume 2 of the Intel Xeon Phi Processor datasheet (document 335265).
    2. The tacc_stats system does this as well, but I don't remember how completely it check, and I can't tell from your example code if the checks are still in there....
  2. PCI configuration space is only supposed to be accessed using naturally-aligned 8-bit, 16-bit, or 32-bit loads. 
    1. I would not assume that it is safe to use "memcpy".
    2. It is a good idea to generate an assembly-language version of your code and check to make sure that ordinary 32-bit MOVE instructions are used for both loads and stores indexed off the mmconfig_ptr.
  3. Every once in a while I find a system where the counters are not counting.
    1. There is a Global Control Register for the Uncore Counters in the Ubox.  For KNL it is at MSR 0x700.
    2. Writing "1" to bit 63 freezes all uncore counters in all uncore boxes.
    3. BUT, this bit is "Write-Only", so there is no way to tell if the counters are supposed to be frozen.
    4. So somewhere in my setup code, I write a "1" to bit 61 of MSR 0x700 to unfreeze all the uncore counters.
"Dr. Bandwidth"

发表评论

登录添加评论。还不是成员?立即加入