bug in Haswell-E Offcore Response counters?

bug in Haswell-E Offcore Response counters?

On a Haswell-E processor (Xeon E7-4830 v3, family_signature=06_3f), it seems like the offcore response counters work only for a response type of  ANY.  Otherwise, they return 0.  Below are the details.

I'm testing a cache ping-pong program with two threads on two sockets.  If I set requests to DMND_DATA_RD (bit 0) and response to ANY (bit 16), I get expected results:

  % perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x10001/ taskset -c 0,12 ./a.out 

 Performance counter stats for 'taskset -c 0,12 ./a.out':

        20,063,594      cpu/event=0xb7,umask=0x1,offcore_rsp=0x10001/

But for any other settings of the response, I get zero.  For example, with L3_HITM:

    % perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x40001/ taskset -c 0,12 ./a.out

 Performance counter stats for 'taskset -c 0,12 ./a.out':

                 0      cpu/event=0xb7,umask=0x1,offcore_rsp=0x40001/

Is this known behavior?  Am I doing something wrong?  For reference, this is the tested ping-pong program:

#include <pthread.h>

volatile unsigned int x;

void* run_t1(void* r)
    int i;
    for (i = 0; i < 10000000; i++) {
       while (x != i) continue;
       x = ~0;
    return NULL;

void* run_t2(void* r)
    int i;
    for (i = 0; i < 10000000; i++) {
        while (x != ~0) continue;
        x = i + 1;
    return NULL;

int main (int argc, char** argv)
    pthread_t threads[2];
    void* status;
    int i;

    pthread_create(&threads[0], NULL, run_t1, NULL);
    pthread_create(&threads[1], NULL, run_t2, NULL);

    for (i = 0; i < 2; i++)
        pthread_join(threads[i], &status);

    return 0;



4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I don't have any Xeon E7 v3 systems, but I have run into cases on Xeon E5 v3 where the transactions were not what I expected. The two examples that come to mind immediately are:

  1. Several cross-chip interactions use different transaction types in different snooping modes. 
  2. Hardware prefetches can occur in cases where they are not expected.

The programming of these events can be more subtle than a first reading (or second reading, or third reading) of the documentation might suggest.  There are examples of how the offcore response counters can be used at https://download.01.org/perfmon/HSX/haswellx_offcore_v19.tsv.  The "MSRValue" fields here set a lot more bits than you appear to be setting -- for example OFFCORE_RESPONSE.DEMAND_DATA_RD.LLC_MISS.ANY_RESPONSE shows an MSRValue of 0x3fbfc00001.  This MSR value includes

  • Setting all of bits 37:31, which are the "Snoop Response" bits described in Table 18-38 (referenced in Section 18.11.4 of Volume 3 of the SWDM (document 325384-062).
  • Setting 11 of the 15 bits in the "Supplier" field (bits 30:16), described in Table 18-50 (Section
  • Setting only bit 0 of the "Request Type" field (bits 15:0), described in Table 18-47.  This matches your configuration.

Of course I have also seen plenty of bugs in these counters as well...

"Dr. Bandwidth"

Thanks for the information, John!  It helped me make some progress.  Looks like my mistake was not setting a snoop response information bit when I set a non-ANY supplier information bit.  But the results still don't make sense; it looks like the individual suppliers don't add up to the values with ANY:

% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80010001/ taskset -c 0,12 ./a.out
        20,044,681      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80010001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80020001/ taskset -c 0,12 ./a.out
                 0      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80020001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80040001/ taskset -c 0,12 ./a.out
             7,212      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80040001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80080001/ taskset -c 0,12 ./a.out
               684      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80080001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80100001/ taskset -c 0,12 ./a.out
            14,659      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80100001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80200001/ taskset -c 0,12 ./a.out
             1,456      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80200001/
% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80400001/ taskset -c 0,12 ./a.out
               736      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3f80400001/

Interestingly, the value from the haswellx_offcore_v19.tsv isn't supported by my processor:

% perf stat -e cpu/event=0xb7,umask=0x1,offcore_rsp=0x3fbfc00001/ taskset -c 0,12 ./a.out 
 Performance counter stats for 'taskset -c 0,12 ./a.out':

   <not supported>      cpu/event=0xb7,umask=0x1,offcore_rsp=0x3fbfc00001/

But the processor is definitely a Haswell-E:

cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz



The "not supported" message is likely a software limitation -- I have never seen the HW preventing one from setting any bit fields in the performance counters before. 

Some MSRs do have protected bit fields -- you might try writing the "<not supported>" bit pattern to MSR 0x1A6 using the "wrmsr.c" program from msrtools-1.3  to see if the hardware is preventing writing to some of the bits.  (Table 18-50 says that bits 26:23 are reserved, but they are set in the bit field above.)  

On my Xeon E5 v3 systems, there is no problem writing the value 0x3fbfc00001 to MSR 0x1A6, so it is probably overzealous software noticing that you are writing to what are documented to be reserved bits.     Yet another reason why I write (almost) all my own performance monitoring code....

For this example, I manually set up PMC0 to 0x004301b7 and set MSR 0x1a6 to 0x3fbfc00001 (both on core 0).  Then I disabled the HW prefetchers and ran the STREAM benchmark pinned to core 0.   For the STREAM parameters (N=80M, NTIMES=100), I expected about 384 billion cache line reads, and this counter incremented by 386.6 billion during the run.  So it looks like the "<not supported>" bit pattern does count demand LLC misses reasonably accurately for at least one test case.   Re-enabling the HW prefetchers reduced the count to 0.97 billion, indicating that the HW prefetchers are able to keep ahead of the demand loads for the single-threaded test case.  This is not surprising, since the sustained BW for the STREAM kernels was between 19 GB/s and 20 GB/s -- less than 30% of the peak BW of the four DDR4/2133 DRAM channels on socket 0.  

"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today