Software Tuning, Performance Optimization & Platform Monitoring

Performance counter is not accurate enough in Core2 CPUs

Hi there,

According to intel manual, L1D_REPL counts the number of lines brought into the L1 data cache in core2 CPUs. I tested this event on a Q8200 PC, but the result was not expected. I did the following:

1: set IA32_PCM0 to count L1D_REP event on all cores

2: disable all other core's cache

3: flush the cache hierarchy  using wbinvd

4: load counters (high[2] and low[2]) into the cache

5: rdmsr(IA32_PCM0 , low[0], high[0])

6: access a buffer on a 64 byte boundary of 300*64 bytes (cache fills with 64 byte cache lines)

TOR_INSERTS. MISS_OPCODE count is bigger than OPCODE. It is unexpected.

Hello

I am trying to measure LLC_RFO_MISS_PCT base on metrics derived xeon-e5-2600-uncore-guide page 48.

What suprised me is the fact that MISS values are higher than just for OPCODE. While expectations are different and expression from manual says that PCT_MISS = MISS_OPCODE_COUNT / OPCODE_COUNT.

What might go wrong?

I collected both counters togather 

register 0 for TOR_INSERTS.UMASK_OPCODE

register 1 for TOR_INSERTS.UMASK_MISS_OPCODE

opc=0x180  RFO

Results are below:

Stream benchmark runs slower on 8Core E5-2680 when vectorized .

I know there are many hardware geeks in this forum ,so i'd like to make my enquires here.

Compiled Stream.c with 'icc -O -g stream.c -o stream-icc.out -vec-report2' and run on 8Core E5-2680 and 4Core i7-2600K for comparision .You can see the below differences.

DataLA profiling upon TSX abort event?

Hi all,

I was looking at Intel SDM V3, section 18.10.5.1 Intel TSX and PEBS Support. Specifically, I am interested in the available PEBS support upon a RTM abort event. I saw that in the previous manual version (August 2012), there was an additional line suggesting that a DataLA field would be available (valid in the PEBS record):

Offset 0x98 Data Linear Address (if the uop that triggered PEBS was a load or a store),

Some events missing in the manual is not necessarily inexistent!

Dear all,

Recently, I work on a E5-4603 chip. I was a little bit astonished by when I read related part of the chip in the official praogramming manual, because there is unexpectedly no event counter for SIMD integer operations, while many chips have, for exampe i7 chips, there is a event counter naming,  SIMD_INT_128.PACKED_ARITH, event number 20H, Umask 12H.

Becasue in E5 chip description, the location of event number 20H, Umask 12H in E5 chip is empty, I guess maybe the event also functions in the Chip as well. I write a very simple benchmark, like:

[PCM] power and thermal measurements

Wei Yun H wrote:

1. Can the PCM get energy consuming for each core?
2. How does the temperature transfer from this to Celsius?

  1. Unfortunately there is no way to break down the energy consumption per core.
  2. You need to substract the thermal headroom to the maximum temperature of your processor. This depends on your specific CPU, if you are willing to give up some precision, you can use an estimate.

[PCM] PCM not restarting after ctrl-c

I have a problem when using pcm on Linux.
After I have compiled pcm sucessfully, I launch pcm.x and it can show the right information. Then I terminated it by Ctrl+C command.
But when I try to run pcm.x again, it show the PMU is busy all time. The output is like below:
WARNING: Core 0 fixed ctrl:176
Access to Intel(r) Performance Counter Monitor has denied (Performance Monitoring Unit is occupied by other application). Try to stop the application that uses PMU.
Alternatively you can try to reset PMU configuration at your own risk. Try to reset? (y/n)

TSX conflict aborts for single threaded applications

Hi,

I have just started running STAMP (Transactional Memory benchmarks from Stanford) on Haswell. I am monitoring the TSX performance using libpfm4.4 library on Ubuntu (kernel 3.11-rc6). When I run single-threaded transactional memory benchmarks, I see transaction aborts due to data conflicts (TX_MEM.ABORT_CONFLICT) for certain benchmarks. This behavior seems strange to me considering that I am running single-threaded benchmark, so there shouldn't be any data conflicts on transactionally accessed address. Is there something that I am missing?

Subscribe to Software Tuning, Performance Optimization & Platform Monitoring