Software Tuning, Performance Optimization & Platform Monitoring

QPI link counters

I am testing performance on a 2 Numa node system (intel xeon e5-2620 on each socket).
the test runs 12 threads (6 on each node) and they access a shared memory.
i run this once when all memory in allocated on a specific node and once when memory is interleaved between the nodes.
the result is that in interleaved mode the test runs faster.
i thought i'd check how much data is actually transfered between the nodes and maybe it would explain why interleaved is faster.

R2PCIe Test

I have collect RING_THRU_DN_BYTES and RING_THRU_UP_BYTES events.

When I run glxgears program, DN~=5200MB UP~=8000MB.

Whe I run stream, DN~=4200MB UP~=9800MB.

I can't judge whether the events that I collect is right.

 

How to defeat H/W prefetcher in Intel Core i3/i7

Hello everyone,

I am trying to find a way to defeat the H/w prefetcher to detect the stream pattern and access 4KB data in a random order
so that it is not detected and prefetched by H/w prefetcher.

Initially I was thinking to access all even index data in a random pattern as H/w prefetcher prefetch the next cache lines
always (so when I access even index, next odd index data is already prefetched).

LLCM

I calc llcm. It's about 60%, when I run vasp.

I use the following formula

LONGEST_LAT_CACHE.MISS * 100  /  ( LONGEST_LAT_CACHE.MISS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT )

I've no idea about if this is right.

Sudden change in Haswell power consumption

I am monitoring the power consumption of haswell processor and I have noticed some sudden changes in the power consumption for no apparent reason. For demonstration, as shown in the figure below (left y-axis is in Watt and right y-axis is in Celsius), I have fixed the cpu C-state for all cores to C0 at a fixed frequency and then I increase the temperature using a hair dryer. I measure the power consumption at CPU ATX power rail and using RAPL. As one can see, the power rises linearly in correlation with the temperature till about 16.5W and then there is a sudden jump to ~18.5W.

What is UBox, Cbo, sbo, HA, iMC, IRP, PCU, QPI, R2PCIe, R3QPI and etc.

I'm reading xeon-e5-v3-uncore-performance-monitoring.pdf. If I want to monitor some events, I need to set some *CTL, and then read it from *CTR. There are "MSR Address" and "PCICFG Address".

If I monitor PCI events, should I need to set MSR *CTL?

I'm confuse about UBox, Cbo, sbo, HA, iMC, IRP, PCU, QPI, R2PCIe and R3QPI relationship.

 

Preventing FP overcounts for AVX instructions on Sandy Bridge

As most readers of this forum are aware, the performance counter events for floating-point operations can overcount significantly on Sandy Bridge/Ivy Bridge platforms.   This applies to both Event 0x10 "FP_COMP_OPS_EXE.*" and Event 0x11 "SIMD_FP_256.*".    The overcounts are clearly related to stalls -- the counts appear to be very close for data in the L1 Data Cache, increase slightly for data in the L2 cache, increase significantly for data in the L3 cache, and are very high for data in memory.  The degree of overcounting depends on the details of the code generated and the load on the me

Iscriversi a Software Tuning, Performance Optimization & Platform Monitoring