I've an parallel application that uses busy waiting for synchronization. The parallel region is a loop and I've unrolled this loop a few times on the hope to reduce synchronization overhead, increase ILP and expose more optimization opportunities for the compiler (this is in fact my experiment). Now I need to measure several properties of this parallel program, among them are ILP and thread synchronization overhead. My question is: what are the appropriate events for measuring these properties?
I believe that it would be wrong to calculate IPC as (INST_RETIRED.ANY / CPU_CLK_UNHALTED.REF_TSC) in this case, because busy waiting can skew the number of executed instructions. Currently I'm considering the use of UOPS_EXECUTED.CYCLES_GE_*_UOPS_EXEC as a approximation for how much ILP is "present" on the loop, do you think this is a reasonable approach?
As for cache-to-cache transf. overhead due to synchronization I'm considering the use of MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT*_PS and OFFCORE_RESPONSE.DEMAND_RFO.LLC_HIT.HIT*_OTHER_CORE_0 as target counters. Do you think this is a valid approximation?
I'm running these experiments on a Ivy Bridge running Ubuntu 13.10.
Any advise on this will be appreciated. Thanks in advance!