Intel Nehalem - Counting Memory Related Stalls

Intel Nehalem - Counting Memory Related Stalls

Hi,
I am trying to analyze some benchmarks and see how much of their stall cycles are related to memory access. I looked at the documents: "Intel 64 and IA-32 Architectures Optimization Reference Manual" and "Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors".

I could gather that: Total Cycles = UOPS_EXECUTED.CORE_STALLS_CYCLES + UOPS_EXECUTED.CORE_ACTIVE_CYCLES where Total Cycles is CPU_CLK_UNHALTED.THREAD. I also understand that memory related accesses are through ports 2,3 and 4; where as, ALU related operations are through ports 0, 1 and 5.

I could find UOPS_EXECUTED.PORT015_STALL_CYCLES counter to get ALU related stalls but no counter to get memory related stalls. Counter UOPS_EXECUTED.PORT234_CORE seems to be overall memory UOPS and not stall cycles.
Could anyone suggest how to identify memory related stalls?

Also, for the programs I ran, UOPS_EXECUTED.PORT015_STALL_CYCLES was greater than UOPS_EXECUTED.CORE_STALLS_CYCLES. Does that make sense?

I hope this is the right forum for this question. Please correct me otherwise.

Thanks,Vineeth

7 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Hi Vineeth,

Please refer to this article.

Measure memory stalls impacts - (This is average value (cycles) for each stall)
UOPS_EXECUTED.CORE_STALLS_CYCLES / UOPS_EXECUTED.CORE_STALLS_COUNT

UOPS_EXECUTED.CORE_STALLS_CYCLES:

Cycle count when no Uops were executed that were issued on any of the ports. This event must be a core count as port 2, 3 & 4 events are core counts.

UOPS_EXECUTED.CORE_STALLS_COUNT:
Counts when there is 1 or more Uops executed that were issued on any of the ports. This event must be a core count as port 2, 3 & 4 events are core counts.

Hope it helps.

Regards, Peter

Hi Peter,Thanks for your reply.
I have looked at the document byDr.Levinthal, but still have some confusion regarding my initial question. Let me ask a few more questions. Does the UOPS_EXECUTED.CORE_STALL_CYCLES/COUNT consider only memory related ports 2, 3 and 4? Or do they consider all the ports including 0, 1 and 5 which are ALU related?
I ran a program till finish and collected the following numbers (in Billion):CPU_CYCLES_UNHALTED.THREAD = 519UOPS_EXECUTED.CORE_STALL_CYCLES = 126UOPS_EXECUTED.CORE_ACTIVE_CYCLES = 392UOPS_EXECUTED.PORT015_STALL_CYCLES - 194UOPS_EXECUTED.PORT234_CORE = 365
It makes sense to me thatCPU_CYCLES_UNHALTED.THREAD =UOPS_EXECUTED.CORE_STALL_CYCLES +UOPS_EXECUTED.CORE_ACTIVE_CYCLES as mentioned in the document.
But, how canUOPS_EXECUTED.PORT015_STALL_CYCLES be greater thanUOPS_EXECUTED.CORE_STALL_CYCLES?Or, is it thatUOPS_EXECUTED.CORE_STALL_CYCLES doesn't count ports 0, 1 and 5?
My aim is tosegregatememory related stalls from the total stalls that include both memory related stalls + ALU related stalls.
Thanks again,Vineeth

Hi Vineeth,

It doesn't make sense to use UOPS_EXECUTED.PORT234_CORE, because it said from Dr. Levinthal's doc -
"The signals used to count the memory access uops executed (ports 2, 3 and 4) are the
only core events which cannot be counted on a logical core or HT basis...the ALU ports (0,1,5) count on a
per thread basis"

"Thus in the case where HT is
enabled we have the following inequality
UOPS_EXECUTED.CORE_STALL_CYCLES <= True execution stalls per thread <=UOPS_EXECUTED.PORT015_STALL_CYCLES

Of course with HT disabled then
UOPS_EXECUTED.CORE_STALL_CYCLES = True execution stalls per thread = UOPS_EXECUTED.PORT015_STALL_CYCLES"

In most of cases, HT is enabled in system, simply use UOPS_EXECUTED.CORE_STALL_CYCLESwhatever HT isenabled or NOT, to reduce the complexity.

Regards, Peter

Dear Peter,Thanks for your reply and sorry for my long hiatus. I am back at this problem.I could understand thatUOPS_EXECUTED.CORE_STALL_CYCLES can be less thanUOPS_EXECUTED.PORT015_STALL_CYCLES as the first one is counted only when all ports are stalled. Hence, when either of ports 2, 3 or 4 are not stalled, they are not counted.But my main concern remains. That is, how do I calculate *memory related stalls* on Intel Core i7. The obvious step looks to me that I should get STALLS on PORT234 and get the percentage of total stalls (PORTS 0-5), but there is no counter for getting stalls on PORT234 (memory ports).Is there a way to get that or, is there any other way to calculate what percentage of total CPU Cycle stalls are due to memory related reasons, rather than other reasons (like ALU etc)?My machine is HT disabled, by the way.Thanks a lot, again!Vineeth

Hi Vineeth,

Please readthis article for optimization guideline for Intel Core i7 processors

Hope it helps.

Regards, Peter

Vineethtm,

any luck with your quest? Did you manage to count the memory related stall cycles? My understanding is that during a load or a store, there is no "stall" cycle. Maybe the event UOPS_EXECUTED.PORT234 is what you (and me) are searching for?

Best,
Panagiotis

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi