Intel Nehalem - Counting Memory Related Stalls

Intel Nehalem - Counting Memory Related Stalls

Portrait de Vineeth Mekkat

Hi, I am trying to analyze some benchmarks and see how much of their stall cycles are related to memory access. I looked at the documents: "Intel 64 and IA-32 Architectures Optimization Reference Manual" and "Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors".

I could gather that: Total Cycles = UOPS_EXECUTED.CORE_STALLS_CYCLES + UOPS_EXECUTED.CORE_ACTIVE_CYCLES where Total Cycles is CPU_CLK_UNHALTED.THREAD. I also understand that memory related accesses are through ports 2,3 and 4; where as, ALU related operations are through ports 0, 1 and 5. I could find UOPS_EXECUTED.PORT015_STALL_CYCLES counter to get ALU related stalls but no counter to get memory related stalls. Counter UOPS_EXECUTED.PORT234_CORE seems to be overall memory UOPS and not stall cycles.
Could anyone suggest how to identify memory related stalls? Also, for the programs I ran, UOPS_EXECUTED.PORT015_STALL_CYCLES was greater than UOPS_EXECUTED.CORE_STALLS_CYCLES. Does that make sense? I hope this is the right forum for this question. Please correct me otherwise. Thanks, Vineeth
7 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Peter Wang (Intel)

Hi Vineeth,

Please refer to this article.

Measure memory stalls impacts - (This is average value (cycles) for each stall)
UOPS_EXECUTED.CORE_STALLS_CYCLES / UOPS_EXECUTED.CORE_STALLS_COUNT

UOPS_EXECUTED.CORE_STALLS_CYCLES:

Cycle count when no Uops were executed that were issued on any of the ports. This event must be a core count as port 2, 3 & 4 events are core counts.

UOPS_EXECUTED.CORE_STALLS_COUNT:
Counts when there is 1 or more Uops executed that were issued on any of the ports. This event must be a core count as port 2, 3 & 4 events are core counts.

Hope it helps.

Regards, Peter

Portrait de Vineeth Mekkat

Hi Peter, Thanks for your reply. I have looked at the document byDr.Levinthal, but still have some confusion regarding my initial question. Let me ask a few more questions. Does the UOPS_EXECUTED.CORE_STALL_CYCLES/COUNT consider only memory related ports 2, 3 and 4? Or do they consider all the ports including 0, 1 and 5 which are ALU related?
I ran a program till finish and collected the following numbers (in Billion): CPU_CYCLES_UNHALTED.THREAD = 519 UOPS_EXECUTED.CORE_STALL_CYCLES = 126 UOPS_EXECUTED.CORE_ACTIVE_CYCLES = 392 UOPS_EXECUTED.PORT015_STALL_CYCLES - 194 UOPS_EXECUTED.PORT234_CORE = 365
It makes sense to me thatCPU_CYCLES_UNHALTED.THREAD =UOPS_EXECUTED.CORE_STALL_CYCLES +UOPS_EXECUTED.CORE_ACTIVE_CYCLES as mentioned in the document. But, how canUOPS_EXECUTED.PORT015_STALL_CYCLES be greater thanUOPS_EXECUTED.CORE_STALL_CYCLES? Or, is it thatUOPS_EXECUTED.CORE_STALL_CYCLES doesn't count ports 0, 1 and 5? My aim is tosegregatememory related stalls from the total stalls that include both memory related stalls + ALU related stalls. Thanks again, Vineeth

Portrait de Peter Wang (Intel)

Hi Vineeth,

It doesn't make sense to use UOPS_EXECUTED.PORT234_CORE, because it said from Dr. Levinthal's doc -
"The signals used to count the memory access uops executed (ports 2, 3 and 4) are the
only core events which cannot be counted on a logical core or HT basis...the ALU ports (0,1,5) count on a
per thread basis"

"Thus in the case where HT is
enabled we have the following inequality
UOPS_EXECUTED.CORE_STALL_CYCLES <= True execution stalls per thread <=UOPS_EXECUTED.PORT015_STALL_CYCLES

Of course with HT disabled then
UOPS_EXECUTED.CORE_STALL_CYCLES = True execution stalls per thread = UOPS_EXECUTED.PORT015_STALL_CYCLES"

In most of cases, HT is enabled in system, simply use UOPS_EXECUTED.CORE_STALL_CYCLESwhatever HT isenabled or NOT, to reduce the complexity.

Regards, Peter

Portrait de vineethtm

Dear Peter, Thanks for your reply and sorry for my long hiatus. I am back at this problem. I could understand thatUOPS_EXECUTED.CORE_STALL_CYCLES can be less thanUOPS_EXECUTED.PORT015_STALL_CYCLES as the first one is counted only when all ports are stalled. Hence, when either of ports 2, 3 or 4 are not stalled, they are not counted. But my main concern remains. That is, how do I calculate *memory related stalls* on Intel Core i7. The obvious step looks to me that I should get STALLS on PORT234 and get the percentage of total stalls (PORTS 0-5), but there is no counter for getting stalls on PORT234 (memory ports). Is there a way to get that or, is there any other way to calculate what percentage of total CPU Cycle stalls are due to memory related reasons, rather than other reasons (like ALU etc)? My machine is HT disabled, by the way. Thanks a lot, again! Vineeth

Portrait de Peter Wang (Intel)

Hi Vineeth,

Please readthis article for optimization guideline for Intel Core i7 processors

Hope it helps.

Regards, Peter

Portrait de Panagiotis F.

Vineethtm,

any luck with your quest? Did you manage to count the memory related stall cycles? My understanding is that during a load or a store, there is no "stall" cycle. Maybe the event UOPS_EXECUTED.PORT234 is what you (and me) are searching for?

Best,
Panagiotis

Connectez-vous pour laisser un commentaire.