bandwidth analysis on xeon phi

bandwidth analysis on xeon phi

Portrait de Surya Narayanan N.

According to the tutorials bandwidth analysis can be performed in 2 ways. 1.knc-cutom analysis (from core) 2. knc-bandwidth (just uncore)

http://www.youtube.com/watch?v=vnOqpyzui_s

I would like to do it the first way using the formula and I have certain doubts for the same.

given formula:

Read bandwidth (bytes/clock)
(L2_DATA_READ_MISS_MEM_FILL + L2_DATA_WRITE_MISS_MEM_FILL + HWP_L2MISS) * 64 / CPU_CLK_UNHALTED

Write bandwidth (bytes/clock)
(L2_VICTIM_REQ_WITH_DATA + SNP_HITM_L2) * 64 / CPU_CLK_UNHALTED

I run my multi-threaded application from a script which does some environment setting before calling the application.

like

amplxe-cl -collect-with runsa-knc -knob event-config=CPU_CLK_UNHALTED:sa=10000, (other events with their sampling frequency) -- ssh mic0 "./script.sh"

 

Q1: Is this the statistics of the script or of all the process running while collecting statistics? How can I determine these statistics of the application which my script started?

I get the event summery like this

Event summary
-------------
Hardware Event Type       Hardware Event Count:Self      Hardware Event Sample Count:Self    Events Per Sample
-----------------------------  -------------------------                            --------------------------------                                              -----------------
HWP_L2MISS                                           91000                      13                                1000
CPU_CLK_UNHALTED                            49840000                   712                               10000
L2_DATA_READ_MISS_CACHE_FILL   336000                     48                                1000
L2_DATA_READ_MISS_MEM_FILL        714000                     102                               1000
L2_DATA_WRITE_MISS_CACHE_FILL         0                          0                                 1000
L2_DATA_WRITE_MISS_MEM_FILL              0                          0                                 1000
L2_VICTIM_REQ_WITH_DATA                       0                          0                                 1000
SNP_HITM_L2                                                 0                          0                                 1000

then I see my result using

amplxe-cl -report hw-events -format=csv -csv-delimiter=comma -report-output=output.csv -show-as=sample -r /home//bandwidth2/ -call-stack-mode=user-only -cumulative-threshold-percent=loop -group-by=process

Q2: Again is this giving the statistics of the script? As they are very different from the summary. How can i get the statistics of just the application am interested in which was spawned by the script?

Q3: As I have collected the samples with certain number of events (sa:1000) when i calculate the bandwidth I should multiply the Hardware Event Sample Count:L2_DATA_WRITE_MISS_MEM_FILL:Self with the "sa" value to get the correct bandwidth value?

3 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Surya Narayanan N.

Q4: While collecting there is a parameter "cpu-mask". If i set it to 0 does it mean it will monitor the hw-events only in core 0? if i set it to "all" then it monitors all 240 cores? If so, wont my statistics be wrong with the information form applications other than my multi-threaded application? I would like to know how to use this parameter.

Portrait de iliyapolak

I do not know how it is implemented on Xeon Phi,but on Intel CPUs performance counters are not pinned to specific OS thread although they can be set to track user or kernel mode activity.For example when you are measuring performance of some application(process) then OS scheduler decides to swap out your thread which is currently monitored and the other unrelated thread is scheduled to run on the same core so in such a situation performance counter will record events generated by different process.You can set affinity to specific core and run set your thread's priority to very high in order to prevent swapping out.Still it is related to Windows and to general purpose CPU.It is the job of VTune to resolve addresses(tracking IP ) of currently executing thread.

Connectez-vous pour laisser un commentaire.