Help using Vtune for hardware performance counter monitoring

Help using Vtune for hardware performance counter monitoring

Hi All,

I am new to Vtune, I have few basic questions to get me started using it, please help me out.

I am working on a XEON E5345 processor and I want to few monitor hardware performance events (like L1 misses, L2 hits/misses, LLC hits/misses etc) using Vtune's command line interface. I was reading through some guides and the user forum and I think I can use vtune like :-

amplxe-cl -collect-with runsa -knob event-config=CPU_CLK_UNHALTED.CORE:sa=1000000 -- C:\test\sample.exe

My questions are:-

a) I want to know what are all the different kind of events available on my system for monitoring. Is there any limitation as to what all and how many events can be monitored together?

b) Also, I found something about a "collect general exploration" analysis option with vtune, how is it different from running vtune with the "collect-with" and "runsa" option.

Thanks for your help.

- Rene

15 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Hi Rene:

Yes, you should read up on General Exploration and utilize this predefined set of events, which basically collects information on all areas of processor efficiency, including cache metrics.  The product documentation is available online (Windows* | Linux*), as well as on the system after installation. This would be a good place to start. 

Events are specific to a processor.  To view all available events on the current system from the command line, execute the command 'amplxe-runss -event-list'.  And, while the VTune Amplifier XE documentation includes descriptions, the authoritative guide to processor events is the Software Developer Manual (SDM).  Volume 3 includes info on performance monitoring.

The graphical interface (GUI) will automatically limit you to only trying to collect events applicable to your system, while the command line can only inform you *after* you've attempted a collection.  The GUI will also provide a list of available events when you create your own "custom analysis type", so it is easier to use wrt event selection and configuration.

Finally, can you execute the command 'amplxe-runss -context-value-list' and tell me the value of "PMU"?

Regards, MrAnderson

Also, more information about our "top down" methodology and using General Exploration is available at http://www.intel.com/vtune-tuning-guides.

Regards, MrAnderson

Hi Mr. Anderson,

Thanks for pointing me to the useful documentation, I will read through them.

Since I am working on a remote server (via ssh), I am using the command-line options of vtune instead of the GUI option. So, I am looking for options to configure the events through command line options.

I executed the command and the PMU value is "core2".

I am reading through the "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B", chapter 19. Given my server system (XEON E5345), I am looking at Chapter 19.9 (PERFORMANCE MONITORING EVENTS FOR INTEL® XEON® PROCESSOR 3000, 3200, 5100, 5300 SERIES AND INTEL® CORE™2 DUO PROCESSORS), and it provides a list of all the performance events. Is this applicable to my system then?

Thanks again,

Rene

 

Yes, but, again, you can just use the 'general-exploration' analysis type and all the right events will be collected to triage any inefficient code.  Once you have identified areas of concern, you can select specific events to do a "deep dive."

Regards, MrAnderson

Thank you for your suggestion, I will give that a try.

 

Hi Mr Anderson,

I am trying to analyze few applications without specifically attaching to them. I tried running the vtune as follows :-
     amplxe-cl -collect general-exploration -analyze-system -allow-multiple-runs -- ./EXECUTABLE
    
     To generate the report, I used following command
     amplxe-cl -report hw-events-group-by=process -report-output=/path-to-final-output/outfile-name

My final output file contains the grouping by processes, but it contains only the following few hardware events:-

Process                Hardware Event Count:CPU_CLK_UNHALTED.CORE:Self (K), Hardware Event Count:INST_RETIRED.ANY:Self (K), Hardware Event Count:CPU_CLK_UNHALTED.REF:Self (K), Hardware Event Count:MEM_LOAD_RETIRED.L2_LINE_MISS:Self (K), Hardware Event Count:BUS_TRANS_MEM.ALL_AGENTS:Self (K), Hardware Event Count:MEM_LOAD_RETIRED.L2_MISS:Self (K), Hardware Event Count:BUS_TRANS_BURST.SELF:Self (K), Hardware Event Count:RESOURCE_STALLS.BR_MISS_CLEAR:Self (K), Hardware Event Count:RS_UOPS_DISPATCHED.CYCLES_NONE:Self (K), Hardware Event Count:CYCLES_L1I_MEM_STALLED:Self (K), Hardware Event Count:BR_MISSP_EXEC:Self (K)

Why are the other events not reported here? Or is the default configuration setting for general exploration.

Thanks,

Rene

 

 

Hi Rene:

The event set for GE is defined by the processor you are collecting data on.  You don't say how long you collected data, but if those are the events displayed then that is all that was collected, either by-design or because the other events did not occur.

So, I checked the configuration and those are the correct events for Core(TM)2 processors.  You have to realize that Core 2 is three generations behind.  Also, there are several other Core 2 analysis types available: core2-bandwidth-breakdown
ily, core2-cycles-uops, core2-memory-access.  I don't know if those will provide any more information, but you can try and see.

If you want other events, you will have to create your own list and specify them on the command line.  I'm sorry, but we developed our tuning methodology *after* the Core 2 processors, basically starting with micro-architecture codename Nehalem and later.

Regards, MrAnderson

Hi Mr. Anderson,

As you suggested, I tried creating a custom event-list and ran vtune as follows:-

amplxe-cl -collect-with runsa -knob event-config=INST_RETIRED.ANY_P,INST_RETIRED.LOADS, INST_RETIRED.STORES, INST_RETIRED.OTHER, BR_INST_RETIRED.ANY, UOPS_RETIRED.ANY -analyze-system -allow-multiple-runs -- ~/executable

I have not specified duration of collection here because I want to collect the statistics till my application run finishes.

So after my run finished, the overall summary looks like as follows:-

Event summary
-------------
Hardware Event Type  Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample
-------------------  -------------------------  --------------------------------  -----------------
INST_RETIRED.ANY_P                453778000000                            226889  2000000
INST_RETIRED.LOADS                167182000000                             83591  2000000
INST_RETIRED.STORES                80882000000                             40441  2000000
INST_RETIRED.OTHER                213386000000                            106693  2000000
BR_INST_RETIRED.ANY                68788000000                             68788  1000000
UOPS_RETIRED.ANY                  572308000000                            286154  2000000

Then I generate a report (with hw-events and grouped by process) and the process I am interested to analyze has the following stats:-

Hardware Event Count:INST_RETIRED.ANY_P:Self (M)  449832
Hardware Event Count:INST_RETIRED.LOADS:Self (M)  166042
Hardware Event Count:INST_RETIRED.STORES:Self(M)  80398
Hardware Event Count:INST_RETIRED.OTHER:Self (M)  211834
Hardware Event Count:BR_INST_RETIRED.ANY:Self(M)  68236
Hardware Event Count:UOPS_RETIRED.ANY:Self   (M)  547458

I have a question regarding this:-

(A) Are the above stats in Millions? Because otherwise, the instruction counts for the process look very small and the all the counts don't add upto the overall stats.

(B) I ran vtune allowing multiple runs of the application, so are these numbers the average of all the runs or the sum of all the runs?

(B) If I have to interpret say the CPI, can I use the INST_RETIRED.ANY_P value and divide it by CPU_CLK_UNHALTED_CORE? Or do I have to take of the SAV values (default values in my case) into account as well.

Thanks again for your help.

Regards,

Rene
 

Hi Mr Anderson,

I am not very sure of some of the results I am getting by running my application. For example, I am trying to collect statistics of the instruction mix of the application:-

Hardware Event Count:INST_RETIRED.ANY:Self (M)            1345498
Hardware Event Count:CPU_CLK_UNHALTED.CORE:Self (M)       1042754
Hardware Event Count:INST_RETIRED.ANY_P:Self (M)          448526
Hardware Event Count:INST_RETIRED.LOADS:Self (M)          165702
Hardware Event Count:INST_RETIRED.STORES:Self (M)         80254
Hardware Event Count:INST_RETIRED.OTHER:Self (M)          211444
Hardware Event Count:BR_INST_RETIRED.ANY:Self (M)         68272
Hardware Event Count:UOPS_RETIRED.ANY:Self (M)            547870

a) Why is INST_RETIRED.ANY so different from INS_RETIRED.ANY_P? Aren't they both supposed to be counting the number of instructions executed?

b) Also, UOPS_RETIRED.ANY is much smaller than INST_RETIRED.ANY, which would make uops/inst < 1, how is that possible?

I have 2 related questions regarding vtune usage, then.

a) Is there a limitation on how many events I can monitor in any given run? There would be only a fixed number of hardware counters available on the system, so what would happen if I specify more events for monitoring? 

Also, are there any set of hardware events that cannot be monitored together?

Thanks,

Rene
 

Hi All,

Can you please help me to understand the difference between INST_RETIRED.ANY and INST_RETIRED.ANY_P above. In my runs, I see that INST_RETIRED_ANY is 3 times the value of INST_RETIRED.ANY_P. The value of UOPS_RETIRED.ANY is much smaller than INST_RETIRED.ANY. Can it be related to the fact that I am allowing multiple runs of my application?

Any help is really appreciated.

Thanks,

Rene

 

When you allow multiple runs adding events will increase number of runs. Vtune collects as many as possible in each run. For short runs such as you would select multiple runs this shouldn't aggravate problems such as cache warmup and data structure initialization which may affect first run. Multiple runs may be more repeatable if you set affinity.

Hi Tim,

As I added the allow-multiple-runs command line option, my aplication ran for 3 times before the collection stopped. The total time for the 3 application runs to complete was around 9 minutes (not so short). But what I am not able to understand is why is INST_RETIRED.ANY almost 3 times the value of INST_RETIRED.ANY_P below. Shouldn't the two values be almost the same?

INST_RETIRED.ANY:Self (M)            1345498

INST_RETIRED.ANY_P:Self (M)          448526

UOPS_RETIRED.ANY:Self (M)            547870

Also, the value of UOPS_RETIRED.ANY is much smaller than INST_RETIRED.ANY, which makes uops/inst << 1.  How can that be possible?

Is INST_RETIRED.ANY the value of 3 application runs together and the INST_RETIRED.ANY_P value of only one application run? What I mean to ask is : Can the discrepancy between the two values be related to the fact that I am allowing multiple runs of my application?

Thanks,

Rene

Since no one seems to be popping up with an answer, perhaps it would be easiest to re-run the test one time to see how INST_RETIRED.ANY and INST_RETIRED.ANY_P are related?

If you are running on Linux and have root access, it is fairly easy to program and read the counters manually using the "wrmsr" and "rdmsr" programs from the msrtools-1.2 package.   This does not provide the PC sampling capability of VTune, and does not automatically deal with counter overflows or process migration, but it is an easy way to get whole-program counts for each core.

John D. McCalpin, PhD "Dr. Bandwidth"

Hi Dr. McCalpin,

Thanks for your reply.

When I run my application once, INST_RETIRED.ANY and INST_RETIRED.ANY_P are equal to each other. When my application is run multiple number of times during a single vtune run(allow-multiple-runs), INST_RETIRED.ANY becomes "(no.of times application is run) times INST_RETIRED.ANY_P. The other programmable event stats I am getting from vtune are of the same order as the INST_RETIRED.ANY_P values (like if I am looking at the number of loads or stores - they are 30,20% w.r.t INST_RETIRED.ANY_P value respectively, which is what I would expect anyways). So, I am using the value of INST_RETIRED.ANY_P instead of INST_RETIRED.ANY for IMIX calculations.

I have root access on my machine, but I do not have instrumentation capability over the application. I want to monitor the stats for a database server (continuously running), when a particular client connection is active. So I am using vtune to profile the entire system when the client connection is active and grouping my stats by the process (which gives me the stats for the server application). I am not sure how to use wrmsr and rdmsr in such a case.

Thanks,

Rene

 

Melden Sie sich an, um einen Kommentar zu hinterlassen.