Erroneous detailed hardware metrics in VTune Amplifier XE 2013

Erroneous detailed hardware metrics in VTune Amplifier XE 2013


I'm having problems with my analysis results in VTune. Whenever i do an analysis, hardware event-based metrics behave strangely. Their "aggregator" (e.g. Bad Speculation) seems to show correct values, but whenever i expand that, detailed metrics (e.g. Branch Mispredict, Machine Clears) always show zero (the blue bar is missing).

To illustrate, here is a C++ code snippet that should trigger loads of L1 cache misses:

/* ... */

/*kEvilOffset is 0x1000, accessing data with this offset should result in a large amount of L1D replacements (assuming a 32KB L1 data cache)*/

__declspec(noinline) void DoStuff() {
  for (size_t i = 0; i < kAllocSize - kEvilOffset; ++i)
  for (size_t j = i; j < kAllocSize; j += kEvilOffset) {
    data[i] *= 39;

/* ... */

After running a "General Exploitation (Sandy Bridge / Ivy Bridge / Haswell)" analysis, the blue bar for "Back-End Bound" is very wide (which - by the way - should be correct, since cache misses are in that category), but when i expand that category, there are no blue bars at all for any of those events. I also attached two screenshots. Am i doing something wrong?

I'm compiling with Intel C++ 14.0, and using VTune Amplifier XE 2013 Update 15 (build 328102).

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Very good question.

1. You can use right-click on data -> Show data as : from Bar to Number, see exact data

2. Move the cursor on the column (metric) - you can see the explanation of this metric, and how data is generated according to metric's formula.

3. Each metric whatever it is upper or lower - they calculate the data by using different event.  

The numeric representation shows 0.681 for the category "Back-End Bound", and 0.000 for every sub-category of it (it did not help :( ).


Peter E. escreveu:

The numeric representation shows 0.681 for the category "Back-End Bound", and 0.000 for every sub-category of it (it did not help :( ).


My understanding is that metrics in sub-category (e.g. Memory Latency, Memory Replacements, Memory Reissues) will impact the performance on "Back-end Bound". However other factors (used in formula) , for example IDQ UOPS not delivered and UOPS issued are divided by CPU clocks - also will impacts on the Back-end performance.

Hi Peter E.:

The reality is that the events available to the VTune Amplifier XE from the processor, while vaguely indicating a back-end bound issue, do not provide enough information to pinpoint the problem.  The newer processors are doing a better job.  So, while VTune Amplifier is indicating a *potential* performance issue in the back-end, the problem does not fall into one of the sub-metrics defined for the back-end.  That's the general answer.

Specifically, if you provide more details, we *may* be able to help.  For example, what processor are you collecting this data on?  And, can you share the results with us?  (zip up the results directory and either attach them here or submit an issue at Intel® Premier Support).

I'm performing the analysis on a Core i7 2600k. Since i have a student license, i'm not eligible for Intel Premier Support (as far as i know). I attached both the analysis results, and the source code of the small test program i have run it on (needs a file named data.dat, content irrelevant).


Downloadapplication/zip CacheTest_0.zip1004.93 KB

Thanks for your result file & source. Actually there was no memory penalty in example - but I saw that high IDQ_UOPS_NOT_DELIVERED.CORE event count (Which caused Back-end Bound highlighted), see bottom-up report when you select this event in timeline report, will see high IDQ_UOPS_NOT_DELIVERED.CORE during 3.3s - so select 3.3s - 3.35s for time range to zoom-in and filter on selection to generate new report. You will see DoStuff() with high Back-end Bound, then double-click to view source line which is - 

Source Line    Source    CPU_CLK_UNHALTED.THREAD    CPU_CLK_UNHALTED.THREAD    INST_RETIRED.ANY    INST_RETIRED.ANY    CPI Rate: Total    CPI Rate: Self    Retiring    Bad Speculation    Retiring    Bad Speculation    Back-end Bound    Front-end Bound    Back-end Bound    Front-end Bound
21      for (size_t j = i; j < kAllocSize; j += kEvilOffset) {    3,840,000,000    3,840,000,000    5,268,000,000    5,268,000,000    0.729    0.729    0.373    0.007    0.373    0.007    0.613    0.011    0.613    0.011

This line is inner loop - it meant IDQ decoded uops not delivered to RAT (each clock only allows 4 uops to RAT), stall in RAT? Sometime, you can adjust algorithm or use Intel C/C++ composer to optmize it.



Leave a Comment

Please sign in to add a comment. Not a member? Join today