VTUNE and VPU hotspot analysis

VTUNE and VPU hotspot analysis

I'm using VTUNE to look at the hotspots in my code.  I'm down into the vector instructions and confused by a few things.  I understand that VTUNE isn't a cycle accurate simulator, but why it is that I see things like the following:

   ...
   vmovaps %zmm8, %k0, %zmm0                324.718 ms
   vmovaps %zmm0, %k0, %zmm8                 84.007 ms
   vpslld $0x1f, %zmm2, %k0, %zmm1           96.931 ms
   vpsrld $0x01, %zmm2, %k0, %zmm2          134.087 ms
   vpord %zmm2, %zmm1, %k0, %zmm1           143.781 ms
   vmovaps %zmm9, %k0, %zmm2                245.558 ms
   vmovaps %zmm21, %k0, %zmm9                75.929 ms
   ...

As far a I can tell there are no data dependencies to prior instructions.

Q1: why are the movaps times all over the map?

Q2: why is vpssld 30% different than vpslrd?

Q3: why is there no indication of a pipeline stall on the vpord (due to the prior vpslld/vpsrld instructions)?

Q4: though I can't show it here, all my "CPU time" histogram bars in the source and assembly windows are red indicating "poor".  How is a single vmovaps deemed to be "poor" (vs. Idle, Ok, Ideal, Over)?

My application is compiled with icc -g -debug extended -debug inline-debug-info -debug expr-source-pos -std=c9x -O3 -Wall -openmp -offload ...

Extra credit: It appear that the compiler (icc) is reluctant to rearrange vector instructions to avoid data dependency pipeline stalls especially if the instruction come from different source code expessions (i.e. different source lines).  Is there some information as to how aggressive I should expect that compiler optimization to be?

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Craig,

My apologies for the answer not being forthcoming. We talked to an expert several times over this long silent period. Though he has insisted that he will answer, he hasn't.

We'll pursue the answer through another route.

Regards
---
Taylor
 

Or another "expert," Taylor?  OK, if you're willing to take a "blink" response, I'll give it a shot.

As you say, VTune Amplifier is not cycle-accurate, and is not a simulator.  It does some code instrumentation and a lot of sampling.  Sampling is subject to sampling error, so everything here must be taken with that understanding.  Your example shows a sequence of instructions and one might initially be led to assume that they are executed in this sequence.  However, this would be an incorrect assumption on a modern, out-of-order machine.  Recognizing that the instructions are at best partially ordered in this sequence (as you acknowledge in your question about the dependencies on the OR), it becomes more understandable why the "times" on "adjacent" instructions are not the same.  With Sample-After-Values possibly anywhere from 100,000 to 10,000,000 depending on expected overall sample times, the addition of one extra sample could have an unexpected impact on the total.  Moreover, there are other factors such as skid--the sample interrupt may occur with the IP pointed at the next instruction address, lumping some samples onto following instructions--that can lead to unequal samples on supposedly sequentially executed instructions.   If, for example, the vmovaps on line 2 above was preceded by a memory move to fill %zmm8, I would expect to find more samples on the following instruction, because of time possibly spent waiting for memory or cache.  Skid could also explain why the shift instructions don't have an identical number of samples: if the left shift 31 places takes longer than the right shift of 1, samples taken during the left shift may show up attributed to the right-shift.  There is a register-naming dependency that may keep the third VMOVAPS in the ROB until the OR has been retired, broadening its window for receiving clock samples from instructions that may have been executed before it.  The OR would not cause a pipeline stall, just a delayed retirement from the ReOrder Buffer until the Reservation Station can detect the availability of both its arguments.  The concurrency-coloring that VTune Amplifier displays is based strictly on the concurrent readiness of the available HW threads, not on the effectiveness of Instruction Level Parallelism--on a multi-thread machine a single VMOVAPS keeping only one thread active would always show up as poor utilization.

Performance analysis and tuning can be tricky using sample-based analysis tools on multi-thread machines with out-of-order instruction execution capabilities.  Our usual answer is to step back from the individual instruction observations to see the patterns within execution basic blocks and to account for architectural idiosyncrasies like skid where they might occur. Sorry that there's not a simpler answer, but hopefully this gets you closer to an understanding of what is happening with this code sequence.

MIC CPUs so far are in-order, so that part of the explanation would apply to other CPU models.

It's still true that the events in question aren't "precise," and the timings due to a given instruction will spread from there to following instructions which share the pipeline ("skid").

MIC performance does depend to a significant extent on the compiler sorting the instructions within a block so as to deal with the latencies. You say there are no dependencies, yet it does appear that the or instruction requires the results of the 2 preceding ones. One would expect  other dependencies in the loop to be longer and have more influence on the selected order.

Leave a Comment

Please sign in to add a comment. Not a member? Join today