Developer Guide

Contents

Interpret Performance Counter Data

Profiling information helps in identifying poor memory or pipe behaviors that lead to unsatisfactory kernel performance.
The following sections explain the Profiler metrics that the Profiler reports record.

Stall, Occupancy, Bandwidth

For specific lines of kernel code, the
CPU/FPGA Interaction View
shows stall percentage, occupancy percentage, and average memory bandwidth. For definitions of stall, occupancy, data transfer size, and bandwidth, refer to Table 1.
DPC++ generates a pipeline architecture where work-items traverse through the pipeline stages sequentially. For more information, refer to Pipelining.
Simplified Representation of a Kernel Pipeline Instrumented with Performance Counters
The following are simplified equations that describe how the Profiler calculates stall, occupancy, and bandwidth:
ivalid_count
in the bandwidth equation also includes the
predicate=true
input to the load-store unit.
Ideal kernel pipeline conditions:
  • Stall percentage equals 0%
  • Occupancy percentage equals 100%
  • Bandwidth equals the board's bandwidth
For a given location in the kernel pipeline, if the sum of the stall percentage and the occupancy percentage approximately equals 100%, the Profiler identifies the location as the stall source. If the stall percentage is low, the Profiler identifies the location as the victim of the stall.
The Profiler reports a high occupancy percentage if the compiler generates a highly efficient pipeline from your kernel, where work-items or iterations are moving through the pipeline stages without stalling. The following notes may be helpful when interpreting profiling data:
  • If all LSUs are accessed the same number of times, they have the same occupancy value.
  • If work-items cannot enter the pipeline consecutively, bubbles are inserted to the pipeline.
  • In loop pipelining, loop-carried dependencies also form bubbles in the pipeline because of bubbles that exist between iterations. For more information about bubbles, refer to Occupancy.
  • If an LSU is accessed less frequently than other LSUs, such as the case when an LSU is outside a loop that contains other LSUs, this LSU has a lower occupancy value than the other LSUs. This rule also applies to pipes.

Stalling Pipes

Pipes provide a point-to-point communication link between two kernels.
Stalls occur if there is an imbalance between the read and write sides of the pipe, or if the read and write kernels are not running concurrently.
For example, if the kernel that reads is not launched concurrently with the kernel that writes, or if the read operations occur much slower than the write operations, the Profiler identifies a stall for the
ProducerToConsumerPipe::write
call in the write kernel.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804