Developer Guide

Contents

Loops Analysis

The
report.html
file contains information about all loops in your design and their unroll statuses. The Loops Analysis report helps you examine whether the 
Intel® oneAPI
DPC++/C++
Compiler
 can maximize the throughput of your kernels.
To view the Loops Analysis report, click 
Throughput Analysis
Loops Analysis
. The purpose of this view is to show estimates of performance indicators (such as II) and potential performance bottlenecks. For each loop, you can identify the following using the report:
  • Whether the loop is pipelined
  • Whether the loop uses a hyper-optimized loop structure
  • Any pragma or attribute applied on the loop
  • II of the loop
  • Loops Analysis report does not report anything about loops in NDRange kernels.
  • The F
    MAX
    II report is now deprecated and merged with the Loops Analysis report.
The left-hand
Loops List
pane of the Loops Analysis report displays the following types of loops:
  • Fused loops (see Loop Fusion)
  • Fused subloops
  • Coalesced loops
  • Fully unrolled loops
  • Partial unrolled loops
  • Regular loops

Loop Pragma and Attributes

You can use the Loops Analysis report to help determine where to deploy one or more of the following pragmas or attributes on your loops:

Key Performance Metrics

The Loops Analysis report captures the following key performance metrics on all blocks:
  • Source Location
    : Indicates the loop location in the source code.
  • Pipelined
    : Indicates whether a loop is pipelined. Pipelining allows for many data items to be processed concurrently (in the same clock cycle) while making efficient use of the hardware in the datapath by keeping it occupied.
  • II
    : Shows the sustainable initiation interval (II) of the loop. Processing data in loops is an additional source of pipeline parallelism. When you pipeline a loop, the next iteration of the loop begins before previous iterations complete. You can determine the number of clock cycles between iterations by the number of clock cycles you require to resolve any dependencies between iterations. You can refer to this number as the initiation interval (II) of the loop. The
    Intel® oneAPI
    DPC++/C++
    Compiler
    automatically identifies these dependencies and builds hardware to resolve these dependencies while minimizing the II.
  • Scheduled f
    MAX
    : Shows the scheduled maximum clock frequency at which the loop operates. The f
    MAX
    is the maximum rate at which the outputs of registers are updated.
    The physical propagation delay of the signal between two consecutive registers limits the clock speed. This propagation delay is a function of the complexity of the Boolean logic in the path. The path with the most logic (and the highest delay) limits the speed of the entire circuit, and you can refer to this path as the critical path.
    The f
    MAX
    is calculated as the inverse of the critical path delay. High f
    MAX
    is desirable because it correlates directly with high performance in the absence of other bottlenecks. The compiler attempts to optimize for different objectives for the scheduled f
    MAX
    depending on whether the f
    MAX
    target is set and whether the
    #pragma II
    is set for each of the loops. The f
    MAX
    target is a strong suggestion and the compiler does not error out if it is not able to achieve this f
    MAX
    , whereas the
    #pragma II
    triggers an error if the compiler is not able to achieve the requested II. The f
    MAX
    achieved for each block of code is shown in the Loops Analysis report. This behavior is outlined in the following table:
    Explicitly specify f
    MAX
    ?
    Explicitly specify II?
    Compiler's Scheduler Behavior
    No
    No
    Use heuristic to achieve best f
    MAX
    /II trade off
    No
    Yes
    Best effort to achieve the II for the corresponding loop (may not achieve the best possible f
    MAX
    )
    Yes
    No
    Best effort to achieve f
    MAX
    specified (may not achieve the best possible II)
    Yes
    Yes
    Best effort to achieve the f
    MAX
    specified at the given II. The compiler errors out if it cannot achieve the requested II.
    Intel® recommends that if you are using an f
    MAX
    target in the command line or for a kernel, use
    #pragma II = <N>
    for performance-critical loops in your design.
  • Latency
    : Shows the number of clock cycles a loop takes to complete one or more instructions. Typically, you want to have low latency. However, lowering latency often results in decreased f
    MAX
    .
  • Speculated Iterations
    : Shows the loop speculation. Loop speculation is an optimization technique that enables more efficient loop pipelining by allowing future iterations to be initiated before determining whether the loop was exited already.
  • Max Interleaving Iterations
    : Indicates the number of interleaved invocations of an inner loop that can be executed simultaneously. For more information, refer to max_interleaving Attribute
Example
The following is a DPC++ kernel example that includes three loops:
1 cgh.single_task<class example>([=]() { 2 #pragma unroll 3 for (int i = 0; i < 10; i++) { 4 acc_data[i] += i; 5 } 6 #pragma unroll 1 7 for (int k = 0; k < N; k++) { 8 #pragma unroll 5 9 for (int j = 0; j < N; j++) { 10 acc_data[j] = j + k; 11 } 12 } 13 });
The Loops Analysis report of this design example highlights the unrolling strategy for the different kinds of loops defined in the code.
The 
Intel® oneAPI
DPC++/C++
Compiler
implements the following loop unrolling strategies based on the source code:
  • Fully unrolls the first inner loop because of the 
    #pragma unroll
     specification.
  • Does not unroll the second loop, which is an outer loop because of the 
    #pragma unroll 1
     specification.
  • Unrolls the third loop (an inner loop of the second loop) five times because of the 
    #pragma unroll 5
     specification.
For more examples, refer to Loops section.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.