Design and Optimization Methodology
Model Offloading to Accelerator
- Estimate the offload benefit and overhead for each loop/function in your original C++ or Fortran code to make better decisions on which parts of code to offload
- Check performance gain for a DPC++, OpenCL™, or OpenMP* application if you offload it to a different accelerator
- Compute throughput modelcounts arithmetic operations in a region on a baseline platform and estimates the execution time on a target platform required to achieve the same mix of arithmetic operations, considering it as bound by compute engines only.
- Memory sub-system throughput modeltraces memory accesses inside a region on a baseline platform and estimates the execution time on a target platform needed to transfer the same amount of memory. Memory traffic is measured using a cache simulator that reflects the target platform's memory configuration.
- Offload data transfer analysismeasures memory accesses that are read from or written to a region and will need to be sent over a PCIe* if the region is offloaded to a target platform.
Visualize Performance Bottlenecks with Roofline Chart
- Collects loop/function (for CPU) or OpenCL™ kernels (for GPU) timings and memory data.
- Measures the hardware limitations and collects floating-point and integer operations data.
- Arithmetic intensity (xaxis) - measured in number of floating-point operations (FLOPs) and/or integer operations (INTOPs) per byte, based on the loop/function algorithm, transferred between CPU/VPU/GPU and memory.
- Performance (yaxis) - measured in billions of floating-point operations per second (GFLOPS) and/or billions of integer operations per second (GINTOPS).
- Execution time of each loop/function/kernel is reflected in the size and color of each dot. The dots on the chart correspond to OpenCL kernels for GPU Roofline, while for the CPU Roofline, they correspond to individual loops/functions.
- Memorybandwidth limitations are plotted as diagonal lines.
- Computecapacity limitations are plotted as horizontal lines.
Check Vectorization Efficiency
- Outer loops
- Loops with user-defined functions
- Loops that the compiler assumes to have data dependencies
- Where vectorization, or parallelization with threads, will pay off the most
- If vectorized loops are providing benefit, and if not, why not
- Un-vectorized loops and why they are not vectorized
- Performance problems in general
Prototype Threading Designs
- Identify issues and fix them before implementing parallelism
- Add threading to C, C++, and Fortran code
- Prototype the performance impact of different threaded designs and project scaling on systems with larger core counts without disrupting development or implementation
- Find and eliminate data-sharing issues during design (when they're less expensive to fix)
Available High-Level Parallel Frameworks
Intel® oneAPI Threading Building Blocks(
- Focus on the part of the program that uses the most time.
- Do not guess, measure.