Performance Tuning of OpenCL™ Applications on Intel® Xeon Phi™ Coprocessor using Intel® VTune™ Amplifier XE 2013/2015

Download PDF


Intel® SDK for OpenCL™ Applications provides a development environment for OpenCL applications on both Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors for both Windows* and Linux* operating systems. The latest SDK is available at and includes development tools, runtime, and support for optimization tools. In addition, recent releases of the Intel® VTune™ Amplifier XE provide essential functionality for tuning OpenCL applications on Intel Xeon Phi coprocessors, including OpenCL kernels source-level analysis. This article provides a basic workflow for profiling OpenCL applications on Intel Xeon Phi coprocessors and some examples of performance analysis.

Steps to Profile Your OpenCL Application

Profiling of OpenCL applications with the Intel VTune Amplifier XE is similar to profiling any native or offload application on the Intel Xeon Phi coprocessor. It also works the same for Windows* and Linux* host OS.

Here are the steps you need to follow to profile your OpenCL application.

NOTE: Intel Xeon Phi coprocessor analysis configuration in Intel VTune Amplifier XE 2015 differs from Intel VTune Amplifier XE 2013 workflow. This article provides configurations steps for the Intel VTune Amplifier XE 2015 workflow. For more details, see product documentation at

  1. Create an Intel VTune Ampilfier XE project:
    1. source /opt/intel/vtune_amplifier_xe_2015/
    2. amplxe-gui &
    3. Create a project via New-> Project… menu option.
      1. When the Project Properties dialog box appears, choose Intel Xeon Phi coprocessor (host launch) in Target system combo box and specify your OpenCL application as the Application to launch.
      2. Specify additional search libraries.
        1. Click the “Binary/Symbol Search Directories” tab.
        2. Add the following host search directories:
          • /opt/mpss//sysroots/k1om-mpss-linux/boot
          • /opt/intel/composer_xe_2015/lib/mic
  2. Run Advanced Hotspots analysis.
    1. Click on the New Analysis button.
      The Analysis Type window opens (below).
    2. Select Algorithm Analysis -> Advanced Hotspots.
    3. Click Start.

Figure 1. Intel VTune Amplifier XE 2015 screen shot for choosing the analysis type for the Intel® Xeon Phi™ coprocessor

Analyzing Your OpenCL Application

Once your analysis completes, the Hotspots by CPU Usage viewpoint opens. If you click the Bottom-up tab and choose the grouping as selected below, you will be ready to start tuning your application.

Figure 2. Intel VTune Amplifier XE Bottom-Up screen shot showing a summary by process/module for the Intel Xeon Phi coprocessor.

Some important comments:

  • You should focus on the mic_server process. This process covers all the device-side OpenCL application activities. It is generally recommended to filter by this process.
  • For the overall activity aggregated in the “CPU time” chart on the figure (CPU here means Intel MIC architecture cores), it is recommended to zoom and filter the area of actual kernel execution. In our example analysis, this area is the largest red rectangle in Figure 2.
  • Note that time spent in the mic_server consists of:
    • [Dynamic Code], which constitutes the kernels (below)
    • Intel® Threading Building Blocks (Intel TBB) costs
    • SVML (vector math library that is responsible for most heavy built-ins like math)
    • Other functions: for example, Linux kernel routines inside vmlinux.

The previous screen shot showed the hotspots of the processes. Now let’s inspect the same trace for top hotpots over all modules, assuming you already filtered by the mic_server process. This is easy when switched to the Top-down Tree view:

Figure 3. Intel VTune Amplifier XE screen shot of the Top-down Tree view showing the top hotspots for Intel Xeon Phi coprocessor.

Here you get the top-list of hotpots from all modules. In this example, note that most hotspots are from dynamic code (specific OpenCL kernel names are listed). There is some contribution for the Intel TBB library as well and finally some heavy math (__ocl_svml_b2_sqrt) that is attributed to the code from SVML (vector math library module that implements the OpenCL kernel built-ins for Intel OpenCL platform).

In general, many entries for Intel TBB in the hotpots breakdown might indicate some inefficiency in work group scheduling. For example, a small number of Intel TBB entries could mean that the work groups are too lightweight. Refer to the section called “Intel Xeon Phi coprocessor cores utilization” (below) for an example analysis for work group parallelism.

More importantly, you can click on any kernel name, and get the source-level profiling information(see the example screenshot below). Note that you need to provide a path to the *.cl file with compile time option, otherwise Intel VTune Amplifier will explicitly ask you for the file path.

Figure 4. OpenCL kernels source-level hotspots view in Intel VTune Amplifier XE.

Also you can click on any line in the OpenCL kernel source code and inspect its assembly code . This is useful to locate expensive instructions, for example:

  • Heavy math built-ins that are subject for native or relaxed math experiment. Refer to the Intel SDK for OpenCL Applications XE Optimization Guide listed in references at the end of the article.
  • Prefetching instructions are costly according to the trace. It is likely that the prefetching itself is inefficient. Consider the dedicated section on HW/SW prefetching in the Optimization Guide.
  • Similarly, if you are observing gather/scatter instructions in the instruction hotspots, your data layout and/or access are likely to need some improvement. Refer to the corresponding section in the Optimization Guide.
  • If there are masked instructions in the instruction hotspot regions, your code is likely to suffer from divergent branches and associated penalties. For help on this, refer to the Optimization Guide.

Event-based General Exploration on the Intel Xeon® Phi™ Coprocessor

In addition to hotspots analysis, you can conduct experiments using a variety of hardware events and associated efficiency metrics. For example, analyzing kernels for data read/write misses might help you to identify potential improvements in the prefetching code, or better data reuse via tiling.

Figure 5. Intel VTune Amplifier XE screen shot of event-driven profiling analysis for Intel Xeon Phi coprocessors with data read misses as a metric of interest.

The event-driven analysis for the OpenCL application on the Intel Xeon Phi coprocessor is conceptually similar to the analysis for the regular native (or offload) application for the coprocessor. For further details, we direct you to this introduction:

Example analysis of Intel Xeon Phi coprocessor cores utilization

Although looking at the individual events is useful, one of the most important general hints from profiling is the aggregated cores utilization. You can select and filter by a region of execution on the timeline. See Figure 1 for a sample screen shot of the timeline where the approximate area of interest is marked with red rectangle. You can inspect the level of activity for individual threads, but overall activity, which is aggregated on the “CPU Time” chart, is an important metric of efficiency.

Specifically, large holes in the core utilization might indicate insufficient parallel slack, tasks that are too short, or too frequent synchronization, and other pitfalls. Let’s consider an example of the overall cores utilization for a custom data-mining OpenCL application and executed on an Intel Xeon Phi coprocessor. The application uses an iterative algorithm where the data size is growing from iteration to iteration. Since early iterations are pretty time consuming, it is particularly important to saturate the available compute resource efficiently. Inspecting the aggregated cores utilization revealed a poor employment especially for the first five iterations.

Figure 6. Original (aggregated) cores utilization on the Intel VTune Amplifier timeline for a custom OpenCL data-mining application executed on the Intel Xeon Phi coprocessor. The iterative nature of the algorithm and the growing demand for compute power that increases every iteration are clearly seen

Intel SDK for OpenCL Applications XE Optimization Guide explains the internals of the Intel OpenCL code implementation for the Intel Xeon Phi coprocessor (see the reference at the end of the article). Take particular note for how individual work groups are mapped to hardware threads and an important recommendation of having sufficient numbers of work groups on the fly.

In the given example, the actual reason, beyond poor utilization, is the suboptimal work group size used in this OpenCL application. Specifically, the value of 32, which originated as the NVidia* GPU warp size (since the application was initially targeted for GPUs), and the input problem size resulted in too few work groups for the number of Intel Xeon Phi coprocessor cores. This is an example of insufficient parallel slack that we mentioned earlier.

After changing the work group size to 16, which still preserved vectorization, the resulting utilization was improved considerably:

Figure 7. Improved (aggregated) cores utilization on the Intel VTune Amplifier timeline for custom OpenCL data-mining application executed on Intel Xeon Phi coprocessor. With the optimized value of work group size, the device is well saturated after the first algorithm iteration.

Similarly, the execution time of the kernel has been improved by 10-40% depending on the test.

For further information see the following documents:

  1. Information on the general Intel VTune Amplifier workflow:
  2. Optimization Guide for guidelines of OpenCL applications targeting the Intel Xeon processors and the Intel Xeon Phi coprocessors
Per informazioni più dettagliate sulle ottimizzazioni basate su compilatore, vedere il nostro Avviso sull'ottimizzazione.