Programming Guide

Contents

Compilation Flow Overview

When you create a program with offload, the compiler must generate code for both the host and the device. oneAPI tries to hide this complexity from the developer. The developer simply compiles a Data Parallel C++ (DPC++) application with
dpcpp
, and the same compile command generates both host and device code.
For device code, two options are available: Just-in-Time (JIT) compilation and Ahead-of-Time (AOT) compilation, with JIT being the default. This section describes how host code is compiled, and the two options for generating device code. Additional details are available in Chapter 13 of the Data Parallel C++ book.

Traditional Compilation Flow (Host-only Application)

The traditional compilation flow is a standard compilation like the one used for C, C++, or other languages, used when there is no offload to a device.
The traditional compilation phases are shown in the following diagram:
Diagram illustrating the steps of traditional flow
  1. The front end translates the source into an intermediate representation and then passes that representation to the back end.
  2. The back end translates the intermediate representation to object code and emits an object file (
    host.obj
    on Windows*,
    host.o
    on Linux*).
  3. One or more object files are passed to the linker.
  4. The linker creates an executable.
  5. The application runs.

Compilation Flow for DPC++ Offload Code

The compilation flow for DPC++ offload code adds steps for device code to the traditional compilation flow, with JIT and AOT options for device code. In this flow, the developer compiles a DPC++ application with
dpcpp
, and the output is an executable containing both host and device code.
The basic compilation phases for DPC++ offload code are shown in the following diagram:
Diagram illustrating the steps of DPC++ Offload flow
  1. The host code is translated to object code by the back end.
  2. The device code is translated to either a SPIR-V or device binary.
  3. The linker combines the host object code and the device code (SPIR-V or device binary) into an executable containing the host binary with the device code embedded in it. This process is known as a fat binary.
  4. At runtime, the operating system starts the host application. If it has offload, the DPC++ runtime loads the device code (converting the SPIR-V to device binary if needed).
  5. The application runs on the host and a specified device.

JIT Compilation Flow

In the JIT compilation flow, the code for the device is translated to SPIR-V intermediate code by the back-end, embedded in the fat binary as SPRI-V, and translated from SPIR-V to device code by the runtime. When the application is run, the runtime determines the available devices and generates the code specific to that device. This allows for more flexibility in where the application runs and how it performs than the AOT flow, which must specify a device at compile time. However, performance may be worse because compilation occurs when the application runs. Larger applications with significant amounts of device code may notice performance impacts.
The JIT compilation flow is useful when you do not know what the target device will be.
JIT compilation is not supported for FPGA devices.
The compilation phases are shown in the following diagram:
Diagram illustrating the steps of JIT flow
  1. The host code is translated to object code by the back end.
  2. The device code is translated to SPIR-V.
  3. The linker combines the host object code and the device SPIR-V into a fat binary containing host executable code with SPIR-V device code embedded in it.
  4. At runtime:
    1. The device runtime on the host translates the SPIR-V for the device into device binary code.
    2. The device code is loaded onto the device.
  5. The application runs on the host and device available at runtime.

AOT Compilation Flow

In the AOT compilation flow, the code for the device is translated to SPIR-V and then device code in the host back-end and the resulting device code is embedded in the generated fat binary. The AOT flow provides less flexibility than the JIT flow because the target device must be specified at compilation time. However, executable start-up time is faster than the JIT flow.
  • The AOT compilation flow is good when you know exactly which device you are targeting.
  • The AOT flow is recommended when debugging your application as it speeds up the debugging cycle.
The compilation phases are shown in the following diagram:
Diagram illustrating the steps of AOT flow
  1. The host code is translated to object code by the back end.
  2. The device code is translated to SPIR-V.
  3. The SPIR-V for the device is translated to a device code object using the device specified by the user on the command line.
  4. The linker combines the host object code and the device object code into a fat binary containing host executable code with device executable code embedded in it.
  5. At runtime, the device executable code is loaded onto the device.
  6. The application runs on a host and specified device.

Fat Binary

A fat binary is generated from the JIT and AOT compilation flows. It is a host binary that includes embedded device code. The contents of the device code vary based on the compilation flow.
Box labeled host code that contains smaller box labled devicde code
  • The host code is an executable in either the ELF (Linux) or PE (Windows) format.
  • The device code is a SPIR-V for the JIT flow or an executable for the AOT flow. Executables are in one of the following formats:
    • CPU: ELF (Linux), PE (Windows)
    • GPU: ELF (Windows, Linux)
    • FPGA: ELF (Linux), PE (Windows)

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.