Get Started

  • 0.7
  • 09/18/2020
  • Public Content

Get Started with OpenMP* Offload Feature to GPU for the Intel® C++ Compiler and Intel® Fortran Compiler

The OpenMP* Offload to GPU feature of the Intel® C++ Compiler and Intel® Fortran Compiler compiles OpenMP source files for a wide range of accelerators.

Before You Begin

Visit the Release Notes page for the known issues and most up-to-date information:

OpenMP 4.5 Subset for GPUs

The
Intel® C++ Compiler for oneAPI
supports all OpenMP 4.5 features on CPU except User-Defined Reduction (UDR) support. GPU and Offloading support a subset of OpenMP 4.5 and OpenMP 5.0 declared variant features, which are based on analysis and a collection of OpenMP constructs used in High Performance Computing (HPC) applications. For more detailed support, go to the OpenMP Support section of the
Intel® C++ Compiler for oneAPI
Developer Guide and Reference.
OpenMP 4.5 Subset Includes Device Support for GPU Target
Below are the OpenMP pragmas that are supported in the
-qnextgen
compilers for GPU and CPU:
  • declare_target
  • declare_simd
  • target
  • target_teams
  • target_teams_distribute
  • target_teams_distribute_simd
  • target_parallel
  • target_parallel_for
  • target_parallel_for_simd
  • target_teams_distribute_parallel_for
  • target_teams_distribute_parallel_for_simd
  • target_variant_dispatch
  • simd
  • master
  • atomic
  • barrier
  • parallel_for
  • teams
  • teams_distribute
  • teams_distribute_parallel_for
  • teams_distribute_parallel_for_simd
  • target_data)
  • target_update
  • target_enter_data
  • target_exit_data
  • parallel
  • single
  • for
  • for_simd
  • parallel_for_simd
  • distribute_parallel_for
  • parallel_sections)
  • sections
Below are the OpenMP clauses supported in the
nextgen
compilers for CPU and GPU:
  • if
  • final
  • num_threads
  • safelen
  • simdlen
  • collapse
  • default
  • private
  • firstprivate
  • lastprivate
  • shared
  • reduction
  • linear
  • aligned
  • proc_bind
  • schedule
  • ordered
  • nowait
  • untied
  • mergeable
  • flush
  • read
  • write
  • update
  • capture
  • seq_cst
  • depend
  • device
  • simd
  • map
  • num_teams
  • thread_limit
  • num_tasks
  • ist_schedule
  • defaultmap
  • to
  • from
  • use_device_ptr
  • is_device_ptr
Below are the runtime support routines available on a CPU Host:
  • EXTERN int omp_get_num_devices(void);
  • EXTERN int omp_get_initial_device(void);
  • EXTERN void *omp_target_alloc(size_t size, int device_num);
  • EXTERN void omp_target_free(void *device_ptr, int device_num);
  • EXTERN int omp_target_is_present(void *ptr, int device_num);
  • EXTERN int omp_target_memcpy(void *dst, void *src, size_t length, size_t dst_offset, size_t src_offset, int dst_device, int src_device);
  • EXTERN int omp_target_memcpy_rect(void *dst, void *src, size_t element_size, int num_dims, const size_t *volume, const size_t *dst_offsets, const size_t *src_offsets, const size_t *dst_dimensions, const size_t *src_dimensions, int dst_device, int src_device);
  • EXTERN int omp_target_associate_ptr(void *host_ptr, void *device_ptr, size_t size, size_t device_offset, int device_num);
  • EXTERN int omp_target_disassociate_ptr(void *host_ptr, int device_num);
  • EXTERN int omp_is_initial_device(void);
  • EXTERN int omp_get_initial_device(void);
  • EXTERN void kmp_global_barrier_init(void); // Intel externsion
  • EXTERN void kmp_global_barrier(void); // Intel externsion
  • EXTERN void omp_set_default_device(int dev_num )
  • EXTERN int omp_get_default_device(void)
Below are the device runtime routines for GPU:
  • EXTERN int omp_get_team_num(void);
  • EXTERN int omp_get_num_teams(void);
  • EXTERN int omp_get_team_size(int);
  • EXTERN int omp_get_thread_num(void);
  • EXTERN int omp_get_num_threads(void);
  • EXTERN int omp_in_parallel(void);
  • EXTERN int omp_get_max_threads(void);
  • EXTERN int omp_get_device_num(void);
  • EXTERN int omp_get_num_devices(void);
Below are the environment variables:
  • The control default device found through
    OMP_DEFAULT_DEVICE
    :
    • Accepts a non-negative integer value.
  • export OMP_TARGET_OFFLOAD={"MANDATORY" | "DISABLED" | "DEFAULT" }
    :
    • MANDATORY
      : The target region code running on a GPU or accelerator.
    • DISABLED
      : The target region code running on a CPU.
    • DEFAULT
      : The target region code running on a GPU if the device is available, if it is not, it will fall back to the CPU.
OpenMP 4.5 (Including Target Gen9 or later) Support
The OpenMP offloading runtime performs target specific mapping for the GPU. The mapping from the OpenMP execution model to GPU hardware is shown below:
OpenMP
GPU Hardware
# of Teams, Threads, SIMD lanes
Team
Sub-Slice (SS)
2+ Team per DSS
Thread
EU Thread
64 EU Threads can be used per DSS
SIMD
Lane (or "Channel")
SIMD1, SIMD4, SIMD8, SIMD16, SIMD32
For GPU support, the multi-level parallelism is enabled via multiple OpenMP teams, threads, and SIMD lanes to fully utilize the hardware sources at all levels. On an Intel® GPU, multiple teams can be mapped to a sub-slice, so the applications can run more than 64 threads per DSS.
  • Permits the use of a hardware barrier across threads in a team.
  • Permits OpenMP thread semantics (independent branching, etc.).
  • Allows synchronization across teams.
  • Uses Teams to exploit the whole machine (more migration is needed).
  • Uses an OpenMP SIMD or compiler vectorization to exploit SIMD. In the
    Intel® oneAPI HPC Toolkit
    , OpenMP SIMD is supported for CPU. For GPC, a compiler vectorization at kernel level is used to exploit SIMT-like SIMD execution model. For MS66, the
    Intel® oneAPI HPC Toolkit
    supports OpenMP SIMD on GPU as well. Under explicit SIMD vectorization mode, the clause
    afelen(n)
    is used for a safety check only. The compiler can select
    simdlen < n
    . The clause
    simdlen(n)
    is mapped to SIMD instruction lanes/channels.

Options Support for Targeting Gen9 or Later for OpenMP 3.0 Legacy Apps (Based on Earlier Exploration Work)

There are two new options:
  • -fiopenmp
  • -fopenmp-targets=spir64
that support OpenMP and offloading execution on CPU and GPU. The
-fiopenmp
option enables a middle-end that supports the transformation of OpenMP in LLVM* (but not in a Clang* front-end). The
-fopenmp-targets=spir64
option enables the compiler to generate a
x86 + SPIR64
fat binary for the GPU device binary generation.

Enhanced Support for Gen9 or Later Directives for C++ and Fortran

You may want to write library functions using the OpenMP offloading model, using the offloading region to use faster versions of the functions. To support this usage model, the compiler supports a target variant dispatch construct and an extension to tell the compiler to emit a dispatch code around a function call. The construct can take an optional device clause. The syntax is:
#pragma omp target variant dispatch [device( n )] function_call
The dispatch code is a runtime check that is used to decide whether to call the base or the variant version of the function. If the GPU is available, then the variant version is called; otherwise, the base version is called. If the device (
n
) clause is specified, then it calls the variant version only if the device
n
is available.
To specify the names of the base and variant versions of the function, this compiler release supports a subset of the OpenMP 5.0 declare variant construct, as follows:
#pragma omp declare variant ( variant_func )match(construct={target variant dispatch}, device={arch(gen)}) base_func
Below is an example using this feature:
#include <stdio.h> #define N 1024 float __attribute__((nothrow, noinline)) vecadd_gpu_offload() { float result = 0.0; float a[N], b[N]; #pragma omp target parallel for reduction(+: result) map(to: a, b) for (int k=0; k<N; k++) { a[k] = k; b[k] = k + 1; result = result + a[k] + b[k]; } printf("GPU version was called. "); return result; } #pragma omp declare variant(vecadd_gpu_offload) \ match(construct={target variant dispatch}, device={arch(gen)}) float __attribute__((nothrow, noinline)) vecadd_base() { float result = 0.0; float a[N], b[N]; #pragma omp parallel for reduction(+: result) for (int k=0; k<N; k++) { a[k] = k; b[k] = k + 1; result = result + a[k] + b[k]; } printf("CPU version was called. "); return result; } int main() { float result=0.0; #pragma omp target variant dispatch { result = vecadd_base(); } if (result == 1048576.0) { printf("PASSED: correct results\n"); return 0; } printf("FAILED: incorrect results\n"); return -1; }

Documentation of Any Restrictions on Gen9 or Later Target Regions

Given that OpenMP offloading is built on top of an OpenCL™ runtime stack for GPU, the restrictions that apply to any OpenCL kernel functions also apply to OpenMP offloading region code. Below is a list of restrictions:
  • Recursive function calls (unless compile time constant expression)
  • Non-placement new and delete
  • Go-to statement restriction
  • Register and
    thread_local
    storage qualifiers
  • Virtual function qualifier
  • Function pointers (unless compile time constant expression)
  • Virtual functions
  • Exception handling
  • C++ standard library (e.g. only
    printf
    is support for GPU)
  • Implicit Lambda-to-Function Pointer conversion
  • Variadic function
  • Variable Length Arrays (VLA), this is not supported for the tasking model and async-offloading

Usage Example for OpenMP Offloading

The code below is a simple matrix multiplication example of using OpenMP target, teams, distribute and parallel for combined construct:
// matmul.cpp: Matrix Multiplication Example using OpenMP Offloading #include <stdio.h> #include <math.h> #include <stdlib.h> #define MAX 128 int A[MAX][MAX], B[MAX][MAX], C[MAX][MAX], C_SERIAL[MAX][MAX]; typedef int BOOL; typedef int TYPE; BOOL check_result(TYPE *actual, TYPE *expected, unsigned n) { for (unsigned i = 0; i < n; i++) { if(actual[i] != expected[i]) { printf("Value mismatch at index = %d. Expected: %d" ", Actual: %d.\n", i, expected[i], actual[i]); return 0; } } return 1; } void __attribute__ ((noinline)) Compute() { #pragma omp target teams distribute parallel for map(to: A, B) map(tofrom: C) \ thread_limit(128) { for (int i = 0; i < MAX; i++) for (int j = 0; j < MAX; j++) for (int k = 0; k < MAX; k++) C[i][j] += A[i][k] * B[k][j]; } } int main() { for (int i = 0; i < MAX; i++) for (int j = 0; j < MAX; j++) { A[i][j] = i + j - 1; B[i][j] = i - j + 1; } for (int i = 0; i < MAX; i++) for (int j = 0; j < MAX; j++) for (int k = 0; k < MAX; k++) C_SERIAL[i][j] += A[i][k] * B[k][j]; Compute(); if (!check_result((int*) &C[0][0], (int*) &C_SERIAL[0][0], MAX * MAX)) { printf("FAILED\n"); return 1; } printf("PASSED\n"); return 0; }

Compilation and Run Commands

On Linux*, GCC* 4.8.5 or higher must be installed for host code compilation. This is to avoid any incompatibilities due to a changed C++ Application Binary Interface (ABI).
  1. Compile
    : Compile the source code with an
    icx
    ,
    icpx
    , or
    ifx
    compiler driver that invokes GPU offloading with:
    icx -fiopenmp –fopenmp-targets=spir64 file.cpp matmul_offload.cpp -o matmul
    OR
    icpx -fiopenmp –fopenmp-targets=spir64 file.cpp matmul_offload.cpp –o matmul
    OR
    ifx -fiopenmp –fopenmp-targets=spir64 file.f90 matmul_offload.f90 –o matmul
  2. Run
    : Set up the environment variable: with
    export OMP_TARGET_OFFLOAD="MANDATORY"
    . Its default value is
    DEFAULT
    , which indicates that the execution can be run on CPU and GPU. Below is an example:
    sh-4.2$ ./matmul ** Program Scope patch lists ** ** Kernel Patch Lists : Kernel Name = __omp_offloading_811_600f59f6__Z7Computev_l14 ** BINDING_TABLE_STATE Entry[ 0 ] : 00000000 (0) Entry[ 1 ] : 00000040 (2) Entry[ 2 ] : 00000080 (4) INTERFACE_DESCRIPTOR_DATA = { 00000000, 00000000, 00000000, 00000000, 000000c3, 00000000, 00000000, 00000003 } KernelStartPointer = : 0 Kernel64bitStartPointer = : 0 SoftwareExceptionEnable = : 0 MaskStackExceptionEnable = : 0 IllegalOpcodeExceptionEnable = : 0 FloatingPointMode = : 0 ThreadPriority = : 0 SingleProgramFlow = : 0 DenormMode = : 0 SamplerCount = : 0 SamplerStatePointer = : 0 BindingTableEntryCount = : 3 BindingTablePointer = : 6 ConstantURBEntryReadOffset = : 0 ConstantURBEntryReadLength = : 0 NumberOfThreadsInThreadGroup = : 0 GlobalBarrierEnable = : 0 SharedLocalMemorySize = : 0 BarrierEnable = : 0 RoundingMode = : 0 CrossThreadConstantDataReadLength : 3 Kernel Name: __omp_offloading_811_600f59f6__Z7Computev_l14 PASSED

Enhanced Compiler Integration of GPU-Optimized LIBM Functions

In some cases, there will be multiple variants of a given math function with differing accuracy/performance tradeoffs. The compiler provides a way to choose an appropriate variant based on your compiler options. All fp-models that are supported by the compiler are supported in OpenMP Offload to GPU Feature of the Intel® C++ Compiler and the Intel® Fortran Compiler. The fp-model that is supported by Intel® C++ Compiler is migrated as well. Below is the list of math functions that are supported for Gen9 or later, based on OpenCL built-in math functions.
std::unordered_map<std::string, std::string> llvm::vpo::OCLBuiltin = { // float: {"sinf", "_Z3sinf"}, {"cosf", "_Z3cosf"}, {"tanf", "_Z3tanf"}, {"erff", "_Z3erff"}, {"expf", "_Z3expf"}, {"logf", "_Z3logf"}, {"log2f", "_Z4log2f"}, {"powf", "_Z3powff"}, {"sqrtf", "_Z4sqrtf"}, {"fmaxf", "_Z4fmaxff"}, {"llvm.maxnum.f32", "_Z4fmaxff"}, {"fminf", "_Z4fminff"}, {"llvm.minnum.f32", "_Z4fminff"}, {"fabsf", "_Z4fabsf"}, {"llvm.fabs.f32", "_Z4fabsf"}, {"ceilf", "_Z4ceilf"}, {"llvm.ceil.f32", "_Z4ceilf"}, {"floorf", "_Z5floorf"}, {"llvm.floor.f32", "_Z5floorf"}, // double: {"sin", "_Z3sind"}, {"cos", "_Z3cosd"}, {"tan", "_Z3tand"}, {"erf", "_Z3erfd"}, {"exp", "_Z3expd"}, {"log", "_Z3logd"}, {"log2", "_Z4log2d"}, {"pow", "_Z3powdd"}, {"sqrt", "_Z4sqrtd"}, {"fmax", "_Z4fmaxdd"}, {"llvm.maxnum.f64", "_Z4fmaxdd"}, {"fmin", "_Z4fmindd"}, {"llvm.minnum.f64", "_Z4fmindd"}, {"fabs", "_Z4fabsd"}, {"llvm.fabs.f64", "_Z4fabsd"}, {"ceil", "_Z4ceild"}, {"llvm.ceil.f64", "_Z4ceild"}, {"floor", "_Z5floord"}, {"llvm.floor.f64", "_Z5floord"}, {“invsqrtf”, “_Z5rsqrtf”}, {“invsqrt”, “_Z5rsqrtd”}};

Fast Inverse Square-Root Functions

The
libomptarget
runtime library has implemented a performance profiling for tracking on GPU kernel start/complete time and data-transfer time. This feature can be enabled by setting the environment variable
LIBOMPTARGET_PROFILE=T
, with the results seen below:
GPU Performance (Gen9, export LIBOMPTARGET_PROFILE=T,usec) … … Kernel Name: __omp_offloading_811_29cbc383__ZN12BlackScholesIdE12execute_partEiii_l368 iteration #0 ... calling validate ... ok calling close ... execution finished in 1134.914ms, total time 0.045min passed LIBOMPTARGET_PROFILE: -- DATA-READ: 16585.256 usec -- DATA-WRITE: 9980.499 usec -- EXEC-__omp_offloading_811_29cbc383__ZN12BlackScholesIfE12execute_partEiii_l368: 24048.503 usec
The data-read and data-read cost were measured for OpenCL SVMMap/SVMUnmap and data copy. For example:
INVOKE_CL_RET_FAIL(clEnqueueSVMMap, queue, CL_TRUE, CL_MAP_WRITE, tgt_ptr, size, 0, nullptr, nullptr); memcpy(tgt_ptr, hst_ptr, size); INVOKE_CL_RET_FAIL(clEnqueueSVMUnmap, queue, tgt_ptr, 0, nullptr, nullptr);
The USM that is supported is planned to align with the L0 API support schedule to support OpenMP 5.0 USM features.

Early Integration of GPU-specific Debug Information

In order to provide the GPU specific debug, support for the environment variable:
export LIBOMPTARGET_DEBUG=1
was added. This allows for dumping offloading runtime debugging information. Its default value is 0 which indicates no offloading runtime debugging information dump. See the example below:
sh-4.2$ export LIBOMPTARGET_DEBUG=1 sh-4.2$ ./matmul Libomptarget --> Loading RTLs... Libomptarget --> Loading library 'libomptarget.rtl.nios2.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.nios2.so': libomptarget.rtl.nios2.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.x86_64_mic.so'... Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64_mic.so'! Libomptarget --> No devices supported in this RTL Libomptarget --> Loading library 'libomptarget.rtl.opencl.so'... Target OPENCL RTL --> Start initializing OpenCL Target OPENCL RTL --> cl platform version is OpenCL 2.1 LINUX Target OPENCL RTL --> Found 1 OpenCL devices Target OPENCL RTL --> Device#0: Genuine Intel(R) CPU 0000 @ 3.00GHz Target OPENCL RTL --> max WGs is: 8 Target OPENCL RTL --> max WG size is: 8192 Target OPENCL RTL --> addressing mode is 64 bit Target OPENCL RTL --> cl platform version is OpenCL 2.1 Target OPENCL RTL --> Found 1 OpenCL devices Target OPENCL RTL --> Device#0: Intel(R) Gen9 HD Graphics NEO Target OPENCL RTL --> max WGs is: 24 Target OPENCL RTL --> max WG size is: 256 Target OPENCL RTL --> addressing mode is 64 bit Libomptarget --> Successfully loaded library 'libomptarget.rtl.opencl.so'! Libomptarget --> Optional interface: __tgt_rtl_run_target_team_nd_region Libomptarget --> Optional interface: __tgt_rtl_run_target_region_nowait Libomptarget --> Optional interface: __tgt_rtl_run_target_team_region_nowait Libomptarget --> Optional interface: __tgt_rtl_run_target_team_nd_region_nowait Libomptarget --> Registering RTL libomptarget.rtl.opencl.so supporting 1 devices! Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.x86_64.so': libomptarget.rtl.x86_64.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.cuda.so': libomptarget.rtl.cuda.so: cannot open shared object file: No such file or directory! Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'... Libomptarget --> Unable to load library 'libomptarget.rtl.aarch64.so': libomptar get.rtl.aarch64.so: cannot open shared object file: No such file or directory! Libomptarget --> RTLs loaded! Target OPENCL RTL --> Target binary is VALID Libomptarget --> Image 0x000000000060d0a0 is compatible with RTL libomptarget.rtl.opencl.so! Libomptarget --> RTL 0x0000000002012250 has index 0! Libomptarget --> Registering image 0x000000000060d0a0 with RTL libomptarget.rtl.opencl.so! Libomptarget --> Done registering entries! Libomptarget --> Entering target region with entry point 0x000000000040a2a0 anddevice Id -1 Libomptarget --> Checking whether device 0 is ready. Libomptarget --> Is the device 0 (local ID 0) initialized? 0 Target OPENCL RTL --> Initialize OpenCL device Libomptarget --> Device 0 is ready to use. Target OPENCL RTL --> Dev 0: load binary from 0x000000000060d0a0 image Target OPENCL RTL --> Expecting to have 1 entries defined. Target OPENCL RTL --> Found device RTL: /home/users/xtian/cmplr/dev_xmain/builds/xmainoffloadlinuxefi2_debug/llvm/lib/libomptarget-opencl.a ** Program Scope patch lists ** Libomptarget --> Creating new map entry: HstBase=0x00000000006224f0, HstBegin=0x 00000000006224f0, HstEnd=0x000000000062c130, TgtBegin=0x0000000002100ad0 Libomptarget --> There are 40000 bytes allocated at target address 0x00000000021 00ad0 - is new … … … … Libomptarget --> Launching target execution __omp_offloading_811_600f59f6__Z7Com putev_l14 with pointer 0x0000000002a994a0 (index=0). Target OPENCL RTL --> OpenCL: Kernel Arg 0 set successfully Target OPENCL RTL --> OpenCL: Kernel Arg 1 set successfully Target OPENCL RTL --> OpenCL: Kernel Arg 2 set successfully … … … Libomptarget --> Image 0x000000000060d0a0 is compatible with RTL 0x0000000002012250! Libomptarget --> Unregistered image 0x000000000060d0a0 from RTL 0x0000000002012250! Libomptarget --> Done unregistering images! Libomptarget --> Removing translation table for descriptor 0x000000000060ea50 Libomptarget --> Done unregistering library!
Programming with an Intel® GPU is similar to programming with other GPUs. Different GPU (micro) architectures perform differently. Re-tuning a code for a new (micro) architecture is harder than functional migration. Intel is working towards for provide compilers, libraries and tools to reduce the burden of the latter, but this does not eliminate the need for performance optimizations.

Find More

Document
Description and links
OpenMP 5.0 specification PDF
The OpenMP Specification describes how OpenMP offloading can be used for devices.
The GNU C/C++ Library
SC'16 and SC'17 LLVM-HPC workshop papers on OpenMP support
LLVM Compiler Implementation for Explicit Parallelization and SIMD Vectorization.
LLVM-HPC@SC 2017: 4:1-4:11
LLVM Framework and IR Extensions for Parallelization, SIMD Vectorization and Offloading.
LLVM-HPC@SC 2016: 21-31

Notices and Disclaimers

Intel technologies may require enabled hardware, software or service activation.
No product or component can be absolutely secure.
Your costs and results may vary.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804