• 10/30/2018
  • Public Content
Contents

Writing Kernels to Directly Target the Intel® Architecture Processors

Using the OpenCL™ vector data types is a straightforward way to directly utilize the Intel® Architecture vector instruction set. See the Using Vector Data Types section for more information. Consider the following code snippet:
float4 a, b; float4 c = a + b;
After compilation, it resembles the following C snippet in intrinsics:
__m128 a, b; __m128 c = _mm_add_ps(a, b);
Or in assembly:
movaps xmm0, [a] addps  xmm0, [b] movaps [c], xmm0
However, in contrast to the code in intrinsics, an OpenCL kernel that uses
float4
data type, transparently benefits from Intel® Advanced Vector Extensions (Intel® AVX) if the compiler promotes
float4
to
float8
. The vectorization module can pack work items automatically, though it might be sometimes less efficient than manual packing.
If the native size for your kernel requires less than 128 bits and you want to benefit from the explicit vectorization, consider packing work items together manually.
For example, your kernel uses the
float2
vector type. It receives (
x
,
y
) float coordinates, and shifts them by (
dx
,
dy
):
__kernel void shift_by(__global float2* coords, __global float2* deltas) {   int tid = get_global_id(0);   coords[tid] += deltas[tid]; }
To increase the kernel performance, you can manually pack pairs of work items:
//Assuming the target is Intel® AVX enabled CPU __kernel __attribute__((vec_type_hint(float8))) void shift_by(__global float2* coords, __global float2* deltas) {   int tid = get_global_id(0);   float8 my_coords = (float8)(coords[tid], coords[tid + 1],                              coords[tid + 2], coords[tid + 3]);   float8 my_deltas = (float8)(deltas[tid], deltas[tid + 1],                              deltas[tid + 2] , deltas[tid + 3]);   my_coords += my_deltas;   vstore8(my_coords, tid, (__global float*)coords); }
Every work item in this kernel does four times as much job as a work item in the previous kernel. Consequently, they require only one fourth of invocations, reducing the run-time overheads. However, when you use manual packing, you must also change the host code accordingly.
For vectors of 32-bit data types, for example,
int4
,
int8
,
float4,
and
float8
data types use explicit vectorization to improve the performance. Other data types (for example,
char3
) may cause a behind-the-scene upcast of the input data, which has negative impact on performance.
For best performance for a given data type, the vector width should match the underlying SIMD width. This value differs for different architectures. For example, consider querying the recommended vector width using
clGetDeviceInfo
with
CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT
parameter. You get vector width of four for the 2nd Generation Intel® Core™ processors, but vector width of eight for higher versions of processors. Use
int8
so that vector width fits both architectures. Similarly for floating point data types, you can use
float8
data to cover many potential architectures.
NOTE
: Using scalar data types such as
int
or
float
is often the most “scalable” way to help the compiler do right vectorization job for the specific SIMD architecture underneath.
 
You can target to a specific Intel Architecture processor using a conditional code with an OpenCL™ C predefined macro
__INTEL_OPENCL_CPU_<CPUSIGN>.
 
The macro tunes the kernel for a specific CPU device microarchitecture.
<CPUSIGN>
is the CPU signature of a device.
You can specify one of the following values for this macro:
  • __INTEL_OPENCL_CPU_SKL__
    - Intel® microarchitecture code name Skylake
  • __INTEL_OPENCL_CPU_SKX__
    - Intel® microarchitecture code name Skylake on Intel Xeon® processor family
  • __INTEL_OPENCL_CPU_BDW__
    - Intel® microarchitecture code name Broadwell
  • __INTEL_OPENCL_CPU_BDW_XEON__
    - Intel® microarchitecture code name Broadwell on Intel Xeon® processor family
  • __INTEL_OPENCL_CPU_HSW__
    - Intel® microarchitecture code name Haswell
  • __INTEL_OPENCL_CPU_HSW_XEON__
    - Intel® microarchitecture code name Haswell on Intel Xeon® processor family
  • __INTEL_OPENCL_CPU_IVB__
    - Intel® microarchitecture code name Ivy Bridge
  • __INTEL_OPENCL_CPU_IVB_XEON__
    - Intel® microarchitecture code name Ivy Bridge on Intel Xeon® processor family
  • __INTEL_OPENCL_CPU_SNB__
    - Intel® microarchitecture code name Sandy Bridge
  • __INTEL_OPENCL_CPU_SNB_XEON__
    - Intel® microarchitecture code name Sandy Bridge on Intel Xeon® processor family
  • __INTEL_OPENCL_CPU_WST__
    - Intel® microarchitecture code name Westmere
  • __INTEL_OPENCL_CPU_WST_XEON__
    - Intel® microarchitecture code name Westmere on Intel Xeon® processor family
  • __INTEL_OPENCL_CPU_UNKNOWN__
    - Unknown microarchitecture
To tune performance for your target CPU, you can use this macro with
intel_vec_len_hint
extension. For example:
// Kernel side.   // Force vectorization with to 8 on BDW. // Runtime defines a macro corresponding to the device CPU signature. #ifdef __INTEL_OPENCL_CPU_BDW__ __attribute__((intel_vec_len_hint(8))) #endif //BDW __kernel void memcpy1(__global float* src, __global float* dst) {     size_t gid = get_global_id(0);     dst[gid] = src[gid]; }
For more information about
intel_vec_len_hint
attribute extension, refer to Vectorizer Knobs.

See Also

OpenCL 1.2 Specification at

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804