Autovectorization in Intel® OpenCL SDK 1.5

Hi everyone!

Intel just released the Intel® OpenCL SDK version 1.5, and I want to highlight one improvement that is very important but not always visible to the user: the new Implicit CPU Vectorization module.

What are the benefits of using the implicit CPU vectorization module?

SIMD instructions expose a high-level of parallelism and are used to accelerate the performance of data-parallel applications in multiple domains. The 2nd Generation Intel® Core Processor Family codenamed “Sandy Bridge”, features the Intel® AVX instruction set, which has 8 wide floating point SIMD processing. Applications which take advantage of SIMD instructions can run as much as 8x faster. For example in Intel® AVX, the instruction “vaddps” performs an addition of 8 floating point numbers in parallel. The Implicit CPU vectorization module seamlessly compiles your OpenCL kernels to fully utilize the full 8 wide floating point SIMD processing, boosting the performance of user code without user intervention.

Vectorization challenges

SIMD instructions allow the implementation of powerful algorithms but also present a challenge. Traditionally, programming languages such as C or Java, allowed the user to write scalar code, which operates on a single data elements at a time. SIMD instructions are exposed to C using intrinsics, which are function calls that are mapped by the compiler to specific function calls. For example the “_mm256_add_ps” function is mapped to the ‘vaddps’ instruction call described earlier. Intrinsic functions allow experts take advantage of the SIMD instruction set and code performing applications. However, writing intrinsics remained the domain of performance experts because it required intimate knowledge of the available instruction set. Unlike programming regular C code, compilers are unable to optimize intrinsic code in the same way it optimizes sequential C code. In addition to the difficulty of programming SIMD instructions, migrating existing code to new architectures poses a serious challenge. New processors generations expose new instructions which allow faster execution of code. However, taking advantage of the new features often require the re-write of the programs which use the old SIMD intrinsics.

Modern compilers attempt to automatically vectorize user code in order to use the SIMD instruction set. However, general purpose programming languages were not designed with vectorization and parallelism in mind, and some features of the language make the detection of parallel code difficult. Issues such as pointer aliasing and memory alignment prevent the vectorization in many cases.

How does the the implicit CPU vectorization module work?

Our compiler vectorizes your code to optimize it for both the Sandy Bridge architecture and future architectures. The Intel® OpenCL SDK compiler features a vectorization module that takes scalar code from the user and generates SIMD instructions. The Implicit CPU vectorization module enables you to ignore the underlying architecture and still enjoy the performance of SIMD instructions across multiple Intel® devices-current and future.

The Implicit CPU vectorization module generates SIMD instructions from scalar code, by executing multiple kernel work-items on different SIMD lanes together. The first SIMD lane executes the first work-item, and so on. For example, the code below features the SAXPY function from the BLAS package.
__kernel void saxpy(__global float *X, __global float *Y, float alpha) {
size_t index = get_global_id(0);
Y[index] += X[index]*alpha;
}

The SAXPY function is efficiently compiled by the vectorization module to the code below. Notice the use of the “vaddps” AVX instruction which performs eight instances together. The assembly code below executes much faster than the scalar version.
vmovdqu XMM1, XMMWORD PTR [RDI + 4*R9]
vmovdqu XMM2, XMMWORD PTR [R8 + 4*R9]
vmulps XMM2, XMM2, XMM0
vaddps XMM1, XMM1, XMM2
vmovups XMMWORD PTR [RDI + 4*R9], XMM1

The SAXPY function above is a simple function where all of the work-items perform the exactly same code. However, some programs are more complex and require more advanced use of SIMD instructions. In the modified code below only values which are greater than 10 are added to the X array. This requires the vectorization module to use masks in order to execute the code in parallel. The vectorization module executes both the “then” and “else” parts of the condition and uses the “blend” instruction to select the correct value for each of the SIMD lanes.

Input Program:
__kernel void saxpy(__global float *X, __global float *Y, float alpha) {
size_t index = get_global_id(0);
float val = X[index]*alpha;
if (val > 10) Y[index] += val;
}

Output code:
vmovdqu XMM3, XMMWORD PTR [R8 + 4*R9]
vmulps XMM3, XMM3, XMM0
vcmpltps XMM4, XMM1, XMM3
vblendvps XMM3, XMM2, XMM3, XMM4
vmovups XMM4, XMMWORD PTR [RDI + 4*R9]
vaddps XMM3, XMM4, XMM3
vmovups XMMWORD PTR [RDI + 4*R9], XMM3

Notice that due to the fact that the compiler has to execute both the “then” and the “else” sides of the if-statement, more instructions are used and the utilization of the SIMD unit is reduced. Despite the reduced utilization, the compiler’s vectorized code outperforms of the scalar version.
In some cases, the Intel® OpenCL SDK detects that the user program may not benefit from SIMD vectorization for various reasons, and disables vectorization.

To summarize, the Intel® OpenCL SDK is an awesome framework which allows the developers to actually focus on developing their application while the framework automagically optimizes the code and adds significant value when using OpenCL on CPUs.
For more complete information about compiler optimizations, see our Optimization Notice.

Comments


Quote: "Notice the use of the “vaddps” AVX instruction which performs eight instances together"

Code: vaddps XMM1, XMM1, XMM2

Comment: the vaddps on XMM registers is 128-bit wide (4 items), not 256-bit (8 items), because XMM registers are 128-bit.


So how can I take advantage of this while providing some kind of fallback for users using AMD processors? Do I have to ship two different executables?


Alex, you are correct! In the example above the code uses XMM registers, which are 128-bit wide.

DF: Our compiler operates in JIT mode and will generate code based on the available instruction set. For example, on a processor which has SSE4 and not AVX, our compiler will generate SSE instructions.


Nadav, the processor in the example clearly supports AVX since the vex coded vmulps etc. are used however Intel OpenCL SDK is not taking advantage of AVX. So I think this article is giving misleading information... It should say using SSE with Intel OpenCL SDK on Intel processors with AVX support :)


Nadav, the processor in example obviously suppots AVX (since the code uses vex coded instructions) but Intel OpenCL SDK uses only SSE. So your explanation is incorrect.

The article is misleading because it does not show how Intel OpenCL SDK can take advantage of AVX.

It only shows it can take advantage of SSE and can not take advantage of AVX even on processors which support AVX.