To maximize use of vector CPUs, consider using vector data types in
your kernel code as a more involved performance alternative to the automatic
(compiler-aided) vectorization described in the Benefitting
from Implicit Vectorization
section. This technique enables you to
map vector data types directly to the hardware vector registers. Thus,
the used data types should match the width of the underlying SIMD instructions.
Consider the following recommendations:
Starting the 2nd Generation Intel® Core™ processors with Intel®
Advanced Vector Extension (Intel® AVX) support, use data types such
, so you bind code to
the specific register width of the underlying hardware. This method
provides maximum performance on a specific platform. However, performance
on other platforms and generations of Intel® Core™ processors might
be less than optimal.
Use wider data types, such as
, to transparently
cover many SIMD hardware register widths. However, using types wider
than the underlying hardware is similar to loop unrolling. This method
might improve performance in some cases, but also increases register
pressure. Consider using
data type to process
four pixels simultaneously when operating on pixels with eight bits
With vector data types, each work item processes
elements. Make sure the size of a grid, which is the number of work-items
required to process the same dataset, does not exceed the N value.
improves performance only starting the 4th Generation Intel® Core™ processors.
Using vector data types, you plan the vector-level parallelism yourself
instead of relying on the implicit vectorization module. See the Benefitting
from Implicit Vectorization
section for more information.
This approach is useful in the following scenarios:
You are porting the code that originally used the following instructions:
Intel® Streaming SIMD Extensions (Intel® SSE)
Intel® Advanced Vector Extensions (Intel® AVX)
Intel® Advanced Vector Extensions 2 (Intel® AVX2)
Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
You want to benefit from hand-tuned vectorization of your code.
The following example demonstrates the multiplication kernel that targets
the 256-bit vector units of the 2nd Generation Intel Core processors and
void edp_mul(__global const float8 *a,
__global const float8 *b,
__global float8 *result)
int id = get_global_id(0);
result[id] = a[id]* b[id];
In this example, the data passed to the kernel represents buffers of
float8. The calculations are performed on eight elements together.
The attribute added before the kernel, signals the compiler, or the
implementation that this kernel has an optimized vectorized form, so the
implicit vectorization module does not operate on it. Use
to hint compiler that your kernel already processes data using mostly
vector types. For more details on this attribute, see the section 6.7.2
of the OpenCL™ 1.2 specification at https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf
OpenCL 1.2 Specification at