How to Vectorize Code on 32-Bit Intel® Architecture


Challenge

Vectorize code for greater performance. The SIMD features of Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2) and MMX™ technology require new methods of coding algorithms. One of them is vectorization. Vectorization is the process of transforming sequentially executing, or scalar, code into code that can execute in parallel, taking advantage of the SIMD architecture parallelism.


Solution

Compare the performance improvement that can be obtained from assembly code versus the cost of those improvements, in order to choose between hard-coding in assembly, using intrinsics, using C/C++ language extensions, or automatic vectorization. The following chart summarizes the trade-offs between those techniques:



Each of these options is treated in a separate item:

 

Each of those items illustrates the use of coding adjustments to the following simple loop:

void add(float *a, float *b, float *c)

{

int i;

for (i = 0; i < 4; i++) {

c[i] = a[i] + b[i];

}

}

 

Note that the loop runs for only four iterations. This allows a simple replacement of the code with Streaming SIMD Extensions.

For the optimal use of the Streaming SIMD Extensions that need data alignment on the 16-byte boundary, all of the examples in these items assume that the arrays passed to the routine – a, b, c, – are aligned to 16-byte boundaries by a calling routine. For the methods to ensure this alignment, please refer to the application notes for the Pentium® 4 processor available at Intel Resources for Hardware Developers.

The same techniques may be used for single-precision floating-point, double-precision floating-point, and integer data under SSE2, SSE and MMX technology.


Source

IA-32 Intel® Architecture Optimization Reference Manual

 


For more complete information about compiler optimizations, see our Optimization Notice.
Tags: