How to Vectorize Assembly Code by Hand on 32-Bit Intel® Architecture

Submit New Article

December 9, 2008 11:00 PM PST



Challenge

Vectorize code by hand-coding in assembly. Programming directly in assembly language for a target platform may produce the required performance gain, but assembly code is not portable between processor architectures and is expensive to write and maintain.

Consider the following simple loop:

void add(float *a, float *b, float *c)
{
int i;
for (i = 0; i < 4; i++) {
c[i] = a[i] + b[i];
}
}

 


Solution

Code key loops directly in assembly language using an assembler or by using inlined assembly (C-asm) in C/C++ code. The Intel® Compiler or assembler recognize the new instructions and registers, then directly generate the corresponding code. This model offers the opportunity for attaining greatest performance, but this performance is not portable across the different processor architectures.

The following code example shows the Streaming SIMD Extensions inlined-assembly encoding that corresponds to the code in the Challenge section:

void add(float *a, float *b, float *c)
{
__asm {
mov eax, a
mov edx, b
mov ecx, c
movaps xmm0, XMMWORD PTR [eax]
addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0
}
}

 

This item is part of a series of items about coding techniques for vectorization.


Source

IA-32 Intel® Architecture Optimization Reference Manual