Improving Performance by Aligning Data

The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays a, b, and x in Driver.c on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will modify the declarations of a, b, and x in Driver.c using the __attribute keyword, which has the following syntax:

float array[30] __attribute((aligned(base, [offset])));

This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example:

FTYPE a[ROW][COLWIDTH] __attribute((aligned(16)));

In addition, the row length of the matrix, a, needs to be padded out to be a multiple of 16 bytes, so that each individual row of a is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in Multiply.c are aligned by using #pragma vector aligned.

Note

If you use #pragma vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if #pragma vector aligned is not used. See the code under the ALIGNED macro in Multiply.c

If your compilation targets the Intel® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, #pragma vector aligned advises the compiler that the data is 32-byte aligned.

Rebuild the program after adding the ALIGNED preprocessor definition to ensure consistently aligned data.

Multiply.c(45) (col. 2): remark: loop was not vectorized: not inner loop.
Multiply.c(55) (col. 3): remark: LOOP WAS VECTORIZED.
Driver.c(140) (col. 2): remark: loop was not vectorized: not inner loop.
Driver.c(140) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient.
Driver.c(140) (col. 2): remark: LOOP WAS VECTORIZED.
Driver.c(141) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient.
Driver.c(164) (col. 2): remark: LOOP WAS VECTORIZED.
Driver.c(145) (col. 2): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.
Driver.c(81) (col. 2): remark: LOOP WAS VECTORIZED.
Driver.c(69) (col. 2): remark: loop was not vectorized: vectorization possible but seems inefficient.
Driver.c(72) (col. 3): remark: LOOP WAS VECTORIZED.
Driver.c(54) (col. 2): remark: loop was not vectorized: not inner loop.
Driver.c(55) (col. 3): remark: loop was not vectorized: vectorization possible but seems inefficient.
Driver.c(60) (col. 3): remark: loop was not vectorized: not inner loop.
Driver.c(61) (col. 4): remark: LOOP WAS VECTORIZED.

Note

Your line and column numbers may be different.

Now, run the executable and record the execution time.

Next: Improving Performance with Interprocedural Optimization

For more complete information about compiler optimizations, see our Optimization Notice.