Improving Performance by Aligning Data

The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays a, b, and x in Driver.c on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will modify the declarations of a, b, and x in Driver.c using the aligned attribute keyword, which has the following syntax:

float array[30] __attribute__((aligned(base, [offset])));

This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example:

FTYPE a[ROW][COLWIDTH] __attribute__((aligned(16)));

In addition, the row length of the matrix, a, needs to be padded out to be a multiple of 16 bytes, so that each individual row of a is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in Multiply.c are aligned by using #pragma vector aligned.

Note

If you use #pragma vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if #pragma vector aligned is not used. See the code under the ALIGNED macro in Multiply.c

If your compilation targets the Intel® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, #pragma vector aligned advises the compiler that the data is 32-byte aligned.

Recompile the program after adding the ALIGNED macro to ensure consistently aligned data. Use -qopt-report=4 to see the change in aligned references.

icc -std=c99 -qopt-report=4 -qopt-report-phase=vec -D NOALIAS -D ALIGNED Multiply.c Driver.c -o MatVector

Multiply.optrpt before adding the #pragma vector aligned shows:

LOOP BEGIN at Multiply.c(49,9)
   <Peeled loop for vectorization>
   LOOP END

   LOOP BEGIN at Multiply.c(49,9)
      remark #15388: vectorization support: reference a[i][j] has aligned access   [ Multiply.c(50,21) ]
      remark #15388: vectorization support: reference x[j] has aligned access   [ Multiply.c(50,31) ]
      remark #15305: vectorization support: vector length 2
      remark #15399: vectorization support: unroll factor set to 4
      remark #15309: vectorization support: normalized vectorization overhead 1.031
      remark #15300: LOOP WAS VECTORIZED
      remark #15442: entire loop may be executed in remainder
      remark #15448: unmasked aligned unit stride loads: 2
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 10
      remark #15477: vector cost: 4.000
      remark #15478: estimated potential speedup: 2.380
      remark #15488: --- end vector cost summary ---
   LOOP END

   LOOP BEGIN at Multiply.c(49,9)
   <Alternate Alignment Vectorized Loop>
   LOOP END

   LOOP BEGIN at Multiply.c(49,9)
   <Remainder loop for vectorization>
   LOOP END

And after adding -D ALIGNED:

LOOP BEGIN at Multiply.c(49,9)
      remark #15388: vectorization support: reference a[i][j] has aligned access   [ Multiply.c(50,21) ]
      remark #15388: vectorization support: reference x[j] has aligned access   [ Multiply.c(50,31) ]
      remark #15305: vectorization support: vector length 2
      remark #15399: vectorization support: unroll factor set to 4
      remark #15309: vectorization support: normalized vectorization overhead 0.594
      remark #15300: LOOP WAS VECTORIZED
      remark #15448: unmasked aligned unit stride loads: 2
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 10
      remark #15477: vector cost: 4.000
      remark #15478: estimated potential speedup: 2.410
      remark #15488: --- end vector cost summary ---
   LOOP END

   LOOP BEGIN at Multiply.c(49,9)
   <Remainder loop for vectorization>
      remark #15388: vectorization support: reference a[i][j] has aligned access   [ Multiply.c(50,21) ]
      remark #15388: vectorization support: reference x[j] has aligned access   [ Multiply.c(50,31) ]
      remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 2.417
   LOOP END

Note

Your line and column numbers may be different.

Now, run the executable and record the execution time.