The vectorizer can generate faster code when operating on aligned data. In this activity you will improve performance by aligning the arrays a, b, and x in Driver.c on a 16-byte boundary so that the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will modify the declarations of a, b, and x in Driver.c using the aligned attribute keyword, which has the following syntax:
float array[30] __attribute__((aligned(base, [offset])));
This instructs the compiler to create an array that it is aligned on a "base"-byte boundary with an "offset" (Default=0) in bytes from that boundary. Example:
FTYPE a[ROW][COLWIDTH] __attribute__((aligned(16)));
In addition, the row length of the matrix, a, needs to be padded out to be a multiple of 16 bytes, so that each individual row of a is 16-byte aligned. To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in Multiply.c are aligned by using #pragma vector aligned.
If you use #pragma vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if #pragma vector aligned is not used. See the code under the ALIGNED macro in Multiply.c
If your compilation targets the Intel® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, #pragma vector aligned advises the compiler that the data is 32-byte aligned.
Recompile the program after adding the ALIGNED macro to ensure consistently aligned data. Use -qopt-report=4 to see the change in aligned references.
icc -std=c99 -qopt-report=4 -qopt-report-phase=vec -D NOALIAS -D ALIGNED Multiply.c Driver.c -o MatVector
Multiply.optrpt before adding the #pragma vector aligned shows:
LOOP BEGIN at Multiply.c(49,9) <Peeled loop for vectorization> LOOP END LOOP BEGIN at Multiply.c(49,9) remark #15388: vectorization support: reference a[i][j] has aligned access [ Multiply.c(50,21) ] remark #15388: vectorization support: reference x[j] has aligned access [ Multiply.c(50,31) ] remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 4 remark #15309: vectorization support: normalized vectorization overhead 1.031 remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15448: unmasked aligned unit stride loads: 2 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 10 remark #15477: vector cost: 4.000 remark #15478: estimated potential speedup: 2.380 remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at Multiply.c(49,9) <Alternate Alignment Vectorized Loop> LOOP END LOOP BEGIN at Multiply.c(49,9) <Remainder loop for vectorization> LOOP END
And after adding -D ALIGNED:
LOOP BEGIN at Multiply.c(49,9) remark #15388: vectorization support: reference a[i][j] has aligned access [ Multiply.c(50,21) ] remark #15388: vectorization support: reference x[j] has aligned access [ Multiply.c(50,31) ] remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 4 remark #15309: vectorization support: normalized vectorization overhead 0.594 remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 2 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 10 remark #15477: vector cost: 4.000 remark #15478: estimated potential speedup: 2.410 remark #15488: --- end vector cost summary --- LOOP END LOOP BEGIN at Multiply.c(49,9) <Remainder loop for vectorization> remark #15388: vectorization support: reference a[i][j] has aligned access [ Multiply.c(50,21) ] remark #15388: vectorization support: reference x[j] has aligned access [ Multiply.c(50,31) ] remark #15335: remainder loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override remark #15305: vectorization support: vector length 2 remark #15309: vectorization support: normalized vectorization overhead 2.417 LOOP END
Now, run the executable and record the execution time.