Tutorial

  • 03/26/2021
  • Public Content

Additional Exercises

The previous examples made use of double precision arrays. They may be built instead with single precision arrays by
changing the command-line option
-real-size 64
to
-real-size 32
. The non-vectorized versions of the loop execute only slightly faster the double precision version; however, the vectorized versions are substantially faster. This is because a packed SIMD instruction operating on a 16-byte vector register operates on four single precision data elements at once instead of two double precision data elements.
In the example with data alignment, you will need to set
ROWBUF=3
to ensure 16-byte alignment for each row of the matrix
a
. Otherwise,
the directive
!dir$ vector aligned
will cause the program to fail.
This completes the tutorial that shows how the compiler can optimize performance with various vectorization techniques.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.