‹ Back to Video Series: Recorded Training Series

Optimization of Vector Arithmetics in Intel® Architectures

  • Overview

In Episode 6 of “Hands-On Workshop (HOW) series on parallel programming and optimization with Intel® architectures”, we discuss the details of performance tuning for automatically vectorized calculations.

We discuss:

  • The choice of data structures for unit-stride memory access and precision control
  • The usage of data alignment, padding and alignment hints
  • The general approach to regularizing vectorization pattern that avoids peel loops and remainder loops
  • The application of strip-mining and loop splitting to expose vectorization opportunities to the compiler.

Our discussion is illustrated with 3 practical examples:

  • Application of Coulomb's Law
  • Lower-Upper (LU) decomposition of small matrices
  • Binning of values in a large array

Performance results on an Intel® Xeon® processor and an Intel® Xeon Phi™ coprocessor are reported for each optimization technique applied to the respective application.

The hands-on part of the episode demonstrates the practical application of the discussed techniques on the example applications used in the lecture.