In Episode 6 of “Hands-On Workshop (HOW) series on parallel programming and optimization with Intel® architectures”, we discuss the details of performance tuning for automatically vectorized calculations.
- The choice of data structures for unit-stride memory access and precision control
- The usage of data alignment, padding and alignment hints
- The general approach to regularizing vectorization pattern that avoids peel loops and remainder loops
- The application of strip-mining and loop splitting to expose vectorization opportunities to the compiler.
Our discussion is illustrated with 3 practical examples:
- Application of Coulomb's Law
- Lower-Upper (LU) decomposition of small matrices
- Binning of values in a large array
Performance results on an Intel® Xeon® processor and an Intel® Xeon Phi™ coprocessor are reported for each optimization technique applied to the respective application.
The hands-on part of the episode demonstrates the practical application of the discussed techniques on the example applications used in the lecture.