Optimization of Vector Arithmetics in Intel® Architectures


In Episode 6 of “Hands-On Workshop (HOW) series on parallel programming and optimization with Intel® architectures”, we discuss the details of performance tuning for automatically vectorized calculations.

We discuss:

  • The choice of data structures for unit-stride memory access and precision control
  • The usage of data alignment, padding and alignment hints
  • The general approach to regularizing vectorization pattern that avoids peel loops and remainder loops
  • The application of strip-mining and loop splitting to expose vectorization opportunities to the compiler.

Our discussion is illustrated with 3 practical examples:

  • Application of Coulomb's Law
  • Lower-Upper (LU) decomposition of small matrices
  • Binning of values in a large array

Performance results on an Intel® Xeon® processor and an Intel® Xeon Phi™ coprocessor are reported for each optimization technique applied to the respective application.

The hands-on part of the episode demonstrates the practical application of the discussed techniques on the example applications used in the lecture.



英特尔的编译器针对非英特尔微处理器的优化程度可能与英特尔微处理器相同(或不同)。这些优化包括 SSE2、SSE3 和 SSSE3 指令集和其他优化。对于在非英特尔制造的微处理器上进行的优化,英特尔不对相应的可用性、功能或有效性提供担保。该产品中依赖于微处理器的优化仅适用于英特尔微处理器。某些非特定于英特尔微架构的优化保留用于英特尔微处理器。关于此通知涵盖的特定指令集的更多信息,请参阅适用产品的用户指南和参考指南。

通知版本 #20110804