A Guide to Optimization Techniques for the Intel® MIC Architecture

A 3-part educational series on Optimization Techniques for the Intel® MIC Architecture is provided by Colfax Research. The series focuses on select topics on optimization of applications for Intel’s multi-core and manycore architectures (Intel® Xeon® processors and Intel® Xeon Phi™ processors).

Part 1: Multi-Threading and Parallel Reduction

Part 1 focuses on thread parallelism and race conditions. It discusses the usage of mutexes in OpenMP to resolve race conditions. For a practical illustration, the authors construct and optimize a micro-kernel for binning particles based on their coordinates. Workloads like this one occur in such applications as Monte Carlo simulations, particle physics software, and statistical analysis.

The optimization technique discussed leads to a performance increase of 25x on a 24-core CPU and up to 100x on the MIC architecture compared to a single-threaded implementation on the same architectures.

Part 2: Strip-Mining for Vectorization

In this part, the authors discuss data parallelism. The focus is automatic vectorization and exposing vectorization opportunities to the compiler. For a practical illustration, they construct and optimize a micro-kernel for particle binning particles. Again, similar workloads occur applications in Monte Carlo simulations, particle physics software, and statistical analysis.

The optimization technique discussed in this part leads to code vectorization, which results in an order of magnitude performance improvement on an Intel Xeon processor. Performance on Xeon Phi compared to that on a high-end Xeon is 1.4x greater in single precision and 1.6x greater in double precision.

Part 3: False Sharing and Padding 

In this part, the authors discuss false sharing, highlighting the situations in which it may occur, and eliminating it with the help of data container padding. For a practical illustration, they construct and optimize a micro-kernel for binning particles based on their coordinates. Results show that the impact of false sharing may be as high as an order of magnitude performance loss in a parallel application. On Intel Xeon processors, padding required to eliminate false sharing is greater than on Intel Xeon Phi coprocessors, so target-specific padding values may be used in real-life applications.

For more complete information about compiler optimizations, see our Optimization Notice.