Optimization Techniques for the Intel® MIC Architecture: Part 1 of 3

Abstract

This is part 1 of a 3-part educational series of publications introducing select topics on optimization of applications for the Intel multi-core and manycore architectures (Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors).

In this paper we focus on thread parallelism and race conditions. We discuss the usage of mutexes in OpenMP* to resolve race conditions. We also show how to implement efficient parallel reduction using thread private storage and mutexes.

For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Such a workload occurs in such applications as Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to a performance increase of 25x on a 24-core CPU and up to 100x on the MIC architecture compared to a single-threaded implementation on the same architectures.

Download the full article (PDF) PDF iconDownload

In the next publication of this series, we will demonstrate further optimization of this workload, focusing on vectorization.   Optimization Techniques for the Intel® MIC Architecture: Part 2 of 3

For more complete information about compiler optimizations, see our Optimization Notice.