This is part 1 of a 3-part educational series of publications introducing select topics on optimization of applications for the Intel multi-core and manycore architectures (Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors).
In this paper we focus on thread parallelism and race conditions. We discuss the usage of mutexes in OpenMP* to resolve race conditions. We also show how to implement efficient parallel reduction using thread private storage and mutexes.
For a practical illustration, we construct and optimize a micro-kernel for binning particles based on their coordinates. Such a workload occurs in such applications as Monte Carlo simulations, particle physics software, and statistical analysis. The optimization technique discussed in this paper leads to a performance increase of 25x on a 24-core CPU and up to 100x on the MIC architecture compared to a single-threaded implementation on the same architectures.
In the next publication of this series, we will demonstrate further optimization of this workload, focusing on vectorization. Optimization Techniques for the Intel® MIC Architecture: Part 2 of 3
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804