When confronted with nested loops, the granularity of the computations that are assigned to threads will directly affect performance. Loop transformations such as splitting and merging nested loops can make parallelization easier and more productive.
One key to attaining good parallel performance is choosing the right granularity for the application. Granularity is the amount of real work in the parallel task. If granularity is too fine, then performance can suffer from communication overhead.
How to configure OpenMP in the Intel IPP library to maximize multi-threaded performance of the Intel IPP primitives.
A step-by-step introduction to application performance tuning using the Intel® Compilers version 13 for IA-32 and Intel® 64 processors that are included with Intel® Parallel Studio XE 2013
GOptimize Data Structures and Memory Access Patterns to I
Intel® Math Kernel Library (Intel® MKL) provides highly optimized and extensively threaded general matrix-matrix multiplication (GEMM) functions. In this article, we explain how to design and measure of the performance using Intel MKL SGEMM, and outline about 7 tips to help developers to perform performance tests and quickly evaluate the floating pointing computing capability (FLOPS) on a...
List of Intel IPP functions optimized for processor code name Haswell and Skylake
This blog contains additional content for the article "Advanced Vectorization" from Parallel Universe #12:
Reference Link and Download
Intel Vectorization Tools