by Paul Besl
Download full case study in PDF format Download
Download accompanying source code (ZIP format) Download
A customer recently purchased a significant number of Intel Xeon Phi coprocessors to augment the capabilities of their cluster of 16-core dual-socket compute nodes based on the Intel Xeon E5-2600 series processors (with 8 cores per socket). The 61 core Intel Xeon Phi coprocessors can be programmed to execute natively as a separate Linux* host or via an offload method controlled by the Intel Xeon processor host. In this case study, we take one mini-application code, tested the host version, tried out the Intel Xeon Phi coprocessor native execution, optimized the native version (and as often happens, also the host version) by switching two AoS arrays (“arrays of structures”) containing three-dimensional point data to the SoA format (“structures of arrays”). We also added an “offload pragma” to offload the key O(N2) loop, keeping almost all of the performance of the native version in the offload version. In addition, we did a cache-blocking transformation on the two main loops by creating 4 loops and then interchanging the two middle (non-inner, non-outer) loops. Finally we compare the performance of all tested Intel Xeon and Intel Xeon Phi coprocessor versions showing that a 9 to 1 performance ratio exists between the slowest AoS double precision executable vs. the fastest SoA single precision executable.
A customer recently purchased a significant number of Intel Xeon coprocessors to augment the capabilities of their cluster of 16-core dual-socket compute nodes based on Intel Xeon E5-2600 series processors (with 8 cores per socket). They will be supporting a wide variety of software packages on these systems. They already run clusters with the same set of supported software packages on earlier x86_64 clusters. Intel has developed software technology in the form of Intel® compilers, Intel® Math Kernel Library (Intel® MKL), Intel® MPI libraries, and Intel performance analysis tools that allow this customer and their users to port their software to run natively on the Intel Xeon Phi coprocessor’s Linux* OS as if it were a separate Linux host or to offload compute tasks from the host to the Intel Xeon Phi coprocessor located on the PCIe bus of the host. At this point in time, it has been demonstrated that significant compute server codes (i.e. not client-based Windows* oriented user-interactive codes) can be ported to or partially offload-enabled for Intel Xeon Phi coprocessor without a huge amount of software work. However, the remaining task for a software developer is to “tune & optimize” their software so that it runs efficiently & quickly on the coprocessor. Intel Xeon is able to run x86_64, 128-bit SSE (Streaming SIMD Extensions), 128-bit AVX-1, or 256-bit AVX-1 (Intel® Advanced Vector Extensions (Intel® AVX) codes using “big cores” with out-of-order (OOO) capabilities, but the Intel Xeon Phi coprocessor can only run in-order x86_64 instructions (including x87 instructions) or the new 512-bit Intel® Initial Many Core Instructions (Intel® IMCI) floating point vector instructions. Ideally, we’d like to develop a decision tree or step-by-step recipe for software developers to be able to follow in their porting & optimization activities so that excellent performance can be attained with the minimum amount of software developer work.