Porting applications to Intel® Xeon Phi™ Coprocessor
The following Best Known Methods are assembled here to help users port applications to the Intel® Xeon Phi™ coprocessor. When porting applications, use the following topics as guidelines and keep in mind that many tips assume the use of Intel® compilers.
1. Use Compiler Switches Efficiently
- To compile an application for the Intel Xeon Phi coprocessor natively, use the compiler option –mmic.
- Choose compiler switches that maximize performance but require minimal precision whenever possible.
- Floating point
- Use single vs. double precision where possible.
- Use various precision controls where applicable: -imf-*, -[no-]prov-*
- The compiler does not generate low-precision sequences unless low-precision options are added explicitly in the command line. With current compilers, you should use the -fimf* flags such as:
- Some combinations that may be useful (SP: single precision, DP: double precision) are:
- -fimf-precision=low -fimf-domain-exclusion=15 (gives lowest precision sequences available for both SP/DP)
- -fimf-domain-exclusion=15 -fimf-accuracy-bits=22 (low precision compared to default for DP)
- -fimf-domain-exclusion=15 -fimf-accuracy-bits=11 (even lower precision for DP, low precision compared to default for SP)
- -fimf-max-error=2048 -fimf-domain-exclusion=15 (gives lower accuracy than default max-error of 4 ulp (Units in the Last Place), but higher accuracy than (a) above)
- Perform strength reduction
- Compare the square of something instead of taking the square root.
- Multiply by an inverted value instead of dividing by an invariant value.
-fimf-domain-exclusion=<n1> -fimf-accuracy-bits=<n2> -fimf-precision=low -fimf-max-error=<n3_ulps>
These options affect code generation for vector as well as scalar code. For the full list of options and detailed descriptions, please refer to the "Floating-Point Options" in the Compiler User and Reference Guide (Compiler Reference > Compiler Option Categories and Descriptions > Floating-Point Options). The document can be found here http://software.intel.com/en-us/articles/intel-c-composer-xe-documentation.
2. Thread Your Application
- Increase the degree of thread parallelism using programming models and compiler features:
- Hardware threads can be used by a mix of process parallelism, across different Message Passing Interface (MPI) ranks, and thread parallelism within each process. OpenMP* is the most common way to expose thread parallelism, along with Intel® Cilk™ or Intel® Threading Building Blocks (Intel® TBB); pthreads can be used when necessary.
- Don’t count on the compiler to do “auto-parallelization.” See the Compiler User and Reference Guide http://software.intel.com/en-us/articles/intel-parallel-studio-xe-for-linux-documentation/#ccomposer
- Use MPI to increase parallelism if appropriate for your application:
- Use local option –n to set the number of MPI processes: mpiexec.hydra -n ./a.out to set the number of processes (rank).
- Use option –host to run MPI processes on both host and coprocessor: mpiexec.hydra –host
./a.host : -host ./a.mic
- Set the number of OpenMP threads per process with OMP_NUM_THREADS
- Control affinity of processes for MPI with I_MPI_PIN_DOMAIN, e.g., by setting it to omp to get the value of OMP_NUM_THREADS, or by setting it to a value that's evenly divisible by the number of threads per core (4)
- Set I_MPI_DEBUG=5 to reveal the MPI process affinity map.
- Example: mpiexec.hydra -env I_MPI_PIN_DOMAIN 12 -env KMP_AFFINITY balanced -env OMP_NUM_THREADS 9 -env I_MPI_DEBUG 5 -n 20 a.out
- Identify load imbalance with the Intel® Trace Analyzer and Collector.
- Affinitize, and don’t oversubscribe:
- Control affinity for OpenMP/MPI with KMP_AFFINITY. The OpenMP threads will not migrate outside the cores assigned to their parent process.
- Each OpenMP thread may also determine the desired set of Operation System processes on which it is to execute and bind to them with the kmp_set_affinity API call during runtime. The corresponding environment variable is KMP_AFFINITY. Please refer to "Thread Affinity Interface" topic in the compiler User and Reference Guides for more details (http://software.intel.com/en-us/articles/intel-parallel-studio-xe-for-linux-documentation).
- Compact affinity fills up each core before mapping threads to the next core. Scatter spreads threads across consecutive cores first, then wraps around to the same cores again if there are sufficient threads. Balanced, available only on the Intel Xeon Phi coprocessor, is often best when there is locality across consecutive threads, since it evenly spreads work across cores like scatter, but maps consecutive threads within the same core like compact.
- Which affinity setting is best may vary by algorithm, data structures used, and by kernel. So some ex-perimentation may be required to determine the best setting for a sequence of kernels. The affinity may be managed with API calls within the code. Beware that changing the affinity within an application may incur overhead, so be careful that those costs get adequately amortized.
3. Code Inspection for Localized Changes
- Type conversions
- In the C/C++ language, constants that don't have an f or F suffix are presumed to be double. Omitting that suffix can have significant performance consequences, both from unnecessary conversion instruction se-quences, and from using only half of the vector bandwidth because of a presumption of double precision.
- Use the single precision floating point version of functions wherever possible, e.g., sinf() vs. sin()
- Signed vs. unsigned types
- Unsigned types may incur overflow handling, whereas this is not the case for signed types. Use signed instead of unsigned types wherever possible.
- Floating-point precision
- Rewrite "/invariant" as "*1/invariant" to use a multiple vs. a divide. The compiler may already do some of this, especially if the invariant is a constant.
4. Tune Your Operating System
Red Hat* Enterprise Linux 6.2: Disabling intel_idle driver to address bus speed issue: we have seen data transfer per-formance between the coprocessor and host over PCI Express* bus, using offload pragma, on RedHat Enterprise Linux* 6.2, to be degraded due to the use of an intel_idle driver on this specific OS release.
To address that, disable intel_idle driver by passing intel_idle.max_cstate=0 as a kernel boot parameter (makes the system use acpi_idle instead) and reboot the system.
If no changes are observed in the performance, an update to the system's BIOS may be needed as well.
Intel, the Intel logo, Cilk, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.