As part of the application readiness efforts for future Intel® Xeon® processors and Intel® Xeon Phi™ processors (code named Knights Landing), developers are interested in improving two key aspects of their workloads:
This article mainly talks about vectorization/code generation and lists some helpful tools and resources for thread parallelism.
Use the latest Intel Compilers (14.0+) and compile with the “-xMIC-AVX512” compiler knob to generate Knights Landing (KNL) binary.
sde –knl -- ./<knl.exe> <args.>
OR you can run your MPI application as
mpirun –n <no. of ranks> sde –knl -- ./<knl.exe> <args.>
To generate “Instruction Mix” reports using Intel SDE for different architectures:
Intel Xeon Phi coprocessor
sde –knl -mix -top_blocks 100 -iform 1 -- ./<knl.exe> <args.>
You can also run the corresponding Intel Xeon processor binary on Intel SDE for comparisons and analysis purposes:
Intel Xeon processor
sde -ivb -mix -top_blocks 100 -iform 1 -- ./<ivb.exe> <args.>
sde –hsw -mix -top_blocks 100 -iform 1 -- ./<hsw.exe> <args.>
It is recommended to generate instruction mix reports using single MPI/OpenMP* thread runs (OMP_NUM_THREADS=1) for analysis simplification purposes.
For resolving thread parallelism issues refer to the thread parallelism section below.
SNAP kernel obtained from https://www.nersc.gov.
Extracted Kernel from http://www.berkeleygw.org/.
Total Dynamic Instruction Reduction:
Significant % of x87 code for Intel AVX and Intel AVX2 for this kernel.
Intel SDE also provided the top basic blocks for your run based on hot instruction execution.
If you look at the top basic blocks, you see a significant number of x87 instructions in this kernel for the Intel AVX/AVX2 code. Below is just a snippet of the first basic block for Intel AVX2 instruction mix report.
The corresponding source code for the above basic block is line 459 (as highlighted above).
Looking at the source we observed there is “complex” division in line 459 involved in this statement and the compiler generates x87 sequence to conform to strict IEEE semantics and to avoid any overflows and underflows.
The way to avoid this is to compile with -fp-model fast=2. This allows the compiler to assume that real and imaginary parts of the double precision denominator lie in the approximate range, so it generates simple code without the tricks above. It can then generate vector Intel AVX/AVX2 instructions for the entire loop.
The EXECUTIONS count in the basic block is the number of times this basic block was executed, and ICOUNT gives the total number of instructions executed for this basic block for all the executions. Thus ICOUNT/EXECUTIONS give the total number of instructions in this basic block.
In addition, combination of the vectorization optimization report generated by the compiler (using the –qopt-report=5) and SDE top basic blocks can be used for doing a first pass ‘vectorization study’. Compiling with –qopt-report=5 generates an optimization report file kernel.optrpt. You can look for the corresponding source line in the basic block (example the line 459 above) and map it to the optimization report generated by the compiler to find whether your loop/basic block was vectorized or not (if not, why not). In the optimization report, you can also look for messages like – if some arrays in the loop were aligned or unaligned.
This is just an example of the kind of analysis that is possible with instruction mix reports from Intel SDE, but a lot more analysis is possible. For more details please see /content/www/us/en/develop/articles/intel-software-development-emulator.html.
Configurations for the run: The instruction mix for the extracted kernel was generated using Intel® SDE version 7.2, the application was compiled with Intel® Compilers version 14.0.2 20140120. The run was conducted by Intel Engineer Karthik Raman. For more information go to http://www.intel.com/performance.
Efficient parallelism is key for applications in the HPC domain to achieve great performance and cluster scaling. This is more critical than before with the many core architectures (like Intel Xeon Phi coprocessor) and also the increasing core counts with Intel Xeon processors.
The parallelism can be across several layers such as instruction level (super scalar), data level (SIMD/vectorization), thread level: shared memory (OpenMP) and/or distributed memory (MPI). Many HPC programs are moving to hybrid shared memory/distributed memory programming model where both OpenMP and MPI are used.
You can test thread scalability and efficiency of your application using existing hardware (Intel Xeon processor and/or Intel Xeon Phi coprocessor (Knights Corner).
Many tools are available for thread scalability analysis. A few are listed below:
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804