As part of the application readiness efforts for future Intel® Xeon® processors and Intel® Xeon Phi™ processors (code named Knights Landing), developers are interested in improving two key aspects of their workloads:
- Vectorization/code generation
- Thread parallelism
This article mainly talks about vectorization/code generation and lists some helpful tools and resources for thread parallelism.
- Intel® Advanced Vector Extensions 512 (Intel® AVX-512) will be first implemented on the processor and coprocessor and will also be supported on some future Intel Xeon processors scheduled to be introduced after Knights Landing.
For more details on Intel AVX-512 refer to: https://software.intel.com/en-us/blogs/2013/avx-512-instructions.
- Intel AVX-512 offers significant improvements and refinements over the Intel® Initial Many Core Instructions (Intel® IMCI) found on current Intel® Xeon Phi™ coprocessors code named Knights Corner.
- Today’s Intel® Compiler (14.0+) has the capability to compile your code for Knights Landing and you can run your binary on Intel® Software Development Emulator (Intel® SDE). Intel® Compilers are available as part of Intel® Parallel Studio XE (available for trial and purchase here) and product documentation can be found here.
- Intel SDE is an emulator for upcoming instruction set architecture (ISA) extensions. It allows you to run programs that use new instructions on existing hardware that lacks those new instructions.
- Intel SDE is useful for performance analysis, compiler development tuning, and application development of libraries.
- Intel SDE for Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Skylake (Client), Goldmont, and Knights Landing (KNL) is available here: http://www.intel.com/software/sde.
- Please note that Intel SDE is a software emulator and is mainly used for emulating future instructions. It is not cycle accurate and can be very slow (up-to 100x). It is not a performance-accurate emulator.
- Instruction Mix:
- Intel SDE comes with several useful emulator-enabled pin tools and one of them in the mix histogramming tool.
- This mix histogramming tool can compute histograms using any of the following: dynamic instructions executed, instruction length, instruction category, and ISA extension grouping.
- The mix-mt tool can also display the top N most frequently executed basic blocks and disassemble them.
- Plethora of information from instruction mix reports:
- Top basic blocks in terms of instruction %, dynamic instruction execution for evaluating compiler code generation, function-based instruction count breakdown, instruction count of each ISA type etc.
- With appropriate parser scripts you can also evaluate FLOP counts, INT counts, memory operation counts, SIMD intensity (operations/instructions), etc.
Compiling your application for Knights Landing
Use the latest Intel Compilers (14.0+) and compile with the “-xMIC-AVX512” compiler knob to generate Knights Landing (KNL) binary.
To run your application on Intel SDE
sde –knl -- ./<knl.exe> <args.>
OR you can run your MPI application as
mpirun –n <no. of ranks> sde –knl -- ./<knl.exe> <args.>
To generate “Instruction Mix” reports using Intel SDE for different architectures:
Intel Xeon Phi coprocessor
sde –knl -mix -top_blocks 100 -iform 1 -- ./<knl.exe> <args.>
You can also run the corresponding Intel Xeon processor binary on Intel SDE for comparisons and analysis purposes:
Intel Xeon processor
sde -ivb -mix -top_blocks 100 -iform 1 -- ./<ivb.exe> <args.>
sde –hsw -mix -top_blocks 100 -iform 1 -- ./<hsw.exe> <args.>
It is recommended to generate instruction mix reports using single MPI/OpenMP* thread runs (OMP_NUM_THREADS=1) for analysis simplification purposes.
For resolving thread parallelism issues refer to the thread parallelism section below.
Example - Build and Run Application on Intel SDE
- Untar the SNAPJune13.tar.gz – tar xvzf SNAPJune13.tar.gz
- Change directory into the ‘SNAPJune13’ directory – cd SNAPJune13
- Change directory into the ‘src’ directory – cd src
- Untar the src.tar – tar xvf src.tar
- Edit the ‘Makefile’ as follows (vi Makefile):
- Change the “FORTRAN = ftn” to “FORTAN = mpiifort”
- Change the “FFLAGS = -O3 -mp” to “FFLAGS = -O3 –xMIC-AVX512 –g –openmp –parallel-source-info=2“
- Source the latest Intel MPI & Intel Compiler (preferably 14+).
- Build the executable by running make on the terminal – make
- Copy the executable ‘snap’ from the ../SNAPJune13/src directory to the ../SNAPJune13/large directory.
- Change directory into the ‘large’ directory – cd ../large/
- Copy the large-2048nodes.input as large-1thread.input (cp large-2048nodes.input large-1thread.input)
- Make the following changes to large-1thread.input file, change the following settings (vim large-1thread.input)
- Run on the Intel SDE as follows: sde -knl -mix -top_blocks 100 -iform 1 -- ./snap ./large-1thread.input ./large-1thread.output
- The run is completed when you see ‘Success! Done in a SNAP!’ on the stdout.
- This will generate an ‘sde-mix-out.txt’ which contains the instruction mix information.
- A snap shot of the instruction profile would look like –
Example Analysis using instruction mix report from Intel SDE
Extracted Kernel from http://www.berkeleygw.org/.
Total Dynamic Instruction Reduction:
- Intel AVX -> Intel AVX2 Reduction: 1.08x
- Intel AVX2 -> Intel AVX-512 Reduction: 3.15x
Function Level Breakdown
Further Breakdown on isa-set categories
Significant % of x87 code for Intel AVX and Intel AVX2 for this kernel.
Intel SDE also provided the top basic blocks for your run based on hot instruction execution.
If you look at the top basic blocks, you see a significant number of x87 instructions in this kernel for the Intel AVX/AVX2 code. Below is just a snippet of the first basic block for Intel AVX2 instruction mix report.
The corresponding source code for the above basic block is line 459 (as highlighted above).
Looking at the source we observed there is “complex” division in line 459 involved in this statement and the compiler generates x87 sequence to conform to strict IEEE semantics and to avoid any overflows and underflows.
The way to avoid this is to compile with -fp-model fast=2. This allows the compiler to assume that real and imaginary parts of the double precision denominator lie in the approximate range, so it generates simple code without the tricks above. It can then generate vector Intel AVX/AVX2 instructions for the entire loop.
The EXECUTIONS count in the basic block is the number of times this basic block was executed, and ICOUNT gives the total number of instructions executed for this basic block for all the executions. Thus ICOUNT/EXECUTIONS give the total number of instructions in this basic block.
In addition, combination of the vectorization optimization report generated by the compiler (using the –qopt-report=5) and SDE top basic blocks can be used for doing a first pass ‘vectorization study’. Compiling with –qopt-report=5 generates an optimization report file kernel.optrpt. You can look for the corresponding source line in the basic block (example the line 459 above) and map it to the optimization report generated by the compiler to find whether your loop/basic block was vectorized or not (if not, why not). In the optimization report, you can also look for messages like – if some arrays in the loop were aligned or unaligned.
This is just an example of the kind of analysis that is possible with instruction mix reports from Intel SDE, but a lot more analysis is possible. For more details please see https://software.intel.com/en-us/articles/intel-software-development-emulator.
Configurations for the run: The instruction mix for the extracted kernel was generated using Intel® SDE version 7.2, the application was compiled with Intel® Compilers version 14.0.2 20140120. The run was conducted by Intel Engineer Karthik Raman. For more information go to http://www.intel.com/performance.
2) Thread Parallelism
Efficient parallelism is key for applications in the HPC domain to achieve great performance and cluster scaling. This is more critical than before with the many core architectures (like Intel Xeon Phi coprocessor) and also the increasing core counts with Intel Xeon processors.
The parallelism can be across several layers such as instruction level (super scalar), data level (SIMD/vectorization), thread level: shared memory (OpenMP) and/or distributed memory (MPI). Many HPC programs are moving to hybrid shared memory/distributed memory programming model where both OpenMP and MPI are used.
You can test thread scalability and efficiency of your application using existing hardware (Intel Xeon processor and/or Intel Xeon Phi coprocessor (Knights Corner).
Many tools are available for thread scalability analysis. A few are listed below:
- OpenMP scalability analysis using Intel® VTune™ Amplifier XE 2015
Serial vs. Parallel time, Spin Overheads, Potential gains possible etc.
- Intel® Trace Analyzer and Collector
To understand MPI application behavior, quickly find bottlenecks, and achieve high performance for parallel cluster applications.
- Intel® Inspector XE 2015
Memory and threading error debugger and thread dependency analysis.