Intel® Cluster Studio XE works on Xeon® Phi coprocessor? OpenMP*? TBB? MPI?

Intel(R) Cluster Studio XE 2013 is a powerful tool suite - which helps you to develop applications, with low latency Intel MPI library, high performance C++/FORTRAN compiler, native profiling component named VTune Amplifier XE 2013, node level analysis component named Intel® Trace Collector/Analyzer, Threading and memory correctness components named Inspector XE 2013.     

Purposes of this article are: 

  • Get familiarity of using Intel® Software Development Products on Intel® Xeon Phi™ Coprocessor
  • Know different usage modes of development
  • Get familiar with Intel® Trace Collector/Analyzer and VTune™ Amplifier XE
Note :
1. All demo code are attached in zip file, you can practise below demos
2. Use amplxe-gui to open vtune result. I showed some screen-shots in demos  
Intel® Xeon Phi™ coprocessor software configuration

Key features of the Intel® Xeon Phi™ Coprocessor:

  • 50+ cores which run the Intel instruction set architecture 
  • 4 threads per physical core
  • 512 bit registers for SIMD operations (vector operations)
  • 512K L2 cache per core
  • High speed bi-directional ring connecting the 50+ cores

Getting Ready…

  • Ensure Xeon Phi™ coprocessor is running
    • Use “service mpss status” to check
    • Use “service mpss start” to invoke if it stops
  • Install Intel® Cluster Studio XE 2013 
  • Install VTune™ Amplifier driver on Phi coprocessor
    • Check if driver is working on Phi coprocessor

# ssh mic0

# lsmod | grep sep3 

e.g: sep3_8                 45016  0

If the driver is not installed

# cd vtune_root/bin64/k1om/

# ./sep_micboot_install.sh

Use “service mpss restart” to restart mpss

Setting environment variables

  • source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64
  • source /opt/intel/impi/4.1.0.024/bin64/mpivars.sh
  • source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh
  • source /opt/intel/itac/8.1.0.024/bin/itacvars.sh impi4
  • export I_MPI_MIC=1
  • export I_MPI_FABRICS=shm:tcp
  • export VT_LOGFILE_FORMAT=stfsingle
  • scp -r /opt/intel/composer_xe_2013.2.146/compiler/lib/mic/* mic0:/lib64/
  • scp -r /opt/intel/impi/4.1.0.024/mic/bin/* mic0:/bin/
  • scp -r /opt/intel/impi/4.1.0.024/mic/lib/* mic0:/lib64/
  • scp -r /opt/intel/composer_xe_2013.2.146/tbb/lib/mic/* mic0:/lib64
Demo #1, OpenMP* program on Xeon Phi coprocessor 
1. Compile OpenMP code for Xeon Phi Coprocessor
# icc -g -O3 -mmic -openmp -openmp-report omp_pi.c -o omp_pi.MIC
omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
2. Copy binary to the target device
# scp omp_pi.MIC mic0:/root
omp_pi.MIC                                    100%   20KB  19.7KB/s   00:00
3. Use VTune™ Amplifier XE to analyze 
# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/omp_pi.MIC
 

 
Demo #2, Intel® TBB built program on Xeon Phi coprocessor
1. Compile TBB code for Xeon Phi Coprocessor
# icpc -g -O3 -mmic -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x /opt/intel/composer_xe_2013.2.146/tbb/lib/mic/libtbb_debug.so.2 tbb_pi.cpp -o tbb_pi.MIC -lpthread
2. Copy binary to the target device
# scp tbb_pi.MIC mic0:/root
tbb_pi.MIC                                    100%   91KB  90.8KB/s   00:00
3. Use VTune™ Amplifier XE to analyze 
# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/tbb_pi.MIC
 
Demo #3, “Offload” program on Xeon Phi coprocessor
1. Compile “offload” code for Xeon Phi Coprocessor
# icc -g -O3 -openmp -openmp-report offload_pi.c -o offload_pi
offload_pi.c(18): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
offload_pi.c(18): (col. 9) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED.
2. Use VTune™ Amplifier XE to analyze 
# amplxe-cl -collect knc-lightweight-hotspots -- ./offload_pi
 
Demo #4, Use MPI built program on Xeon Phi coprocessor
1. Compile MPI code for Xeon and Xeon Phi Coprocessor
# mpiicc -g -openmp -O3 -o test-openmp test-openmp.c 
# mpiicc -g -openmp -mmic -O3 -o test-openmp.MIC test-openmp.c 
2. Copy binary to the target device
# scp test-openmp.MIC mic0:/root
test-openmp.MIC                               100%   17KB  17.2KB/s   00:00
3. Run the Intel MPI tests before: 
# mpirun -host `hostname` -n 2 ./test-openmp
# mpirun -env OMP_NUM_THREADS 4 -host mic0 -n 2 /root/test-openmp.MIC
4. Use MPI built program on Xeon Phi coprocessor – Hybrid mode
# mpirun -env OMP_NUM_THREADS 2 -host `hostname` -n 2 ./test-openmp : -env OMP_NUM_THREADS 4 -host mic0 -n 2 /root/test-openmp.MIC
 
Demo #5, Use VTune™ Amlipifier XE to analyze
1. Compile MPI code for Xeon Phi™ Coprocessor
# make clean | make MIC
2. Copy binary to the target device
# scp poisson.MIC mic0:/root
3. Run the Intel MPI tests
# amplxe-cl -collect knc-general-exploration -cpu-mask=1-64 --search-dir all:rp=. -- ssh mic0 OMP_NUM_THREADS=64 /root/poisson.MIC -n 3500 -iter 10
 
Demo #6, Intel Trace Collector / Analyzer
1. Compile MPI code for Xeon Phi™ Coprocessor
# make clean | make
# make clean | make MIC
Note: there is “-tcollect” option in Makefile
2. Copy binary to the target device
# scp poisson.MIC mic0:/root
3. Run the Intel MPI tests before: 
export VT_LOGFILE_FORMAT=stfsingle
# mpirun -env OMP_NUM_THREADS=1 -host `hostname` -n 2 ./poisson -n 3500 -iter 10 : -env OMP_NUM_THREADS=1 -host mic0 -n 6 /root/poisson.MIC -n 3500 -iter 10
traceanalyzer poisson.single.stf
Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.