Intel® Cluster Studio XE works on Intel® Xeon Phi™ coprocessor? OpenMP*? TBB? MPI?

Intel® Cluster Studio XE 2013 is a powerful tool suite - which helps you to develop applications, with low latency Intel MPI library, high performance C++/FORTRAN compiler, native profiling component named VTune Amplifier XE 2013, node level analysis component named Intel® Trace Collector/Analyzer, Threading and memory correctness components named Inspector XE 2013.     

Purposes of this article are: 

  • Get familiarity of using Intel® Software Development Products on Intel® Xeon Phi™ Coprocessor
  • Know different usage modes of development
  • Get familiar with Intel® Trace Collector/Analyzer and VTune™ Amplifier XE
Note :
1. All demo code are attached in zip file, you can practise below demos
2. Use amplxe-gui to open vtune result. I showed some screen-shots in demos  
 
Intel® Xeon Phi™ coprocessor software configuration
 

Key features of the Intel® Xeon Phi™ Coprocessor:

  • 50+ cores which run the Intel instruction set architecture 
  • 4 threads per physical core
  • 512 bit registers for SIMD operations (vector operations)
  • 512K L2 cache per core
  • High speed bi-directional ring connecting the 50+ cores

Getting Ready…

  • Ensure Xeon Phi™ coprocessor is running
    • Use “service mpss status” to check
    • Use “service mpss start” to invoke if it stops
  • Install Intel® Cluster Studio XE 2013 
  • Install VTune™ Amplifier driver on Phi coprocessor
    • Check if driver is working on Phi coprocessor

# ssh mic0

# lsmod | grep sep3 

e.g: sep3_8                 45016  0

If the driver is not installed

# cd vtune_root/bin64/k1om/

# ./sep_micboot_install.sh

Use “service mpss restart” to restart mpss

Setting environment variables

  • source /opt/intel/composer_xe_2013.2.146/bin/compilervars.sh intel64
  • source /opt/intel/impi/4.1.0.024/bin64/mpivars.sh
  • source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh
  • source /opt/intel/itac/8.1.0.024/bin/itacvars.sh impi4
  • export I_MPI_MIC=1
  • export I_MPI_FABRICS=shm:tcp
  • export VT_LOGFILE_FORMAT=stfsingle
  • scp -r /opt/intel/composer_xe_2013.2.146/compiler/lib/mic/* mic0:/lib64/
  • scp -r /opt/intel/impi/4.1.0.024/mic/bin/* mic0:/bin/
  • scp -r /opt/intel/impi/4.1.0.024/mic/lib/* mic0:/lib64/
  • scp -r /opt/intel/composer_xe_2013.2.146/tbb/lib/mic/* mic0:/lib64
Demo #1, OpenMP* program on Xeon Phi coprocessor 
1. Compile OpenMP code for Xeon Phi Coprocessor
# icc -g -O3 -mmic -openmp -openmp-report omp_pi.c -o omp_pi.MIC
omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
2. Copy binary to the target device
# scp omp_pi.MIC mic0:/root
omp_pi.MIC                                    100%   20KB  19.7KB/s   00:00
3. Use VTune™ Amplifier XE to analyze 
# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/omp_pi.MIC
 
 
 

 
Demo #2, Intel® TBB built program on Xeon Phi coprocessor
1. Compile TBB code for Xeon Phi Coprocessor
# icpc -g -O3 -mmic -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x /opt/intel/composer_xe_2013.2.146/tbb/lib/mic/libtbb_debug.so.2 tbb_pi.cpp -o tbb_pi.MIC -lpthread
2. Copy binary to the target device
# scp tbb_pi.MIC mic0:/root
tbb_pi.MIC                                    100%   91KB  90.8KB/s   00:00
3. Use VTune™ Amplifier XE to analyze 
# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/tbb_pi.MIC
 
 
Demo #3, “Offload” program on Xeon Phi coprocessor
1. Compile “offload” code for Xeon Phi Coprocessor
# icc -g -O3 -openmp -openmp-report offload_pi.c -o offload_pi
offload_pi.c(18): (col. 9) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
offload_pi.c(18): (col. 9) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED.
2. Use VTune™ Amplifier XE to analyze 
# amplxe-cl -collect knc-lightweight-hotspots -- ./offload_pi
 
 
Demo #4, Use MPI built program on Xeon Phi coprocessor
1. Compile MPI code for Xeon and Xeon Phi Coprocessor
# mpiicc -g -openmp -O3 -o test-openmp test-openmp.c 
# mpiicc -g -openmp -mmic -O3 -o test-openmp.MIC test-openmp.c 
2. Copy binary to the target device
# scp test-openmp.MIC mic0:/root
test-openmp.MIC                               100%   17KB  17.2KB/s   00:00
3. Run the Intel MPI tests before: 
# mpirun -host `hostname` -n 2 ./test-openmp
# mpirun -env OMP_NUM_THREADS 4 -host mic0 -n 2 /root/test-openmp.MIC
4. Use MPI built program on Xeon Phi coprocessor – Hybrid mode
# mpirun -env OMP_NUM_THREADS 2 -host `hostname` -n 2 ./test-openmp : -env OMP_NUM_THREADS 4 -host mic0 -n 2 /root/test-openmp.MIC
 
 
Demo #5, Use VTune™ Amlipifier XE to analyze
1. Compile MPI code for Xeon Phi™ Coprocessor
# make clean | make MIC
2. Copy binary to the target device
# scp poisson.MIC mic0:/root
3. Run the Intel MPI tests
# amplxe-cl -collect knc-general-exploration -cpu-mask=1-64 --search-dir all:rp=. -- ssh mic0 OMP_NUM_THREADS=64 /root/poisson.MIC -n 3500 -iter 10
 
 
Demo #6, Intel Trace Collector / Analyzer
 
1. Compile MPI code for Xeon Phi™ Coprocessor
# make clean | make
# make clean | make MIC
Note: there is “-tcollect” option in Makefile
2. Copy binary to the target device
# scp poisson.MIC mic0:/root
3. Run the Intel MPI tests before: 
export VT_LOGFILE_FORMAT=stfsingle
# mpirun -env OMP_NUM_THREADS=1 -host `hostname` -n 2 ./poisson -n 3500 -iter 10 : -env OMP_NUM_THREADS=1 -host mic0 -n 6 /root/poisson.MIC -n 3500 -iter 10
 
traceanalyzer poisson.single.stf
For more complete information about compiler optimizations, see our Optimization Notice.
AttachmentSize
Image icon omp.png246.17 KB
Image icon tbb.png112.46 KB
Image icon offload.png8.99 KB
Image icon general.png108.11 KB
Image icon itac.png35.01 KB
Image icon sw-config.png275.5 KB
Package icon demos.zip69.72 KB