Ready to run applications from multicore platform onto Intel® Xeon Phi™ coprocessor

We are staying the stage where most of developers do parallel-programming on multicore platform, and preparing to step into Many-Integrated-Cores (MIC) architecture.

Intel® Xeon Phi™ coprocessor (based on MIC architecture) combined many Intel CPU cores into a single chip (MIC microarchitecture), which can be connected to an Intel Xeon processor (Host) through PCI express bus. Intel Xeon Phi coprocessor can run a full service Linux* operation system, and communicate with Host.

Intel Xeon Phi coprocessor has peak performance on FLOPS, wide memory bandwidth, many threads’ paralleling and robust Vector Processing Unit (VPU, process 512-bit SIMD).

I have one machine which is Intel® Core™ i7 CPU with 1.6GHz, 6 cores with HT technology, also there is Intel® Xeon Phi™ coprocessor 1.1GHz in this machine.

In this article, I will pilot some experiments (cases with samples) to teach:

  • How to use Intel(R) C/C++ Composer XE to recompile code for Intel Xeon Phi coprocessor
  • Run it on MIC device and use VTune™ Amplifier XE to analyze the performance

Preparing works:

  • Ensure MIC device is running: use “service mpss status” to check, and use “service mpss start” to invoke it if it stops
  • Ensure Intel® C/C++ Composer XE, Intel® MPI library and Intel® VTune™ Amplifier XE have been installed in system, then set environments of them. For example,
    • source /opt/intel/composer_xe_2013.1.117/bin/compilervars.sh intel64
    • source /opt/intel/impi/4.1.0/bin64/mpivars.sh
    • source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh

Case 1: Use OpenMP* for Pi calculation, to run on Xeon Host and MIC device 

  1. Compile, run Pi-OMP code on multicore system, analyze performance
    1. # icc -g -O3 -openmp -openmp-report omp_pi.c -o pi

omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

b. Run program

# time ./pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m11.008s

user 2m8.496s

sys   0m0.179s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect lightweight-hotspots -- ./pi

Observe the result which is opened by amplxe-gui, we know:

√ Workloads on threads are balanced

√ Each core is full-utilized

2. Compile Pi-OMP code on Host and run it on MIC, analyze performance

a. #icc -g -O3 -mmic -openmp -openmp-report omp_pi.c -o pi

omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

# scp pi mic0:/root    ; copy program to the device

You have to copy MIC libraries to the device first, before running native MIC program

# scp -r /opt/intel/composer_xe_2013.1.117/compiler/lib/mic/*  mic0:/lib64

b. Run program

#time ssh mic0 /root/pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m2.524s

user 0m0.010s

sys   0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi

Observe the result which is opened by amplxe-gui, we know:

There are 241 threads paralleling-working

Workloads on threads are balanced

Each threads took ~2s, including time spent in Linux and OMP library

Each core is not full-utilized

Case 2: Use MPI for Pi calculation, to run on Xeon Host and MIC device 

  1. Compile, run Pi-MPI code on multicore system, analyze performance
    1. # mpiicc -g -O3 mpi_pi.c -o pi

b. Run program

 time mpirun -n 12 ./pi

Computed value of Pi by using MPI:  3.141592654

Elapsed time: 21.72 second

real  0m21.760s

user 4m20.592s

sys   0m0.104s

c. Use VTune™ Amplifier XE to analyze (note that lightweight-hotspots is not supported on single node for MPI program, PUM resource cannot be reused)

# mpirun -n 12 amplxe-cl -r mpi_res_host -collect hotspots -- ./pi 

(There will be 12 result directories generated, for 12 processes - you can pick up anyone to analyze)

Observe the result which is opened by amplxe-gui, we know:

MPI program (12 threads) ran on 12 cores respectively, with full-core-utilized

Each core ran single thread

2. Compile Pi-MPI code on Host and run it on MIC, analyze performance

        a. #mpiicc -g -O3 -mmic mpi_pi.c -o pi

#scp pi mic0:/root    ; copy program to the device

Copy impi bins/libraries onto the device

# scp /opt/intel/impi/4.1.0.024/mic/bin/* mic0:/bin

# scp /opt/intel/impi/4.1.0.024/mic/lib/* mic0:/lib64

b. Run program

# time ssh mic0 /bin/mpiexec -n 240 /root/pi

Computed value of Pi by using OpenMP:  3.141592654

Elapsed time: 14.95 seconds

real  0m19.570s

user 0m0.010s

sys   0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc_lightweight_hotspots -r mpi_res_target -search-dir all:rp=. -- ssh mic0 /bin/mpiexec -n 240 /root/pi
(it is quite different like Host, all threads info is stored in one result directory)

Observe the result which is opened by amplxe-gui, we know:

In most of time, MPI program are parallel-working

All cores are full-utilized, in most of time

Pi calculation itself is ~13s only

But vmlinux & OMP libraries takes more time, probably “reduction” work between threads

Case 3: Use Threading Building Block (TBB) for Pi calculation, to run on Xeon Host and MIC device 

  1. Compile, run Pi-TBB code on multicore system, analyze performance
    1. #icpc -g -O3 -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x -ltbb_debug tbb_pi.cpp -o pi

b. Run program

# time ./pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m10.887s

user 2m9.637s

sys   0m0.008s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect lightweight-hotspots -- ./pi

Observe the result which is opened by amplxe-gui, we know:

√ operator() on each thread takes 10s

√ Workloads on threads are balanced

√ 12 cores are full-utilized on 12 threads

2. Compile Pi-TBB code on Host and run it on MIC, analyze performance

a. #icpc -g -O3 -mmic -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x  tbb_pi.cpp /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/libtbb_debug.so.2 -o pi –lpthread

# scp pi mic0:/root    ; copy program to the device

Also, need to copy TBB libraries to MIC device

# scp -r /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/* mic0:/lib64
b. Run program

#time ssh mic0 /root/pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m3.265s

user 0m0.010s

sys   0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi

Observe the result which is opened by amplxe-gui, we know:

There are 166 threads paralleling-working

Workloads on threads are balanced

Each threads took ~3.25s, including time spent in operator(), Linux and TBB library

Each core is not full-utilized

Case 4: Use OpenMP* for Matrix application, to run on Xeon Host and MIC device 

  1. Compile, run Pi-OMP code on multicore system, analyze performance
    1.  icc -g -O3 -openmp -openmp-report -vec-report matrix.c -o matrix

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

b. Run program

# time ./matrix

real  0m7.408s

user 1m17.586s

sys   0m0.344

c. Use VTune™ Amplifier XE to analyze

 amplxe-cl -collect lightweight-hotspots -- ./matrix

Observe the result which is opened by amplxe-gui, we know:

Workloads of threads are balanced

Each core is full-utilized (~1,200% for 6 core with HT)

2. Compile Pi-OMP code on Host and run it on MIC, analyze performance

a. # icc -g -O3 -mmic -openmp -openmp-report -vec-report matrix.c -o matrix

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

# scp matrix mic0:/root    ; copy program to the device

(You have to copy MIC libraries to the device first, before running native MIC program - if you didn't do it before)

b. Run program

#time ssh mic0 /root/matrix

real  0m1.695s
user 0m0.008s
sys   0m0.007s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/matrix

Observe the result which is opened by amplxe-gui, we know:

There are 242 threads paralleling-working

Workloads on threads are balanced

Each threads took ~1.08s, including time spent in Linux and OMP library

Each core is not full-utilized

Conclusion:
 Your HPC applications might be very suitable for running on Intel Xeon Phi coprocessor, it’s time to starting to work on MIC architecture. Intel C/C++ composer assists you to generate MIC code and VTune Amplifier XE will help to analyze performance.

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.