Ready to run applications from multicore platform onto Intel® Xeon Phi™ coprocessor

We are staying the stage where most of developers do parallel-programming on multicore platform, and preparing to step into Many-Integrated-Cores (MIC) architecture.

Intel® Xeon Phi™ coprocessor (based on MIC architecture) combined many Intel CPU cores into a single chip (MIC microarchitecture), which can be connected to an Intel Xeon processor (Host) through PCI express bus. Intel Xeon Phi coprocessor can run a full service Linux* operation system, and communicate with Host.

Intel Xeon Phi coprocessor has peak performance on FLOPS, wide memory bandwidth, many threads’ paralleling and robust Vector Processing Unit (VPU, process 512-bit SIMD).

I have one machine which is Intel® Core™ i7 CPU with 1.6GHz, 6 cores with HT technology, also there is Intel® Xeon Phi™ coprocessor 1.1GHz in this machine.

In this article, I will pilot some experiments (cases with samples) to teach:

  • How to use Intel(R) C/C++ Composer XE to recompile code for Intel Xeon Phi coprocessor
  • Run it on MIC device and use VTune™ Amplifier XE to analyze the performance

Preparing works:

  • Ensure MIC device is running: use “service mpss status” to check, and use “service mpss start” to invoke it if it stops
  • Ensure Intel® C/C++ Composer XE, Intel® MPI library and Intel® VTune™ Amplifier XE have been installed in system, then set environments of them. For example,
    • source /opt/intel/composer_xe_2013.1.117/bin/compilervars.sh intel64
    • source /opt/intel/impi/4.1.0/bin64/mpivars.sh
    • source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh

Case 1: Use OpenMP* for Pi calculation, to run on Xeon Host and MIC device 

  1. Compile, run Pi-OMP code on multicore system, analyze performance
    1. # icc -g -O3 -openmp -openmp-report omp_pi.c -o pi

omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

b. Run program

# time ./pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m11.008s

user 2m8.496s

sys   0m0.179s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect lightweight-hotspots -- ./pi

Observe the result which is opened by amplxe-gui, we know:

√ Workloads on threads are balanced

√ Each core is full-utilized

2. Compile Pi-OMP code on Host and run it on MIC, analyze performance

a. #icc -g -O3 -mmic -openmp -openmp-report omp_pi.c -o pi

omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

# scp pi mic0:/root    ; copy program to the device

You have to copy MIC libraries to the device first, before running native MIC program

# scp -r /opt/intel/composer_xe_2013.1.117/compiler/lib/mic/*  mic0:/lib64

b. Run program

#time ssh mic0 /root/pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m2.524s

user 0m0.010s

sys   0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi

Observe the result which is opened by amplxe-gui, we know:

There are 241 threads paralleling-working

Workloads on threads are balanced

Each threads took ~2s, including time spent in Linux and OMP library

Each core is not full-utilized

Case 2: Use MPI for Pi calculation, to run on Xeon Host and MIC device 

  1. Compile, run Pi-MPI code on multicore system, analyze performance
    1. # mpiicc -g -O3 mpi_pi.c -o pi

b. Run program

 time mpirun -n 12 ./pi

Computed value of Pi by using MPI:  3.141592654

Elapsed time: 21.72 second

real  0m21.760s

user 4m20.592s

sys   0m0.104s

c. Use VTune™ Amplifier XE to analyze (note that lightweight-hotspots is not supported on single node for MPI program, PUM resource cannot be reused)

# mpirun -n 12 amplxe-cl -r mpi_res_host -collect hotspots -- ./pi 

(There will be 12 result directories generated, for 12 processes - you can pick up anyone to analyze)

Observe the result which is opened by amplxe-gui, we know:

MPI program (12 threads) ran on 12 cores respectively, with full-core-utilized

Each core ran single thread

2. Compile Pi-MPI code on Host and run it on MIC, analyze performance

        a. #mpiicc -g -O3 -mmic mpi_pi.c -o pi

#scp pi mic0:/root    ; copy program to the device

Copy impi bins/libraries onto the device

# scp /opt/intel/impi/4.1.0.024/mic/bin/* mic0:/bin

# scp /opt/intel/impi/4.1.0.024/mic/lib/* mic0:/lib64

b. Run program

# time ssh mic0 /bin/mpiexec -n 240 /root/pi

Computed value of Pi by using OpenMP:  3.141592654

Elapsed time: 14.95 seconds

real  0m19.570s

user 0m0.010s

sys   0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc_lightweight_hotspots -r mpi_res_target -search-dir all:rp=. -- ssh mic0 /bin/mpiexec -n 240 /root/pi
(it is quite different like Host, all threads info is stored in one result directory)

Observe the result which is opened by amplxe-gui, we know:

In most of time, MPI program are parallel-working

All cores are full-utilized, in most of time

Pi calculation itself is ~13s only

But vmlinux & OMP libraries takes more time, probably “reduction” work between threads

Case 3: Use Threading Building Block (TBB) for Pi calculation, to run on Xeon Host and MIC device 

  1. Compile, run Pi-TBB code on multicore system, analyze performance
    1. #icpc -g -O3 -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x -ltbb_debug tbb_pi.cpp -o pi

b. Run program

# time ./pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m10.887s

user 2m9.637s

sys   0m0.008s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect lightweight-hotspots -- ./pi

Observe the result which is opened by amplxe-gui, we know:

√ operator() on each thread takes 10s

√ Workloads on threads are balanced

√ 12 cores are full-utilized on 12 threads

2. Compile Pi-TBB code on Host and run it on MIC, analyze performance

a. #icpc -g -O3 -mmic -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x  tbb_pi.cpp /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/libtbb_debug.so.2 -o pi –lpthread

# scp pi mic0:/root    ; copy program to the device

Also, need to copy TBB libraries to MIC device

# scp -r /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/* mic0:/lib64
b. Run program

#time ssh mic0 /root/pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m3.265s

user 0m0.010s

sys   0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi

Observe the result which is opened by amplxe-gui, we know:

There are 166 threads paralleling-working

Workloads on threads are balanced

Each threads took ~3.25s, including time spent in operator(), Linux and TBB library

Each core is not full-utilized

Case 4: Use OpenMP* for Matrix application, to run on Xeon Host and MIC device 

  1. Compile, run Pi-OMP code on multicore system, analyze performance
    1.  icc -g -O3 -openmp -openmp-report -vec-report matrix.c -o matrix

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

b. Run program

# time ./matrix

real  0m7.408s

user 1m17.586s

sys   0m0.344

c. Use VTune™ Amplifier XE to analyze

 amplxe-cl -collect lightweight-hotspots -- ./matrix

Observe the result which is opened by amplxe-gui, we know:

Workloads of threads are balanced

Each core is full-utilized (~1,200% for 6 core with HT)

2. Compile Pi-OMP code on Host and run it on MIC, analyze performance

a. # icc -g -O3 -mmic -openmp -openmp-report -vec-report matrix.c -o matrix

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

# scp matrix mic0:/root    ; copy program to the device

(You have to copy MIC libraries to the device first, before running native MIC program - if you didn't do it before)

b. Run program

#time ssh mic0 /root/matrix

real  0m1.695s
user 0m0.008s
sys   0m0.007s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/matrix

Observe the result which is opened by amplxe-gui, we know:

There are 242 threads paralleling-working

Workloads on threads are balanced

Each threads took ~1.08s, including time spent in Linux and OMP library

Each core is not full-utilized

Conclusion:
 Your HPC applications might be very suitable for running on Intel Xeon Phi coprocessor, it’s time to starting to work on MIC architecture. Intel C/C++ composer assists you to generate MIC code and VTune Amplifier XE will help to analyze performance.

AnexoTamanho
Download omp-pi.c503 bytes
Download mpi-pi.c1.12 KB
Download tbb-pi.cpp916 bytes
Download matrix.c481 bytes
Download mic7.png94.54 KB
Download mic8.png101.08 KB
Download mic100.png104.06 KB
Download mic101.png120.97 KB
Download mic102.png81.99 KB
Download mic103.png115.47 KB
Download mic104.png105.36 KB
Download mic105.png124.23 KB
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.