Ready to run applications from multicore platform onto Intel® Xeon Phi™ coprocessor

We are staying the stage where most of developers do parallel-programming on multicore platform, and preparing to step into Many-Integrated-Cores (MIC) architecture.

Intel® Xeon Phi™ coprocessor (based on MIC architecture) combined many Intel CPU cores into a single chip (MIC microarchitecture), which can be connected to an Intel Xeon processor (Host) through PCI express bus. Intel Xeon Phi coprocessor can run a full service Linux* operation system, and communicate with Host.

Intel Xeon Phi coprocessor has peak performance on FLOPS, wide memory bandwidth, many threads’ paralleling and robust Vector Processing Unit (VPU, process 512-bit SIMD).

I have one machine which is Intel® Core™ i7 CPU with 1.6GHz, 6 cores with HT technology, also there is Intel® Xeon Phi™ coprocessor 1.1GHz in this machine.

In this article, I will pilot some experiments (cases with samples) to teach:

  • How to use Intel(R) C/C++ Composer XE to recompile code for Intel Xeon Phi coprocessor
  • Run it on MIC device and use VTune™ Amplifier XE to analyze the performance

Preparing works:

  • Ensure MIC device is running: use “service mpss status” to check, and use “service mpss start” to invoke it if it stops
  • Ensure Intel® C/C++ Composer XE, Intel® MPI library and Intel® VTune™ Amplifier XE have been installed in system, then set environments of them. For example,
    • source /opt/intel/composer_xe_2013.1.117/bin/compilervars.sh intel64
    • source /opt/intel/impi/4.1.0/bin64/mpivars.sh
    • source /opt/intel/vtune_amplifier_xe_2013/amplxe-vars.sh

Case 1: Use OpenMP* for Pi calculation, to run on Xeon Host and MIC device 

  1. Compile, run Pi-OMP code on multicore system, analyze performance
    1. # icc -g -O3 -openmp -openmp-report omp_pi.c -o pi

omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

b. Run program

# time ./pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m11.008s

user2m8.496s

sys   0m0.179s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect lightweight-hotspots -- ./pi

Observe the result which is opened by amplxe-gui, we know:

√ Workloads on threads are balanced

√ Each core is full-utilized

2. Compile Pi-OMP code on Host and run it on MIC, analyze performance

a. #icc -g -O3 -mmic -openmp -openmp-report omp_pi.c -o pi

omp_pi.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

# scp pi mic0:/root    ; copy program to the device

You have to copy MIC libraries to the device first, before running native MIC program

# scp -r /opt/intel/composer_xe_2013.1.117/compiler/lib/mic/*  mic0:/lib64

b. Run program

#time ssh mic0 /root/pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m2.524s

user 0m0.010s

sys   0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi

Observe the result which is opened by amplxe-gui, we know:

There are 241 threads paralleling-working

Workloads on threads are balanced

Each threads took ~2s, including time spent in Linux and OMP library

Each core is not full-utilized

Case 2: Use MPI for Pi calculation, to run on Xeon Host and MIC device 

  1. Compile, run Pi-MPI code on multicore system, analyze performance
    1. # mpiicc -g -O3 mpi_pi.c -o pi

b. Run program

 time mpirun -n 12 ./pi

Computed value of Pi by using MPI:  3.141592654

Elapsed time: 21.72 second

real  0m21.760s

user 4m20.592s

sys   0m0.104s

c. Use VTune™ Amplifier XE to analyze (note that lightweight-hotspots is not supported on single node for MPI program, PUM resource cannot be reused)

# mpirun -n 12 amplxe-cl -r mpi_res_host -collect hotspots -- ./pi 

(There will be 12 result directories generated, for 12 processes - you can pick up anyone to analyze)

Observe the result which is opened by amplxe-gui, we know:

MPI program (12 threads) ran on 12 cores respectively, with full-core-utilized

Each core ran single thread

2. Compile Pi-MPI code on Host and run it on MIC, analyze performance

        a. #mpiicc -g -O3 -mmic mpi_pi.c -o pi

#scp pi mic0:/root    ; copy program to the device

Copy impi bins/libraries onto the device

# scp /opt/intel/impi/4.1.0.024/mic/bin/* mic0:/bin

# scp /opt/intel/impi/4.1.0.024/mic/lib/* mic0:/lib64

b. Run program

# time ssh mic0 /bin/mpiexec -n 240 /root/pi

Computed value of Pi by using OpenMP:  3.141592654

Elapsed time: 14.95 seconds

real  0m19.570s

user 0m0.010s

sys   0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc_lightweight_hotspots -r mpi_res_target -search-dir all:rp=. -- ssh mic0 /bin/mpiexec -n 240 /root/pi
(it is quite different like Host, all threads info is stored in one result directory)

Observe the result which is opened by amplxe-gui, we know:

In most of time, MPI program are parallel-working

All cores are full-utilized, in most of time

Pi calculation itself is ~13s only

But vmlinux & OMP libraries takes more time, probably “reduction” work between threads

Case 3: Use Threading Building Block (TBB) for Pi calculation, to run on Xeon Host and MIC device 

  1. Compile, run Pi-TBB code on multicore system, analyze performance
    1. #icpc -g -O3 -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x -ltbb_debug tbb_pi.cpp -o pi

b. Run program

# time ./pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m10.887s

user2m9.637s

sys   0m0.008s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect lightweight-hotspots -- ./pi

Observe the result which is opened by amplxe-gui, we know:

√ operator() on each thread takes 10s

√ Workloads on threads are balanced

√ 12 cores are full-utilized on 12 threads

2. Compile Pi-TBB code on Host and run it on MIC, analyze performance

a. #icpc -g -O3 -mmic -DTBB_DEBUG -DTBB_USE_THREADING_TOOLS -std=c++0x  tbb_pi.cpp /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/libtbb_debug.so.2 -o pi –lpthread

# scp pi mic0:/root    ; copy program to the device

Also, need to copy TBB libraries to MIC device

# scp -r /opt/intel/composer_xe_2013.1.117/tbb/lib/mic/* mic0:/lib64
b. Run program

#time ssh mic0 /root/pi

Computed value of Pi by using OpenMP:  3.141592654

real  0m3.265s

user 0m0.010s

sys   0m0.003s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/pi

Observe the result which is opened by amplxe-gui, we know:

There are 166 threads paralleling-working

Workloads on threads are balanced

Each threads took ~3.25s, including time spent in operator(), Linux and TBB library

Each core is not full-utilized

Case 4: Use OpenMP* for Matrix application, to run on Xeon Host and MIC device 

  1. Compile, run Pi-OMP code on multicore system, analyze performance
    1.  icc -g -O3 -openmp -openmp-report -vec-report matrix.c -o matrix

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

b. Run program

# time ./matrix

real  0m7.408s

user1m17.586s

sys   0m0.344

c. Use VTune™ Amplifier XE to analyze

 amplxe-cl -collect lightweight-hotspots -- ./matrix

Observe the result which is opened by amplxe-gui, we know:

Workloads of threads are balanced

Each core is full-utilized (~1,200% for 6 core with HT)

2. Compile Pi-OMP code on Host and run it on MIC, analyze performance

a. # icc -g -O3 -mmic -openmp -openmp-report -vec-report matrix.c -o matrix

matrix.c(16): (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED

matrix.c(18): (col. 12) remark: PERMUTED LOOP WAS VECTORIZED

# scp matrix mic0:/root    ; copy program to the device

(You have to copy MIC libraries to the device first, before running native MIC program - if you didn't do it before)

b. Run program

#time ssh mic0 /root/matrix

real  0m1.695s
user 0m0.008s
sys   0m0.007s

c. Use VTune™ Amplifier XE to analyze

# amplxe-cl -collect knc-lightweight-hotspots --search-dir all:rp=./ -- ssh mic0 /root/matrix

Observe the result which is opened by amplxe-gui, we know:

There are 242 threads paralleling-working

Workloads on threads are balanced

Each threads took ~1.08s, including time spent in Linux and OMP library

Each core is not full-utilized

Conclusion:
 Your HPC applications might be very suitable for running on Intel Xeon Phi coprocessor, it’s time to starting to work on MIC architecture. Intel C/C++ composer assists you to generate MIC code and VTune Amplifier XE will help to analyze performance.

AttachmentSize
File omp-pi.c503 bytes
File mpi-pi.c1.12 KB
File tbb-pi.cpp916 bytes
File matrix.c481 bytes
Image icon mic7.png94.54 KB
Image icon mic8.png101.08 KB
Image icon mic100.png104.06 KB
Image icon mic101.png120.97 KB
Image icon mic102.png81.99 KB
Image icon mic103.png115.47 KB
Image icon mic104.png105.36 KB
Image icon mic105.png124.23 KB
For more complete information about compiler optimizations, see our Optimization Notice.

2 comments

Top
Peter Wang (Intel)'s picture

I haven't seen any error message, and please check "ssh mic0 ~/pi-mic" can run smoothly first. Is it possible that you have to setup some environment in mic, required by app?

yingbo c.'s picture

Dear Wang, I have read several articles written by you. Now I have a question, does running vtune amplifier xe to analyze mic performance need to be root user? When I run vtune in command "amplxe-cl -collect knc-hotspots --search-dir all:rp=./ -- ssh mic0 ~/pi-mic", it printed out information below:

amplxe: Collection stopped.
amplxe: Using result path `/home/cuiyingbo/r000hs'
amplxe: Executing actions 16 % Resolving module symbols
amplxe: Warning: Cannot locate file `/sbin/sshd'.
amplxe: Warning: Cannot locate file `/lib64/libcrypto.so.10'.
amplxe: Executing actions 18 % Resolving information for `sshd'
amplxe: Warning: Cannot locate file `/boot/vmlinuz-2.6.38.8-g2593b11'.
amplxe: Executing actions 50 % Generating a report

My application needs 5 - 10 miniutes in knc, but vtune process finished just about 2 - 3 secs every time, does something go wrong ?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.