Intel® Math Kernel Library (Intel MKL) Compiler Assisted Offload and Automatic Offload example on Intel Xeon Phi

Purpose:If you want to manage data manually in certain parts of your application and also want to take advantage of MKL Automatic offload in others, you can do that as explained in the following example

This Article provides example on using Compiler Assisted Offload and Automatic Offload together

The example can be used to simulate concurrent access to MIC from MKL AO and offload compiler w/ and w/o manual synchronization between them

The example consists of:

  • t.c – the example’s source code
  • run – the run script that sets up environment, builds example’s binary, and launches it with user-provided parameters
  • t.simple.c – a simpler version of the example which does not support manual synchronization but is a bit easier to understand (see below)

The example can either run offload and AO DGEMM one after another (if the first command line argument is ‘0’) or concurrently (if the first command line argument is ‘1’)

The manual synchronization mode is enabled by passing ‘1’ as the second command line argument and mimics ORSL: it places mutexes around offload/AO DGEMM calls. Moreover, it allows AO DGEMM to fall back to host if the lock is already set. The simpler version of the example does not support manual synchronization

Some run logs can be found below. It seems like in *this example* sometimes *not* synchronizing is slightly better than always synchronizing presumably because computations never overlap and we get slightly better performance due to overlapped computations spawned from offload compiler/MKL and communications spawned from MKL/offload compiler.

Example run Logs:

----------------------------------------------

# time -p ./run 0 0 4096

Coprocessor access: serial

Manual synchronization: off

N: 4096

Host/AO 4096x4096 DGEMM: 630.80 GFlops

Offload 4096x4096 DGEMM: 475.65 GFlops

real 17.06

user 26.91

sys 2.07

----------------------------------------------

# time -p ./run 1 0 4096

Coprocessor access: concurrent

Manual synchronization: off

N: 4096

Offload 4096x4096 DGEMM: 288.14 GFlops

Host/AO 4096x4096 DGEMM: 472.66 GFlops

real 13.81

user 19.58

sys 5.13

---------------------------------------------

# time -p ./run 1 1 4096

Coprocessor access: concurrent

Manual synchronization: on

N: 4096

Host/AO 4096x4096 DGEMM: 168.10 GFlops

Offload 4096x4096 DGEMM: 467.55 GFlops

real 14.98

user 35.97

sys 19.91

--------------------------------------------

# time -p ./run 0 0 8192

Coprocessor access: serial

Manual synchronization: off

N: 8192

Host/AO 8192x8192 DGEMM: 811.73 GFlops

Offload 8192x8192 DGEMM: 600.43 GFlops

real 32.31

user 96.96

sys 5.00

-------------------------------------------

# time -p ./run 1 0 8192

Coprocessor access: concurrent

Manual synchronization: off

N: 8192

Offload 8192x8192 DGEMM: 19.97 GFlops

Host/AO 8192x8192 DGEMM: 27.04 GFlops

real 184.45

user 267.15

sys 57.36

----------------------------------------

# time -p ./run 1 1 8192

Coprocessor access: concurrent

Manual synchronization: on

N: 8192

Offload 8192x8192 DGEMM: 594.91 GFlops

Host/AO 8192x8192 DGEMM: 181.86 GFlops

real 31.20

user 207.67

sys 93.12

Please refer other articles related to Intel MKL on Intel Xeon Phi at Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessor

 

For more complete information about compiler optimizations, see our Optimization Notice.