Using Intel® Math Kernel Library Compiler Assisted Offload in Intel® Xeon Phi™ Processor

Introduction

Beside native execution, another usage model of using the Intel® Math Kernel Library (Intel® MKL) on an Intel® Xeon Phi™ processor is the compiler assisted offload (CAO). The CAO usage model allows users to offload Intel MKL functions and data to an Intel Xeon Phi processor by using the Intel® compiler and its offload pragma support to manage functions and offloaded data.

This document shows how users can offload Intel MKL functions and data to the Intel Xeon Phi processor from an Intel® Xeon® processor-based machine. In order to use Intel MKL CAO in an Intel Xeon Phi processor, users need to set up Offload over Fabric software first.

Part 1 – Installing Offload over Fabric Software

In this example, Intel® Omni-Path Architecture (Intel® OPA) was used to connect an Intel Xeon processor machine and an Intel Xeon Phi processor machine. For details on installation and configuring IP over Fabric, please refer to the paper How to install the Intel® Omni-Path Architecture Software.

In this article, an Intel® Xeon® processor E5-2698 v3 @ 2.30GHz server is the host machine and an Intel Xeon Phi Processor is the target machine. Both machines run on Red Hat Enterprise Linux* 7.2. Each machine has an Intel® Omni-Path Host Fabric Interface PCIx16 card and they are connected with an Intel® Omni-Path Cable.

  • Install the Intel® Omni-Path Architecture Fabric Host Software, IntelOPA-IFS.RHEL72-x86_64.10.4.2.0.7.tgz, which can be downloaded from the Intel download center, on both machines. Note that this version 10.4.2.07 of the Intel OPA Fabric Host Software requires the library libfabric version 1.4 or greater. libfabric is a core component of OpenFabrics Interfaces*. Therefore, you need to recompile and install a newer version of libfabric, as shown in the next step (see A BKM for Working with libfabric* on a Cluster System when using Intel® MPI Library).
  • Download libfabric-1.4.2.tar.bz2 and rebuild libfabric:
    # rpmbuild -ta libfabric-1.4.2.tar.bz2 --define 'configopts --enable-verbs=yes'
    # cd /root/rpmbuild/RPMS/x86_64
    # yum install libfabric-1.4.2-1.el7_2.x86_64.rpm libfabric-debuginfo-1.4.2-1.el7_2.x86_64.rpm libfabric-devel-1.4.2-1.el7_2.x86_64.rpm
    # fi_info
    provider: psm2
    
  • After installing the Intel OPA Fabric Host Software on both host and target machines, reboot them.
  • Configure Intel OFA IP over Infiniband (IPoIB). In this example, the IP addresses on the host and target are 192.168.100.101 and 192.168.100.102, respectively.
  • Bring the IP over Fabric interface up on both machines:
    On the host machine:
    [host]# ifup ib0
    [host]# ifconfig ib0
    ib0: flags=4163  mtu 65520
            inet 192.168.100.101  netmask 255.255.255.0  broadcast 192.168.100.255
            inet6 fe80::211:7501:179:311  prefixlen 64  scopeid 0x20
    Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
            infiniband 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
            RX packets 5415223  bytes 47440566267 (44.1 GiB)
            RX errors 0  dropped 0  overruns 0  frame 0
            TX packets 5850844  bytes 47481697417 (44.2 GiB)
            TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    
    Similarly, on the target machine:
    [target]# ifup ib0 
    [target]# ifconfig ib0
    ib0: flags=4163  mtu 65520
            inet 192.168.100.102  netmask 255.255.255.0  broadcast 192.168.100.255
            inet6 fe80::211:7501:174:44e0  prefixlen 64  scopeid 0x20
    Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
            infiniband 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
            RX packets 11370  bytes 1989607 (1.8 MiB)
            RX errors 0  dropped 0  overruns 0  frame 0
            TX packets 11551  bytes 5639588 (5.3 MiB)
            TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
    
  • Run the Intel OPA Fabric Manager
    [host]# opaconfig –E opafm
    [host]# service opafm start
    
    [target]# opaconfig –E opafm
    [target]# service opafm start
    
  • Set up Secure Shell (SSH) password-less for the offload testing. To do this, first generate a pair of authentication keys on the host without entering a passphrase:
    [host]$ ssh-keygen -t rsa
    Then append the host machine new public key to the target machine public key using the command ssh-copy-id:
    [host]$ ssh-copy-id @192.168.100.102
  • Download and install the Offload over Fabric software for host version 1.5.2 from the Intel Xeon Phi Processor Software page on the host machine (follow the instructions in the User Guide).
  • Similarly, download the Offload over Fabric for target software version 1.5.2 from the Intel Xeon Phi Processor Software page and install it on the target machine.
  • Finally, install the latest version of Intel® Parallel Studio XE (in this example the Intel Parallel Studio XE 2018 is used) on the host machine.

Part 2 – Using Intel® Math Kernel Library Compiler Assisted Offload

In the second part of this article, an example of using the Intel MKL CAO feature to offload an Intel MKL function to the target machine is shown. The original code from Multiplying Matrices Using dgemm was modified to add the offload capability.

You can use offload pragma to initiate an offload from the host to the target. The in specifier defines a variable as strictly an input to the target; the value is not copied back to the host. The inout specifier defines a variable that is both copied from the host to the target and then from the target to the host. The program offloads the same function many times. Also, to retain data between different offloads, you can specify alloc_if(1) to perform fresh memory allocation in the first iteration, and specify free_if(0) to retain the memory. In the subsequent offload, you can specify alloc_if(0) to reuse the memory and specify free_if(0) to retain the memory. In the last offload, you can specify alloc_if(0) to reuse the memory and specify free_if(1) to free up the memory. In the sample code in the appendix, the program iterates the offload process three times: in the first iteration, the memory in the Intel Xeon Phi processor is allocated to store the matrices and retained, and the memory is reused in the subsequent iterations. In the last iteration, the memory is freed.

The code sample offloads the cblas_dgemm function for matrix multiplication to the Intel Xeon Phi processor via Offload over Fabric. The cblas_dgemm function computes a scalar-matrix-matrix product and adds the result to a scalar-matrix product:

Where A, B, and C are double-precision matrices, α and β are double-precision scalars.

The program calls the interface and passes the following arguments:

cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, k, B, n, beta, C, n);

Where:

  • CblasRowMajor indicates that the elements of each row of the matrices are stored contiguously.
  • CblasNoTrans indicates that the matrix A and B should not be transposed or conjugate transposed before the matrix multiplication.
  • m, n, k are integers that indicate the matrix dimension. In this case, m specifies the number of rows of matrix A and of matrix C, n specifies the number of column of matrix B, k specifies the number of columns of A and the number of rows of B. Thus, matrix A is m rows by k columns, matrix B is k rows by n columns, and matrix C is m rows by n columns.
  • alpha and beta are scalars used in the multiplication as shown above in the formula.
  • A, B, and C are arrays used to store the matrices, respectively.
  • Leading dimension of the matrices A, B, and C: in this example, k is the leading dimension of matrix A (or the number of columns). n is the leading dimension of matrix B; n is also the leading dimension of matrix C.

To run the application, set the proper compiler environment variables for the Intel® Parallel Studio XE 2018 and compile the code sample from the host machine:

[host]$ source /opt/intel/parallel_studio_xe_2018.0.033/psxevars.sh intel64
[host]$ icc -mkl -qopenmp mkl-cao.c -o mkl-cao.out

Prior to executing the program, you need to set the environment variable OFFLOAD_NODES to the IP address (over the high-speed network Intel OPA) of the target machine, in this case 192.168.100.102, to indicate that the target is available for offloading.

[host]$ export OFFLOAD_NODES=192.168.100.102

Optionally, to generate offload execution time and the amount of data transferred, one can set the environment variable OFFLOAD_REPORT to 2 (value 1 reports the offload computation time only, while value 3 reports the offload computation time, the amount of data transferred, device initialization, and individual variable transfers).

[host]$ export OFFLOAD_REPORT=2

To run the application, one must pass the values of m, n, k, alpha and beta to the application running on the host machine. For example, the following command line triggers a matrix multiplication where m=n=k=14096, alpha=1.0, and beta=2.0. Note that for simplicity, all elements in the matrix A are initialized to 1.0, and all elements in the matrix B were initiated to 2.0. The application allocates the memory for matrices in the host, then offloads the MKL matrix multiplication function and the matrix arrays to the Intel Xeon Phi processor three times. The target machine allocates memory in the first iteration, performs the matrix multiplication, and sends the results back to the host. In the second iteration, the target machine re-uses the allocated memory and performs the matrix multiplication. In the last iteration, the target machine de-allocates memory after performing the matrix multiplication and sending back the result.

[host]$./mkl-cao.out 14906 14906 14906 1.0 2.0
m:14906 n:14906 k:14906 alpha:    1.00 beta:    2.00
[Offload] [MIC 0] [File]                    mkl-cao.c
[Offload] [MIC 0] [Line]                    76
[Offload] [MIC 0] [Tag]                     Tag 0
[Offload] [HOST]  [Tag 0] [CPU Time]        8.436242(seconds)
[Offload] [MIC 0] [Tag 0] [CPU->MIC Data]   5332532092 (bytes)
[Offload] [MIC 0] [Tag 0] [MIC Time]        3.096220(seconds)
[Offload] [MIC 0] [Tag 0] [MIC->CPU Data]   1777510688 (bytes)

[Offload] [MIC 0] [File]                    mkl-cao.c
[Offload] [MIC 0] [Line]                    76
[Offload] [MIC 0] [Tag]                     Tag 1
[Offload] [HOST]  [Tag 1] [CPU Time]        3.002104(seconds)
[Offload] [MIC 0] [Tag 1] [CPU->MIC Data]   5332532116 (bytes)
[Offload] [MIC 0] [Tag 1] [MIC Time]        2.669768(seconds)
[Offload] [MIC 0] [Tag 1] [MIC->CPU Data]   1777510688 (bytes)

[Offload] [MIC 0] [File]                    mkl-cao.c
[Offload] [MIC 0] [Line]                    76
[Offload] [MIC 0] [Tag]                     Tag 2
[Offload] [HOST]  [Tag 2] [CPU Time]        3.233454(seconds)
[Offload] [MIC 0] [Tag 2] [CPU->MIC Data]   5332532116 (bytes)
[Offload] [MIC 0] [Tag 2] [MIC Time]        2.668376(seconds)
[Offload] [MIC 0] [Tag 2] [MIC->CPU Data]   1777510688 (bytes)

 Top left corner of matrix A(m x k):
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00
        1.00        1.00        1.00        1.00        1.00        1.00

 Top left corner of matrix B(k x n):
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00
        2.00        2.00        2.00        2.00        2.00        2.00

 Top left corner of matrix C(m x n):
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00
   208684.00   208684.00   208684.00   208684.00   208684.00   208684.00

Conclusion

With the Intel OPA, a host computer can take advantage of Intel® Many Integrated Core Architecture by using the Intel compiler’s offload pragma support for the Intel Xeon Phi processor. Moreover, Intel MKL CAO allows users to use the highly optimized Intel MKL functions on an Intel Xeon Phi processor. This paper shows users, step-by-step, how to set up and configure the Intel OPA and enable Offload over Fabric. The code samples illustrate how to use offload pragma to offload an Intel MKL function from a host to an Intel Xeon Phi processor.

References

A BKM for Working with libfabric* on a Cluster System when using Intel® MPI Library

How to Install the Intel® Omni-Path Architecture Software

Developer Reference for Intel Math Kernel Library 2018 - C

Effective Use of the Intel Compiler’s Offload Features

Download sample code [1.59KB]

Appendix A

The sample code is shown below.

/*
 *  Copyright (c) 2017 Intel Corporation. All Rights Reserved.
 *
 *  Portions of the source code contained or described herein and all documents related
 *  to portions of the source code ("Material") are owned by Intel Corporation or its
 *  suppliers or licensors.  Title to the Material remains with Intel
 *  Corporation or its suppliers and licensors.  The Material contains trade
 *  secrets and proprietary and confidential information of Intel or its
 *  suppliers and licensors.  The Material is protected by worldwide copyright
 *  and trade secret laws and treaty provisions.  No part of the Material may
 *  be used, copied, reproduced, modified, published, uploaded, posted,
 *  transmitted, distributed, or disclosed in any way without Intel's prior
 *  express written permission.
 *
 *  No license under any patent, copyright, trade secret or other intellectual
 *  property right is granted to or conferred upon you by disclosure or
 *  delivery of the Materials, either expressly, by implication, inducement,
 *  estoppel or otherwise. Any license under such intellectual property rights
 *  must be express and approved by Intel in writing.
 */
#include <stdlib.h>
#include <stdio.h>
#include "omp.h"
#include "mkl.h"

#define min(x,y) (((x) < (y)) ? (x) : (y))

int offload(int m, int n, int k, double alpha, double beta)
{
   int i, j;

   /* Allocate memory using MKL function to aligned on 64-byte boundary */
   double *A = mkl_malloc(sizeof(double) * m * k, 64);
   if (A == NULL) 
      return (-1);
   else
   {
      /* Initialize matrix A */
      for (i = 0; i < m*k; i++)
         A[i] = 1.0;
   }

   double *B = mkl_malloc(sizeof(double) * k * n, 64);
   if (B == NULL) 
   {
      mkl_free(A);
      return (-1);
   }
   else
   {
      /* Initialize matrix B */
      for (i = 0; i < k*n; i++)
         B[i] = 2.0;
   }

   double *C = mkl_malloc(sizeof(double) * m * n, 64);
   if (C == NULL)
   {
      mkl_free(A);
      mkl_free(B);
      return (-1);
   }
   else
   {
      /* Initialize matrix C */
      for (i = 0; i < m*n; i++)
         C[i] = 0.0;
   }

   const int NITERS = 3;

   for (i = 0; i < NITERS; i++) 
   {
      static int first_run = 1, last_run = 0;

#pragma offload target(mic:0) in(m, n, k, alpha, beta) \
		in(A: length(m*k) alloc_if(first_run) free_if(last_run)) \
		in(B: length(k*n) alloc_if(first_run) free_if(last_run)) \
		inout(C: length(m*n) alloc_if(first_run) free_if(last_run))
      {
         cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
	   m, n, k, alpha, A, k, B, n, beta, C, n);
      }

   
      first_run = 0;
      if (i == NITERS-2)
         last_run = 1;
   }

   // Verify
   printf (" Top left corner of matrix A(m x k): \n");
   for (i=0; i<min(m,6); i++) 
   {
      for (j=0; j<min(k,6); j++) 
         printf ("%12.2f", A[j+i*k]);

      printf ("\n");
   }
        
   printf ("\n Top left corner of matrix B(k x n): \n");
   for (i=0; i<min(k,6); i++) 
   {
      for (j=0; j<min(n,6); j++)
         printf ("%12.2f", B[j+i*n]);
 
      printf ("\n");
   }
    
   printf ("\n Top left corner of matrix C(m x n): \n");
   for (i=0; i<min(m,6); i++) 
   {
      for (j=0; j<min(n,6); j++)
         printf ("%12.2f", C[j+i*n]);

      printf ("\n");
   }

   mkl_free(A);
   mkl_free(B);
   mkl_free(C);

   return 0;
}

int main(int argc, char **argv)
{
   int rc, m, n, k;
   double alpha, beta;

   if (argc != 6) 
   {
      printf("Usage: ./mkl-cao.out m n k alpha beta \n");
      printf("Where m is the number of row of matrix A \n");
      printf("      n is the number of column of matrix B \n");
      printf("      k is the number of column of matrix A \n");
      printf("      alpha and beta are scale factors  \n");

      return argc;
   }

   m = atoi(argv[1]);
   n = atoi(argv[2]);
   k = atoi(argv[3]);
   alpha = atof(argv[4]);
   beta = atof(argv[5]);

   printf("m:%d n:%d k:%d alpha:%8.2f beta:%8.2f\n", m, n, k, alpha, beta);
   rc = offload(m, n, k, alpha, beta);

   return rc;
}
For more complete information about compiler optimizations, see our Optimization Notice.
AttachmentSize
Package icon mkl-cao-new.zip1.59 KB