Effective Use of the Intel Compiler's Offload Features

Effective Use of the Intel Compiler’s Offload Features for Intel MIC Architecture

Compiler Methodology for Intel® MIC Architecture

Choosing a Programming Model, Effective Use of the Intel Compiler's Offload Features for Intel MIC Architecture

Overview

In this chapter, we examine various best known methods for the Intel® Composer XE 2013 Heterogeneous Offload programming model for the Intel® MIC Architecture. (Testing Changes with Firefox*)

Topics

Selecting Code Sections to Offload

Selections Based on Parallelism

Choose highly-parallel sections of code to run on the coprocessor. Serial code offloaded to the coprocessor will run much slower than on the CPU.

Changing Scope of Offloaded Sections Based on Data Flow

Using the criterion of level of parallelism to select code regions to offload may yield many small sections to offload. This must be balanced with the need for transferring data back and forth between CPU and MIC. Data exchange can be slow (subject to PCI-E speeds). It can also be difficult because of marshaling (pragma offload) or need for inserting _Cilk_shared keywords and _Offload_shared_malloc dynamic allocation. If two parallel sections do some serial processing between them then choose between a) moving the output data of the first parallel section back to the CPU, running the serial code on the CPU and then moving the input data of the second parallel region from CPU to coprocessor, versus b) keeping the data on the coprocessor and running the serial code there, in other words, making the entire parallel-serial-parallel section of code an offload unit.

Choosing Data Transfer Mechanism

Copyin/Copyout Model (#pragma offload)
This model is supported in both the Intel C/C++ and the Intel Fortran compilers
If the data exchanged between CPU and coprocessor is limited to scalars or arrays of bit-wise copyable elements, choose the #pragma offload model. This model requires localized changes to the code at the point of offload, and some markup of function declarations. Fortran programs are limited to this model (Fortran does not support the Shared-Memory model described below).

Shared-memory Model (_Cilk_shared/_Cilk_offload)
This model is available in the Intel C/C++ compiler ONLY (not supported in Fortran).
If the data exchanged between CPU and coprocessor is more complex than simple scalars and bit-wise copyable arrays, you may consider using the _Cilk_shared/_Cilk_offload constructs. These pragmas help implement a shared-memory offoad programming model. This model requires functions and statically allocated data to be given the _Cilk_shared attribute, and dynamically allocated data to be allocated in shared memory. The effort needed to implement and use _Cilk_Shared/_Cilk_Offload for the the shared-memory programming model can be more extensive, however the classes of programs able to use Intel MIC Architecture are richer since almost all C/C++ programs can be handled.

Offload Using #pragma offload

Measuring Offload Performance

Initialization Overhead
By default when a program performs the first #pragma offload all MIC devices assigned to the program are initialized. Initialization consists of loading the MIC program on to each device, setting up a data transfer pipeline between CPU and the device and creating a MIC thread to handle offload requests from the CPU thread. These activities take time. Therefore, do not place the first offload within a timing measurement. Exclude this one-time overhead by performing a dummy offload to the device.

    // Example of empty offload for initialization purposes
    int main()
    {
        #pragma offload_transfer target(mic)
        ...
    }

Alternatively, use the OFFLOAD_INIT=on_start environment variable setting to pre-initialize all available MIC devices before starting the main program

Offload Data Transfer

Minimizing Input Data
Compute Locally if Possible

Keep Data Persistent across Offloads
If data values at the end of an offload are needed by a later offload, keep them on the coprocessor.

When relying on data reuse across offloads, the offloads must be to the same coprocessor. Ensure this is the case by using an explicit coprocessor number in the target clause.

Persistence: Statically allocated data
In C/C++, variables declared at file-scope and function-local variables with storage class “static” are statically allocated. Fortran common blocks, data declared in the PROGRAM block, and data with the “save” attribute are statically allocated. Static data will retain values across offloads as long as they are not over-written with new values. Use the nocopy clause to reuse previous values.

     // File scope
    int x, y[100];
    void f()
    {
        x = 55;
        // x sent from CPU, y computed on coprocessor
        ...
    #pragma offload target(mic:0) in(x) nocopy(y)
    { y[50] = 66; }
    …
    #pragma offload target(mic:0) nocopy(x,y)
    { // x and y retain previous values }
    }

Persistence: Stack allocated data

In C/C++ and Fortran, variables declared within functions and subroutines are given “automatic” or stack storage by default. Minimize the need for retaining function-local values across offloads.

In the offload environment, each offloaded region runs as a separate function on the coprocessor. Stack allocated variables are normally never retained across offloads. To implement the functionality of data persistence across offloads, if “nocopy” is requested then scalar values are copied back to the CPU at the end of an offload and again to the coprocessor at the next offload to simulate their retention across offloads. For efficiency reasons, this is not recommended for non-scalars (i.e., large function-local arrays and struct objects). Starting with the version 13.0.0. 079 Compiler Product (not earlier Beta versions), the compiler will functionally support offloading of such function-local arrays. However, we do not recommend using this feature for performance-sensitive portions.

  void f()
  {
        int x = 55;
        int y[10] = { 0,1,2,3,4,5,6,7,8,9};
        // x, y sent from CPU
        // To use values computed into y by this offload in next offload,
    // y is brought back to the CPU
    #pragma offload target(mic:0) in(x,y) inout(y)
    { y[5] = 66; }
    // The assignment to x on the CPU
    // is independent of the value of x on the coprocessor
    x = 30; 
    …
    // Reuse of x from previous offload is possible using nocopy
    // However, array y needs to be sent again from CPU
    #pragma offload target(mic:0) nocopy(x) in(y)
    {   = y[5]; // Has value 66
        = x;    // x has value 55 from first offload
    }
  }    

Persistence: Heap allocated data

The coprocessor heap is persistent across offloads. There are two ways to use heap memory on MIC:

  1. Using the #pragma offload
  2. Explicitly calling malloc on the coprocessor

Either let the compiler manage dynamic memory using the #pragma or manage it using malloc/free. Compiler-managed dynamic memory is allocated/deallocated using alloc_if and free_if.

Compiler-managed Heap-allocated Data

Memory allocation is controlled by alloc_if and free_if, and data transfer is controlled by in/out/inout/nocopy. The two are independent, but data can only be transferred in and out of allocated memory.
// The following macros are use in all the samples when alloc_if/free_if clauses are used
#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define REUSE alloc_if(0) free_if(0)

void f()
{
  int *p = (int *)malloc(100*sizeof(int));
  // Memory is allocated for p, data is sent from CPU and retained
  #pragma offload target(mic:0) in(p[0:100] : ALLOC)
  { p[6]] = 66; }
  …
  // Memory for p reused from previous offload and retained once again
  // Fresh data is sent into the memory
  #pragma offload target(mic:0) in(p[0:100] : REUSE)
  { p[6]] = 66; }
  …
  // Memory for p reused from previous offload,
  // freed after this offload.
  // Final data is pulled from coprocessor to CPU
  #pragma offload target(mic:0) out(p[0:100] : FREE)
  { p[7]] = 77; }
  …
}


Explicitly managed Heap-allocated Data

Code running on the coprocessor may call malloc/free to explicitly allocate/deallocate dynamic memory. Pointer variables pointing to dynamic memory allocated in this way will are scalars and are subject to the data persistence rules described above, depending on the scope of their definition – static, or function-local.
To prevent interference between compiler-managed dynamic allocation and explicit dynamic allocation, use the nocopy clause for pointer variables referenced within offload regions that are being explicitly managed.

void f()
{
  int *p;
  … 
  // The nocopy clause ensures CPU values pointed to by p
  // are not transferred to coprocessor
  #pragma offload target(mic:0) nocopy(p)
  {
    // Allocate dynamic memory for p on coprocessor
    p = (int *)malloc(100);
    p[0] = 77;
    …
  }
  ..
  // The nocopy clause ensures p is not altered by the offload process
  #pragma offload target(mic:0) nocopy(p)
  {
    // Reuse dynamic memory pointed to by p
    … = p[0]; // Will be 77
  }
    
}

Local Pointers Versus Pointers Used Across Offloads

Pointers used within offload regions are by default inout, that is, data associated with them is transferred in and out. Sometimes a pointer may be used strictly locally, that is, it is assigned and used on the coprocessor only. The nocopy clause is useful in this case to leave the pointer unmodified by the offload clauses, and allow the programmer to explicitly manage the value of the pointer.  In other cases, data is transferred into the pointer from the CPU, and a subsequent offload may want to either a) use the same memory allocated and transfer fresh data into it, or b) keep the same memory and reuse the same data. For case a), an in clause with length equal to the number of elements is useful. For case b) an in clause with length 0 can be used to “refresh” the pointer but avoid any data transfer.

The complete description of in/out/nocopy and use of length clause:

 

Length or element count

< 0

Length or element count

== 0

Length or element count > 0

nocopy :

  alloc_if(0) free_if(0)

OK, useful for local MIC ptr

OK, useful for local MIC ptr

OK, useful for local MIC ptr

nocopy :

  alloc_if(0) free_if(1)

OK, update ptr, free memory (ignore length)

OK, update ptr, free memory (ignore length)

OK, update ptr, free memory (ignore length)

nocopy :

  alloc_if(1) free_if(0)

Error, cannot alloc <0

Error, cannot alloc 0

OK, do alloc, update ptr

nocopy :

  alloc_if(1) free_if(1)

Error, cannot alloc <0

Error, cannot alloc 0

OK, do alloc, update ptr, free memory

in / out / inout:

  alloc_if(0) free_if(0)

OK, update ptr only

OK, update ptr only

OK, update ptr, transfer

in / out / inout:

  alloc_if(0) free_if(1)

OK, update ptr, no transfer, free

OK, update ptr, no transfer, free

OK, update ptr, transfer, free

in / out / inout:

  alloc_if(1) free_if(0)

Error, cannot alloc/transfer <0

Error, cannot alloc 0

OK, alloc, update ptr, transfer

in / out / inout:

  alloc_if(1) free_if(1)

Error, cannot alloc/transfer <0

Error, cannot alloc 0

OK, alloc, update ptr, transfer, free

An example of the use of in/out/nocopy and use of length clause is below:


Example of Local Pointer

int *p;
int *temp;

p = malloc(SIZE);
temp = p;

// transfer data into “p”, but do nothing with “temp”
#pragma offload target(mic:0) in(p : ALLOC) nocopy(temp)
{
   // temp is allocated locally
    temp = malloc(…);
    memcpy(temp, p, …);
}

// Reuse temp’s value from previous offload
#pragma offload target(mic:0) out(p : FREE) nocopy(temp)
{
   // temp’s value is preserved from the previous offload
    memcpy(p, temp, …);
    free(temp);
}


Example of Persistent MIC Pointer and Selective Data Transfer

#include <stdlib.h>
#include <stdio.h>

#define SIZE 10

void func1(int *p)
{
   int i;

   // Because transfer count is 0, no data is transferred to MIC
   // However, because "in" is used,
   // the pointer value gets initialized on MIC
   #pragma offload target(mic:0) \
        in(p : length(0) REUSE)
   { for(i=0; i<SIZE; i++) printf("%3d", p[i]); printf("\n"); }
}

int main()
{
   int i;
   int *a;

   a = (int *)malloc(SIZE*sizeof(int));
   for(i=0; i<SIZE; i++) a[i] = i;

   // Allocate a on MIC only; transfer no data
   #pragma offload_transfer target(mic:0) \
       nocopy(a : length(SIZE) ALLOC)

   // Pick up the memory allocated for a and transfer data into it
   // Transfer count of SIZE is used
   // Each element of a is printed, then incremented
   #pragma offload target(mic) \
       in(a : length(SIZE) REUSE)
   { for(i=0; i<SIZE; i++) printf("%3d", a[i]++); printf("\n"); }

   func1(a);

   // Fetch data back to CPU, and free a on MIC
   #pragma offload_transfer target(mic:0) \
       out(a : length(SIZE) FREE)

   return 0;
}

Free-ing memory used by Offload without knowing the length

void freeOnCoprocessor(void* mem_ptr)
{
    char *c_ptr = (char *)(mem_ptr);

   // no data-transfer, want to free the previously allocated memory on the
   // coprocessor. Don't know the length here, use the value 0
    #pragma offload_transfer target(mic:0)
               nocopy(c_ptr:length(0) FREE)

    free(mem_ptr);
}

The length value is not needed for freeing (so you can pass a dummy-value of 0 as the length). The length modifier is needed with a pointer because whether you are allocating or freeing is known only at runtime (alloc_if and free_if are expressions). That's why lexically, the length modifier is needed. But when freeing the length value is ignored.

Transferring non-bitwise Copyable Data Between CPU and MIC

Sometimes a data object containing a mixture of bitwise copyable elements (such as scalars and arrays) and non-bitwise elements (such as pointers to other data) need to be exchanged between CPU and MIC. The compiler will by default disallow such objects in the in/out clauses. If the program is only concerned with transferring the bitwise copyable elements of such objects then a compiler switch can disable the error using the -wd<number> switch, or convert the error to a warning using the -ww<number> switch.

Note:

  1. The non-bitwise copyable elements will have indeterminate value and it is your responsibility not to access those fields before first assigning valid values to them.
  2. There may be other circumstances where the compiler will issue the "not bitwise copyable" diagnostic. When the error may be over-ridden the error code is printed. Use that error code in the -wd or -ww switch.
    // Example of Non-Bitwise Object Transfer, Only Bitwise Copyable Data Needs to be Transferred
    --- file wd2568.cpp ---
    #include <complex>
    typedef std::complex<float> Value;

    void f()
    {
        const Value* C;
        #pragma offload_transfer target(mic) in(C:length(2))
    }
    
    > icc -c wd2568.cpp
    wd2568.cpp(8): error #2568: variable "C" used in this offload region is not bitwise copyable
    #pragma offload target(mic) in(C:length(2))
    ^
    compilation aborted for wd2563.cpp (code 2)
    
    // Compiling with -wd2568 allows the compilation to proceed
    > icc -c wd2568.cpp -wd2568
    >
 

In other cases, all elements of the non-bitwise object need to be transferred. In this case, you must transfer the individual components of the non-transferrable struct object.

The compiler cannot transfer non bit-wise copyable structs as a whole but can transfer individual fields separately, allowing you to specify a length for each pointer variable.

Once the data is transferred it remains persistent. Sometimes, the “nocopy” clause needs to be used so that existing data remains on MIC and the compiler does not attempt to update it.

Here is an example of passing a struct containing pointer fields, keeping the data persistent across offloads. Data is sent from CPU to MIC only once, in the function send_inputs. It is brought back to the CPU at the end, in the function receive_results. In between, you can use the data as many times as you like, as shown in the function use_the_data.

#include <stdio.h>
#include <stdlib.h>
#define SIZE 10
#define ALLOC alloc_if(1) free_if(0)
#define REUSE alloc_if(0) free_if(0)
#define FREE  alloc_if(0) free_if(1)
// Example of Non-Bitwise Object Transfer, All Data Elements Needed
typedef struct {
    int m1;
    int *m2;
} nbwcs;
__declspec(target(mic)) nbwcs struct1;
void send_inputs()
{
    int m1;
    int *m2;
    // Initialize the struct
    struct1.m1 = 10;
    struct1.m2 = (int *)malloc(SIZE * sizeof(int));
    for (int i=0; i<SIZE; i++)
    {
        struct1.m2[i] = i;
    }
    
    // In this offload data is transferred
    m1 = struct1.m1;
    m2 = struct1.m2;
    #pragma offload target(mic:0) in(m1) in(m2[0:SIZE] : ALLOC) nocopy(struct1)
    {
        struct1.m1 = m1;
        struct1.m2 = m2;
        printf("MIC offload1: struct1.m2[0] = %d, struct1.m2[SIZE-1] = %d\n", struct1.m2[0], struct1.m2
[SIZE-1]);
        fflush(0);
    }
}
void use_the_data()
{
    // In this offload data is used and updated
    #pragma offload target(mic:0) nocopy(struct1)
    {
        for (int i=0; i<SIZE; i++)
        {
            struct1.m2[i] += i;
        }
        printf("MIC offload2: struct1.m2[0] = %d, struct1.m2[SIZE-1] = %d\n", struct1.m2[0], struct1.m2
[SIZE-1]);
        fflush(0);
    }
}
void receive_results()
{
    int *m2;
    // In this offload data is used,, updated, freed on MIC and brought back to the CPU
    m2 = struct1.m2;
    #pragma offload target(mic:0) out(m2[0:SIZE] : FREE) nocopy(struct1)
    {
        for (int i=0; i<SIZE; i++)
        {
            struct1.m2[i] += i;
        }
        printf("MIC offload3: struct1.m2[0] = %d, struct1.m2[SIZE-1] = %d\n", struct1.m2[0], struct1.m2
[SIZE-1]);
        fflush(0);
    }
    printf("CPU: struct1.m2[0] = %d, struct1.m2[SIZE-1] = %d\n", struct1.m2[0], struct1.m2[SIZE-1]);
}
int main()
{
    send_inputs();
    use_the_data();
    receive_results();
    return 0;
}
 

Transferring Arrays of Pointers

The offload syntax currently disallows specifying arrays of pointers in the IN and OUT clauses. If one or more of the arrays pointed to are needed in offloaded code, assign each required pointer array element to a scalar pointer variable of the same pointer type and use that variable in #pragma offload.

If you need all the data pointed to by an array of pointers at the same time on MIC when doing the computation, then use a loop to allocate and transfer the data to MIC and another loop to free the memory on MIC. In between, do an offload and use the data you transferred. Be careful to use nocopy whenever the pointer array is referenced in the offloaded code, because by default it is treated as inout, but you are transferring in/out the data in the array separately. See below:

#include <offload.h>
__declspec(target(mic)) float *fp[10];
__declspec(target(mic)) float *fp0;
int main()
{
        int i;
        // Allocate memory and initialize
        for (i=0; i<10; i++)
        {
                fp[i] = malloc(100*sizeof(float));
                fp[i][0:100] = i;
        }
        // Transfer input float** array to MIC
        for (i=0; i<10; i++)
        {
                fp0 = fp[i];
                #pragma offload target(mic) in(fp0[0:100] : ALLOC) nocopy(fp)
                {
                        fp[i] = fp0;
                }
        }
        printf("CPU fp[5][5]=%f\n", fp[5][5]);
        #pragma offload target(mic) nocopy(fp)
        {
                fp[5][5] = 55;
                if (_Offload_get_device_number() >= 0)
                {
                        printf("MIC fp[5][5]=%f\n", fp[5][5]);
                        fflush(0);
                }
        }
        printf("CPU fp[5][5]=%f\n", fp[5][5]);
        // Fetch results and free memory allocated on MIC
        for (i=0; i<10; i++)
        {
                fp0 = fp[i];
                #pragma offload_transfer target(mic) out(fp0[0:100] : FREE)
        }
        return 0;
}

Function Inlining into Offload Constructs

Sometimes inlining a function is necessary for optimum performance of the generated code. Functions called directly within a #pragma offload are not inlined by the compiler even if they are marked as inline. To enable optimum performance of code in offload regions, either manually inline functions, or place the entire offload construct into its own function.

In the example below the code in function v1 demonstrates the problem. Without the #pragma offload the function call f(a,i) would have been inlined by the compiler and the loop would have been vectorized. However, when offloaded, the call to f(a,i) is not inlined, which inhibits loop vectorization.

One solution is to manually inline function f, as shown in function v2.

Another solution is to move the offload construct into its own function as shown in function v3.

#pragma offload_attribute(push, target(mic))
__declspec(align(64)) int a[1024];
const int size = 1024;
void f(int *a, int i)
{
 a[i] = i+1;
}
void g(int *a, int size)
{
 int i;
     // This loop is vectorized
 __assume_aligned(a, 64);
 for (i=0; i<size; i++)
 {
  a[i] = i+1;
 }
}
#pragma offload_attribute(pop)
int v1()
{
 int i;
 #pragma offload target(mic) inout(a[0:1024])
 {
  __assume_aligned(&a[0], 64);
  for (i=0; i<size; i++)
  {
              // Function call in loop is not inlined
              // Had it been inlined, loop would have vectorized
   f(a, i);
  }
 }
 return a[0];
}
int v2()
{
 int i;
 #pragma offload target(mic) inout(a[0:1024])
 {
  __assume_aligned(&a[0], 64);
  for (i=0; i<size; i++)
  {
           // Solution1: manually inline the code
   a[i] = i+1;
  }
 }
 return a[0];
}
int v3()
{
 int i;
 #pragma offload target(mic) inout(a[0:1024])
 {
          // Solution 2: place the offloaded code in a function
          // Loop inside g will be vectorized
  g(a, size);
 }
 return a[0];
}

Checking Status of Offload

NEW in Intel(R) Composer XE 2013 SP1 (in beta testing spring/summer 2013):  This compiler supports new clauses mandatory, optional and status which provide greater control over offload.


NEW in Intel(R) Composer XE 2013 SP1: mandatory and optional clauses

Offload specified using an offload pragma is mandatory by default. The clause “optional” is available to make an individual #pragma offload optional. The clause “mandatory” is also available to specify that a particular offload is mandatory.

NEW in Intel(R) Composer XE 2013 SP1: Compiler switch to control optional/mandatory

All offloads in a file can be made optional or mandatory with the compiler switch  –offload-mode={none|mandatory|optional}. The “none” setting turns off the offload feature, which means all #pragma offloads are ignored.

NEW in Intel(R) Composer XE 2013 SP1: status clause

The status clause specifies a variable that will hold the result of an offload after the offload has executed. The variable in the status clause is of type OFFLOAD_STATUS, defined in offload.h. The macro OFFLOAD_STATUS_INIT(var) can be used to initialize the status variable to a special value to distinguish success or failure.

Offload execution behavior

When offload is optional, if an offload request cannot be met
    •    Execution falls back to the CPU
    •    If a “status” clause had been used, then you will be able to tell what happened

With mandatory offload, if the request cannot be met
    •    No CPU fallback
    •    If a “status” clause is specified, the program won’t terminate. You will need to handle the situation yourself
    •    Without a “status” clause the program will be terminated

NEW in Intel(R) Composer XE 2013 SP1: Example of Optional Offload

#include "offload.h"

void optional_offload()
{
        _Offload_status x;

        OFFLOAD_STATUS_INIT(x);
        #pragma offload target(mic:0) status(x) optional
        {
                if (_Offload_get_device_number() < 0) {
                        printf("optional offload ran on CPU\n");
                } else {
                        printf("optional offload ran on MIC\n");
                }
        }
        if (x.result == OFFLOAD_SUCCESS) {
                printf("optional offload was successful\n");
        } else {
                printf("optional offload failed\n");
        }
}

When device is available, prints:
optional offload was successful

When device is not available, prints:
optional offload ran on CPU
optional offload failed


NEW in Intel(R) Composer XE 2013 SP1: Example of Mandatory Offload

#include "offload.h"

void mandatory_offload()
{
        _Offload_status x;

        OFFLOAD_STATUS_INIT(x);
        #pragma offload target(mic) status(x) mandatory
        {
                if (_Offload_get_device_number() < 0) {
                        printf("mandatory offload ran on CPU\n");
                } else {
                        printf("mandatory offload ran on MIC\n");
                }
        }
        if (x.result == OFFLOAD_SUCCESS) {
                printf("mandatory offload was successful\n");
        } else {
                printf("mandatory offload failed\n");
        }
}

When device is available, prints:
 mandatory offload was successful

When device is not available, prints:
 mandatory offload failed

NEW in Intel(R) Composer XE 2013 SP1: Example of Error Condition Checking with status

#include "offload.h"

void offload1()
{
        _Offload_status x;
        int *p = malloc(100*sizeof(int));
        long elements = (long)4*1024*1024*1024;

        OFFLOAD_STATUS_INIT(x);
        #pragma offload target(mic) status(x) in(p[0:elements])
        {
                if (_Offload_get_device_number() < 0) {
                        printf("Ran on CPU\n");
                } else {
                        printf("Ran on MIC\n");
                }
        }
        if (x.result == OFFLOAD_SUCCESS) {
                printf("offload was successful\n");
        } else {
                printf("offload failed\n");
                if (x.result == OFFLOAD_OUT_OF_MEMORY) {
                        printf("offload failed due to insufficient memory\n");
                }
        }
}

When device is available but 16GB cannot be allocated, prints:
offload failed
offload failed due to insufficient memory

When device is not available, prints:
offload failed

Minimize Coprocessor Memory Allocation Overhead

Dynamic memory allocation on the coprocessor can be slow. Minimize allocation/deallocation overhead by doing fewer allocations and frees.  If an array is going to be passed multiple times between CPU and coprocessor, allocate it at first usage and free it at last usage. See example under “Data Persistence, Heap allocated data”.  Even if the same array is not going to be reused in offloaded code repeatedly, the same memory block allocated on MIC could be reused.

// Allocate a buffer of “count” elements on coprocessor
// accessible through the CPU value of p
#pragma offload_transfer target(mic:0) nocopy(p[0:count] : ALLOC)
…
// CPU data from x is transferred into coprocessor buffer p
// elements1 <= count
#pragma offload target(mic:0) \
        in(x[0:elements1] : REUSE into(p[0:elements1])
{ …  = p[10]; }
…
// CPU data from y is transferred into coprocessor buffer p
// elements2 <= count
#pragma offload target(mic:0) \
        in(y[0:elements2] : REUSE into(p[0:elements2])
{  … = p[10]; }
…
// Free the buffer
#pragma offload_transfer target(mic:0) nocopy(p[0:count] : FREE)

Note that memory buffers kept allocated even when not needed would consume available coprocessor memory so balance the need for memory with minimizing allocation/deallocation overhead.

Offload Data Alignment

To enable vectorization of code on the coprocessor align data on 64B boundary or higher. For statically allocated variables this is achieved using __declspec(align(64)). For pointer data transferred to the coprocessor the align modifier of #pragma offload could be useful.
#pragma offload target(mic) in(p[0:2048] :align(64))

Note that the offload library normally assigns the coprocessor data the same offset within 64B as the offset within 64B of the CPU data. This offset matching ensures that fast DMA transfers between CPU and coprocessor will be enabled. An align modifier may override this offset matching. To get the benefits of fast DMA data transfer and proper alignment of coprocessor data, align the CPU data instead, and don’t explicitly use the align directive.

Maximize Data Transfer Rate

Data transfer rate between CPU and coprocessor is slowest for stack data and fastest for statically allocated and dynamically allocated data. Align CPU data on a 64B boundary or higher for improved data transfer rate. Align at 2MB for maximum transfer rate.
Make data transfer size a multiple of 64B for improved transfer rate, and a multiple of 2MB for maximum transfer rate. Generally, the larger the data transfer size, the higher the bandwidth. Allocate coprocessor memory in large (2MB) pages for improved data transfer rate. See notes on using the environment variable MIC_USE_2MB_BUFFERS.

Overlapping Data Transfer and Offloaded Computation

Input data needed by an offloaded computation may be sent in advance of the offload. The CPU may continue processing while the data in being transferred. In the example below f1 and f2 are sent to the coprocessor ahead of the offload that will use their values..

01   const int N = 4086;
02   float *f1, *f2;
03   float result;
04   f1 = (float *)memalign(64, N*sizeof(float)); 
05   f2 = (float *)memalign(64, N*sizeof(float));
...
10   // CPU issues send and continues
11   #pragma offload_transfer in( f1[0:N] ) signal(f1)
12   
...

20   // CPU issues send and continues
21   #pragma offload_transfer in( f2[0:N] ) signal(f2)
22   
...
30   // CPU issues request to do computation using f1 and f2
31   // Coprocessor begins execution after pre-sent data is received
32   #pragma offload wait(f1, f2) out( result )
33   {
34          result = foo(N, f1, f2);
35   }

To receive data asynchronously from MIC to CPU, signal and wait are used as clauses of two different pragmas. The first offload performs the compute but only initiates data transfer. The second pragma causes a wait for the data transfer to be completed.

01   const int N = 4086;
02   float *f1, *f2;
03   f1 = (float *)memalign(64, N*sizeof(float)); 
04   f2 = (float *)memalign(64, N*sizeof(float));
...

10   // CPU sends f1 as input synchronously
11   // The output is in f2, but is not needed immediately
12   #pragma offload in(f1[0:N]) nocopy(f2[0:N]) signal(f2)
14   {
15        foo(N, f1, f2);
16   }
..
20   #pragma offload_transfer wait(f2) out(f2[0:N])
21   
22   // CPU can now use the result in f2
23   ...

Offloading a Lambda Function

A lambda is an inline function. It can be tricky to apply the offload attribute to such a function if it has to be offloaded. Here is an example:

#pragma offload_attribute(push,target(mic))
#include <stdio.h>

template<typename F>
void Run( F f ) {
    f();
}
#pragma offload_attribute(pop)

int main() {
#pragma offload target(mic)
    Run( [&] () __attribute__((target(mic)))
{
#ifdef __MIC__
        printf("MIC says Hello, world\n");
#else
        printf("CPU says Hello, world\n");
#endif
    } );
}

 

Offload Using _Cilk_shared/_Cilk_offload
 

Marking Data and Classes _Cilk_shared

Shared Pointer Declaration

A Shared pointer is declared as follows:

int * _Cilk_shared q;   // Shared pointer q to non-shared int
 

Declaration of a Pointer to Shared Data

A pointer to Shared data is written this way:

_Cilk_shared int * p;   // Non-shared pointer p to shared int
 

A Shared pointer to Shared data is a combination of the two:

_Cilk_shared int * _Cilk_shared r;    // Shared pointer r to shared int
 

Declaring a Class Type as Shared

When a class type must be declared Shared, place the keyword between “class” and the rest of the declaration. Placing _Cilk_shared at the beginning marks the data being declared Shared and not the type:

Class _Cilk_shared C {
  // class members
 };

Allocating Dynamic Memory for Shared Data

Dynamically allocated Shared data must be allocated from the pool of Shared memory. This is done by using the APIs:

_Offload_shared_malloc
_Offload_shared_free
_Offload_shared_aligned_malloc
_Offload_shared_aligned_free

Using Placement New

In some cases the memory allocation is not under direct user control, for example, STL objects. The offload.h header provides a placement new mechanism for diverting STL memory allocations into Shared memory. Here is an example

#pragma offload_attribute (push, _Cilk_shared)
#include <vector>
#include "offload.h"
#pragma offload_attribute (pop)
#include <stdio.h>
using namespace std;
_Cilk_shared vector<int, __offload::shared_allocator<int> > * _Cilk_shared v;

_Cilk_shared void foo() {
   for (int i = 0; i<  5; i++) {
     printf("%d\n", (*v)[i]);  // fault here otherwise
   }
}

int main() {
  // v is allocated in shared mem
  // v's elements are also in shared mem
  v = new (_Offload_shared_malloc(sizeof(vector<int>)))
          _Cilk_shared
          vector<int, __offload::shared_allocator<int> >(5);  
                                
  for (int i = 0; i<  5; i++) {
    (*v)[i] = i;
  }
  _Cilk_offload foo();
  return 0;
}

Improving Performance of _Cilk_offload/_Cilk_shared

The default memory model for Shared data is to assume both CPU and coprocessor may modify data that is Shared. If the application data model is such that input data of an offload is only sent from CPU to coprocessor and after the offload has finished, all modified data can be sent back to the CPU without needing to be merged with other Shared data that may have been concurrently modified on the CPU, then a simpler and more efficient synchronization model may be specified. Enable this model using the environment variable MYO_CONSISTENCE_PROTOCOL.

Example

setenv MYO_CONSISTENCE_PROTOCOL HYBRID_UPDATE_NOT_SHARED

Linking Offloaded Code with Coprocessor Libraries

At present the compiler is very strict about checking mixing of Shared and non-Shared pointers and it is necessary to circumvent some of these checks. In general, Shared data can always be processed by routines that know nothing about sharing, as long as casts are used.
Linking third-party libraries built for MIC may be linked with offloaded code by following these steps:
1. Keep your 3rd party libraries as they are, i.e., built with –mmic and unaware of any sharing.
2. Offload from the CPU program to some functions on MIC that serve as the data exchange functions. These functions will be marked _Cilk_shared and will deal with data marked as _Cilk_shared.
3. From these data exchange functions running on MIC, make calls to the MIC-only libraries. Now, because data and functions referenced in functions marked _Cilk_shared are required to be _Cilk_shared, you will need casts on data and functions defined in the external libraries (which are not built with the Shared keywords).
Schematically, CPU code à SHARED code running on MIC à(casts) code built with –mmic

Example

 --- Makefile ---
 main.out:         main.cpp libbar.so
             icc -o main.out main.cpp -offload-option,mic,compiler,"-L. -lbar"

 libbar.so:        libbar.cpp
            icc -mmic -fPIC -shared -o libbar.so libbar.cpp

 clean:
           rm *.o *.so main.out

run:
          export MIC_LD_LIBRARY_PATH=".:$$MIC_LD_LIBRARY_PATH"
          main.out
--- file main.cpp ---
#include <vector>
#include <stdio.h>
using namespace std;

#include "offload.h"

typedef vector<float, __offload::shared_allocator<float> > _Cilk_shared * SHARED_VECTOR_PTR;
typedef vector<float> * VECTOR_PTR;
SHARED_VECTOR_PTR v;
typedef _Cilk_shared void (*SHARED_FUNC)(VECTOR_PTR v);

extern void libbar(VECTOR_PTR v);

_Cilk_shared void bar(VECTOR_PTR v)
{
        if (_Offload_get_device_number() == 0)
        {
                printf("MIC bar\n");
                for (int i = 0; i<  5; i++) {
                        printf("\t%f\n", (*v)[i]);
                }
        }
}

_Cilk_shared void foo(SHARED_VECTOR_PTR v)
{
        if (_Offload_get_device_number() == 0)
        {
                printf("MIC foo\n");
                for (int i = 0; i<  5; i++) {
                        printf("\t%f\n", (*v)[i]);
                }
        }
        bar((VECTOR_PTR)v);
#ifdef __MIC__
   (*(SHARED_FUNC)&libbar)((VECTOR_PTR)v);
#else
#endif
}

int main()
{
        // v's elements are in shared mem
        v = new (_Offload_shared_malloc(sizeof(vector<float>)))
                _Cilk_shared vector<float, __offload::shared_allocator<float> > (5);


        for (int i = 0; i<  5; i++) {
                (*v)[i] = i;
        }
        _Cilk_offload foo(v);

        return 0;
}


 ---- file libbar.cpp ---
#include <vector>
#include <stdio.h>
using namespace std;

typedef vector<float> * VECTOR_PTR;

void libbar(VECTOR_PTR v)
{
        if (_Offload_get_device_number() == 0)
        {
                printf("MIC libbar\n");
                for (int i = 0; i<  5; i++) {
                        printf("\t%f\n", (*v)[i]);
                }
        }
}

Using MKL and TBB on MIC

MKL and TBB are available on MIC, to be called from functions running on MIC. To use MKL the compiler switch –mkl (Linux) or /Qmkl (Windows) is specified when compiling the program on the CPU. Similarly, to use TBB the compiler switch –tbb or /Qtbb is specified when compiling the program on the CPU. These switches are automatically propagated to the MIC compilation of the offloaded code and those MKL and TBB functions that are supported on MIC may then be called from code running on MIC.

See _Cilk_shared/_Cilk_offload example below (taken from the MKL documentation).

/* C source code is found in dgemm_example.c */
#pragma offload_attribute(push,_Cilk_shared)
#define min(x,y) (((x) < (y)) ? (x) : (y))
#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
int not_main()
{
    double *A, *B, *C;
    int m, n, k, i, j;
    double alpha, beta;
    printf ("\n This example computes real matrix C=alpha*A*B+beta*C using \n"
            " Intel® MKL function dgemm, where A, B, and  C are matrices and \n"
            " alpha and beta are double precision scalars\n\n");
    m = 2000, k = 200, n = 1000;
    printf (" Initializing data for matrix multiplication C=A*B for matrix \n"
            " A(%ix%i) and matrix B(%ix%i)\n\n", m, k, k, n);
    alpha = 1.0; beta = 0.0;
    printf (" Allocating memory for matrices aligned on 64-byte boundary for better \n"
            " performance \n\n");
    A = (double *)mkl_malloc( m*k*sizeof( double ), 64 );
    B = (double *)mkl_malloc( k*n*sizeof( double ), 64 );
    C = (double *)mkl_malloc( m*n*sizeof( double ), 64 );
    if (A == NULL || B == NULL || C == NULL) {
      printf( "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n");
      mkl_free(A);
      mkl_free(B);
      mkl_free(C);
      return 1;
    }
    printf (" Intializing matrix data \n\n");
    for (i = 0; i < (m*k); i++) {
        A[i] = (double)(i+1);
    }
    for (i = 0; i < (k*n); i++) {
        B[i] = (double)(-i-1);
    }
    for (i = 0; i < (m*n); i++) {
        C[i] = 0.0;
    }
    printf (" Computing matrix product using Intel® MKL dgemm function via CBLAS interface \n\n");
    cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 
                m, n, k, alpha, A, k, B, n, beta, C, n);
    printf ("\n Computations completed.\n\n");
    printf (" Top left corner of matrix A: \n");
    for (i=0; i<min(m,6); i++) {
      for (j=0; j<min(k,6); j++) {
        printf ("%12.0f", A[j+i*k]);
      }
      printf ("\n");
    }
    printf ("\n Top left corner of matrix B: \n");
    for (i=0; i<min(k,6); i++) {
      for (j=0; j<min(n,6); j++) {
        printf ("%12.0f", B[j+i*n]);
      }
      printf ("\n");
    }
    
    printf ("\n Top left corner of matrix C: \n");
    for (i=0; i<min(m,6); i++) {
      for (j=0; j<min(n,6); j++) {
        printf ("%12.5G", C[j+i*n]);
      }
      printf ("\n");
    }
    printf ("\n Deallocating memory \n\n");
    mkl_free(A);
    mkl_free(B);
    mkl_free(C);
    printf (" Example completed. \n\n");
    return 0;
}
#pragma offload_attribute(pop)
int main()
{
 _Cilk_offload not_main();
 return 0;
}

Customizing Code for CPU and MIC

Avoid #ifdef __MIC__
On occasion, offloaded code requires customization for the coprocessor. When possible, use a dynamic check to select code to run on the coprocessor:

   if (_Offload_get_device_number() >= 0)
   {
       // MIC version
   } else {
      // CPU version
   }

This method of customization may be infeasible if the MIC or CPU versions use intrinsics that are not available on both processors. In these cases an #ifdef may be used. However, do not use an #ifdef __MIC__ directly within a #pragma offload construct because it has the potential to create a mismatch between variables sent from/received by the coprocessor, and sent/received by the CPU. Mismatch may occur because the default for variables is inout and the variable references may not be identical on the two alternate code versions.

 #ifdef __MIC__
    // MIC version
 #else
   // CPU version
 #endif

Controlling Options Passed to CPU and MIC Compilers

Many options that you specify in the command-line for an offload compilation apply to both the host-side compilation as well as the MIC-side compilation. 

If you want to pass additional options to the offload compilation, or you would like to override the command line options passed to offload compilation, you must use option [Q]offload-option to specify the additional or overriding options.

Example: You can pass the reporting options only for the MIC-side compilation (this makes it easier for analyzing what optimizations are performed for MIC) as follows: icc -offload-option,mic,compiler,"-vec-report2 -opt-report-phase hlo -opt-report=3" t1.c -c -openmp

In this case, the reporting options are not used for the host-side compilation, and they get used for the MIC-side compilation.

When building a heterogeneous application, the driver passes all compiler options specified on the command-line to the host compilation and only certain options to the offload compilation. To see a list of options passed to the offload compilation, specify option "-watch=mic-cmd".

// The following compiler invocation produces the output shown below.
   ifort -openmp program.f90 -g -o g.out -watch=mic_cmd
MIC command line:
   ifort -openmp program.f90 -g -o g.out
 

If you add the option "-watch=mic-cmd" to the earlier example above, then the compiler reports the exact command-line expansion that will be used for the MIC-side compile as follows (in addition to doing the compile):

scel5% icc -offload-option,mic,compiler,"-vec-report2 -opt-report-phase hlo -opt-report=3" t1.c -c  -openmp -watch=mic-cmd

MIC command line:

icc t1.c -c -watch=mic-cmd -openmp -vec-report2 -opt-report-phase hlo -opt-report=3

Internal compiler options which typically begin with –m are not automatically passed from CPU to MIC compilations. These options must be specified explicitly for either the CPU or MIC compilations (using the –offload-option,mic,compiler,<option>).

Example

// The following passes an internal option to the CPU compiler
// The MIC command line is asked to be printed
  icc -c test.c -watch=mic_cmd -mP2OPT_hlo_pref_indirect_refs=T
MIC command line:
icc -c test.c

// The following passes an internal option to the MIC compiler
// The MIC command line is asked to be printed
  icc -c test.c -watch=mic_cmd -offload-option,mic,compiler,-mP2OPT_hlo_pref_indirect_refs=T
MIC command line:
icc -c test.c -mP2OPT_hlo_pref_indirect_refs=T

Using Multiple MIC Cards

On a multi-card system #pragma offloads that do not specify an explicit MIC card number result in offloads issued to MIC card 0. However, if data is to be reused between offloads then it is safer to use explicit card numbers in the offloads to ensure that data is carried over from one offload to another in a predictable manner, irrespective of the number of cards available in the system.

The offload pragma allows a user to write offloads to logical cards 1 to N. The physical cards available to the process are specified using the environment variable OFFLOAD_DEVICES=<list of physical devices>. Then, logical card numbers are mapped to physical cards by doing N%<#cards-available>, meaning that logical card numbers wraparound among the physical cards.

// Transfer data in “p” to card 1 and reuse in next offload
#pragma offload_transfer target(mic:1) in( p[0:length] RETAIN )
...
// Use data previously transferred to card 1
// Use of matching card number ensures data “p” is available on same card
#pragma offload_transfer target(mic:1) nocopy( p[0:length] REUSE )

When using _Cilk_shared and _Cilk_offload management of data is automatic. Explicit card numbers using _Cilk_offload_to(<card-number>) are useful for manually load balancing between available cards instead of relying on the round-robin default offload behavior.
See also: environment variable OFFLOAD_DEVICES

Environment Variables for Controlling Offload

There are two categories of environment variables:

  1. Those that affect the way the Offload runtime library operates
  2. Those that are passed through to the co-processor execution environment by the Offload library

We first describe environment variables in category 1. These are prefixed with either "MIC_" or "OFFLOAD_". The prefix is fixed, as is the environment variable name.

The special environment variable MIC_ENV_PREFIX is used to distinguish variables in category 2. It is described at the end of this section.

MIC_USE_2MB_BUFFERS

Sets the threshold for creating buffers with large pages. A buffer is created with the large pages hint if its size exceeds the threshold value.

Example

// any variable allocated on MIC that is equal to or greater than
// 100KB in size will be allocated in large pages.
setenv MIC_USE_2MB_BUFFERS 100k

This environment variable applies only for data allocated by pragma-offload for pointer variables in in/out/nocopy clauses. If the environment variable is not set, all such allocations happen in 4KB pages.

See also the article here for more information about use of large pages on MIC: Large Page Considerations

MIC_STACKSIZE

Sets the size of the offload process stack on MIC. This is the overall stack size. Use MIC_OMP_STACKSIZE to modify the size of each OpenMP thread.

Example

setenv MIC_STACKSIZE 100M // Sets MIC stack to 100 MB

MIC_LD_LIBRARY_PATH

Sets the path where shared libraries needed by the MIC offloaded code reside.

See example in section "Linking Offloaded Code with Coprocessor Libraries".

OFFLOAD_REPORT

__Offload_report(int on_or_off); // to be called on CPU only 

API allows you to turn on/off the reporting at runtime, while the environment variable OFFLOAD_REPORT is set to 1 or 2 value..

At runtime, the API may change the value of the flag. This does not affect the setting of the environment variable. It does affect which offloads will produce reports.

Example

#include <stdio.h>
__declspec(target(mic)) volatile int x;
int main()
{
    __Offload_report(0);
    #pragma offload target(mic)
    {
        x = 1;
    }
    __Offload_report(1);
    #pragma offload target(mic)
    {
        x = 2;
    }
    return 0;
}

For the program above, with OFFLOAD_REPORT=1 the report will be as follows:

[Offload] [MIC 0] [File] test_ofld0.c
[Offload] [MIC 0] [Line] 15
[Offload] [MIC 0] [CPU Time] 0.000268 (seconds)
[Offload] [MIC 0] [MIC Time] 0.000022 (seconds)
 

For OFFLOAD_REPORT=2 the report will be as follows: 

[Offload] [MIC 0] [File] test_ofld0.c
[Offload] [MIC 0] [Line] 15
[Offload] [MIC 0] [CPU Time] 0.000263 (seconds)
[Offload] [MIC 0] [CPU->MIC Data] 0 (bytes)
[Offload] [MIC 0] [MIC Time] 0.000023 (seconds)
[Offload] [MIC 0] [MIC->CPU Data] 4 (bytes)
 

where [CPU->MIC Data] and [MIC->CPU Data] is the total data transferred in bytes

OFFLOAD_DEVICES

The environment variable OFFLOAD_DEVICES restricts the process to use only the MIC cards specified as the value of the variable. <value> is a comma separated list of physical device numbers in the range 0 to (number_of_devices_in_the_system-1).

Devices available for offloading are numbered logically. That is _Offload_number_of_devices() returns the number of allowed devices and device indexes specified in the target specifiers of offload pragmas are in the range 0 to (number_of_allowed_devices-1).

Example

setenv OFFLOAD_DEVICES “1,2”

Allows the program to use only physical MIC cards 1 and 2 (for instance, in a system with four installed cards). Offloads to devices numbered 0 or 1 will be performed on physical devices 1 and 2. Offloads to target numbers higher than 1 will wrap-around so that all offloads remain within logical devices 0 and 1 (which map to physical cards 1 and 2). The function _Offload_get_device_number() executed on a MIC device will return 0 or 1, when the offload is running on physical devices 1 or 2.

OFFLOAD_INIT

 The environment variable specifies a hint to the offload runtime when it should initialize MIC devices.

Supported values:

on_start

All available devices are initialized before entering main.

on_offload

Device initialization is performed right before the first offload to it. Initialization is done only on the MIC device which handles offload.

on_offload_all

All available MIC devices are initialized right before the first offload in a program.

The default is on_offload_all (for backward compatibility).

MIC_ENV_PREFIX

This is the general mechanism to pass environment variable values to the process running on a MIC card.

Note: The setting of this environment variable has no effect on the fixed MIC_* environment variables discussed before this section, namely MIC_USE_2MB_BUFFERS, MIC_STACKSIZE and MIC_LD_LIBRARY_PATH. Those names are fixed.

By default, all environment variables defined in the environment of an executing CPU program are replicated to the coprocessor's execution environment when an offload occurs. You can modify this behavior by defining the environment variable MIC_ENV_PREFIX. When you set MIC_ENV_PREFIX, then not all CPU environment variables are replicated to the coprocessor, but only those environment variables that begin with the value of the MIC_ENV_PREFIX environment variable. The environment variables set on the coprocessor have the prefix value removed. You thus have independent control of OpenMP*, Intel® Cilk™ Plus, and other execution environments that use common environment variable names.

So, if MIC_ENV_PREFIX is not set, the Offload runtime simply replicates the host environment to the coprocessor. If MIC_ENV_PREFIX is set then only those environment variable names whose name begins with the value defined by MIC_ENV_PREFIX are passed to the target (with prefix removed).

Thus, the value of MIC_ENV_PREFIX sets the value of the prefix which is used to recognize environment variable values intended for programs running on MIC devices. For example, setenv MIC_ENV_PREFIX MYCARDS will use “MYCARDS” as the string that indicates that an environment variable is intended for a MIC process.

Environment variable values of the form <mic-prefix>_<var>=<value> will send <var>=<value> to each card.

Environment variable values of the form <mic-prefix>_<card-number>_<var>=<value> will send <var>=<value> to the MIC card numbered <card-number>.

Environment variable values of the form <mic-prefix>_ENV=<variable1=value1|variable2=value2> will send <variable1>=<value1> and <variable2>=<value2> to each card.

Environment variable values of the form

<mic-prefix>_<card-number>_ENV=<variable1=value1|variable2=value2> will send <variable1>=<value1> and <variable2>=<value2> to the MIC card numbered <card-number>.

Example

setenv MIC_ENV_PREFIX PHI     // Defines the prefix to be used
setenv PHI_ABCD abcd          // Sets ABCD=abcd on all cards
setenv PHI_2_EFGH efgh        // Sets EFGH=efgh on logical MIC 2
setenv PHI_VAR X=x|Y=y        // Sets X=x and Y=y on all cards
setenv PHI_4_VAR P=p|Q=q      // Sets P=p and Q=q on MIC 4

NEXT STEPS

It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on the Intel® Xeon Phi™ coprocessor. The paths provided in this guide reflect the steps necessary to get best possible application performance.

Back to the chapter "Native and Offload Programming Models"

Теги:
Пожалуйста, обратитесь к странице Уведомление об оптимизации для более подробной информации относительно производительности и оптимизации в программных продуктах компании Intel.