About Asynchronous Data Transfer

This topic only applies to Intel® Many Integrated Core Architecture (Intel® MIC Architecture).

To transfer data between the CPU and the coprocessor, use the offload_transfer pragma with either all in clauses or all out clauses. Without a signal clause the data transfer is synchronous: The next statement is executed only after the data transfer is complete.

offload_transfer with a signal makes the data transfer asynchronous. The tag specified in the signal clause is an address expression associated with that dataset. The data transfer is initiated and the CPU can continue past the pragma statement.

A later pragma written with a wait clause causes the activity specified in the pragma to begin only after all the data associated with the tag has been received. The data is placed into the variables specified when the data transfer was initiated. These variables must still be accessible.

Alternatively, you can use the non-blocking API _Offload_signaled() to also determine if a section of offloaded code has completed running on a specific target device.

Note

The signal and wait clauses, the offload_wait construct and the _Offload_signaled() API refer to a specific target device, so you must specify target-number in the target() clause.

Querying a signal before the signal has been initiated results in undefined behavior, and a runtime abort of the application. For example, consider a query of a signal (SIG1) on target device 0, where the signal was actually initiated for target device 1. The signal was initiated for target device 1, so there is no signal (SIG1) associated with target device 0, and therefore the application aborts.

If, during an asynchronous offload, a signal is created in one thread, Thread A, and waited for in a different thread, Thread B, you are responsible for ensuring that Thread B does not query the signal before Thread A has initiated the asynchronous offload to set up the signal. Thread B querying the signal before Thread A has initiated the asynchronous offload to set up the signal, results in a runtime abort of the application.

If if-specifier evaluates to false and you use a signal (tag) clause, then the signal is undefined and any wait on this signal has undefined behavior.

Asynchronous Data Transfer From the CPU to the Coprocessor

To transfer data asynchronously from the CPU to the coprocessor, use a signal clause in an offload_transfer pragma with in clauses. The variables listed in the in clauses form a data set. The pragma initiates the data transfer of those variables from the CPU to the coprocessor. A subsequent offload pragma with a wait clause that uses the same value for tag as that used in the signal clause causes the statement controlled by the pragma to begin execution on the coprocessor only after the data transfer is complete.

Asynchronous Data Transfer from the Coprocessor to the CPU

To transfer data asynchronously from the coprocessor to the CPU, use the signal and wait clauses in two different pragmas. The first offload pragma performs the computation, but only initiates the data transfer. The second pragma causes a wait for the data transfer to complete.

Examples: Coprocessor to CPU and CPU to Coprocessor

The example below demonstrates various asynchronous data transfers between the CPU and coprocessor.

     1  #include <stdio.h>
     2
     3  __attribute__((target(mic)))
     4             void add_inputs(int N, float *f1, float*f2);
     5
     6  void display_vals(int id, int N, float*f2);
     7
     8  int main()
     9  {
    10     const int N = 5;
    11     float *f1, *f2;
    12     int i, j;
    13
    14     f1 = (float *)_mm_malloc(N*sizeof(float),4096);
    15     f2 = (float *)_mm_malloc(N*sizeof(float),4096);
    16
    17     for (i=0;i<N;i++){
    18        f1[i]=i+1;
    19        f2[i]=0.0;
    20     }
    21

Section 1 below (lines 22-56) demonstrates asynchronous data transfers, using IN and OUT, between the CPU and coprocessor with asynchronous computation. The data transfer of the arrays f1 and f2is initiated at lines 28-30. The offload_transfer does not initiate a computation. Its only purpose is to start transferring data for f1 and f2 to the coprocessor. At lines 40-44 the CPU initiates the computation, with the function add_inputs, on the coprocessor and continues execution to the offload_wait at line 51. The offloaded function uses the data f1 and f2, whose transfer was initiated earlier on the CPU. The execution of the offloaded region on the coprocessor begins only after the transfers of f1 and f2 are complete and the signal tag, (f1) is set accordingly. While the offloaded region executes on the coprocessor, the CPU waits at line 51 pending completion of the computation and data transfer of the results in f2 to the CPU. Execution on the CPU only continues beyond line 50 after the data for f2 is transferred to the CPU and the signal tag (f2) is set accordingly.

22     //-----------  Section 1   --------------------------------------
    23
    24     // Asynchronous transfer IN (to coprocessor) of f1 and f2
    25     //
    26     // CPU issues send and then continues
    27
    28     #pragma offload_transfer target(mic:0)  signal (f1) \
    29                     in( f1 : length(N) alloc_if(1) free_if(0) ) \
    30                     in( f2 : length(N) alloc_if(1) free_if(0) )
    31
    32     // Asynchronous compute and transfer OUT (to CPU) of f2
    33     //
    34     // CPU issues request to perform computation and continues
    35     //
    36     // Coprocessor receives offload request, waits for pre-sent
    37     // data.  After receiving data, performs computation and
    38     // transfers (asynchronous) data OUT (to CPU)
    39
    40     #pragma offload target(mic:0) wait(f1) signal (f2) \
    41                     in( N ) \
    42                     nocopy( f1 : alloc_if(0) free_if(1) ) \
    43                     out( f2 : length(N) alloc_if(0) free_if(1) )
    44     add_inputs(N, f1, f2);
    45
    46     // Wait for offload completion
    47     //
    48     // CPU waits for completion of previous offload and
    49     // data transfer out (to CPU) of f2
    50
    51     #pragma offload_wait target(mic:0)  wait(f2)
    52
    53
    54     // Show current values
    55     display_vals(1, N, f2);
    56

In the same example, section 2 (lines 57-90) demonstrates multiple asynchronous data transfers, using IN, from the CPU to the coprocessor with synchronous computation and synchronous data transfer, using OUT, from the coprocessor to the CPU. Multiple independent asynchronous data transfers can occur at any time. The offload_transfer sends f1 and f2 to the coprocessor at different times, first f1 in lines 63-64, and then f2 in lines 68-69. The transfers are independent. At lines 81-85 the execution of the offloaded region and the function add_inputs on the coprocessor begins only after the transfers of f1 and f2 are complete and the signal tags (f1 and f2 ) are both set accordingly. Execution on the CPU waits for the completion of the offloaded computation and data transfer of the results in f2 to the CPU. The data transfer of f2 to the CPU occurs synchronous with the execution of the offloaded region.

    57     //-----------  Section 2   --------------------------------------
    58
    59     // Independent asynchronous transfers IN (to coprocessor)
    60     //
    61     // CPU issues send and continues
    62
    63     #pragma offload_transfer target(mic:0)  signal (f1) \
    64                     in( f1 : length(N) alloc_if(1) free_if(0) )
    65
    66     // CPU issues send and continues
    67
    68     #pragma offload_transfer target(mic:0)  signal (f2) \
    69                     in( f2 : length(N) alloc_if(1) free_if(0) )
    70
    71     // Wait for independent transfers IN (to coprocessor),
    72     // perform synchronous compute and data transfers out
    73     //
    74     // CPU issues request to perform computation and waits for
    75     // completion
    76     //
    77     // Coprocessor receives offload request, waits for pre-sent
    78     // data. After receiving data, performs computation and
    79     // transfers (synchronous) data OUT (to CPU)
    80
    81     #pragma offload target(mic:0) wait(f1 , f2) \
    82                     in( N ) \
    83                     nocopy( f1 : alloc_if(0) free_if(1) ) \
    84                     out( f2 : length(N) alloc_if(0) free_if(1) )
    85     add_inputs(N, f1, f2);
    86
    87
    88     // Show current values
    89     display_vals(2, N, f2);
    90

Section 3 (lines 91-132) in the example demonstrates an independent asynchronous data transfer (IN) from the CPU to the coprocessor with synchronous data transfer (IN) from the CPU to the coprocessor and computation, followed by an independent asynchronous data transfer (OUT) from the coprocessor to the CPU. The offloaded function uses the data f1 and f2 . The transfer of f2 was initiated earlier on the CPU at lines 97-98. The execution of the offloaded region on lines 111-115 on the coprocessor begins only after the transfers of f1 and f2 are complete and the signal tag (f2) is set accordingly for the transfer of f2 . After the offloaded region executes on the coprocessor, the computed results of f2 remain on the coprocessor and execution on the CPU continues beyond line 115. At lines 122-123, the CPU initiates an asynchronous data transfer (OUT) from the coprocessor to the CPU for the computed results for f2 and continues execution to line 128 where the CPU waits for the completion of the transfer of f2 . Execution on the CPU continues beyond line 128 only after the data for f2 is transferred to the CPU and the signal tag (f2) is set accordingly.

    91     //-----------  Section 3   --------------------------------------
    92
    93     // Asynchronous transfer IN (to coprocessor) of f2
    94     //
    95     // CPU issues send and then continues
    96
    97     #pragma offload_transfer target(mic:0)  signal(f2) \
    98                     in( f2 : length(N) alloc_if(1) free_if(0) )
    99
   100     // Synchronous transfer IN (to coprocessor) of f1 with
   101     // synchronous compute of f2 where new computed values
   102     // of f2 remain on coprocessor
   103     //
   104     // CPU transfers values IN (to coprocessor) of f1, then issues
   105     // request to perform computation and waits for completion
   106     //
   107     // Coprocessor receives offload request, waits for pre-sent
   108     // data for f2.  After receiving data, performs the
   109     // computation and holds the results in f2 on coprocessor
   110
   111     #pragma offload target(mic:0) wait(f2) \
   112                     in( N ) \
   113                     in ( f1 : length(N) alloc_if(1) free_if(0) ) \
   114                     nocopy( f2 )
   115     add_inputs(N, f1, f2);
   116
   117
   118     // CPU waits for completion of previous offload, then
   119     // initiates asynchronous transfer OUT (to CPU) of f2
   120     // and continues
   121
   122     #pragma offload_transfer target(mic:0)  signal (f2) \
   123                     out( f2 : length(N) alloc_if(0) free_if(1) )
   124
   125
   126     // CPU waits for completion of transfer of f2 to the CPU
   127
   128     #pragma offload_wait target(mic:0)  wait(f2)
   129
   130     // Show current values
   131     display_vals(3, N, f2);
   132
   133  }
   134
   135  void add_inputs (int N, float *f1, float*f2)
   136  {
   137     int i;
   138
   139     for (i=0; i<N; i++){
   140        f2[i] = f2[i] + f1[i];
   141     }
   142  }
   143
   144  void display_vals (int id, int N, float *f2)
   145  {
   146     int i;
   147
   148     printf("\nResults after Offload #%d:\n",id);
   149     for (i=0; i<N; i++){
   150       printf("     f2[%d]= %f\n",i,f2[i]);
   151     }
   152  }

The following example double buffers inputs to an offload.

#pragma offload_attribute(push, target(mic))
int count = 25000000;
int iter = 10;
float *in1, *out1;
float *in2, *out2;
#pragma offload_attribute(pop)

void do_async_in()
{
  int i;
  #pragma offload_transfer target(mic:0) in(in1 : length(count) alloc_if(0) free_if(0) ) signal(in1)
  for (i=0; i<iter; i++)
  {
    if (i%2 == 0) {
          #pragma offload_transfer target(mic:0) if(i!=iter-1) in(in2 : length(count) alloc_if(0) free_if(0) ) signal(in2)
          #pragma offload target(mic:0) nocopy(in1) wait(in1) out(out1 : length(count) alloc_if(0) free_if(0) )
          compute(in1, out1);
    } else {
          #pragma offload_transfer target(mic:0) if(i!=iter-1) in(in1 : length(count) alloc_if(0) free_if(0) ) signal(in1)
          #pragma offload target(mic:0) nocopy(in2) wait(in2) out(out2 : length(count) alloc_if(0) free_if(0) )
          compute(in2, out2);
    }
  }
}
For more complete information about compiler optimizations, see our Optimization Notice.