Asynchronous Data Transfer and Computation

Introduction

The Intel® C++ Compiler and Intel® Fortran Compiler provide simple language extensions for offload, used for sending computations from an Intel® Xeon® processor (host) to an Intel® Xeon Phi™ coprocessor(s). With this syntax, software developers can build applications that execute both synchronously and asynchronously with respect to the host. In asynchronous mode, code running on the host sends work to the coprocessor. The host code continues to execute, blocking only when needing the result from the coprocessor to continue its work. This document describes basic techniques used to implement asynchronous algorithms for Intel Xeon Phi coprocessors. This article assumes the reader is familiar with basic offload programming including offload pragmas and directives, and Intel® Cilk™ keywords.

Explicit offloading: Non-shared Memory Model

The Intel Xeon processor and the Intel Xeon Phi coprocessor do not share a common memory. With the non-shared memory model, data is copied to and from the coprocessor under the explicit control of the programmer and the Intel compiler runtime. The programmer marks code blocks with C/C++ pragmas or Fortran directives. The compiler runtime transfers the subsequent code block to the coprocessor and, by default, transfers data from the host to the coprocessor, performs the computation on the coprocessor and returns the results back to the host. The non-shared memory model supports scalars, arrays and bitwise copyable C structs or Fortran derived types (no embedded pointers or allocatable arrays).

Synchronization Clauses

Use offload pragmas and directives with signal and wait clauses to enable asynchronous data transfer and computation.

  1. An offload_transfer pragma/directive called with a signal clause begins an asynchronous data transfer.
  2. An offload pragma/directive called with a signal clause begins an asynchronous computation.
  3. An offload_wait pragma/directive called with a wait clause blocks execution until an asynchronous data transfer or computation is complete.

The wait and signal clauses are associated with each other via a unique value.

In C and C++, the syntax is:

#pragma offload_transfer target(mic:n) signal(signal_value)
#pragma offload target(mic:n) signal(signal_value)
#pragma offload_wait target(mic:n) wait(signal_value)

In Fortran, the syntax is:

!dir$ offload_transfer target(mic:n) signal(signal_value)
!dir$ offload target(mic:n) signal(signal_value)
!dir$ offload_wait target(mic:n) wait(signal_value)

Forthcoming examples in this paper are C++.

Asynchronous Data Transfer

Use the offload_transfer pragma/directive with a signal clause to transfer data between the host and the coprocessor. As the data transfer begins, the CPU continues execution past the pragma statement until it reaches a subsequent pragma written with a wait clause. The host blocks execution and waits until the coprocessor receives all the data associated with the signal clause before executing the pragma statement on the coprocessor. Multiple independent asynchronous data transfers can occur at any time.

Example: Host to Coprocessor

To transfer data asynchronously from host to coprocessor, use a signal clause in an offload_transfer pragma with all in clauses:

// Host allocates and initializes vector1 and vector2
…
// Host starts asynchronous data transfer to the coprocessor
// with *in* and *signal* clauses.
#pragma offload_transfer in(vector1, vector2 : length(N)) signal(vector1)
…
// Host code continues executing as the data transfer occurs
…
// Coprocessor blocks until data is received
#pragma offload wait(vector1) out(vector2)
{
	vector2 = calculate(vector1, vector2);  
}
…

Example: Coprocessor to Host

To receive data asynchronously from coprocessor to host, use a signal clause in an offload_transfer pragma with all out clauses:

// Host allocates and initializes vector1 and vector2
…
// The offload computes vector2, but result is not sent back
#pragma offload in(vector1, vector2 : length(N)) signal(vector2)
{
	vector2 = calculate(vector1, vector2);  
}
…
// Host code and offload computation occur concurrently
…
// Host initiates asynchronous data transfer from coprocessor
// with *wait* and *out* clauses.
#pragma offload_transfer wait(vector2) signal(vector2) out(vector2 : length(N))
…
// Host can continue
…
// When Host needs the result of the offloaded computation
// it waits for data transfer to be completed
#pragma offload_wait(vector2)
…
// Host can now use the result in vector2

Asynchronous Computation

Use an offload pragma/directive with a signal clause to initiate computation on the coprocessor. The host begins the offload execution on the coprocessor while computation on the host continues in parallel. The host arrives at a pragma statement written with a wait clause and blocks until the computation on the coprocessor is complete. This allows the host to issue offloads and carry on concurrent activity without using any additional CPU threads.

do {
     #pragma offload target(mic) signal(signal_value)
     {
           long_running_coprocessor_compute();
     }
     concurrent_host_activity();
     #pragma offload_wait (signal_value)
} while (1);

Programming Practices and Strategies

By default, the compiler runtime determines which coprocessor to send the data and computation to in a multi-coprocessor system. The use of signal/wait pairs alone is not enough to ensure that the data and the associated computation are offloaded to the same coprocessor. To reliably use data persistence and asynchronous offload in a multi-coprocessor system, always use an offload pragma/directive with the target clause, target(mic:n), to explicitly indicate which coprocessor should be used.

Additionally, there are offload runtime interfaces, defined in offload.h, to determine the number of coprocessors in a system, retrieve the number of the current coprocessor, as well as to time offload regions and measure the amount of data transferred.

Set the environment variable OFFLOAD_REPORT to trace runtime execution. For a single offload and OFFLOAD_REPORT=1, the output looks like:

[Offload] [MIC 0] [File]            sampleC13.c
[Offload] [MIC 0] [Line]            350
[Offload] [MIC 0] [CPU Time]        0.010006 (seconds)
[Offload] [MIC 0] [MIC Time]        0.000246 (seconds)

For a single offload with OFFLOAD_REPORT=2, the output looks like:

[Offload] [MIC 0] [File]            sampleC13.c
[Offload] [MIC 0] [Line]            350
[Offload] [MIC 0] [CPU Time]        0.009827 (seconds)
[Offload] [MIC 0] [CPU->MIC Data]   4 (bytes)
[Offload] [MIC 0] [MIC Time]        0.000244 (seconds)
[Offload] [MIC 0] [MIC->CPU Data]   0 (bytes)

Implicit Offloading: Virtual Shared Memory Model

With the shared memory model, the Intel Compiler runtime creates and manages a virtual memory system with memory space on the CPU mapped to the coprocessor. Programmers mark data as being available to virtual shared memory by using Intel Cilk keywords (C/C++ only). The runtime manages and updates the memory spaces automatically, and the coprocessor transfers only modified data back to host at the end of offload. Since explicit data marshaling is no longer required, the shared memory model supports the use of complex data structures used by C and C++ programmers.

Asynchronous Computation

This example shows how to use Intel Cilk parallel extensions to perform an asynchronous offload. In the code below, the FindArea() function is marked with _Cilk_shared, which declares the function as available for offload. The compiler will compile this function twice to create a version that runs on the host and a version that runs on the coprocessor.

_Cilk_shared float FindArea(float r)
{
   float x, y, area;  // private variables for each function instance
   unsigned int seed  = __cilkrts_get_worker_number() + clock();
   unsigned int seed2 = __cilkrts_get_worker_number() + clock() + 2;
   cilk::reducer_opadd<int> inside(0);

   _Cilk_for (int i=0;i<20000;i++)
   {
      x = (float) rand_r(&seed)  / RAND_MAX;
      y = (float) rand_r(&seed2) / RAND_MAX;
      x = 2.0 * x - 1.0;
      y = 2.0 * y - 1.0;
      if (x * x + y * y < r * r)
      {
         inside++;
      }
    }
    area = 4.0 * inside.get_value() / 20000.0;
    return area;
}

In the main program, the call to FindArea() is prefaced with _Cilk_spawn_Cilk_Offload. This syntax tells the compiler to run _Cilk_offload FindArea(r2) in parallel with FindArea(r1) on the CPU. One CPU task performs a computaiton on the host while another CPU task executes an offload, in effect performing an asynchronous offload. By default, the runtime chooses which coprocessor to use, unless the programmer specifies a specific coprocessor with _Cilk_offload_to(target-number). The _Cilk_sync statement forces the host to wait for FindArea(r2) to complete before continuing so that its results can be used for further processing.

int main()
{
   // Get input, do error checking
   … 
   AreaLg = _Cilk_spawn _Cilk_offload FindArea(r2);  // Runs on coprocessor
   AreaSm = FindArea(r1);  // Runs on host
   _Cilk_sync;   // Wait for coprocessor to complete
   TotalArea = AreaLg - AreaSm;   // Use the result
   …
   // Continue processing
}

To summarize, the above code launches an offload to compute the area for r2. Computation resumes on the host to calculate the area for r1; the host then waits until the results from the coprocessor are received. After further processing, the code terminates.

Conclusion

Software developers can increase application performance by sending computations to Intel Xeon Phi coprocessors. The asynchronous offload programming techniques described in this paper allow the host system to send data-parallel computations to the coprocessor(s) while continuing to execute work on the host. The Intel C++ Compiler and Intel Fortran Compiler provide language extensions for offload, which support this asynchronous processing.

Additional Resources

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Copyright© 2012 Intel Corporation. All rights reserved.

Performance Notice

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

有关编译器优化的更完整信息,请参阅优化通知
标签: