Intel® C++ Compiler 19.0 Developer Guide and Reference
This topic only applies when targeting Intel® Many Integrated Core Architecture (Intel® MIC Architecture).
To transfer data between the CPU and the target device, use the offload_transfer pragma with either all in clauses or all out clauses. Without a signal clause the data transfer is synchronous: The next statement is executed only after the data transfer is complete.
offload_transfer with a signal makes the data transfer asynchronous. The tag specified in the signal clause is an address expression associated with that dataset. The data transfer is initiated and the CPU can continue past the pragma statement.
A later pragma written with a wait clause causes the activity specified in the pragma to begin only after all the data associated with the tag has been received or shared with the target device. The data is placed into the variables specified when the data transfer was initiated. These variables must still be accessible.
Alternatively, on Intel® MIC Architecture, you can use the non-blocking API _Offload_signaled() to also determine if a section of offloaded code has completed running on a specific target device.
On Intel® MIC Architecture, the signal and wait clauses, the offload_wait construct and the _Offload_signaled() API refer to a specific target device, so you must specify target-number in the target() clause.
Querying a signal before the signal has been initiated results in undefined behavior, and a runtime abort of the application. For example, consider a query of a signal (SIG1) on target device 0, where the signal was actually initiated for target device 1. The signal was initiated for target device 1, so there is no signal (SIG1) associated with target device 0, and therefore the application aborts.
If, during an asynchronous offload, a signal is created in one thread, Thread A, and waited for in a different thread, Thread B, you are responsible for ensuring that Thread B does not query the signal before Thread A has initiated the asynchronous offload to set up the signal. Thread B querying the signal before Thread A has initiated the asynchronous offload to set up the signal, results in a runtime abort of the application.
If if-specifier evaluates to false and you use a signal (tag) clause, then the signal is undefined and any wait on this signal has undefined behavior.
To transfer data asynchronously from the CPU to the target, use a signal clause in an offload_transfer pragma with in clauses. The variables listed in the in clauses form a data set. The pragma initiates the data transfer of those variables from the CPU to the target. A subsequent offload pragma with a wait clause that uses the same value for tag as that used in the signal clause causes the statement controlled by the pragma to begin execution on the target only after the data transfer is complete.
To transfer data asynchronously from the target to the CPU, use the signal and wait clauses in two different pragmas. The first offload pragma performs the computation, but only initiates the data transfer. The second pragma causes a wait for the data transfer to complete.
The example below demonstrates various asynchronous data transfers between the CPU and target.
1 #include <stdio.h>
2
3 __attribute__((target(mic)))
4 void add_inputs(int N, float *f1, float*f2);
5
6 void display_vals(int id, int N, float*f2);
7
8 int main()
9 {
10 const int N = 5;
11 float *f1, *f2;
12 int i, j;
13
14 f1 = (float *)_mm_malloc(N*sizeof(float),4096);
15 f2 = (float *)_mm_malloc(N*sizeof(float),4096);
16
17 for (i=0;i<N;i++){
18 f1[i]=i+1;
19 f2[i]=0.0;
20 }
21
Section 1 below (lines 22-56) demonstrates asynchronous data transfers, using IN and OUT, between the CPU and target with asynchronous computation. The data transfer of the arrays f1 and f2is initiated at lines 28-30. The offload_transfer does not initiate a computation. Its only purpose is to start transferring data for f1 and f2 to the target. At lines 40-44 the CPU initiates the computation, with the function add_inputs, on the target and continues execution to the offload_wait at line 51. The offloaded function uses the data f1 and f2, whose transfer was initiated earlier on the CPU. The execution of the offloaded region on the target begins only after the transfers of f1 and f2 are complete and the signal tag, (f1) is set accordingly. While the offloaded region executes on the target, the CPU waits at line 51 pending completion of the computation and data transfer of the results in f2 to the CPU. Execution on the CPU only continues beyond line 50 after the data for f2 is transferred to the CPU and the signal tag (f2) is set accordingly.
22 //----------- Section 1 --------------------------------------
23
24 // Asynchronous transfer IN (to target) of f1 and f2
25 //
26 // CPU issues send and then continues
27
28 #pragma offload_transfer target(mic:0) signal (f1) \
29 in( f1 : length(N) alloc_if(1) free_if(0) ) \
30 in( f2 : length(N) alloc_if(1) free_if(0) )
31
32 // Asynchronous compute and transfer OUT (to CPU) of f2
33 //
34 // CPU issues request to perform computation and continues
35 //
36 // target receives offload request, waits for pre-sent
37 // data. After receiving data, performs computation and
38 // transfers (asynchronous) data OUT (to CPU)
39
40 #pragma offload target(mic:0) wait(f1) signal (f2) \
41 in( N ) \
42 nocopy( f1 : alloc_if(0) free_if(1) ) \
43 out( f2 : length(N) alloc_if(0) free_if(1) )
44 add_inputs(N, f1, f2);
45
46 // Wait for offload completion
47 //
48 // CPU waits for completion of previous offload and
49 // data transfer out (to CPU) of f2
50
51 #pragma offload_wait target(mic:0) wait(f2)
52
53
54 // Show current values
55 display_vals(1, N, f2);
56
In the same example, section 2 (lines 57-90) demonstrates multiple asynchronous data transfers, using IN, from the CPU to the target with synchronous computation and synchronous data transfer, using OUT, from the target to the CPU. Multiple independent asynchronous data transfers can occur at any time. The offload_transfer sends f1 and f2 to the target at different times, first f1 in lines 63-64, and then f2 in lines 68-69. The transfers are independent. At lines 81-85 the execution of the offloaded region and the function add_inputs on the target begins only after the transfers of f1 and f2 are complete and the signal tags (f1 and f2 ) are both set accordingly. Execution on the CPU waits for the completion of the offloaded computation and data transfer of the results in f2 to the CPU. The data transfer of f2 to the CPU occurs synchronous with the execution of the offloaded region.
57 //----------- Section 2 --------------------------------------
58
59 // Independent asynchronous transfers IN (to target)
60 //
61 // CPU issues send and continues
62
63 #pragma offload_transfer target(mic:0) signal (f1) \
64 in( f1 : length(N) alloc_if(1) free_if(0) )
65
66 // CPU issues send and continues
67
68 #pragma offload_transfer target(mic:0) signal (f2) \
69 in( f2 : length(N) alloc_if(1) free_if(0) )
70
71 // Wait for independent transfers IN (to target),
72 // perform synchronous compute and data transfers out
73 //
74 // CPU issues request to perform computation and waits for
75 // completion
76 //
77 // target receives offload request, waits for pre-sent
78 // data. After receiving data, performs computation and
79 // transfers (synchronous) data OUT (to CPU)
80
81 #pragma offload target(mic:0) wait(f1 , f2) \
82 in( N ) \
83 nocopy( f1 : alloc_if(0) free_if(1) ) \
84 out( f2 : length(N) alloc_if(0) free_if(1) )
85 add_inputs(N, f1, f2);
86
87
88 // Show current values
89 display_vals(2, N, f2);
90
Section 3 (lines 91-132) in the example demonstrates an independent asynchronous data transfer (IN) from the CPU to the target with synchronous data transfer (IN) from the CPU to the target and computation, followed by an independent asynchronous data transfer (OUT) from the target to the CPU. The offloaded function uses the data f1 and f2 . The transfer of f2 was initiated earlier on the CPU at lines 97-98. The execution of the offloaded region on lines 111-115 on the target begins only after the transfers of f1 and f2 are complete and the signal tag (f2) is set accordingly for the transfer of f2 . After the offloaded region executes on the target, the computed results of f2 remain on the target and execution on the CPU continues beyond line 115. At lines 122-123, the CPU initiates an asynchronous data transfer (OUT) from the target to the CPU for the computed results for f2 and continues execution to line 128 where the CPU waits for the completion of the transfer of f2 . Execution on the CPU continues beyond line 128 only after the data for f2 is transferred to the CPU and the signal tag (f2) is set accordingly.
91 //----------- Section 3 --------------------------------------
92
93 // Asynchronous transfer IN (to target) of f2
94 //
95 // CPU issues send and then continues
96
97 #pragma offload_transfer target(mic:0) signal(f2) \
98 in( f2 : length(N) alloc_if(1) free_if(0) )
99
100 // Synchronous transfer IN (to target) of f1 with
101 // synchronous compute of f2 where new computed values
102 // of f2 remain on target
103 //
104 // CPU transfers values IN (to target) of f1, then issues
105 // request to perform computation and waits for completion
106 //
107 // target receives offload request, waits for pre-sent
108 // data for f2. After receiving data, performs the
109 // computation and holds the results in f2 on target
110
111 #pragma offload target(mic:0) wait(f2) \
112 in( N ) \
113 in ( f1 : length(N) alloc_if(1) free_if(0) ) \
114 nocopy( f2 )
115 add_inputs(N, f1, f2);
116
117
118 // CPU waits for completion of previous offload, then
119 // initiates asynchronous transfer OUT (to CPU) of f2
120 // and continues
121
122 #pragma offload_transfer target(mic:0) signal (f2) \
123 out( f2 : length(N) alloc_if(0) free_if(1) )
124
125
126 // CPU waits for completion of transfer of f2 to the CPU
127
128 #pragma offload_wait target(mic:0) wait(f2)
129
130 // Show current values
131 display_vals(3, N, f2);
132
133 }
134
135 void add_inputs (int N, float *f1, float*f2)
136 {
137 int i;
138
139 for (i=0; i<N; i++){
140 f2[i] = f2[i] + f1[i];
141 }
142 }
143
144 void display_vals (int id, int N, float *f2)
145 {
146 int i;
147
148 printf("\nResults after Offload #%d:\n",id);
149 for (i=0; i<N; i++){
150 printf(" f2[%d]= %f\n",i,f2[i]);
151 }
152 }
The following example double buffers inputs to an offload.
#pragma offload_attribute(push, target(mic))
int count = 25000000;
int iter = 10;
float *in1, *out1;
float *in2, *out2;
#pragma offload_attribute(pop)
void do_async_in()
{
int i;
#pragma offload_transfer target(mic:0) in(in1 : length(count) alloc_if(0) free_if(0) ) signal(in1)
for (i=0; i<iter; i++)
{
if (i%2 == 0) {
#pragma offload_transfer target(mic:0) if(i!=iter-1) in(in2 : length(count) alloc_if(0) free_if(0) ) signal(in2)
#pragma offload target(mic:0) nocopy(in1) wait(in1) out(out1 : length(count) alloc_if(0) free_if(0) )
compute(in1, out1);
} else {
#pragma offload_transfer target(mic:0) if(i!=iter-1) in(in1 : length(count) alloc_if(0) free_if(0) ) signal(in1)
#pragma offload target(mic:0) nocopy(in2) wait(in2) out(out2 : length(count) alloc_if(0) free_if(0) )
compute(in2, out2);
}
}
}