Hi Christopher,
I was also recently trying to implement data persistence on the coprocessor in an offload from a function. The only method that I was successful with is declaring a static array in the scope of the function to hold the data, and marking it with "__attribute__((target(mic)))" (or __declspec(target(mic)) ). It actually makes sense that "nocopy" fails for non-static and non-global pointers: the variable that holds persistent data must survive between multiple function calls. So static or global variables seem like the only way to go. Here is a (dirty) code that illustrates the approach that I ended up with:
#include <stdio.h>
#include <cstring>
void foo(const char* data, const int n) {
static const char* persistentData __attribute__((target(mic))) = NULL;
if (data != persistentData) {
printf("Offloading data...\n");
persistentData = data;
#pragma offload_transfer target(mic:0) in(persistentData : length(n) alloc_if(1) free_if(0))
}
#pragma offload target(mic:0) nocopy(persistentData : length(n) alloc_if(0) free_if(0))
{
printf("Re-using data...\n");
for (int i = 0; i < n; i++)
printf("%c", persistentData[i]);
fflush(0);
}
}
int main() {
char d[50] = "Hello World!\n";
const int n = std::strlen(d);
foo(d, n);
foo(d, n);
strcpy(d, "Goodbye cruel world!\n"); // Will NOT be printed
foo(d, n);
}
Result:
$ icpc foo.cc
$ ./a.out
Offloading data...
Re-using data...
Hello World!
Re-using data...
Hello World!
Re-using data...
Hello World!
$





Data persistence between offloads
Is it possible to have data persistence between offloads in different functions without using global variables? How do you let the MIC know to reuse the same memory? I know about using "nocopy" in the offload pragma, but I only seem to be able to get this to work when using the same variable name as in the previous offload (and not, for example, when using a pointer to the same memory in the host).
If my question needs more clarification, below is a simple stand-alone program that illustrates my question. It has three offloads. The intention is to initialize an array in the first offload, do some computation in the second offload without copying any data to or from the MIC, and read back the values (after additional modification) in the third offload. I want to time just the middle offload, in order to get just the time for computation without any data transfers to or from the card. Presumably I could add a chain of such operations before eventually transferring the data back to the host.
I am compiling it as follows:
icc -O2 -offload-build -offload-attribute-target=mic -openmp -vec-report3 -openmp-report -o micsimple simple.cpp
There are two #define's at the top of the program to illustrate three cases:
1) When none of the #defines are uncommented, all three offloads are in the main function, and the middle offload takes about 0.2 seconds.
2) When just the #define FUNCTION_CALL is uncommented, the middle offload occurs in another function, with the "nocopy" clause, and it results in an error: "offload error: process on the device 0 was terminated by signal 11".
3) When FUNCTION_CALL and GLOBAL_ARRAY are uncommented, the middle offload occurs in another function, but it uses the same global variable for the array as the main function (rather than a pointer passed to it as a parameter). This has no error but the middle offload is much slower than Case 1 (2.3 seconds).
It seems a little awkward to have to use global variables (or confine all offloads the same function) in order to let the MIC know to reuse memory from a previous offload. I am also unclear as to why the offload is slower in another function even when the global variable is used. Of course, I may be doing something wrong, or there may be a way around this that I have not realized.
Thanks!
#define SIZE 1000000000
//#define FUNCTION_CALL
//#define GLOBAL_ARRAY
#include <math.h>
#include <stdlib.h>
#include <iostream>
#include <sys/time.h>
#include <stdio.h>
#ifdef GLOBAL_ARRAY
__declspec (target(mic)) float* array;
#endif
void myFunction(
#ifndef GLOBAL_ARRAY
float* array,
#endif
int nthreads)
{
#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))
#pragma omp parallel num_threads(nthreads)
{
unsigned int i;
#pragma omp for
#pragma simd
for (i=0; i<SIZE; i++)
array[i] = 2.0f*array[i];
}
}
int main(int argc, char* argv[])
{
struct timeval begin, end, diff;
int nthreads = 16;
#ifndef GLOBAL_ARRAY
static __declspec (target(mic)) float* array;
#endif
array = (float*)malloc(SIZE*sizeof(float));
#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(1) free_if(0))
#pragma omp parallel num_threads(nthreads)
{
unsigned int i;
#pragma omp for
#pragma simd
for (i=0; i<SIZE; i++)
array[i] = 1.0f;
}
gettimeofday(&begin, 0);
#ifdef FUNCTION_CALL
myFunction(
#ifndef GLOBAL_ARRAY
array,
#endif
nthreads);
#else
#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))
#pragma omp parallel num_threads(nthreads)
{
unsigned int i;
#pragma omp for
#pragma simd
for (i=0; i<SIZE; i++)
array[i] = 2.0f*array[i];
}
#endif
gettimeofday(&end, 0);
timersub(&end, &begin, &diff);
float seconds = diff.tv_sec + 1.0E-6*diff.tv_usec;
printf("Seconds: %f\n", seconds);
#pragma offload target(mic) out(array : length(SIZE) alloc_if(0) free_if(1))
#pragma omp parallel num_threads(nthreads)
{
unsigned int i;
#pragma omp for
#pragma simd
for (i=0; i<SIZE; i++)
array[i] = 2.0f*array[i];
}
for (unsigned int i=0; i<16; i++)
printf("%f ", array[rand() % SIZE]);
printf("\n");
return 0;
}