Data persistence between offloads

Data persistence between offloads

Is it possible to have data persistence between offloads in different functions without using global variables?  How do you let the MIC know to reuse the same memory?  I know about using "nocopy" in the offload pragma, but I only seem to be able to get this to work when using the same variable name as in the previous offload (and not, for example, when using a pointer to the same memory in the host).

If my question needs more clarification, below is a simple stand-alone program that illustrates my question.  It has three offloads.  The intention is to initialize an array in the first offload, do some computation in the second offload without copying any data to or from the MIC, and read back the values (after additional modification) in the third offload.  I want to time just the middle offload, in order to get just the time for computation without any data transfers to or from the card.  Presumably I could add a chain of such operations before eventually transferring the data back to the host.

I am compiling it as follows:
icc -O2 -offload-build -offload-attribute-target=mic -openmp -vec-report3 -openmp-report -o micsimple simple.cpp

There are two #define's at the top of the program to illustrate three cases:
1) When none of the #defines are uncommented, all three offloads are in the main function, and the middle offload takes about 0.2 seconds.
2) When just the #define FUNCTION_CALL is uncommented, the middle offload occurs in another function, with the "nocopy" clause, and it results in an error: "offload error: process on the device 0 was terminated by signal 11".
3) When FUNCTION_CALL and GLOBAL_ARRAY are uncommented, the middle offload occurs in another function, but it uses the same global variable for the array as the main function (rather than a pointer passed to it as a parameter).  This has no error but the middle offload is much slower than Case 1 (2.3 seconds).

It seems a little awkward to have to use global variables (or confine all offloads the same function) in order to let the MIC know to reuse memory from a previous offload.  I am also unclear as to why the offload is slower in another function even when the global variable is used.  Of course, I may be doing something wrong, or there may be a way around this that I have not realized.

Thanks!

#define SIZE 1000000000

//#define FUNCTION_CALL
//#define GLOBAL_ARRAY

#include <math.h>
#include <stdlib.h>
#include <iostream>
#include <sys/time.h>
#include <stdio.h>

#ifdef GLOBAL_ARRAY
  __declspec (target(mic)) float* array;
#endif

void myFunction(
#ifndef GLOBAL_ARRAY
  float* array,
#endif
  int nthreads)
{
  #pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))
  #pragma omp parallel num_threads(nthreads)
  {
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array[i] = 2.0f*array[i];
  }
}

int main(int argc, char* argv[])
{
  struct timeval begin, end, diff;
  int nthreads = 16;

#ifndef GLOBAL_ARRAY
  static __declspec (target(mic)) float* array;
#endif
  array = (float*)malloc(SIZE*sizeof(float));

  #pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(1) free_if(0))
  #pragma omp parallel num_threads(nthreads)
  {
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array[i] = 1.0f;
  }

  gettimeofday(&begin, 0);

#ifdef FUNCTION_CALL
  myFunction(
#ifndef GLOBAL_ARRAY
    array,
#endif
    nthreads);
#else
  #pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))
  #pragma omp parallel num_threads(nthreads)
  {
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array[i] = 2.0f*array[i];
  }
#endif

  gettimeofday(&end, 0);
  timersub(&end, &begin, &diff);
  float seconds = diff.tv_sec + 1.0E-6*diff.tv_usec;
  printf("Seconds: %f\n", seconds);

  #pragma offload target(mic) out(array : length(SIZE) alloc_if(0) free_if(1))
  #pragma omp parallel num_threads(nthreads)
  {
    unsigned int i;
    #pragma omp for
    #pragma simd
    for (i=0; i<SIZE; i++)
      array[i] = 2.0f*array[i];
  }

  for (unsigned int i=0; i<16; i++)
    printf("%f ", array[rand() % SIZE]);
  printf("\n");
 
  return 0;
}

25 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Hi Christopher,

I was also recently trying to implement data persistence on the coprocessor in an offload from a function. The only method that I was successful with is declaring a static array in the scope of the function to hold the data, and marking it with "__attribute__((target(mic)))" (or __declspec(target(mic)) ). It actually makes sense that "nocopy" fails for non-static and non-global pointers: the variable that holds persistent data must survive between multiple function calls. So static or global variables seem like the only way to go. Here is a (dirty) code that illustrates the approach that I ended up with:

#include <stdio.h>
#include <cstring>

void foo(const char* data, const int n) {

static const char* persistentData __attribute__((target(mic))) = NULL;

if (data != persistentData) {
printf("Offloading data...\n");
persistentData = data;
#pragma offload_transfer target(mic:0) in(persistentData : length(n) alloc_if(1) free_if(0))
}

#pragma offload target(mic:0) nocopy(persistentData : length(n) alloc_if(0) free_if(0))
{
printf("Re-using data...\n");
for (int i = 0; i < n; i++)
printf("%c", persistentData[i]);
fflush(0);
}
}

int main() {
char d[50] = "Hello World!\n";
const int n = std::strlen(d);
foo(d, n);
foo(d, n);
strcpy(d, "Goodbye cruel world!\n"); // Will NOT be printed
foo(d, n);
}

Result:

$ icpc foo.cc
$ ./a.out
Offloading data...
Re-using data...
Hello World!
Re-using data...
Hello World!
Re-using data...
Hello World!
$

Thanks; that is very helpful!

Chris

Having worked with this a little more, I have a follow-up question.

Your code works well for what it was intended to do: allow you to call the same function consecutively with the same input data, reusing the MIC memory.  However, it still seems subject to the constraint that MIC memory can only be reused when the offload pragma is referenced using the same literal variable name.  Is this an unavoidable constraint?  Is there no way to refer to an address in the MIC memory?

What if I wanted to call your function with several different input arrays, multiple times each, interspersed with each other?  What if I wanted to allocate the MIC memory in one function and then reuse it in another?

Below is another simple program to illustrate my question another way.  If I have "#define USE_ARRAY array1", it works correctly.  However, if I have "#define USE_ARRAY array2", it crashes on the offload, even though array1 and array2 are set to be exactly the same thing (at least in host memory).  Thus, it seems that MIC memory is only associated with a variable name, rather than with a variable value.  I understand what you were saying about how the variable needs to be global or static to maintain persistence (as opposed to a stack variable in a function), but I don't see why it does not seem possible to refer to MIC memory allocated using the name of one global variable with another global variable.  I guess there must be a good reason for this, but it seems to make designing clean, efficient code almost impossible.

#include <stdio.h>
#include <stdlib.h>

#define ALLOC_ARRAY array1
//#define USE_ARRAY array1  // works correctly
#define USE_ARRAY array2  // crashes on offload

int* array1;
int* array2;

int main(int argc, char* argv[])
{
  int n = 10;

  array1 = new int[n];
  array2 = array1;

  #pragma offload_transfer target(mic:0) nocopy(ALLOC_ARRAY : length(n) alloc_if(1) free_if(0))  

  #pragma offload target(mic:0) nocopy(USE_ARRAY : length(n) alloc_if(0) free_if(0))
  {
    for (unsigned int i=0; i<n; i++) USE_ARRAY[i] = 5;
  }

  #pragma offload_transfer target(mic:0) out(USE_ARRAY : length(n) alloc_if(0) free_if(1))

  for (unsigned int i=0; i<n; i++)
    printf("%d ", USE_ARRAY[i]);
  printf("\n");

  return 0;
}

I found this from the FAQ at the top of the forum:

http://software.intel.com/en-us/articles/effective-use-of-the-intel-comp...

Based on the "Local Pointers Versus Pointers Used Across Offloads" section, I got the modified version of the code from my previous post to work by changing the offload that reuses previously allocated data to be an in with length 0 rather than a nocopy of length n.  I don't fully understand this yet, but maybe I am getting on the right track.  If I comment out the line "array2 = array1;", it crashes, so apparently maybe there is a connection between the value of a pointer and the memory with which it is associated on the MIC. 

Anyway, I will continue reading that link and experimenting, but if anyone thinks they can clear up my confusion more quickly, please post. 

Thanks!

#include <stdio.h>
#include <stdlib.h>

#define ALLOC_ARRAY array1
#define USE_ARRAY array2

int* array1;
int* array2;

int main(int argc, char* argv[])
{
  int n = 10;

  array1 = new int[n];
  array2 = array1;

  #pragma offload_transfer target(mic:0) nocopy(ALLOC_ARRAY : length(n) alloc_if(1) free_if(0))  

  #pragma offload target(mic:0) in(USE_ARRAY : length(0) alloc_if(0) free_if(0))
  {
    for (unsigned int i=0; i<n; i++) USE_ARRAY[i] = 5;
  }

  #pragma offload_transfer target(mic:0) out(USE_ARRAY : length(n) alloc_if(0) free_if(1))

  for (unsigned int i=0; i<n; i++)
    printf("%d ", USE_ARRAY[i]);
  printf("\n");

  return 0;
}

Hi Chris, 

I am sorry for all this confusion. The host pointer value is used as a key in a table of host-coprocessor pointer data associations. The association is not by name, but by pointer value. If you had two pointer variables on the host with the same value, then, yes, they will be mapped to the same allocated memory on the coprocessor.

As for your previous code, it did not work because of this statement: 

#pragma offload target(mic:0) nocopy(USE_ARRAY : length(n) alloc_if(0) free_if(0))

This statement simply created the pointer 'array2' on the coprocessor but did not update it's value to point the previously allocated array. By using in with length(0), you told the compiler to create the pointer 'array2' but also to update (refresh) the pointer value to point to the previously allocated memory address. That is why your second code worked. 

Hopefully the following code will further clarify the ultility of in with length(0): 

void f()

{

     int * p = malloc(…);

     …

     // 100 elements are transferred to MIC, from p on the CPU

// In subsequent offloads the MIC memory for the transferred data can be located on the CPU using the value of p

     #pragma offload_transfer in(p[0:100] alloc_if(1) free_if(0))

     g(p);

}

 

void g(int *q)

{

     // This pragma does not do data transfer; previously sent data is reused

     // That’s why the length is 0, meaning don’t send data

     // However, we used the “in” clause instead of “nocopy”, to update the MIC pointer q with the previously allocated memory address

     // Because p (at the time of MIC memory allocation) and q in this offload have the same value, the MIC memory is located

     #pragma offload … in(q[0:0])

     {

           int x = q[0] + q[99];

}

}            

I hope this clarify things. 

-Sumedh

To add to what Sumedh's said, the program (in the first post) works using the FUNCTION method when one creates the instance of the pointer ARRAY in the function using IN with length(0).

Instead of this:
  #pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))

use this:
  #pragma offload target(mic) in(array : length(0) alloc_if(0) free_if(0))

 

Chris, in my code I had to call the function with multiple arrays, all of which had to be persistent. In order to pull this off, I had a single pointer to hold all my data on the card in a single big chunk. I did memory management within that big chunk of data "manually". That is, when I had to use one of the data sets, I used a local pointer in the MIC code to point to the respective section of the big array. 

Thanks, Sumedh and Kevin! It is good to know that this task can be done without the acrobatics that I described.

Thank you very much Sumedh, Kevin, and Andrey!  That was definitely helpful.

I'm trying to add MIC support to a large DFT code that does a lot of memory management between main memory and GRAM (currently in CUDA). I couldn't glean from the documentation so far whether this was possible to do in an organized manner in offload mode - I was almost considering doing it over MPI with native mode instead. This thread has been particularly useful for me to figure out how to achieve this correctly - thanks to all involved!

It would be very helpful if a simple example of memory management (without global or static pointers) based on the examples discussed above was added to the "Managing Memory Allocation for pointer variables" page of the MIC documentation (http://software.intel.com/sites/products/documentation/doclib/stdxe/2013...).

Thanks again!

Thank you for this feedback. I appreciate you taking time to post. Expanding/improving the topic noted is planned for a future release so we will consider your feedback with that. Thank you again.

-offload-build option has been deprecated.  You don't need to use the option now since the compiler automatically detects offload constructs and generates the necessary MIC code.

Hi,

I don't understand if a code like the following is possible or not.

Thank you in advance,

Marco

#include <stdlib.h>
void device_malloc(int *t);
void device_free(int *t);
int main()
{
  int *p;
  p = (int *)malloc(3);
  device_malloc(p);
  device_free(p);
  free(p);
}
void device_malloc(int *t)
{
#pragma offload target(mic:0) <FILL PLEASE>
  {
    t = (int *)malloc(3);
  }
}
void device_free(int *t)
{
#pragma offload target(mic:0) <FILL PLEASE>
  {
    free(t);
  }
}

Unfortunately, your post is one of the unfortunate that was missed.

The code is possible assuming you specify that the offload runtime transfers the data pointed to by 't', but it is incorrect in a few ways.

  • (int *)malloc(3) is an unusual length
  • You need to specify something like inout(t:length(x)) but length(x) doesn't make much sense with respect to your malloc(3)
  • t=(int*)malloc(3) creates a memory leak on the card since the heaps on the host and the coprocessor are separate and unique

I believe what you want to do is to malloc and persist a specified amount of data on the card. To do that, first malloc the data on the host, and then pass the pointer and the amount of data to the offload statement. For example,

int *t;
t = (int*) malloc((sizeof int)*N);
#pragma offload target(mic:0) in(t:length(N) alloc_if(1) free_if(0))
{...}
#pragma offload target(mic:0) nocopy(t:length(N) alloc_if(0) free_if(0))
{...}
#pragma offload target(mic:0) out(t:length(N) alloc_if(0) free_if(1))
{...}

Outside of minor syntax errors, this will (1) allocate an array on the host, (2) allocate that same amount on the coprocessor and copy the data over, (3) persist the data to the next offload. You can find examples of this under the Samples directory for the C++ compiler.

Regards
--
Taylor

Hi Marco, 

You might find the following threading helpful:

http://software.intel.com/en-us/forums/topic/382988#comment-1731861

What Taylor suggests seems to be only working when the offloads are being called inside one function. If the offloads are being called from different C function a segmentation error is returned. I experimented this with the following code:

 

int main( int argc, char* argv[])

	{
  

	  const int N = 1000;

	  double * a,*b,*c;
  a=(double*) malloc(sizeof(double)*N*N);

	  b=(double*) malloc(sizeof(double)*N*N);

	  c=(double*) malloc(sizeof(double)*N*N);
// Initialize Data

	  for ( int i = 0; i < N; ++i )

	    for(int j=0; j<N; ++j){

	      a[j+i*N]=i+j;

	      b[j+i*N]=i+j;

	      c[j+i*N]=0.0;

	    }
// Call alloc_mic to just initialize and transfer a, b and c to coprocessor
    alloc_mic(a,b,c,N);
// call MMult_mic to compute Matrix Matrix Multiplication c=a*b, with a,b, and c already in the PHI memory

	    MMult_mic( a,b,c,N );
  return 0;

	}

	void alloc_mic( double * a, double*b,double *c, int N)

	{
#pragma offload_transfer target(mic:0)

	  in(a:length(N*N) alloc_if(1) free_if(0))

	  in(b:length(N*N) alloc_if(1) free_if(0))

	  in(c:length(N*N) alloc_if(1) free_if(0))

	  {

	    int nth=omp_get_max_threads();

	    printf("MIC Max Threads %fn",1.0*nth);

	  }
}

	void MMult_mic( double * a, double*b,double *c, int N )

	{
#pragma offload target(mic:0) 

	  nocopy(a:length(N*N) alloc_if(0) free_if(0))

	  nocopy(b:length(N*N) alloc_if(0) free_if(0))

	  out(c:length(N*N) alloc_if(0) free_if(0))

	  {
#pragma omp parallel for

	    for ( int i = 0; i < N; ++i )

	      for(int k=0; k<N; ++k)

	        for(int j=0; j<N; ++j)

	          c[j+i*N]+=a[k+i*N]*b[j+k*N];

	  }

	}

 

This returns a segmentation fault. However, if I had the second offload (in MMult_mic) done inside alloc_mic function (after its offload is done) there would be no problem. So the question is what is the problem? I have included the source file as an attachment.

Allegati: 

AllegatoDimensione
Download sample.cpp1.44 KB

Kevin, what you said solved my problem. But the syntax is counter intuitive:

#pragma offload target(mic) nocopy(array : length(SIZE) alloc_if(0) free_if(0))

#pragma offload target(mic) in(array : length(0) alloc_if(0) free_if(0))

In the first case, when we specify alloc_if(0) free_if(0), why will the compiler assume that we are resizing (when we explicitly say no-alloc). In fact the second syntax suggests that we want the compiler to copy the array from CPU to Phi using the previously allocated size (since we dont specify a size but we do say in(...) ).

Intel please fix this messed up notation!! Its been over a year and there are still many such issues!!

Intel should at least fix the documentation, because the documentation uses "nocopy(array)" instead of the code that works.

@Amir - your code should use IN with length(0) vs. NOCOPY in MMult_MIC, e.g.

Instead of:
  nocopy(a:length(N*N) alloc_if(0) free_if(0))\
  nocopy(b:length(N*N) alloc_if(0) free_if(0))\

Use:
  in(a:length(0) alloc_if(0) free_if(0))\
  in(b:length(0) alloc_if(0) free_if(0))\

If you have not already done so, refer to the FAQ cited in post #5 above and related blog post: Behind the Scenes: Offload Memory Management on the Intel® Xeon Phi™ coprocessor

@Dhairya - I posted in your other thread here: http://software.intel.com/forums/topic/499257. We apologize for the incorrect examples in the 13.1 documentation. I believe we have corrected those that appeared in the 13.1 C++ User Guide in our 14.0 (Composer XE 2013 SP1) release. Is it possible you might have referenced the 13.1 version of the guide?

@Kevin:

I would highly appreciate it if you would clarify when/where we should use nocopy? According to Taylor's post above nocopy should have worked but it is not working. So are you suggesting that in Taylor's code above one should have used in with length(0)? Then what is the point of this nocopy clause? 

I suspect that there is a bug in intel's nocopy implementation. The reason I am saying this, is that if two pragmas are in the same function nocopy works but when the second pragma is called, it fails to work.

nocopy means nothing is sent or recieved.

in(....length(0)...)   means the pointer for the memory is sent

Since your pointer is on the host stack (passed as argument to the function MMult_mic)   you need to update the corresponding stack  pointer variable on the card with memory which was allocated on the host in an earlier pragma directive.

In Taylor’s post, nocopy is correct because he shows all the pragmas appearing within the same local scope. In other words, the first pragma with in() updates the pointer on the target AND allocates memory AND transfers the current data; therefore, the second, third and any subsequent pragmas within the same scope need not update the target’s copy of the pointer; therefore, he can use nocopy as he did in the second pragma which was simply intended to reuse the data that was transferred to the card by the previous pragma. His nocopy does not require length() and in his context (with alloc_if(0)) it is ignored.

With respect to what Marco was asking, the key restriction to explicit allocation within offloaded code is that memory allocated by the user using malloc or some such API cannot participate in the data transfer pragmas. For the pragmas to be usable, the allocation must be done using the pragmas also. In other words, Marco would have to transfer the values of “p” to the card and memcpy into “t” himself. I discussed this recently related to another user’s post here: http://software.intel.com/en-us/forums/topic/499631

Your example uses pointers within different scopes; therefore, as Ravi indicated, when you change scope you must update the new scope’s corresponding stack pointer variable on the target; therefore the first reference to the pointer in the new scope must not be a nocopy. It must be an in() with an alloc_if(0) and either a non-zero or zero length. Within that same scope, after this in() pragma you would then use nocopy as in Taylor’s code.

So, the use of nocopy depends on the scope and what is needed in terms of allocation, no data transfer, and/or pointer refresh. The FAQ in post #5 contains a discussion about the Local Pointers Versus Pointers Used Across Offloads.

Rather than duplicate a portion of another recent related post, see if the reply here (http://software.intel.com/en-us/forums/topic/499257#comment-1776963) helps further explain this matter.

As a final note, all the focus/discussion here relates to local pointers. For global pointers, once created/updated, there is no need for in() with length(0) when used within different functions since the value is retained in global(static) memory on the target.

So Ravi, what you are saying is that for array we should ALWAYS use in(...) because the CPU pointer will always be sent from CPU to MIC and never the other way. We can use in(...) even when in the same scope because the CPU pointer value is the same.

So what do we do when we need to transfer data from a persistent MIC array back to the CPU? From what you said out(...) will just over write the CPU pointer variable. Or should we first do in(A: length(0) REUSE) and then out(A: length(N) REUSE)

You can always use in(..)  in the same scope but increase the amount of data transferred by 8bytes when it could have been avoided.

The host pointer is the master pointer and will not be overwritten.  You can only update data from MIC to what the host pointer points to.

IMHO, all the complexities and inelegance would be gone if a high-level version of COI (without pragma) could be made by Intel.

The keeper of the city keys
Put shutters on the dreams.
I wait outside the pilgrim's door
With insufficient schemes.
The black queen chants
The funeral march,
The cracked brass bells will ring;
To summon back the fire witch
To the court of the crimson king.

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi