data persistence without host allocation

data persistence without host allocation

Hi all,
I have tried to create data persistence without host allocation to allow the passing of host objects to host subroutines. The basic goal is to create a set of basic subroutines that hide the calls to the coprocessor. Basic = E.g vector routines, higher routines = Eigensolvers, cg, LD decomposition, and highest = main. Currently the only way I have it working is by allocating unnecessary host buffers. A simple example is given below. Essentially I want to get rid of the allocation on the host done in the constructor. This scheme will allow the minimal amount of copying by doing everything on the coprocessor and only copying back the resultant data from the coprocessor at the very end as well as hide calls to the coprocessor. Is it possible? This is the paradigm I used when coding GPU kernels. Essentially I want the cuda equivalent of a device pointer for the host.

#define REUSE length(0) alloc_if(0) free_if(0)
#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)

template<typename T>
struct gen_vector_mic
{
int length;
T* data;

gen_vector_mic(int length) : length(length)
{
this->data = (T*)memalign(64, length*sizeof(T));
T* p = this->data;
#pragma offload_transfer target(mic:0) nocopy(p:length(length) ALLOC)
{}
}

~gen_vector_mic()
{
T* p = this->data;
#pragma offload target(mic:0) nocopy(p:FREE)
{ }
}
};

template<typename T>
void add(gen_vector_mic<T>& v1, gen_vector_mic<T>& v2, gen_vector_mic<T>& v3, int length, int chunk)
{
T* left = v1.data;
T* right = v2.data;
T* res = v3.data;
#pragma offload target(mic:0) in(length) in(left, right, res:REUSE)
{
int i;
#pragma omp parallel
{
#pragma omp for
#pragma ivdep
#pragma vector always
for(i=0; i<length; ++i)
res[i] = left[i] + 5.0*right[i];
}
}
}

typedef gen_vector_mic<double> vector_mic;
typedef gen_vector_mic< float> float_vector_mic;

... use add and other routines like that to construct higher level routines without knowing about coprocessor calls

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Craig, 

I am unclear of what exactly you are trying to exactly do in this code. As I understand it, you are trying to get rid of the host allocation in the constructor. However, you transfer data from the host to the coprocessor by using the 'in' clause in the offload pragma in the add function.

If you are indeed trying to transfer data to and from the coprocessor, it makes sense to have a corresponding host-side array and hence your current code looks good. 

If you only want to create coprocessor-side arrays, I would recommend using compiler-manager heap-allocated arrays i.e. arrays allocated using the offload or offload_transfer pragma. (Please refer to this for more information). However, since you cannot use uninitialized pointers to allocate data on the coprocessor using the pragmas, you would need to initialize them with a unique value. For more information on why you cannot use uninitialized pointers, please take a look at this blog. My best guess is that your code should look similar to this: 

gen_vector_mic(int length) : length(length)
{
//Created a dummy array with just 1 element
this->data = (T*)memalign(64, 1*sizeof(T));
T* p = this->data;
#pragma offload_transfer target(mic:0) nocopy(p:length(length) ALLOC)
{}
}

Also, I notice that the variables are lacking decoration for MIC architecture (__attribute__((target(mic))) or __declspec(target(mic)) ). Also, the synatx of the offload pragma in the add function is incorrect. Please refer to this compiler reference for the correct syntax. 

I hope this helps.

Yes that answers my question! I will also clean up the code!

Cheers,

Yes that answers my question! I will also clean up the code!

Cheers,

It seems this is only working for relatively small vectors. When the vectors get large I get the following error. What's reason why this is happening?

output:

cpelissi@borg01x081:~/scracth/programs> ./reproducer.x
number of bytes i 16*2^0
number of bytes i 16*2^1
number of bytes i 16*2^2
number of bytes i 16*2^3
number of bytes i 16*2^4
number of bytes i 16*2^5
number of bytes i 16*2^6
number of bytes i 16*2^7
number of bytes i 16*2^8
number of bytes i 16*2^9
offload error: address range partially overlaps with existing allocation

code:

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include "omp.h"
#include "offload.h"
#include "math.h"

#define REUSE length(0) alloc_if(0) free_if(0)
#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define START(message)\

template<typename T>
struct gen_vector_mic
{
int length;
T* data;

gen_vector_mic(int length) : length(length)
{
this->data = (T*)memalign(64, 1*sizeof(T));
T* p = this->data;
#pragma offload_transfer target(mic) nocopy(p:length(length) ALLOC)
{}
}

~gen_vector_mic()
{
T* p = this->data;
#pragma offload target(mic) in(p: length(0) FREE)
{ }
}
};

typedef gen_vector_mic<double> vector_mic;
typedef gen_vector_mic<float> float_vector_mic;
int main(int argc, char** argv)
{
for(int i=0; i<20; ++i)
{
printf("2^%i\n", i);
int length = pow(2,i);
gen_vector_mic<double> v1(length);
gen_vector_mic<double> v2(length);
}

return 0;
}

You are allocatting only one element on the host for v1 and v2 vectors but using a length which make it overlap the two pointer passed in the "in" clause 
     this->data = (T*)memalign(64, 1*sizeof(T));

data = 38309120 length = 32       <- here the address range is [38309120, 38309376)
allocate  memory
data = 38308992 length = 32       <- here the address range is [38308992, 38309248)
offload error: address range partially overlaps with existing allocation

So this brings me to the original question. How can I create a "host pointer" that is associated with a coprocessor buffer that can be passed to host routines that use them to execute routines on the coprocessor without allocating and "image" buffer on the host?

Use the "into" clause

#include <stdint.h>
#include <stdio.h>
#include <malloc.h>

main(int yy, char *argv[])
{
int * mic_ptr;
int * ptr1 = malloc(100);
int * ptr2 = malloc(100);
int out_val, rez_OK;

    *ptr1 = 10;
    *ptr2 = 20;

#pragma offload_transfer target(mic:0) nocopy(mic_ptr : length(10) alloc_if(1) free_if(0))

#pragma offload target(mic:0) in(ptr1[0:1] : alloc_if(0) free_if(0) into(mic_ptr[4:1]) )
{
    printf ("TARGET.1  mic_ptr[4] = %d \n", mic_ptr[4]);
}

free(ptr1);
ptr1 = NULL;

#pragma offload target(mic:0)   in (ptr2[0:1] : free_if(0) into(mic_ptr[4:1])) \
                                out(mic_ptr[4:1]  : free_if(1) into(out_val))
{
    printf ("TARGET.2  mic_ptr[4] = %d \n", mic_ptr[4]);
}
    rez_OK = (out_val == *ptr2);

    printf("%s\n", rez_OK? "PASSED" : "FAILED");
    return (rez_OK == 1);
}

Aren't you still having to allocate host memory? What I wanted was a class with a host pointer that's associated to a coprocessor buffer without allocating a buffer on the host. It seems like this may not be possible. The response by Sumedh Naik (Intel) was what I was looking for but unfortunately it didn't work. The reason I want this is because I have already developed a series of templated linear algebra routines for lattice Quantum Chromodynamics suitable for cpus and gpus and I would like to add the intel phi. If i can create the same  "host" vector routines which take a vector class as described then all the higher level routines will work without writing any more code.  For example:

template<typename T>

double cg_solver( T src_vec, T sol_vec, mat_mult<T vec_in, T vec_out>)

{

vector routines ... 

}

where T would be vector_cpu, vector_gpu, and hopefull vector_mic

Hi Craig, 

I am sorry for the late reply. I may finally have a solution for you. The following code worked for me.

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include "omp.h"
#include "offload.h"
#include "math.h"
#define REUSE length(0) alloc_if(0) free_if(0)
#define ALLOC alloc_if(1) free_if(0)
#define FREE alloc_if(0) free_if(1)
#define START(message)
template<typename T>
struct gen_vector_mic
{
int length;
size_t data;
gen_vector_mic(int length) : length(length)
{
size_t data;
#pragma offload target(mic:0) out(data)
{
 T *ptr=(T*)malloc(sizeof(T)*length);
 data=size_t(ptr);
}
this->data=data;
}
~gen_vector_mic()
{
size_t data=this->data;
#pragma offload target(mic:0) in(data)
{
 T *ptr=(T*)(data);
 free(ptr);
}
}
};
typedef gen_vector_mic<double> vector_mic;
typedef gen_vector_mic<float> float_vector_mic;
int main(int argc, char** argv)
{
for(int i=0; i<20; ++i)
{
printf("2^%in", i);
int length = pow(2,i);
gen_vector_mic<double> v1(length);
gen_vector_mic<double> v2(length);
}

Sumedh, so instead of letting the #pragma offload allocate the memory you are specifically allocating it yourself. Another way would be to actually alloc on the host, make the offload_transfer nocopy() call to create the memory on the MIC, then [edit: realloc the memory on the host back to length 1]. Obviously not a good option if speed is necessary.

Both of these methods seem a bit *hacky*. I don't think it is unreasonable to desire a pointer on the host to link to allocated memory on the MIC without actually having something allocated on the host. Incidentally I also have the same problem when trying to create a memory pool on the MIC. I will let you know if I come up with a better solution.

Corey

Sumedh, can you please explain how you will transfer data between CPU and the array you allocate on MIC?

Hi Dhairya,

I gave your question some thought and am yet to come up a with solution. 

For the Intel MIC architecture, you can transfer the data to the coprocessor by either using #pragma offload or #pragma offload_transfer. However, there is just one caveat: the array on the coprocessor should be allocated by using offload runtime (#pragma offload or #pragma offload_transfer). Hence, we cannot transfer data to the array that we allocated using the above method. 

 

 

 

Leave a Comment

Please sign in to add a comment. Not a member? Join today