OpenCL related question

OpenCL related question

Hello! I have restarted some of my experiments on the Intel Haswell processor and some of them stopped working, namely the ones related to examples meant to be executed for the GPU.

My main question is if I use clCreateBuffer with the flag CL_MEM_USE_HOST_PTR what does that flag actually do:

1. i create the array on the host, the gpu will use the same address as the address that was given to the array allocated on the host; if this thing happens then theoretically I will be able to compute something on the GPU and at the kernel termination point (also known as a synchronization point), the data written by the GPU should be inside the allocate region from the host?

2. i create the vector on the CPU host, after which there is a secondary memory location allocated when invoked clCreateBuffer, even though I am using CL_MEM_USE_HOST_PTR, and the data is inherently copied between the host and the device memory allocations in memory.

The reason i am asking is mainly because of the L3 cache which is shared between the CPU and the GPU. If both operate on the same address then in the L3 cache the data can be seen by both. Therefore a cooperation between the two may be possible

 

Thanks

publicaciones de 17 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Hello Thom,

We have specific OpenCL forum named 'Intel® SDK for OpenCL* Applications' where you will be answered.

 

Thank you.
--
QIAOMIN.Q
Intel Developer Support
Please participate in our redesigned community support web site:

User forums:                   http://software.intel.com/en-us/forums/

Hi Thom,

According to the OpenCL spec 1.2 (https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf ) CL_MEM_USE_HOST_PTR means 2 things:

1. Initial buffer content will reflect the given memory content at creation time.

2. clEnqueueMapBuffer will always write data to the same memory location relative to the pointer given in clCreateBuffer.

All other is implementation specifics and reflect different HW possibilities for each device.

1. Intel OpenCL GPU device and Intel OpenCL CPU device share physical memory with host CPU. For this devices Intel OpenCL runtime indeed follows your assumption #1. For this devices clEnqueueMapBuffer/clEnqueueUnmapBuffer are almost no-ops.

2. Intel OpenCL MIC device does not have shared physical memory with host CPU.  For this devices Intel OpenCL runtime follows your assumption #2. Data is copied back-n-force during clEnqueueMapBuffer/clEnqueueUnmapBuffer operations and during some internally defined events, for example if MIC device is running out of memory.

I do not recommend you to base your application on internal implementation policies as they may change for any new device/version. I do recommend you to use this knowledge for optimizations for specific platform/device.

Hi! Thanks for the information. I finally managed to have something working and I have some comments related to the information. This is just talking about the GPU and CPU, not the MIC.

So I have the following test pseudo code:

1. create buffer with CL_MEM_USE_HOST_PTR

2. launch the kernel, let the kernel finish it's job, and use clWaitForEvents to sync with the ending of the kernel. 

3. If I try reading after the input vector and printing the data to the screen I get bogus results, similar as if the vector were not properly initialized. I will try to experiment more and after I shall post by the end of this day the code I was working on.

 

Thanks

I propose you to use the following sequence:

clCreateBuffer( CL_MEM_USE_HOST_PTR )
clEnqueueNDRange()
clEnqueueMapBuffer( blocking = TRUE )
<print results here>

 

That is exactly how I am using it. But because you said that the clEnqueueMapBuffer was treated as almost a nop,I guessed it might have worked without it. But now :). Maybe some of the data is still stored in the internal cache of the GPU and therefore ones needs a method of flushing that data back to the L3/memory.

 

Thanks

Can you add more data?

Do you use CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE in clCreateCommandQueue() or not ?

Do you use single command queue or not ?

Are you sure kernel was executed for all vector elements ? You can use printf() in the kernel to print all global ids.

 

I use normal in order execution model.

I use only a single command queue.

The global size that I launch the kernel is equal with the size of the vector.

Wait how can you print from within the kernel?

For printf() from inside kernels look here:

https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf page 286, paragraph 6.12.13

 

Ok. I will have to try it out and re-look over my whole code. I will get back to you.

 

Thanks for all your help

So far so good. It works. 

I have one more question.

Lets say I have mapped the buffer to be read and written. After I write into the vector if I say clEnqueueUnmapMemoryObject will that put the data into the vector so that if I restart the kernel, then the GPU will see the written data?

I may have some race conditions. Based on the answer I will try to post part of the code.

Thanks

if I say clEnqueueUnmapMemoryObject will that put the data into the vector so that if I restart the kernel, then the GPU will see the written data?

YES.

 

Thank you so much for all your help. It is working.

 

Thom

Follow up question. I have done some experiments and found out some weird things. One of the experiments i as follows

1. first I create a vector, a very big vector

2. create the appropriate cl_mem object with CL_MEM_USE_HOST_PTR from the previous vector

3. launch a kernel that does some operations on the elements

4. wait for the kernel to finish using a blocking instruction

5. after I want to do some operations on the same vector on the same data part on the host side

The weird thing is that even though the GPU works on it the data is not preserved in L3 cache, because the number of cycles is not lower than if I would run the host computation first, therefore having the mandatory misses in L3 cache. And yes the size of the vector is big enough so that I have the guarantee that the part I am computing on is not in the L3 cache.

Plus the access of the vector is randomized so that the prefetchers won't have anything to do with it.

In other words:

first attempt with just getting the data from the host I got 4 million cycles

doing a loop of 100 iterations over the same vector brought in I got 2 million cycles, therefore temporal locality has it's benefits

running the GPU first and after doing the same operation on the vector but from the CPU side, and counting the cycles I get 4 million cycles.

The conclusion is that the data is not shared between CPU and GPU even though they are sharing the L3 cache?

 

Thanks

I prefer somebody with exact knowledge of specific hardware or from the GPU compiler team will answer you.

Imagen de Maxim Shevtsov (Intel)

Quote:

Doru Adrian Thom P. wrote:

...4. wait for the kernel to finish using a blocking instruction

5. after I want to do some operations on the same vector on the same data part on the host side

Similarly to the need for the clEnqueueUnmapMemObject in the context of your previous question (on how would GPU see the updated data), to let the CPU see the latest-greatest results after GPU finishes the job, the clEnqueueMapBuffer is needed between steps 4 and 5.

I have been trying that, to do a clEnqueueMapBuffer and after Unmap, but still the CPU won't see the data in L3 cache, and it will still have to go to main memory for data. My assumption was that if either the CPU or the GPU operate on a data, that is needed after by the other one, the data will be in L3 and therefore I will cut off some cycles of going to main memory.

 

Thanks

Inicie sesión para dejar un comentario.