Behavior of CL_MEM_USE_HOST_PTR different on AMD and Intel Platforms

Behavior of CL_MEM_USE_HOST_PTR different on AMD and Intel Platforms


I am using CL_MEM_USE_HOST_PTR to create my input and output buffers. On AMD platform, i did the following and everything worked fine but when i tried to run this code on Intel ivy bridge i5-3470, I had to do mapping and apparently, output_gpu was of no use. I am using Visual studio 2012 on windows 7.

//For AMD

//outputGpu and input are vectors

cl_mem	inputBuffer = cl::Buffer(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, inputDataSizeInBytes, &input.front(), &err);
	checkError(err, CL_SUCCESS, "Failed to create input buffer.");
cl_mem outputBuffer = cl::Buffer(context, CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR, outputDataSizeInBytes, &outputGpu.front(), &err);
	checkError(err, CL_SUCCESS, "Failed to create output buffer.");

err = commandQueue.enqueueNDRangeKernel(m_kernel1, cl::NullRange, cl::NDRange(m_globalSize), cl::NDRange(m_localSize),  NULL,  &profilingEvt1);
checkError(err, CL_SUCCESS, "Failed to enqueue kernel.");
err = commandQueue.finish();

for(int i=0;i<cpuResult.size();i++){
	if (cpuResult[i] != outputGpu[i]){
		printf("%d %u %u\t",i, cpuResult[i], outputGpu[i]);

//For Intel, had to do map after kernel execution
		mapPtr = (val_type*)commandQueue.enqueueMapBuffer(outputBuffer, CL_TRUE, CL_MAP_READ, 0,
		sizeof(val_type)*numElements, 0, 0, &err);

for(int i=0;i<cpuResult.size();i++){
		if (cpuResult[i] != mapPtr[i]){
			printf("%d %u %u %u %u\t",i, cpuResult[i], mapPtr[i]);

Now I have a couple of questions regarding this.

1) Why mapping was mandatory on Intel platform? why wasn't the data written in outputGpu vector?

2) I couldn't find a way to align my input and output vectors. All the examples on web suggest to use _aligned_malloc when using use_host_ptr ( but _aligned_malloc does not seem to work with a vector. can someone give an example to align vector?

3) I have found numerous online articles explaining the difference between using CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR, but unfortunately, none of them are clear enough. If I have an input vector (size ranges from 1kB to 128MB and even more), i send it to the GPU for processing and gets back the result in the output vector, which memory flags should I use for my input and output buffer? an example would be appreciated.

3.1) If my application has more than one kernels that use intermediate buffers (only written and read by kernels and not used by host at all), should no memory flags be used for them when creating? so that they are created and used in GPU memory only.



1 post / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.