Currently I am trying to port my OpenCL kernel from nvidia to i5-3317U/HD4000, however my kernel fails when it execute on the HD4000, I locate the problem is caused by copying data from one of the local memory buffer to another in my kernel. and everytime my kernel fails , the driver crashes, but fortunately after around 1 min freezing, the driver recovers again.
I read the datasheet https://01.org/linuxgraphics/sites/default/files/documentation/ivb_ihd_os_vol1_part7.pdf, and finding that local memory is 64KB, a portion of L3, which is verified by the API call clGetDeviceInfo by passing CL_DEVICE_LOCAL_MEM_SIZE query parameter. and the local memory buffer used in my kernel is 6k which I query by clGetKernelWorkGroupInfo API call passing CL_KERNEL_LOCAL_MEM_SIZE as query item. Thus the buffer size shall not be a problem cause this crash.
I try to figure out why local memory will cause the kernel crashes, and now I suspect that different workgroups may conflictly access the same local memory or bank.
To make it more clearly, in HD4000, the local memory is shared by all compute units, and thus the workgroup mapped to compute units may access the same address (or the same bank) on the local memory. And Since HD4000 only have 16 processing units according to http://en.wikipedia.org/wiki/Intel_HD_Graphics, then each of the compute units only contain one executiion unit, the local memory is not dedicated to one compute unit but shared by all. While for nvidia , each SM (streammultiprocessor, mapped to compute units in OpenCL) can contain 32 or 48 execution units, and the shared memory is allocated per SM, i.e. each compute unit has their dedicated local memory, which physically block the interfare between different workgroup access.
To carry my question even further, for intel HD4000, when a workgroup is mapped to compute units, will these compute unit are executing in lock-step like nvidia, or each compute units will handle the workload of one workgroup. If the former case, then the compute unit in HD4000 is actually a process element , while the entire HD4000 is a compute unit, according to OpenCL's programming model?
Any Ideas about my questions above?
However, when query CL_DEVICE_MAX_COMPUTE_UNITS