I was curious about the question:
What is the actual penalty of using local memory optimization on CPU. The documents available on the Intel OpenCL webpage (325696-001US) states that there is "moderate" overhead. This was not very accurate and I wanted to test it.
I have written multiple versions of matrix by matrix multiplication kernels using different approaches to local memory optimization and it appeared that for the matrices of size 1024x1024 the results using local memory are almost twice as fast as without optimization. How can this be explained?
The results for 1024x1024 matrix multiplication for CPU, OpenMP and different kernels is available in the attachment. These were executed on Intel i5 2500k CPU

So again, my question is:
How is it possible that using local memory optimization on matrix multiplication kernel increases the performance on GPU CPU?


