It's not clear is your question about CPU or GPU. I'll try to answer both. On GPU local memory is commonly very fast on-chip memory like cache on CPU that can be controlled explicitly, so it's clear that it brings benefit in performance. In case of CPU local memory isregion of physical memory (DDR) so at first glancethere is not such good performance improve like on GPU. But we must take into account that entire work-group of work-items when using local memory buffer stores data in compact region of memory, so when reusing data by work-items in single work-group we will have good cache hit and data will be processed in cache with very fast rate. If we do not use local memory there won't be such good cache hit, especially in case of loading elements from single column (because there is a big stride between consecutive elements).
Local memory optimization on CPU
如需更全面地了解编译器优化,请参阅优化注意事项.



Local memory optimization on CPU
I was curious about the question:
What is the actual penalty of using local memory optimization on CPU. The documents available on the Intel OpenCL webpage (325696-001US) states that there is "moderate" overhead. This was not very accurate and I wanted to test it.
I have written multiple versions of matrix by matrix multiplication kernels using different approaches to local memory optimization and it appeared that for the matrices of size 1024x1024 the results using local memory are almost twice as fast as without optimization. How can this be explained?
The results for 1024x1024 matrix multiplication for CPU, OpenMP and different kernels is available in the attachment. These were executed on Intel i5 2500k CPU
So again, my question is:
How is it possible that using local memory optimization on matrix multiplication kernel increases the performance on GPU CPU?