Hello
I have implemented a straightaway naive matrix
multiplication in OpenCL with AMD SDK. I get Speedup of around 16 for
just an 8-core CPU system while I only run it on CPUs. I have applied
some popular optimizations like utilizing private memory and local
memory optimizations, and grouping my matrix in one dimension so I use
both global and local dimension sizes. Now I get Speedup of around 24
with same 8-core CPU.
First I wonder this much speedup because for
8-cores I normally get around or less than 8 speedup with OpenMP for
example. so these figures of 16 and 24 amaze me how its possible?
Second
these local + private memory and grouping of work items are
optimizations that I heard are only for GPUs and arent for CPUs so I
again wonder how I get so much boost in speedup when I run it only on
CPUs ?
Thirdly, I wonder how local and private memory and grouping
are handled for CPUs as they cause speedup, caches or processor
registers or what? Because this is magic to get so much speedup...
I also want to know what are CPU specific optimizations in OpenCL ?
Please
help me clarify because I am so new to OpenCL and its giving me so big
performance I cant beleive it, I have verified results and they are
perfectly accurate.
Thanks in advance


