I am a Master Student at the FHNW in Windisch (Switzerland) and I am comparing three Heterogeneous System Architectures (HSA), namely Microsoft C++ AMP, OpenCL and OpenACC. In this work I measure not only GPU performance but also CPU performance in the scope of image convolution on three different images (128x128, 591x591, 2272x1704 pixel) and 10 different filter sizes (3x3, 5x5,..., 21x21). In the CPU case I compare the HSAs with a simple OpenMP implementation and discovered that my OpenCL implementation, that runs on an Intel Core i7 950 3.06 GHz, has similar performance as OpenMP. As you can see in "Performance_OpenCL_Alg.xls" OpenCL reaches the performance of OpenMP only at relatively big filter sizes.
Now my question is, how can this be explained or fixed.I know that a bigger Work Group Size would activate vectorization and result in a better performance, but OpenMP does not use that as well, so I would like to know, why OpenCL do not fully use all 4 Cores at small filters.
I added the source code of the OpenMP and OpenCL Implementation. As you can see I use the class "Mat" of OpenCV to store my images and the filter. To measure execution time I used the OpenMP function "omp_get_wtime()".
I would be very pleased, if you could point me to the right direction.