Performance gap on small filters at convolution with OpenCL on CPU

Performance gap on small filters at convolution with OpenCL on CPU

Hi

I am a Master Student at the FHNW in Windisch (Switzerland) and I am comparing three Heterogeneous System Architectures (HSA), namely Microsoft C++ AMP, OpenCL and OpenACC. In this work I measure not only GPU performance but also CPU performance in the scope of image convolution on three different images (128x128, 591x591, 2272x1704 pixel) and 10 different filter sizes (3x3, 5x5,..., 21x21). In the CPU case I compare the HSAs with a simple OpenMP implementation and discovered that my OpenCL implementation, that runs on an Intel Core i7 950 3.06 GHz, has similar performance as OpenMP. As you can see in "Performance_OpenCL_Alg.xls" OpenCL reaches the performance of OpenMP only at relatively big filter sizes.

Now my question is, how can this be explained or fixed.I know that a bigger Work Group Size would activate vectorization and result in a better performance, but OpenMP does not use that as well, so I would like to know, why OpenCL do not fully use all 4 Cores at small filters.

I added the source code of the OpenMP and OpenCL Implementation. As you can see I use the class "Mat" of OpenCV to store my images and the filter. To measure execution time I used the OpenMP function "omp_get_wtime()".

I would be very pleased, if you could point me to the right direction.

AllegatoDimensione
Download source.zip2.82 KB
Download performance-opencl-alg.xls126.5 KB
4 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Hi, from your charts I can conclude (pls correct me) that scalarized version of the code (i.e. with local size of 1) is faster than auto-vectorized code (OpenCLLocalWorkSize1CPU vs OpenCLSimpleCPU). This is often the case for code in (u)chars, since using these data types requires packing/unpacking for vectorization which is currently performed using SSE/AVX integer instructions.

So first experiment is to disable vectorizer via vec_type_hint, which should improve the perf of the original OpenCLSimpleCPU version (since running scalar ver of the kernel via specifying the local size of 1 is too overheady, because of excessive number of the workgroups produced). I'm seeing a padding in your code, so hopefully global size for the NDRange is good (e.g. dividable by 8), otherwise (for example for the uneven size in the dimension zero of the NDRange) the scalar version of the kernel would run anyway.

Secondly, to get a better results, I would suggest to operate on (at least) 8 pixels in a workitem. This can be done thru uchar8 data type (notice that number of kernel calls will shrink accordingly and reduce overhead on workitems scheduling). Actually this means the manual vectorization, yet you can turn the vectorizer on and off and re-access the perf for the new ver.

Hi. Thank you for your answer. It helped me to perform some more tests, especially with auto-vectorization turned off. In addition, I made some tests with auto-vectorization on and work group sizes bigger than 1x1. Finally, a work group size of 3x3 with auto-vectorization beats the OpenMP implementation.

But my actual question has not be answered. As you can see in "Performance_OpenCL_CPU_Usage.png", the OpenMP implementation uses fully all 4 cores of CPU however OpenCL uses the whole CPU ressources only at bigger filter and image sizes.

My question now is: Does this effect appear because of the compilation of the device code each time it is executet (I run each kernel 5 times in a row)? Or are the calculations of the small images and filters such inexpensive that the CPU does not reach full load bevor the kernel is finished?

Thank you in advance.

Allegati: 

AllegatoDimensione
Download performance-opencl-cpu-usage.png13.84 KB

Hi again, be careful with plain "CPU cores utilization" being the metric to optimize. As you know a simple infinite loop in each thread might occupy the cores completely, wasting time and watts, despite potentially perfect utilization.

If the performance of say 2 threading models is similar, yet the utilization is not, it might indicate different appoaches to task stealing and to wait-loops in the models. Intel OpenCL for CPU relies on the TBB which is generally more adaptive to the load. Especially for the cases when there are no much workgroups (workgroups finest granularity for therading in our implementation) to saturate all the threads and facilate the task-stealing.

On the OpenMP saturating the cores completely, I recollect that when compiling the same native openmp app with Visual Studio 2008 you will see the uneven cores saturation (just like with Intel OpenCL), yet the same app compiled with VS 2010 demostartes full utilization while the same perf as VS2008.

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi