OpenCL SDK 1.5: 2-3 times slower than 1.1

OpenCL SDK 1.5: 2-3 times slower than 1.1

I've tested new 1.5 SDK with some selected OpenCL samples from AMD APP SDK 2.5. These samples allows easy selection of Platform/Device, so very useful for quick-n-dirty tests.

The results are discouraging:

BitonicSort (with -x 16777216 option): 10.85 sec for Intel OpenCL SDK 1.1 and 30.61 sec for version 1.5
EigenValue (-x 20480): 10.75/23.5 sec (1.1/1.5 versions)
Nbody (-x 102400): 6.87/20.0
RadixSort (-x 102400000): 9.08/12.75

Not-so-bad result:
SimpleConvolution: (-x 8192 -y 8192 -m 16): 9.83/9.99

Good Result (only one from six samples tested):
Histogram (-x 20480 -y 32768 -i 3): 0.92/0.83

The Histogram sample is the only one 'not vectorized' (Kernel ... not vectorizes message from Offline Compiler)

The samples was run with platform set to Intel (-p 2 on my machine) and timing (-t on). The times are for kernel+data transfer.

Machine is i7-2600K (so, AVX) @4.5GHz, Windows7/x64. Three OpenCL SDKs are installed (Nvidia, AMD and Intel)

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Alex,

Thanks for the feedback. Generally,you should expectbetter performance with new releases. We are working on reproducingyour setup locally and investigating the problem.

Thanks,
Nadav

Also, it not looks like AVX is really used.
I've tested with vector sum kernel:
__kernel void vsum (__global const float *a, __global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid]+b[gid];
}

The internal loop of vectorized version is 128-bit:
__Vectorized_.vsum: # @__Vectorized_.vsum
.... initialization ....
LBB3_1: # %SyncBB
# =>This Inner Loop Header: Depth=1
mov R10D, DWORD PTR [RDX]
add R10D, DWORD PTR [RSI]
movsxd R10, R10D
vmovups XMM0, XMMWORD PTR [R8 + 4*R10]
vmovups XMM1, XMMWORD PTR [R9 + 4*R10]
vaddps XMM0, XMM1, XMM0
vmovups XMMWORD PTR [RDI + 4*R10], XMM0
add RDX, 32
inc RAX
cmp RAX, RCX
jb LBB3_1

For some kernels I see 256-bit load, then extract128 from YMM to XMM, then 128-bit code again.

Hello Alex,

Do you have SP1 installed for Windows 7?

Thanks,
Doron Singer

Yes, SP1 is installed (and all current updates too)

Alex,

Please notice that theAMD sampleusesgroup size = 1,whichmeans that it always runs non-vectorized code.

Nadav

In some cases our auto-vectorizer estimates that using 128-bit wide registers is likely to generate faster code. In other cases, AVX simply does not implement certain operations (such asinteger operations).


Yes, for the BitonicSort sample the GROUP_SIZE is set to 1, which is very unoptimal.
Other samples differ:
* The EigenValue and NBody samples sets groupsize to min(256,value-from-clGetKernelWorkGroupInfo)
* The RadixSort sample uses min(64,value-from-cl-GetKernelWorkGroupInfo).

The OpenCL driver reports 1024 as Max workgroups size, so real values used are 256 and 64. This is not changed between Intel OpenCL 1.1 and 1.5

Quote: In some cases our auto-vectorizer estimates that using 128-bit wide
registers is likely to generate faster code. In other cases, AVX simply
does not implement certain operations (such asinteger operations).

For simple float vector addition 256-bit vaddps should be way faster than 128-bit one.

The good news:

for my own code (added timers for microbenchmarks) I see about 20% speed-up for Intel OpenCL 1.5 over v1.1.

Hi Alex,

We have investigated the BitonicSort and RadixSort reported issue and found out that we do have a performance regression in our 1.5 Gold vs. 1.1 when it comes to execution of kernels with small work group sizes. We are working on eliminating most of this performance regression and one of our future releases will include the fix for this issue.

To avoid this phenomenawe recommendusing larger work group sizes where the sweetspot would be at workgroup size > 64.

As a side comment, I would recommend using large work group sizes in general as this would probably be more optimal for our implementation. You can read more about it in the optimization guide which is attached to this release.

We are still investigating the EigenValue regression.
However we weren't able to reproduce the NBody regression

Thanks for helping us improve the product,
Boaz

Thanks!

Will wait for the next releases!

Leave a Comment

Please sign in to add a comment. Not a member? Join today