My opencl code was running much slower than it should, i was surprised to find out that clz function (count-leading-zeroes) was the culprit. Writing opencl code i got used to clz being fast, and it took me a while to find out why my code performance was twice lower that it should have.
I do understand that unlike GPUs x86 command set doesn't include anything useful for this operation, but still, it can be much more efficient.
Current implementation seems to just loop throgh the bits until it finds a nonzero one. That's up to 32 loop cycles. 32 unpredictable conditional jumps are very slow.
__Z3clzi: # @_Z3clzi # BB#0: mov ECX, -2147483648 xor EAX, EAX mov EDX, DWORD PTR [ESP + 4] jmp LBB1_1 .align 16, 0x90 LBB1_3: # in Loop: Header=BB1_1 Depth=1 inc EAX shr ECX LBB1_1: # =>This Inner Loop Header: Depth=1 test ECX, ECX je LBB1_4 # BB#2: # in Loop: Header=BB1_1 Depth=1 test ECX, EDX je LBB1_3 LBB1_4: ret
More efficient implementation could at least use a lookup table of 256 ints to find leading zero within a byte,
so clz would only need to cycle through four bytes.
Other problem is that even when given a constant argument it still
generates a slow code instead of calculating result at compile time.