The compiler does not optimize at all constants operations using SIMD intrinsics

The compiler does not optimize at all constants operations using SIMD intrinsics

Hi,

I know that SIMD intrinsics in C/C++ are very limited and some qualifiers (constant, volatile, etc.) are dropped off but I find really disappointing that this affects the quality of the assembly code. For example, let's say you have the following simple code:

int a = (5 + 2)/2;

In this case, the compiler computes the constant expression at compile time and it generates just a movement

movl $3, %eax

However, if you provide this SIMD code:

__m512i a = _mm512_div_epi32(_mm512_add_epi32(_mm512_set1_epi32(5), _mm512_set1_epi32(2)), _mm512_set1_epi32(2));

The compiler is not able to simplify the code and it generates:

vmovaps .L_2il0floatpacket.3(%rip), %zmm1
 vpaddd .L_2il0floatpacket.2(%rip), %zmm1, %zmm0
 call __svml_idiv16

which is really inneficient. Note that it is not even able to detect that you are dividing by 2, which should be optimized by a shift.

Of course, I'm compiling with -O3, so I would like to know if it is possible to make the compiler optimize this kind of things in intrinsics since I'm not able to provide a better optimized code.

Kind regards.

Barcelona Supercomputing Center
4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
iliyapolak's picture

It seems that compiler decided to call svml integer division function which will add the latency of call instruction to the latency of division instruction.

jimdempseyatthecove's picture

Did you get the instruction sequence from a .ASM listing file produced by the compiler?
Or, did you get the instruction sequence from a debugger dissassembly window or VTune dissassembly window?

If from .ASM file, I suggest you capture a release build (with no .ASM file) and look at the resultant code in the debugger (or VTune). You may see additional optimizations.

Jim Dempsey

www.quickthreadprogramming.com
jimdempseyatthecove's picture

Did you get the instruction sequence from a .ASM listing file produced by the compiler?
Or, did you get the instruction sequence from a debugger dissassembly window or VTune dissassembly window?

If from .ASM file, I suggest you capture a release build (with no .ASM file) and look at the resultant code in the debugger (or VTune). You may see additional optimizations.

Jim Dempsey

www.quickthreadprogramming.com

Login to leave a comment.