Efficient branching on double vector comparison using intrinsics?

Efficient branching on double vector comparison using intrinsics?

Hi,

I want to check the range of a vector of double-precision variables, in order to branch to a slow path on exceptional out-of-range cases. My code looks like the following:

    // if(any(!(x < 4.) || (x < 2))) { ... }
    __mmask8 toobig = _mm512_cmpnlt_pd_mask(x, _mm512_set1_pd(4.));
    __mmask8 toosmall = _mm512_cmplt_pd_mask(x, _mm512_set1_pd(2.));
    if(!_mm512_kortestz(toobig, toosmall)) {
        // do something with out-of-range numbers (slow path)
    }
    // do something with in-range numbers (fast path)

I expect it to map to a 3-instruction sequence. However, icc (13.1) seems to generate extra data movement and masking between comparisons and test:

###     __mmask8 toobig = _mm512_cmpnlt_pd_mask(x, _mm512_set1_pd(4.));
        vcmpnltpd k2, zmm0, QWORD PTR .L_2il0floatpacket.5[rip]{1to8} #20.23 c1
###     __mmask8 toosmall = _mm512_cmplt_pd_mask(x, _mm512_set1_pd(2.));
        vcmpltpd  k3, zmm0, QWORD PTR .L_2il0floatpacket.6[rip]{1to8} #21.25 c5
        kmov      eax, k2                                       #20.23 c9
        mov       dl, dl                                        #21.25 c9
        kmov      edx, k3                                       #21.25 c13
###     if(!_mm512_kortestz(toobig, toosmall)) {
        movzx     eax, al                                       #22.9 c13
        movzx     edx, dl                                       #22.9 c17
        kmov      k0, eax                                       #22.9 c17
        kmov      k1, edx                                       #22.9 c21
        kortest   k0, k1                                        #22.9 c25
        je        ..B3.3        # Prob 50%                      #22.9 c25

It seems the compiler generates instructions to clear the high-order bits of the mask. As I understand it, vcmppd already clears the upper part of the mask, so the zero-extend instructions do not seem to serve any useful purpose. Since the code before the branch is on the critical path, I would rather avoid the overhead.

I am attaching a self-repro case, compiled with icpc -mmic -fsource-asm -masm=intel -S mmask8.cpp

If I am not using the proper idiom, what is the recommended way to test __mmask8 variables?

AttachmentSize
Downloadtext/x-c++src mmask8.cpp1.57 KB
5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Why not use:

if(!(toobig | toosmall)) {
// fast path
} else {
// alternate path
}

Jim Dempsey

www.quickthreadprogramming.com

Thanks Jim!

It gets slightly better following your suggestion: I still get the same copies to GPRs and zero-extensions, but at least I avoid the trip back to mask registers.

        vcmpnltpd k1, zmm0, QWORD PTR .L_2il0floatpacket.16[rip]{1to8} #61.23 c1
        vcmpltpd  k2, zmm0, QWORD PTR .L_2il0floatpacket.17[rip]{1to8} #62.25 c5
        kmov      eax, k1                                       #61.23 c9
        mov       dl, dl                                        #62.25 c9
        kmov      edx, k2                                       #62.25 c13
        movzx     eax, al                                       #63.8 c13
        movzx     edx, dl                                       #63.17 c17
        or        eax, edx                                      #63.17 c21
        je        ..B6.3        # Prob 50%                      #63.17 c21

Still much room for improvement... (by the way, I am curious about what mov dl,dl is supposed to do)

The mov dl,dl is there for a stall. Apparently you cannot perform back to back kmov's

Try "if(!((char)toobig | (char)toosmall))"

or reinterpret_cast

Jim Dempsey

www.quickthreadprogramming.com

Thanks. Indeed, "mov dl,dl" seems to be there to prevent the second kmov from pairing with the first one, and avoid the dependency on k2. This might reduce the latency by pipelining the second vcmp*pd and the first kmov, so eax is available earlier. (I guess the c* comments at the end of the lines are the expected issue time in cycles within a basic block.)

No luck with attempts to cast to char. The code generated is still the same... Actually, __mmask8 seems to be a typedef for unsigned char (in zmmintrin.h).

Leave a Comment

Please sign in to add a comment. Not a member? Join today