Missed Optimization: `std::copysignf(1.0f,x)`

Missed Optimization: `std::copysignf(1.0f,x)`

As it says in the title. ICC 17 generates:

my_copysign_1(float):
        movss     xmm1, DWORD PTR .L_2il0floatpacket.1[rip]     #4.12
        movss     xmm2, DWORD PTR .L_2il0floatpacket.0[rip]     #4.12
        andps     xmm0, xmm2                                    #4.12
        andnps    xmm2, xmm1                                    #4.12
        orps      xmm0, xmm2                                    #4.12
        ret                                                     #4.12
.L_2il0floatpacket.0:
        .long   0x80000000
.L_2il0floatpacket.1:
        .long   0x3f800000

A much better result is generated by e.g. GCC:

my_copysign_1(float):
        andps   xmm0, XMMWORD PTR .LC1[rip]
        orps    xmm0, XMMWORD PTR .LC0[rip]
        ret
.LC0:
        .long   1065353216
        .long   0
        .long   0
        .long   0
.LC1:
        .long   2147483648
        .long   0
        .long   0
        .long   0

Even if you don't like the extra space used (and you should like it, because my profiles show it's faster), the actual operations (`andps`, `andnps`, and `orps`) can still be reduced by one instruction.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Ian,

Can you provide us a test case to take a look?

 

Thanks,

Viet Hoang

Er, the testcase would be literally:

#include <cmath>
float my_copysign_1(float x) {
	return std::copysignf(1.0f,x);
}

See it live here.

 

I saw the asm as you mentioned; however, the execution times are the same between the 2 compilers. You need to prove that ICC is slower than GCC, then I can submit a perf bug to our developer.

Thanks,

Viet Hoang

vahoang@orcsle139:/tmp$ rm a.out && g++ t.cpp -O2 -c && g++ main.cpp -O2 -c && g++ main.o t.o -O2 && time ./a.out
f is:1.67772e+07

real    0m2.260s
user    0m2.258s
sys     0m0.001s
vahoang@orcsle139:/tmp$ rm a.out && icpc t.cpp -O2 -c && icpc main.cpp -O2 -c && icpc main.o t.o -O2 && time ./a.out
f is:1.67772e+07

real    0m2.260s
user    0m2.256s
sys     0m0.003s
vahoang@orcsle139:/tmp$ cat t.cpp
#include <iostream>
#include <cmath>

float my_copysign_1(float x);

using namespace std;

float my_copysign_1(float x) {
        return copysignf(1.0f,x);
}
vahoang@orcsle139:/tmp$ cat main.cpp
#include <iostream>
#include <cmath>

float my_copysign_1(float x);
using namespace std;

int main () {

    int i = 0;
    int MAX = 1000000000;
    float f ;
    for (i = 0; i < MAX ; i++)
         f = my_copysign_1( 2.0f) + f ;
    cout << "f is:"  <<  f << endl;
    return 0;
}

 

The difference here is dwarfed by higher-order effects. When I run your benchmark 64 times (use `perf` from `linux-tools-generic` and elevate priority to reduce overhead):

sudo chrt -f 99 perf stat -r 64 -d ./test-g++
sudo chrt -f 99 perf stat -r 64 -d ./test-icpc

I get:

g++: 3060.461358 ms ± 0.02%
icpc: 3062.379933 ms ± 0.02%

This is statistically significant to better than 99.99% certainty.

The versions involved are:

icpc (ICC) 18.0.0 20170811
g++ (GCC) 7.1.0

The difference isn't larger presumably due to super-scalar optimizations and maybe hardware stuff associated with calling the function or latency stalls. I really don't know. But the above test proves that for whatever reason Intel compiler's code is slower.

It should also, again, be obvious that this is the case from first principles—the Intel compiler code is twice as many opcodes, and is a strict superset of g++'s code. Also, independent of this, the code can be optimized by removing at least the `andnps` instruction, even keeping an identical memory literal access pattern.

Leave a Comment

Please sign in to add a comment. Not a member? Join today