Parentheses not honored when using FMA

Parentheses not honored when using FMA

Hello,

I have doubts about asm generated by the Intel compiler for the following code that performs orientation test for a point and a segment:

double orient_test_2d(const double m[2], const double a[2], const double b[2])

{
  double am1 = a[0]-m[0];
  double bm1 = b[1]-m[1];
  double am2 = a[1]-m[1];
  double bm2 = b[0]-m[0];

  return ((am1)*(bm1)) - ((am2)*(bm2));
}

In the return statement the operands are all in parentheses. Intel compiler optimizes the statement and introduces a FMA instruction. I think this is wrong because FMA causes the subtraction and multiplication to be effectively executed at the same time, while the source specifies that the multiplications should be performed before the subtraction.

This is the assembly generated by 'icc-15.0.3 -O3 -march=core-avx2 test.c -c -S'

        vmovsd    (%rdi), %xmm5                                 #3.22
        vmovsd    8(%rdi), %xmm3                                #5.22
        vmovsd    8(%rsi), %xmm2                                #5.17
        vmovsd    (%rdx), %xmm4                                 #6.17
        vsubsd    %xmm3, %xmm2, %xmm6                           #5.22
        vsubsd    %xmm5, %xmm4, %xmm7                           #6.22
        vmovsd    8(%rdx), %xmm1                                #4.17
        vmovsd    (%rsi), %xmm0                                 #3.17
        vsubsd    %xmm3, %xmm1, %xmm8                           #4.22
        vmulsd    %xmm7, %xmm6, %xmm9                           #8.33
        vsubsd    %xmm5, %xmm0, %xmm0                           #3.22
        vfmsub213sd %xmm9, %xmm8, %xmm0                         #8.33
        ret                                                     #8.33

I believe that in order to honor the parentheses icc should NOT generate the FMA, but stick to the programmed operation order. If I wanted a FMA, I would not use parentheses around the multiplications. Current behavior of icc results in inconsistent behavior in my code due to floating point arithmetic errors.

As a side note, the GNU compiler does exactly the same thing.

Can this be considered a bug, or is this the expected behavior? If this is ok, how can I assure in a portable way the effect that I need?

Thanks!

Marcin Krotkiewski

17 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

>>this is wrong because FMA causes the subtraction and multiplication to be effectively executed at the same time

No it does not. FMA == Fused Multiply and Add is better expressed as == Fused Multiply then Add (or subtract).

The assembler code is correct. The vmulsd is performing one of the products, last instruction is performing the other produce((xmm0) - (am2)*(bm2)) then subtracting from the first product.

Jim Dempsey

>> No it does not. FMA == Fused Multiply and Add is better expressed as == Fused Multiply then Add (or subtract).

Of course, you are right. I meant that the instruction is performed with infinite precision. The problem is that the FP result is different than if it was executed using two separate instructions. Hence, it can happen that

orient_test_2d(m,a,b) != - orient_test_2d(m,b,a)

The result is consistent when FMA is not used. I hoped that using parentheses would prevent FMA from being generated, but obviously that was wrong. I have now seen the -fp-model strict compiler option, which prevents fp contraction. I guess I must use it but it would be nice to know how to achieve this in a portable way.

Thanks

Marcin

>>I meant that the instruction is performed with infinite precision.

Actually FMA has slightly higher precision.

The issue is (if you want to be frank about it) you want lesser precise results, presumably due to you having some prior results data that you knew (or should have known) that the results data was/were approximate, but good enough. The different results means you have to re-evaluate what is good enough.

An alternative to -fp-model strict is to generate SSE-only code path, but this may produce different results data from code generated using the FPU (non-vector floating point unit with internal 80-bit intermediaries). At some point, someone is going to have to recertify your results data using FMA, and potentially future instructions such as (hypothetical) dot product. This would be the multiplication of two small vectors, followed by complete horizontal add/subtract (all elements), then optionally added to an accumulator register (again, hypothetical future thinking). This would provide for extended precision across (AVX512) 16/8 multiplies and 15/7 add/sub (or 16/8 with accumulator register). While you might argue that the execution time might be approximately the same, the additional precision might be desirable (though you also will receive complaints that the results differ).

Jim Dempsey

As you were advised when you posted on gcc-help, either of the icc options fp-model precise or , on the latest version, -fprotect -parens will observe the parentheses, as would gcc -std=c99. Given that both gcc and icc require non- default options for your purpose (and gcc will require the default -fno-fast-math as well), you are going against the flow in implying the default should change. You saw on gcc-help there are many who prefer the k&r traditional treatment where parentheses are ignored when not algebraically required.

2 reasons you might prefer to prevent fma in this situation:

1). one of the multiplications is rounded immediately to the declared precision; the other rounding is postponed to the end.

1a) in the case (a*a-b*b) you may get a non-zero result when a==b.  To avoid that, use (a+b)*(a-b). 

2). It probably takes longer with the fma, as the 2 multiplications (or, for 1b, add and subtract) could have been performed in parallel or at worst pipelined at a distance of a cycle or no more than what the data access requires.

I see MSVC++ going to greater lengths to avoid using fma than most compilers, even including violating C/C++ expression evaluation rules, which it doesn't normally do.   I only saw recently how to get fma in MSVC++ (/arch:AVX2 /fp:fast, VS2013/2015).

>>parentheses are ignored when not algebraically required

IMHO, parenthesis should always be obeyed (unless expressly stated otherwise). Parenthetical use should not be demoted to that of algebraically equivalent. Floating point arithmetic itself is not assured to be algebraically equivalent, to wit (a*a-b*b), and (a+b)*(a-b) are algebraically equivalent but are not necessarily floating point equivalent.

I'd rather the "go with the flow" would have not taken the path it did.

Jim Dempsey

Thank you both for your help. It does seem like I am goig against the flow, but I agree with Jim that it is the flow that should reconsider ;)

>> 1). one of the multiplications is rounded immediately to the declared precision; the other rounding is postponed to the end.

Tim hit the nail here. This simple fact - a result of the introduction of FMA - destroys the algorithm. I will explain a bit better. As you know, in the orientation test the sign of the return statement tells us if point m is to the left or to the right of a line going through points a and b. For given m, a, b, depending on the order in which you pass a and b to orient_test_2d you will either get a positive, or a negative number. The trouble starts for points that lie 'on' (or close to in IEEE arithmetic) the line. Assuming that the parentheses are honored and FMA is not used, the same fp instructions are performed in the same order, and for the same arguments regardless of the order of a and b. Only in the last subtraction the order of arguments is reversed. Hence, it is true that

    orient_test_2d(m, a, b) == -orient_test_2d(m, b, a)

because

    x-y == -(y-x)

Introduction of FMA destroys the symmetry and results in different roundoff errors in both scenarios. As a result, for points 'on' the line you can often get two negative, or two positive answers. In my code I do not really care about the 'true' location of m (that can be solved e.g., using exact predicates by Schewtchuk). I do however rely on the assumption that getting the same sign is not possible, unless m lies exactly on the line. Note that the absolute accuracy of the FP type does not matter (whole reasoning works for floats just as well). What's important is the predictability of the order of performed fp ops and their arguments.

In my particular case I have found the problem and thanks to your suggestions I can fix it using non-standard compiler flags (!). I could use non-standard compiler specific attributes to enforce this, or, as Tim has suggested, move this function to a different file compiled with different flags. But this makes me wonder in how many other cases bugs like this could arise? Take some code written 30 years ago. I guess a user (me), who is happily compiling with -O2 -mfma might be unaware that the semantics of the program has changed. After all, I did not use the -ffast-math and I assume that the compiler is conservative. -mfma for me merely means that the target architecture supports the instruction, but in my view (correct me if I'm wrong) I did not permit the compiler to use it in places, where it changes the semantics of the program. Moreover, while the -std=c99 flag helps for gcc (-ansi does the job as well - why does it not follow the k&r?), it does not in case of icc, which requires different flags. And all of this mess can be avoided in an intuitive manner by preserving the parentheses. As most people know: "A language definition that does not require parentheses to be honored is useless for floating-point calculations." (D. Goldberg). The above is a new example of this, at least new for me.

I may exaggerate just a bit, but why should it be obligatory for programmers/users/scientists to know all the compiler flags of all different compilers in order to get portable correct results?

 

In the case of gcc, some of these decisions on compile options have been taken after discussion among experts.  The gfortran community did insist that -fprotect-parens should become a default, with the only way to change it being to set -fno-protect-parens.  The gcc/g++ community insisted that they want a wider variety of options which automatically perform algebraic simplification, even in violation of the language standards.  I never found out why the -fprotect-parens option was withheld from gcc/g++ after a short trial when those people objected to it being a default for g++ -ffast-math.

For icc, it seems to be somewhat historical, in that there were some cases when x87 code was the default (as it still is for 32-bit gcc) where strict observance of language standards might pose an intolerable performance obstacle.  The recent addition of -fprotect-parens might show a direction toward changing this.  It confused me too in that some of the Windows C++ Q-prefixed options are prefixed with q for linux, where there is no gcc equivalent (other than in gfortran in this case).

As protect-parens is so new to icc, there are situations which may require further work before it could become a default, such as the observed blocking of simd optimization for STL accumulate() and inner_product().  gcc shares the problem in that -ffast-math is required to perform those simd optimizations and that over-rides the -std= adherence to the language standard, with no way to restore it under -ffast-math.  ifort has supported the option -assume protect_parens successfully for several years, followed fairly quickly by gfortran -fprotect-parens, so there is a longer history.  Not long after ifort introduced protect_parens, -fp-model precise was made to observe parentheses for both icc and ifort.

By the way, the icc default treatment of parentheses doesn't work as consistently for algebraic simplification as the equivalents in gcc and other compilers do.  As an example, the Kahan summation algorithm may be considered. Certain compilers may reduce that algorithm to a vectorizable sum without totally breaking it as icc does (without protect-parens), by effectively removing parens in a consistent manner.

 

>> orient_test_2d(m, a, b) == -orient_test_2d(m, b, a)...Introduction of FMA destroys the symmetry and results in different roundoff errors in both scenarios

IOW FMA can yield orient_test_2d(m, a, b) != -orient_test_2d(m, b, a)

And this may now indicate a latent bug/assumption with your code. Does Tim's #5 1a) resolve the problem? While it will correct for the circumstance of a==b, it might not always work for a~=b.

And additionally consider when "orient_test_2d(m, a, b) != -orient_test_2d(m, b, a)" that this then indicates it is uncertain as to if you are on the line or off the line (and which way your are off the line). IOW you get the uneasy feeling that your older code expressed certainty when it should not have.

This may be a good subject for a PHD thesis.

BTW

Jim Dempsey

a and be are always different - otherwise there is no single line passing through a and b.

I believe that the FMA destroys an otherwise IEEE-correct algorithm, where I can with certainty say that

Theorem:    orient_test_2d(m, u, v) == - orient_test_2d(m, v, u)

Proof: 

Let m[2], u[2], v[2] be 2D ponits, where m[0] is points x coordinate and m[1] is the y coordinate:

     m - inspected point
  u, v - points that define the line

x and y coordinates will be further denoted as, e.g. mx and my. Writing out explicitly two calls to orient_test_2d with u and v order changed we get

  orient_test_2d(m, u, v) = (ux-mx)*(vy-my) - (uy-my)*(vx-mx)
  orient_test_2d(m, v, u) = (vx-mx)*(uy-my) - (vy-my)*(ux-mx)

Since IEEE arithmetic is commutative (not C itself I guess, but that is not relevant for the problem at hand)

x*y==y*x
x-y==-(y-x)

assuming only + and * instructions are used we can safely change the order of expressions without changing the final result

orient_test_2d(m, u, v)  == (ux-mx)*(vy-my) - (uy-my)*(vx-mx)
/* x*y == y*x*/                == (vy-my)*(ux-mx) - (vx-mx)*(uy-my)
/* x-y==-(y-x) */             == -((vx-mx)*(uy-my) - (vy-my)*(ux-mx))
                                     == - orient_test_2d(m, v, u)

Introduction of FMA results in loss of commutativity, hence the reasoning does not hold.

If you say that C language permits it, and that the compilers are allowed to change the semantics of a FP code unless some sort of 'fp-model strict' flag is given, then I of course accept it. But I am still a bit bothered by this.

 

I'm somewhat concerned about your assertions on IEEE arithmetic, as FMA has also been adopted as an IEEE standard.  Until the associated recommended IEEE facilities become reflected in C standards and accepted into general use, that sort of assertion requires reading between the lines.

A follow-on to one of my replies elsewhere implies that the latest C standards have relaxed the requirements on observation of parentheses which came with C89.  I'm not qualified to judge that.

You might find section 7.2.3 ("Expression Evaluation") of the book Handbook of Floating-Point Arithmetic by J-M Muller et al useful. It discusses the interaction between contracting expressions and the FMA instruction including an example involving sqrt(a*a-b*b). There is a #pragma (#pragma STDC FP_CONTRACT ON/OFF) which controls this but I have no knowledge or experience with it. In particular, I don't know if icc or gcc support it properly.

>> I'm somewhat concerned about your assertions on IEEE arithmetic, as FMA has also been adopted as an IEEE standard.

Right. Now I am also concerned. To summarize, first thing I need is a means to stop the compiler from performing fp contraction. I understood that this is best done using compiler-specific flags for a separate compilation unit. I read that visual studio could do a pragma fp_contract off, gcc could get away with an __attribute__((optimize(...))), but in this case icc can neither (?), so a separate unit is a better choice. For gcc I could use -std=c99 or -ansi, but that only works because fp contraction is off by default due to the lack of support for ISO C pragma STD FP_CONTRACT OFF (gcc Bug 37845). So for the future compatibility I also need to use the now ignored pragma. For icc I should probably use -fp-model precise. Makes me  wonder if it is not simpler to turn off FMA altogether (~ -mno-fma), since I anyway have a separate compilation unit. Does all this sound reasonable to you?

The above approach with compiler options and pragmas will indeed work in this particular case, since for the correctness of my code it is enough that only + and * operations are used, and no FMA. Then there is the more interesting and general problem of observing the parentheses, which is critical for other FP algorhtims, like the Kahan summation that Tim mentioned. icc provides the (obvious and very useful) -fprotect-parens option, which - it seems - nobody else has. Why not - is a mistery. But since this is the case, is there _any_ way in which I can control the arithmetic operation order in standard C (beyond operator associativity), or do I have to program this in intrinsics/assembly?

 

I hope that if gcc does decide in future versions to relax the current compliance of -std=c99 after possible introduction of #pragma FP_CONTRACT that plenty of notice is given.

I believe Intel compiler developers did give consideration to the situation of gcc/gfortran when introducing -fprotect-parens to icc. It has been more difficult than was thought at one time to improve compatibility between various compilers in these matters.

I think that compiler writers should sit down and reflect a bit on what they are doing -- they have obviously started behaving as if generated code performance is the only possible benchmark of compiler quality.

In other words, they got carried away trying to be more clever than the programmer who actually wrote the code and are now doing optimizations which have little to no performance benefit and might even have detrimental effects on code correctness.

Consider the following example:

#include <iostream>
#include <vector>

int main(int argc, char *argv[])
{
    int const dimension = 2;
    int const number_of_nodes = 20;

    std::vector<double> x(number_of_nodes*dimension);

    for (int n=0; n<number_of_nodes; ++n)
    {
        x[0+dimension*n] = 0.01*n;
        x[1+dimension*n] = 0.02*n;
    }

    for (int i=0; i<40; ++i)
        std::cout << "x[" << i << "]: " << x[i] << std::endl;

    return 0;
}

When you compile this with C++ XE 2016, the first loop gets fully unrolled and initialization of array x is replaced with bunch of MOVSD filling it with double values which are pre-calculated at compile time. My questions are:

1. What exactly is saved by avoiding 40 multiplications?

2. Why use scalar moves, pre-calculated values, and full unrolling instead of vectorized multiplication and vector moves?

3. What happens if executable rdata section gets corrupted (i.e. a bit flipped here and there in some of the pre-calculated values due to failing RAM or storage)? As you know, executable checksums and digital signatures are not enforced -- the executable itself would need to validate its own signature or checksum.

Not only there is no benefit in such "optimization" but it may even cause harm -- if calculation was done at runtime as the programmer requested, there would be much lower chance to initialize x with corrupt data.

Conclusion: Do not become so arrogant to assume that you know my intent or my requirements better than I do.

Just my 0.02c.

A similar problem occurs when the compiler decides to "demote" FMA intrinsic operations to separate Multiply and Add instructions.  

In one example, a loop with 48 FMA intrinsics was compiled into code with 36 FMA instructions, 12 Multiply instructions, and 12 Add instructions.   The 36 FMA instructions happened to correspond to the first 36 FMA intrinsics, so presumably the compiler made the change because it computed a lower overall latency by hoisting the multiplies upstream and only exposing the 3-cycle Add latencies at the end of the loop body.

This is extremely counter-intuitive!  It also means that the  accuracy of the code depends on whether the "challenging" rounding cases occur in the first 3/4 of the operations or in the last 1/4, which is not something that I would want to have to analyze in detail....

"Dr. Bandwidth"

Good points Igor. In some cases the scalar moves will be slower than the vectorized multiplication. As for corruption of the data by failing RAM, I would not think that be your biggest concern. Corruption of the code, which is much larger in this case, would have a higher probability of occurring. Intentionally modifying the data is a different story (e.g to obfuscate what your code is actually doing).

Back in the late 1960's, I came a cross a case of a FORTRAN II program calling a subroutine with a literal (arguments are by reference) where the called routine would alter the value referenced by the argument. Then return. The "hack" changed the value of a literal!!!, then a subsequent call to his proprietary code would use the modified literal. IOW reading the source code says one thing, execution (after literal modification) does a different thing. In this case the alteration of the literal was intentional (for obfuscation purposes).

John, et al,

I think the main problem is a "stuck in a rut" problem whereby:

an approximation generated with one set of options (or compiler version), being not equal to an approximation generated with a different set of options (or compiler version) is considered an error.

IOW a difference in approximations is considered an error
iow  a difference in errors is considered an error

In John's example, the demotion for speed may result in lesser accuracy (what do you expect when you instruct the compiler that you want it fast). Well written code should be able to work with approximations.

Back in the days of my college education, we used slide rules to compute the answers to our test questions. Now there is an instrument of approximation if there ever was one. These were used successful to solve most of the engineering problems through the first half of the 20th century.

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today