advice for accuracy loss on vector subtraction

advice for accuracy loss on vector subtraction

Dear all,

In my code, I am doing some vector subtractions in the form of

t = t - alpha * v

where t and v are vectors and alpha is a scalar. I am using daxpy for this however for some problems the norm of the resulting vector t, differs significantly than the value of the norm computed by MATLAB. Well, I tried almost everything I know to overcome the problem, but there is no success so far. What could be the reason of this problem?

Any pointers are appreciated.

20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Dear Umut,

While it is difficult to answer without actual dataset, the typical reasonfor accuracy loss in such operation is thecatastrophic cancellation. It happens when t and alpha*v have the same sign and are very close to each other.

Known recipes are as follows:
1) The use of fused-multiply-add (FMA) - it ensures no intermediate roundings, and hence the most accurate results. Downside could be significant performance penalty if the underlying hardware doesn't support FMA
2) Compute alpha * v with some extra precision, e.g. using Dekker's algorithm for exact FP multiplication

More details about cancellation can be found, for example, here

Please let me know if that helps,

Quoting Sergey Maidanov (Intel)

While it is difficult to answer without actual dataset, the typical reasonfor accuracy loss in such operation is thecatastrophic cancellation. It happens when t and alpha*v have the same sign and are very close to each other

Known recipes are as follows:
1) The use of fused-multiply-add (FMA) - it ensures no intermediate roundings, and hence the most accurate results. Downside could be significant performance penalty if the underlying hardware doesn't support FMA
2) Compute alpha * v with some extra precision, e.g. using Dekker's algorithm for exact FP multiplication

More details about cancellation can be found, for example, here

Please let me know if that helps,

Dear Sergey,

For 1)

I used the -fma option to the compiler but it did not make any significant changes. Well, my cpu is an Intel Core 2 duo which is not that new, so I am not sure if the fma is supported on this cpu. However, it also did not slow down the program that much, at least timings are still fine. My complete options to compiler are

icpc -Wall -g -DNDEBUG -O3 -xhost -fp-model extended -prec-div -prec-sqrt -fp-speculation safe -fma -no-fp-port -mp1 -pc80

For 2)

Well, I am not sure if I understood you completely, the algorithm given multiplies two numbers right? So are you proposing to use this algorithm to multiply each element of v by alpha?

Thanks for your interest, because I am really stuck for a long time, but at least now, I know the source of the problem is vector subtraction.

Some extra information for accuracy loss: I looked at my algorithm and some points where accuracy could be lost in addition to the above:

a.) I am scaling vectors, with reciprocals, such as using dscale in mkl, with '1.e0/scalar', can this be a problem, I am guessing that you would scale the vector with the above multiplication. For instance a C++ template is, mkl_scale( vec, 1.e0/scalar ) which calls dscale in mkl.

b.) I am using some dot products, and taking the square roots of the results of the dot products as the 'scalar' in (a).

Any ideas are appreciated, Thanks in advance.


It's difficult to guess the effects of combining options of Intel compilers from different eras. Don't you get a warning about -mp1?
You would need to use the 32-bit icpc and specify -mia32 rather than -xhost in order to get the effects of -fp-model extended. As you say, there is not yet an Intel core CPU with fma implementation.
mkl dscale would not in any case take account of -fp-model extended.

Dear Tim,

Thanks for the reply. At least, I seem to make some little progress, what I understood from you is

1.) do not use dscale and do the multiplications as Sergey suggested with -fp-model extended with and 32 compiler. For the moment I use a compiler for 64-bit, does not this mean that the significant is 53 bits by defaults and setting the -fp-model extended flag makes the effect to perform intermediate computations with extenden precision? This is a question that puzzles me.

There is also an 'fmal' in the cmath library. Shall I understand that since my CPU does not support FMA this function is also of no use? Can you comment on this?

2.) mp1: I get no warnings. Should I?

3.) After long trial and errors I understood that my accuracy is lost in the vector subtractions(used in orthogonalizations basically), this is a naive question but do you think changing environment and compiling on a newer system(from cpu point of view) can make a difference?

Best regards,

Hi Umut,

For 1)
As TimP indicates it's unlikely that -fma will help as the underlying HW doesn't support FMA. What you can do is to try using math library (LIBM) software implementation for fma().

Here's the snapshot from IntelC/C++ compiler documentation:


Description: The fma functions return (x*y)+z.

Calling interface:

double fma(double x, double y, double z);

long double fmal(long double x, long double y, long double z);

float fmaf(float x, float y, float double z);

I should say that using this software fma() equivalent will certainly have negative performance impact since your system doesn't have HW support for FMA.

For 2)
Under assumption thatthe problem is due to catastrophic cancellation, your source of the problem is in the lack of extra precision when you performs multiply alpha*v[i]. Scaling by using reciprocal 1.0/scalar by itself doesn't introduce large error. The problem is really in subtraction of two floating-point numbers which are close to each other. If alpha*v[i] is exact then subtraction of two numbers which are close to each other will be exact. But if alpha*v[i] is just slightly inexact (e.g. just a few least significant bits in double precision are inexact), the subtraction results in huge amplification of this inaccuracy. That's why you really need to compute alpha*v[i] with extra precision (extra means that you need more accurate bits than it is provided by double precision number, i.e. more than 53 accurate bits).

The simplest way to accomplish this is to use long double type. You will also need to use /Qlong-doubleand /Qpc80switches to enable 80-bit FP.

Based on additional information from you I see that Dekker's algorithm implementation will be the challenge, since alpha by itself is the result of a) and b). Try first the simplest method.

1) Try fma() math library function
2) Try to switch to long double to ensure alpha and alpha*v[i] have extra precision (relative to double)


Dear Sergey,

I hope I am converging. Thanks a lot for the detailed overview. There is another question from my side. I call Intel MKL routines from a C++ code, where the matrices are MTL4 matrices however they give direct pointers to internals of the matrices. I can template the matrices with 'long double's however the MKL routines accept doubles only. Does this flag -pc80 for linux take care of this problem as well?

I will let you know about the results.

Thanks a lot for the comments.


MKL doesn't support long doubles. They are far slower on account of not supporting vectorization. As Sergey mentioned, -pc80 would be required to set the x87 precision mode back to 64-bit, in order for long double or -fp-model extended to give you the increased precision. -fp-model extended allows the compiler to evaluate expressions in long double where convenient, possibly preventing vectorization.
You said originally you were interested in float data types. If that were the case, double normally would give you plenty of protection for extended precision.

Dear Umut,

Unfortunately the answer is negative. -pc80 switch affects x87 code only. MKL routines typically rely on SIMD (SSE2/3/4/AVX, etc) which is not controlled by -pc80 switch.

My assumption is that you don't need to rewrite everything in long double. You need long double in relatively isolated code only.

Before going to long double option have you tried the use of fma() from LIBM? This is probly the least effortful thing to try.

Yes, please let me know your results.


Dear TimP and Sergey,

I am not sure if I told I used float, I generally always use doubles and that is what I do here as well. My data definitions are all doubles. So what Sergey proposed is obsolete now, unfortunately.

Edit for Sergey,

I am looking into fma stuff however there is sth that puzzles me, should I use fma in long double right, which is logical however my vectors are represented in 'double' s since MKL expects double pointers. Now the question is that how to conform this two if that is what I have to do?

Any further comments?

Best regards,


So here is a bit more information where these two subtraction operations work

1.) Mathematically, I need to do

t = t - alpha*v1 - beta*v2;

In mkl sense, I chain two vector subtractions in a sense that

t = t-alpha*v1;
t = t-beta*v2;

considering the order of left to right association of C/C++.

2. ) I need another projection of the form

t = t - X X^T M t

where X is a dense marix of m X n and M is a sparse matrix of M m X m. What I am doing is also chain the



a.) use sparse symmetric matrix vector multiplication for Mt, and
b.) use dense matrix dense vector multiplication, transposed, for X^T(Mt)
c.) use dense matrix dense vector multiplication for X( X^T(Mt) )
d.) then do the subtraction

So basically these are the points where vector subtractions are directly involved.

I tried fma function in double format with a template like this

int fma_axpy( vector& v1,
vector& v2,
double alpha )
int m = size(v1), k;
for(k=0;k v1[k] = fma( alpha, v2[k], v1[k]);
return 0;

to simulate v1 = v1 + alpha * v2(with a negative alpha while calling the function)

in the subtractions in 1.) and 2.) d.) as explained above, it did not result in an improvement.

Best regards and thanks for the help,

fma() might be sufficint, i.e. double version. Let me explain why.fma(x,y,z) = x+y*z is required (according to the standard) to be implemented in such a way like if all intermediate operations are done in infinite precision. That means that y*z will be computed with as much precision as needed to avoid catastrophic cancellation in the final addition.

[ Assumption is that alpha and v[] are reasonably accurate (more than half bits of mantissa are accurate) ]


I really do not get it, it is still not accurate, btw, (x*y)+z. I mean the vector norms computed in MATLAB and C++ are really different, well in a couple of iterations the relative errors on the norms of the vectors increase significantly. And this is due to the fact that the vectors computed with fma are not the same vectors anymore. MATLAB also uses double precision by default if it is not doing some assembly or more precise tricks internally, I am completely clueless where the problem is, well I lnow the problem but do not know the cure...

Edit: here are two data sets from C++ and MATLAB


7.5193727503693637e+01 4.8828125226718058e-04 4.9239756144253149e-13


7.5193727359034170e+01 4.8828125132781796e-04 1.1648769165048053e-12

First col is the vector norm of a linear system solution

Ax = Mq; // where Mq is a rhs coming from a matrix vector product M*q,

2nd col is

alpha = dot(x, Mq);

and 3rd col is the norm of x_hat given as

x_hat = x - alpha*q

In the first iteration the process starts the diverge for this problem and for other problems a bit later...

It does look like your accuracy is limited by use of float data type where, as you say, matlab would always use double.

I do not understand what you mean now, I never used floats...

Hi again,

Formally, we need to be really careful when we say 'accurate' or 'inaccurate'. WhatI getis that MATLAB and C++ results significantly deviate. It doesn't necessarily mean that C++ code is inaccurate. It also doesn't necessarily mean that MATLAB code is inaccurate. It fairly might be that both are inaccurate. The reason for that could easily be the nature of floating-point arithmetic (catastrophic cancellation) - when computation is numericallysensitive to FPformat(which is the case when catastrophic cancellation happens)then minor deviations in how computation is done may result in significantly different answers.

The first recommendations don't seem help. So the next step that I suggest is to provide us the test case (C++ might be enough as the next step) so that we can reproduce what you see.

Based on the provided test case I will try to do the numerical analysis of the computations and tell you where the inaccuracy comes from.

By the way, when you use original C++ code and fma() based one do the results deviate significantly?


Dear Sergey,

I checked that MATLAB code is more accurate because I can do some structural engineering checks that some values should not be as highly accurate as they should be. Having that said I know that my C++ vectors are. unfortunately not that accurate, in the sense to satifsy this simple check so in brief MATLAB vectors seem to be more closer. This seems like 99 percent true at the moment.

I can send you a personal short report if you like with some resutls that I get.

And for fma results yes.

I deeply appreciate your efforts.

Best regards,

Hi Umut,would you please give us, via private thread, the test example and also results you have.I will pass all of these details to Sergey who will work with this issue.--Gennady

Dear Gennady,

Thanks for the reply.
It is not that easy to send you a test code since that is a combination of many libraries. However, to reproduce the problem, I can send you the vectors in the expression

t = t - alpha*v

where I think the problem starts. I can send the vectors t and v written as binary files, also alpha.

So you can check with these input I think. Where shall send these files?

Best regards,

Hi Umut,

Do I understand you correctly that you solve the system of equations Ax=Mq by calling MKL and with help of Matlab. Then, you calculate alpha once again in Matlab and by calling ddot, and last, make vectors subtraction independently. If so, this explains the difference inf in first column numbers where ||x|| is shown.
But also, if the above scenario is correct the problem description you started with is a bit incorrect. The expression t=t-alpha*v is computed with slightly different t, aplha and v.

The last column of your example shows that x and q are almost proportional and this is a reason why orthogonal vector cannot be found reliably. BTW, should not the last equation go like x_hat=x-alpha*Mq?
I beleive the algarithm you implement works under assumption that x_hat!=0 and for the case when this vector is 0 it should switch to special branch or stopping criterion should be applied. Please, don't expect the vector to be exactly equal to 0 - all you see is just influence of floating point errors. And the difference between 10^-13 and 10^-12 does not look catastrofic.


Dear Victor,

Good catch.

I also realized this yesterday by writing the problem again and preparing data for Gennady Fedorov, it took also some time for me to realize because there are two sides of the problem. One is the structural and the other is the fluid part. The problem does not appear for the structural side but the fluid side.

Apparently, solving K_hat x = Mq and getting x with q = unit_vector / sqrt(unit_vector^T*M*unit_vector) (Btw, this is a Lanczos procedure, where q is the first Lanczos subspace vector and K_hat = K+sigma*M, sigma a proper shift for the eigenvalue problem)

results in a x vector which is a multiple of q and this leads to a problem in the orthogonalization because vectors are collinear. So there is nothing to orthogonalize. This is also evident from the norm as you realized that it is small enough.

I am trying to understand now why they somehow become proportional after the first linear equation solution phase.

Best regards,

Leave a Comment

Please sign in to add a comment. Not a member? Join today