PARDISO results are not identical as repeatly run

PARDISO results are not identical as repeatly run

Dear all,
I am using PARDISO solver of mkl 10.0. I set default parameters and mtype=-2. Each time I run my program, PARDISO gives slightly different results while I think the results must be identical. For example

8.007956138556888E-4 vs.
8.007956138556876E-4

or

6.178758439680843E-4 vs.
6.178758439680833E-4

Could you give me reasons why this slight difference occurs?

Thank you in advance.
Thanh

30 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I was told it was because the multi-threaded solution may complete its
work in a different order each time we solve. Seams like a lame excuse
to me and I still suspect some other issue.

In my application it makes PARDISO difficult to use. The changes in results are within the expected solution tolerance but since PARDISO cannot tell us the expected tolerance, we have use another solver to check! Hence PARDISO is redundant. I am still stuck using the old IMSL sparse solver.

While identical results from run to run are nice, array alignment and how the algorithm is parallelized may influence whether the library function returns the same result run to run. Dynamic parallelism makes that even more of a challenge, that is, yourtasks may be executed in a different order run to run, and that different order of operations influences what the final result will be. Take the simple example of adding the double precision numbers1 + (-1) + 2^-56 in different orders. (1 + (-1))+ 2^-56 = 2^-56, while 1 + ((-1) + 2^-56) = 0 due to the IEEE rounding which occurs with each floating-point operation. 2^-56 is very different than 0, but from a computer arithmetic point-of-view equally correct. The online MKL user's manual has a section where this topic is discussed anddescribes actionsyou cantake to better assure identical results run-to-run. It sometimes even requires running serially (on a single thread) which may negatively impact performance. A relative error in the range of 10^(-13) or 10^(-14) may be completely reasonable for the problem you're running, number of threads you are executing on,and the alignment of your input data.

To Andrew Smith: In your experience,isthe old IMSL sparse solverfaster than PARDISO solver? Thanks

To Shane Story: Thanks for your explanation. Could you please show me where the section in MKL user's manual you mention is?

It varies with the problem but I would say PARDISO is faster even when run single threaded.

To clarify, I was comparing PARDISO with DLSLXG

See the online Intel MKLLinux Users Guide (pdf format), section 8-1.

Hi

We were burnt by this issue (not by PARDISO but by other solvers in MKL) and had to resort to making sure that the array partitions are aligned correctly.

We have found mkl documenation not always helpful; however the array alignment issue in particular, has been discussed in chapter 6 of the current (mkl with compiler 11.1.065) user's guide.

On a different note, I think that if the difference of the order of absolute tolerance (for example, by rule of thumb 1e-14 when using Real(8) numbers, should not affect the results from an engineering point of view. If it does, it indicates that the system of equations is badly scaled or you need to rewrite the equations or you may want to introduce stringent flags, such as fp:precise, for floating pointer operations. We did find such places in our programs and are more watchful since. I suspect dynamic parallelization will place even more demands on writing programs.

Abhi

So we have to supply PARDISO with 16 byte aligned workspace.

How do we supply 16 byte aligned workspace to PARDISO?

I think there is mkl_malloc that may be used.

However, We used a brute force approach i.e. we allocated N+1 elements and pass section 1:N or 2:N+1 as required. We thus use intrinsic function LOC and mod to see if the 16 byte alignment is there or not.

Abhi

In Fortran, you can use !DEC$ ATTRIBUTES ALIGN to specify alignment for a variable.

Retired 12/31/2016

Hi Steve

I think you meant "in Intel Fortran". We were suspecting/fearful of similar issues with libraries on other "odd" platform we support and hence did not use any directives.

Abhi

My inputs to PARDISO are components of a derived type. I specified /align:rec16byte for the file containing the derived type definition but I still got random results changes run to run. I will try LOC to see if alignment is achieved.

I see from using LOC that the derived type variables are not aligned on 16 byte boundaries when using /align:rec16bytes.

Then I tried creating local copies of the data using mkl_malloc but the array descriptor of my local variable still said it had only one element and assigning the copy failed. e.g.

integer p
real*8 d(1)
pointer (p, d)
p = mkl_malloc(1000, 16)
d = myComponent%d

!Now d is only assigned the firat value

/align:rec16bytes controls the relative offset of fields in a derived type (or RECORD) from the base - it does not specify where the whole variable is allocated. In other words. /align:rec16bytes may insert padding inside the structure to realign fields, but that doesn't help if the structure base is not 16-byte aligned.

In Intel Fortran, one uses

!DEC$ ATTRIBUTES ALIGN:16 :: varname

to specify the base alignment of individual variables. These can be local variables or ALLOCATABLE. For example, to declare an array of REAL(8) so that the base is aligned 16-bytes, one could use:

REAL(8), ALLOCATABLE :: X(:)
!DEC$ ATTRIBUTES ALIGN:16 X
ALLOCATE (X(10000))

This assumes you use Fortran ALLOCATE - using other allocation methods, such as mkl_malloc, are not affected by the directive.

Andrew, in the source you provide, there is no way for the compiler to know how big d is. You declared it as one element (better would probably be to use (*)). Assuming you had done so, you could perhaps have written:

d(1:1000) = myComponent%d

but it would be far better if you made d an ALLOCATABLE array and allocated it to the proper size. Then everything works out for size and, if you choose, alignment.

Retired 12/31/2016

I tried using local variables with !DEC$ ATTRIBUTES ALIGN:16 :: varname as you suggested Steve.

I did this for all real and integer variables. I used stack variables rather than allocatable but I confirmed 16 byte alignment.

I did not do alignment for the control variable: type(MKL_PARDISO_HANDLE) :: mklMemoryPointers(64)

Then I do not get the same answers run to run for PARDISO with multithreading.

Then I also copied the mklMemoryPointers to a local aligned variable and still no luck.

Anybody else got this to work?

I could develop a test program with input data files if that would help.

Andrew,

The MKL PARDISO Version is not able to compute the same answers
with multithreading. Our new version PARDISO 4.0 from
www.pardiso-project.org is the only parallel sparse direct solver that
has this kind of functionality. Please take a look at the webpage. This
option is currently only available for symmetric indefinite matrices
and we hope to add this feature for all matrix types.

Olaf Schenk

Hi Olaf, how does this impact performancewhen comparing it todynamically scheduled parallel tasks? And are youcalling only sequential MKL BLAS from within your algorithm?

>Hi Olaf, how does this impact performancewhen comparing it
todynamically scheduled parallel tasks?

It is a completely redesign of the scheduling in the numerical factorzation. The aim was to improve threading scalabilty compared to PARDISO 3.3, and at the same time, we wanted to have identical results on multicores.

>And are youcalling only
sequential MKL BLAS from within your algorithm?
We are using sequential MKL BLAS for each thread because multi-threaded MKL BLAS can not compute identical results.

Hi,

Here is just a quote related to IEEE 754:

Floating point arithmetic is not associative. This means that in general for floating point numbers x, y, and z: In mathematics, associativity is a property that a binary operation can have. ...

(x + y) + z != x + (y + z)

(x * y) * z!= x * (y * z)

Floating point arithmetic is also not distributive. This means that in general: In mathematics, and in particular in abstract algebra, distributivity is a property of binary operations that generalises the distributive law from elementary algebra. ...

x * (y + z)!= (x * y) + (x * z)

In short, the order in which operations are carried out can change the output of a floating point calculation. This is important in numerical analysis since two mathematically equivalent formulas may not produce the same numerical output, and one may be substantially more accurate than the other.

Thus, bit-to-bit results can be computed if and only if all operations are executed in predefined order. It means that compiler must not change this order doing any optimizations.

Also its important to say, that multithreaded calculations cannot guarantee order of calculations because mostly they are dynamically scheduled regardless of each thread has ordered calculations. BTW, OpenMP cannot guarantee that parallel section will use the same team of threads on the second run.

At last, Id add the following statement from What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March, 1991 issue of Computing Surveys.

Many programmers like to believe that they can understand the behavior of a program and prove that it will work correctly without reference to the compiler that compiles it or the computer that runs it. In many ways, supporting this belief is a worthwhile goal for the designers of computer systems and programming languages.

In my opinion, instead of requiring identical results of calculations every application should try to prove that used algorithm is stable, calculated deviation is validated and estimated to be sure that results are correct.

Thanks,
-- Victor

Victor,

Your examples are well-known.

However, it is still important to design parallel algorithms and software that can compute bit-by-bit identical results on multicores. Our software PARDISO from U Basel is such an example.

We are using this software for large-scale 3D seismic inversion and the software stack in this research application is typically highly complex. Bit-by-bit identical results in a threaded environment helps us to find bugs in other parts of this application. So in our case it is very important to have this feature.

I can also imagine that this is important for commerical applications since this feature can decrease the number of questions related to different simulation results on multicore architectures.

Olaf

Olaf,

Let us be honest. Suppose you have some application with parallelization. Any changes in
- compiler options/version
- number of threads
will result in different results.
Therefore, instead ofidentity of results I'd suggest analyzing of correctness of the application.

Thanks,
-- Victor

No. I can not agree.

We have analyzed correctness and stability in scientific computing since decades. It is now time to add bit-by-bit reproducible results on multicores. This will be an important topic in the future.

Olaf

Olaf,

In your version of PARDISO there is the following AD-statement:
   Reproducibility of exact numerical results on multi-core architectures.
   The solver is now able to compute the exact bit identical solution
   independent on the number of cores without effecting the scalability.
But this is just an illusion because running on different number of threads will give different results.
Also for example, results on CPUs with FMAs will be different from ones on CPUs without FMAs.

Which of them is correct or more correct?

So, bit-to-bit reproducible results on several runs on the same machine cannot help here and
says nothing about correctness of obtained results.
MKL as a library just provides stable implementation of algorithms but analyzing and validation of results is a task of higher level approaches.

Thanks,
-- Victor

No one can argue that identical results from run-to-run, on multicoreand from processor type-to-processor type are nice.These arespirited areas of discussion and debate in the parallel and floating point arithmetic computing communities.This propertyis highly sought after by those whowant their "diffs" to always match during the validation cycle. For a library like MKL, what is less clear is the cost of deterministic parallelism to performance ... imagine if you always had to do your computations for something like a matrix multiply in exactly the same order regardless of whether there were 1 or a billion threads, whether your architecture had a fused multiply add (with a single rounding) or separate add and multiply units (which round independently), whether your processor did aligned versus unaligned loads 10x faster, whether x87 precision control is set to double (Windows default) or double-extended (Linux default), what compiler and what optimization options you use, and so on. Historically trying to control (rather than exploit) all of these factors usually impacted performance negatively and significantly. Not to mention the effort required to code and control the order of all of the operations computed. Note also that guarantees of bitwise identical results usually requireformal proof, and modern formal SW verification checkers probably aren't yet mature enough to be used on Pardiso/LAPACK-like algorithms just yet.There is no doubt, this will continue to be an important topic for the future.

I guess, its just nave expecting identical results from two different floating-point implementations.

Becausefloating-point representations and arithmetic are inexact and

many input values are inherently inexact too, so the question about the output

values isn't whether there is error or identical result as repeatly run, but how much error should be expected.

There are special computer techniques to evaluate and estimate relative accuracy of results and validation of them, such as:

Normalization of data when a range of their exponents is used in calculations (in order to minimize rounding errors)

Estimation of computer operations based of math estimation of corresponding operators

Interval analysis

MKL just could extend the library with additional functions and methods to be helpful for application developers

Thanks,
-- Victor

I read this thread since I am having similar issues myself with Intel Fortran (nothing to do with MKL). There is an important point that I feel might have been missed in these discussions:

Sure, the FP calcs will change from chipset to chipset, with compiler upgrades, with how many cores I use, etc. But I, the user of MKL, have the control over this. But if I understand correctly, you are not giving me NO control over the reproducability of the results from run to run on the same machine.

I am sorry, but in my opinion that sucks bad time. Suppose I am trying to debug a complex engineering calculation with millions of floating point operations. I am trying to track down a bug, which may be in our code, or simply that my customer has set up engineering problem that needs further work (re-scaling; re-posing, whatever). If the numbers change from debug session to debug session, I could be infor a very very hard time trying to track down the issue because the FP goalposts keep changing underneath me.

I guess I am saying that if dynamic multi-threading can change the results, weneed to have an option to turn that off (at the expense of performance). In other words, the users of MKL need to have some control on the reproducability of results.

Thanks,
Tony

Tony, you should have reproducible results if you align your input data on 16 byte boundaries and run MKL in sequential mode (either through setting the number of threads to 1 or linking with the sequential library). These are factors you, as an MKL user, should be able to control.

Thanks Shane - you are correct of course. But the main reason I want to use MKL is to get multi-threading.

So basically we are saying that (in general) MKL is multi-threading in a dynamic way, so reproducable results are not guaranteed on the same machine andthere is not a way of making the multi-threading be the same from run to run.

Its not a nice restriction to have, especially for a numerical library, but at least I understand what is going on. Can you perhaps get someone in the team to write something in the documentation about this? I am sure I am not the only one in the community who might be asking the same questions.

Thanks
Tony

Leave a Comment

Please sign in to add a comment. Not a member? Join today