Question about multiple RHS

Question about multiple RHS

Dear All.

Itook the implementation of PARDISO mrhs version
into our application.
I followed Kalinkin's instruction.
The solver was operated normaly and resultswere correct.
But problem issolving time.
In phase=33, with single rhs case
the solve time was 0.03450 s.
But, with 4-rhs case
the solve time was 0.532629 s.
I expected the almost same run-time,
however 4-rhs case was 15 times slower than
single rhs.
I heard that we canreduce run-time using multiple rhs.
I tested some other test cases, but results were similar.

What are the overhead factors influenced such results?
Is it right thatthe PARDISO provides the multi threading in phase=33 solve part?

29 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Bosun,May you provide a bit more details about your configuration?Namely, what is your MKL version, OS and processor? How many cores is in your system? It's also good to know the number of equations in your task.With this information we will be able to reproduce the situation and provide you with appropriate advise.Thanks,Konstantin

Dear Konstantin.

MKL version is the latest version and
O/S is Redhat_AS4_U7,
CPU is Intel Xeon 3GHz 4-core
Memory is 64Gb

Thank you.

Regards.
B. Hwang

Ok, thank you! And what is the number of equations? (roughly.. 100, 1000, 10000)
I'll make a couple of runs of similar task and will update you.
Regards,Konstantin

Dear Konstantin

We have many cases, 10000 to millions of equations.
In this case, I tested two cases, 10000 and 150000 equations.

Our applicationtakesiterative solve method.
I mean we take reordering and factorization only one time,
and then we solve the equations repetitively as changing right hand side.
Therefore, solve time is very critical in our application.
So we want to reduce solve phase run-time.

As I mentioned above,
I used 4-RHS iteratively because of memory capacity.
So I set the new 4-RHS every solve phase.
Except this method, all things are the same as single RHS.

Regards.
B. Hwang

Hi,I have a few more questions to have our input conditions 'aligned':1) Did you link against libmkl_intel_thread library, not libmkl_sequential?2) Did you set OMP_NUM_THREADS to any value?3) Which type of matrix did you use in your test?Thanks,Konstantin

Dear Konstantin

Below is my answers.
1) We linked libmkl_intel_thread library.
So we operated reordering & factorization faster than single thread.
2) I set OMP_NUM_THREADS = 4, because our system has 4-core.
3) Positive definite symmetric matrix

I checked just about changing thread library as libmkl_sequential.
The resultwas that single thread(libmkl_sequential) is much faster than
multi threads(libmkl_intel_thread)in phase=33.
I don't understand this situation.

Thanks.

Regards.
B. Hwang

Bosun,I think the fastest way resolving/reproducing this issue isto provide for us the test case with the input data you are encountering this problem with.--Gennady

DearFedorov.

It is impossible to provide the test set.
But I'm sure it is not a test case problem.
Test cases are justpositive definite symmetric
simple matrices which were provedas normal sets
in case of other solver and PARDISO single RHS mode.
As you already read above articles,
all operationswere normal.
The problem is the speed of solve phase =33
in multiple RHS mode.

Have you tried such like this application?
How was it? Speed of MRHS was faster than single RHS?

Thank you.

Regards.
B. Hwang

Hi Bosun,Indeed, we reproduced the problem you described: performance of parallel solution phase with a few RHS is low. We will investigate the problem and try to fix it ASAP.You may try to inprove performance of parallel solve phase by increasing the number of RHS (if possible) to 16-32. I hope this will let you reduce the time per computation of 1 RHS.Thank you,Konstantin

Hi Bosun,
Thanks for your problem you raised. This issue has been submitted to our
internal development tracking database for further investigation, we will
inform you once a new update becomes available.
Here is a bug tracking number for your reference: DPD200190971
Regards, Gennady

Hi Konstantin.

Thank you for your reply.
I'm afraid it has a problem.
I hope youresolveit ASAP.

And I have a question about increasing of RHS numbers.
Our system has only 4 cores.
In this sytem, is it meaningful to use 16-32 RHS?

Thank you.

Regards.
B. Hwang

Hi Bosun,Of course, solving
more RHS is meaningful because parallel solve algorithm (where N RHS is just
splitted via K threads) works better when N > K ("better" means that
computational time per 1 RHS decreased).I would say, in this case theading
overhead is not significant as far as each process has more work to do.

And it's only a
question of your algorithm: how many independent RHS it has to solve at once? If
it can be 16-32 or even more: please try
it.Regards,Konstantin

Hi Konstantin

I tested your advise.
But, N > K case is also slower than 1 RHS case.
I hope the problem will be resolved ASAP.

Thankyou.

Regards.
B. Hwang

Hi Bosun,What do you mean saying: "But, N > K case is also slower than 1 RHS case."? Do you mean that a time per 1 RHS is larger in case of many RHS, i.e.: time_1RHS < time_NRHS/NOf course, solving N RHS cannot be faster than 1 RHS, but the time per 1 RHS (one right hand side) is better.Here's an example:I used Linux server similar to yours, and set OMP_NUM_THREADS=4:1 RHS - 0.033 sec32 RHS - 0.175 secIn other words, in the second case I got 0.0055 sec per 1 RHS, or ~6x scalability. Note, that such a good scalability was achieved because more efficient level-3 BLAS was used in the implementation of NRHS solving phase (N RHS vectors are treated as a matrix).Regards,Konstantin

Hi Konstantin.

Of course, I understood your explanation.
I mean, total run-time of NRHS was
slower than that of 1-RHS.
That is,each 1-RHSrun-time wasn't improved by
increasing of RHS numbers.
In our exmple, 16-RHS test case, the solve time were
1 RHS - 0.023120s
16 RHS - 0.479048s
where 0.023120 * 16 = 0.368s
This means 16RHS slower than 1RHS.

Regards.
B. Hwang

Hi Bosun,Ok, I'm glad that we both use the same terminology: all is clear here now, thanks.As I mentioned, my times are a bit better (with 16, 32 RHS, at least, I observe scalability from 2x to 6x in comparison with 1 RHS).Would you please specify an exact name of processor (like Intel Xeon CPU 5160) in order I can reproduce situation when NRHS=16 is slower than 1 RHS?P.S. As Gennady mentioned, we work on the trackerDPD200190971 re slow solve phase with small RHS number.Regards,Konstantin

Hi Konstantin.

Our system is Intel Xeon 3GHz x2 c2 x86_64.
Is it right that you wanted to know?

And I have a question!
We used TAUCS solver as our calculation engine.
Despite PARDISO was not working in phase=33 as multi thread,
the PARDISO singleprocessis more twice faster than TAUCS solver.

Could you tell me the reason? Because of optimized compiler?
Or any special algorithms?
Please explain simple reasons about that.

Best regards.
B. Hwang

Hi Bosun,Could you send me the output of this command (or attach)?dmesg | grep CPU | sort -uRe performance difference with TAUCS - I do not know the exact reason. Probably, it comes from more efficient matrix-vector operations which is implemented in MKL BLAS and are used in phase=33. But I'm not 100% sure.Regards,Konstantin

Hi Konstantin

The CPU is

Intel Xeon CPU 5160 @ 3.00GHz

Thank you.

Hi Konstantin

I found outhow I can improve the performance.
I raised the RHS number 16 to 1024 by square of 2.
Above 32 RHS numbers, its speed was going up,
and after 256 RHS numbers, the speed-up was saturated.
I think the optimum RHS numbers is 256 with 4-core in
our case.
Finaly I gotabout 3~5 timesfaster performancethan 1-RHS.

Can you explain this situation?
Do each cores operate asmulti-threading?

And why couldn't the under 32 RHS caseget
the speed up?
Because of the BUG we detected?

Thanks, Konstantin

Regards.
B. Hwang

Hello Bosun,I'm glad that you achieved good performance and hope it will be a workaround for you for a while.As I mentioned, increasing NRHS should improve scalability to make it (theoretically) closer and closer to optimal: 4x for 4 cores.And I want to note that we also improved performance for small numbers of NRHS (4-8). We will notify you in which MKL version the fix will be available. However, you should understand that small NRHS (in general) is less efficient than relatively big NRHS.Regards,Konstantin

Dear Konstantin

Thank you for your reply.
And I want to ask one more question.
How is the performance getting better more
4 times(in our case 5.3 times better)with 4 cores?
It should be under 4 times?
My co-workers asked me about it, but I couldn't answer exactly.
Can you explain the detailed reasons?

Best regards.
B. Hwang

It's possible as far as level-3 BLAS (matrix-matrix operations) is used in MKL in case of many RHS instead of level-2 BLAS for 1 RHS (matrix-vector ops). And it's known that level-3 efficiency could be almost 100% of HW peak due to the dominance of fp-operations in comparison with memory operations. At the same time, level-2 BLAS (e.g. DGEMV) has about the same number of fp- and memory- ops and can be less efficient if not all data resides in cache due to significant latency of memory operations.Most likely, it's the reason of super-scalabilty of NRHS solve in some cases.Regards,Konstantin

You mean, because of the differenceof memory efficiencyand BLAS library?
That is, becauses we have to allocate and free memory every solve when 1-RHS?
Is it right that I understood.

Regards.
B. Hwang

>> That is, becauses we have to allocate and free memory every solve when 1-RHS?Not exactly. When we solve 1-RHS, we consider RHS as vector, and use matrix-vector (MV) operations to compute forward and backward substitutions (solve phase). In case of N-RHS, we operate with RHS as with matrix and thus use matrix-matrix (MM) operations.MM operations are more efficient than MV operations theoretically (and practically) in modern computers as far as memory operations are usually more expensive (time-consuming) than float-point operations (each core has the same number of fp-units, but memory bandwidth is often limited/shared between cores, they also share caches and so on). MM product consists of ~N^2 memory ops, and ~N^3 fp ops. MV has respectively ~N^2 and ~N^2. So, we can conclude that MM product not depends so much on memory operation (if computations are implemented in optimal way as it's done in MKL). But MV product is limited by memory bandwidth much more.Regards,Konstantin

Dear Konstantin

I'm appreciating youfor your cooperation,
now I'm very happy due to our experimental results.
Your advices were very helpful for our job,
and so we could achieve the our goal.

Thank youso much Konstantin!

Best regards.
Bosun Hwang

Dear Bosun,I also want to thank you for using MKL and for your active participation on the forum and reporting your problems to us! We're always ready to help!Best regards,Konstantin

Bosun Hwang, We did some improvements in 10,2. Update 7. Could you please check how it works on your side and let us know the results.--Gennady

Leave a Comment

Please sign in to add a comment. Not a member? Join today