PARDISO's slow reordering step

PARDISO's slow reordering step

Hi,

PARDISO's reordering step looks very slow. Here's some statistics produced by PARDISO itself:

================  PARDISO: solving a real nonsymmetric system  ================


Summary PARDISO: ( reorder to factorize )
================

Times:
======
      Time fulladj: 0.011505 s
      Time reorder: 47.586437 s
      Time symbfct: 0.035378 s
      Time A to LU: 0.000000 s
      Time numfct : 0.033240 s
      Time malloc : 0.032921 s
      Time total  : 47.733567 s total - sum: 0.034086 s

Statistics:
===========
 < Parallel Direct Factorization with #processors: >         1
 < Numerical Factorization with Level-3 BLAS performance >

 < Linear system Ax = b>  
             #equations:                                     188224
             #non-zeros in A:                                215330
             non-zeros in A (%):                            0.000608

             #right-hand sides:                              0

 < Factors L and U > 
             #columns for each panel:                        128
             #independent subgraphs:                         0
 < Preprocessing with state of the art partitioning metis>
             #supernodes:                                    178361
             size of largest supernode:                      5
             number of nonzeros in L                       212862
             number of nonzeros in U                       4320
             number of nonzeros in L+U                     217182
             gflop   for the numerical factorization:        0.000066

             gflop/s for the numerical factorization:        0.002000

I run eight instances of PARDISO simultaneously on a dual-socket Xeon E5530 machine. Each of them works on a sparse banded matrix (bandwidth=61) similar to what produced the above. Is it normal that METIS takes nearly 50 secs to finish?

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - styc
Hi,

PARDISO's reordering step looks very slow. Here's some statistics produced by PARDISO itself:

================  PARDISO: solving a real nonsymmetric system  ================


Summary PARDISO: ( reorder to factorize )
================

Times:
======
Time fulladj: 0.011505 s
Time reorder: 47.586437 s
Time symbfct: 0.035378 s
Time A to LU: 0.000000 s
Time numfct : 0.033240 s
Time malloc : 0.032921 s
Time total : 47.733567 s total - sum: 0.034086 s

Statistics:
===========
< Parallel Direct Factorization with #processors: > 1
< Numerical Factorization with Level-3 BLAS performance >

< Linear system Ax = b>
#equations: 188224
#non-zeros in A: 215330
non-zeros in A (%): 0.000608

#right-hand sides: 0

< Factors L and U >
#columns for each panel: 128
#independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
#supernodes: 178361
size of largest supernode: 5
number of nonzeros in L 212862
number of nonzeros in U 4320
number of nonzeros in L+U 217182
gflop for the numerical factorization: 0.000066

gflop/s for the numerical factorization: 0.002000

I run eight instances of PARDISO simultaneously on a dual-socket Xeon E5530 machine. Each of them works on a sparse banded matrix (bandwidth=61) similar to what produced the above. Is it normal that METIS takes nearly 50 secs to finish?

Hello

There are several ways to reduce the reordering time. First of all, you can try to set iparm(2)=0 (the minimum degree algorithm is used instead of METIS) or iparm(2)=3 (the OpenMP version of the nested dissection from the METIS package is used). It should be noted that iparm(2)=3 is only available in the latestMKL 10.2 Gold.

The another way is to set iparm(5)=1 anddefine your own permutation vector perm like perm(i)=i, i=1,2,.. , n (e.g. no row and column permutations). In this case, PARDISO will not perform reordering and only sympolic factorization will be done. However it might lead to increasing of time in factorization and solver steps.

All the best
Sergey

Hi

I also have the same problem. The PARDISO solver run very slow on the reordering phase. I observed that only 1 core running during reordering even when iparm[1] = 3 (the index is 1 since I called from C). I'm using Intel MKL bundled with Intel Composer 2013. Is it the MKL gold version?

BR

Bui

 

Hi Bui,

yes, Composer 2013 contains MKL version 11.1. This is Gold version.

very slow -- can you give more details?  Did you compare it with another solver? What is the problem size? is that win or lin? 32 or 64 bit.

--Gennady

Yes, the problem size is 287342. Reordering time takes ~56s, factorization is ~20s and back substitution is ~2s. I set iparm[1] = 3. Live monitoring shows just 1 core running during reordering. The system is Debian 64 bit, number of threads=32.

BR

Bui

P/S: I did not compare with other solver in this example, but benchmark with other problems shows that PARDISO gives significant advantage for smaller problem size.

This is the setting for reordering step

        iparm[0] = 1; /* No solver default */
        iparm[1] = 3; /* Fill-in reordering from METIS */
        iparm[2] = omp_get_max_threads();
        iparm[3] = 0; /* no iterative-direct algorithm */
        iparm[4] = 0; /* No user fill-in reducing permutation */
        iparm[5] = 0; /* Write solution into x */
        iparm[6] = 0; /* Not in use */
        iparm[7] = 0; /* Max numbers of iterative refinement steps */
        iparm[8] = 0; /* Not in use */
        iparm[9] = 13; /* Perturb the pivot elements with 1E-13 */
        iparm[10] = 1; /* Use nonsymmetric permutation and scaling MPS */
        iparm[11] = 0; /* Not in use */
        iparm[12] = 1; /* Maximum weighted matching algorithm is switched-on (default for non-symmetric) */
        iparm[13] = 0; /* Output: Number of perturbed pivots */
        iparm[14] = 0; /* Not in use */
        iparm[15] = 0; /* Not in use */
        iparm[16] = 0; /* Not in use */
        iparm[17] = -1; /* Output: Number of nonzeros in the factor LU */
        iparm[18] = -1; /* Output: Mflops for LU factorization */
        iparm[19] = 0; /* Output: Numbers of CG Iterations */

 

one of the potential problem may be the free memory size available while running the application. 

I checked real and spd type on my side wiht the N ~ 400000 and NNZ ~ 1,5*10^6, double precision.  the RAM ~32 Gb, Linux 46 bit,

here is the snipshots from the log I see on my side:

=== PARDISO: solving a symmetric positive definite system ===
The local (internal) PARDISO version is                          : 103911000

..................

Summary: ( reordering phase )    ...............      Total time spent                                                 : 3.267394 s

 

< Linear system Ax = b >
             number of equations:           400000
             number of non-zeros in A:      1582000             

             number of non-zeros in L+U:              188513125  --- this is may be the reason of swapping .... 

Summary: ( factorization phase )  ...... Total time spent                                                 : 5.000359 s

Summary: ( solution phase )  .......... Total time spent                                                 : 0.150976 s

Phase 33  PASSED 

and finally -- here is the total results.
------------------------------------------------------------------
iparm[14] = 388818
iparm[15] = 450473
iparm[16] = 1478810
iparm[17] = 188513125
iparm[60] = 388818
iparm[61] = 450473
iparm[62] = 8968
PARDISO time = 8.423172

 

 

 

 

 

Hi

I ran the problem on the machine of 256GB RAM. Therefore memory should not be the issue. Nevertheless, I benchmarked the reordering phase with various number of threads and obtained this scalability results:

n=4: speedup = 1.13x

n=8: speedup = 1.07x

n=16: speedup = 1.18x

We could say that PARDISO reordering does not scale well for my case. Does it scale for your example? Can we use a pre-computed permutation vector and still doing the parallel factorization later on?

BR

Bui

 

I'm also struggling with this:

I've got a very sparse matrix that's 40,908,400 rows (yes, 40 million), each row has only 2 non-zero elements. Reordering takes two and a half days while factoring takes under 2 minutes, and solving takes under 15 seconds.

I've tried providing my own permutation vector (1,2,3...N) but find it's no faster than minimum-degree.

Here is some program output... sorry I don't have more details (I'm using it via a foreign-function interface from SBCL lisp)

About to reorder...
Reordering completed.
Evaluation took:
  213982.818 seconds of real time
  213300.374440 seconds of total run time (213118.947102 user, 181.427338 system)
  [ Run times consist of 3.200 seconds GC time, and 213297.175 seconds non-GC time. ]
  99.68% CPU
  427,969,437,050,360 processor cycles
  307 page faults
  1,217,815,040 bytes consed
  

About to factor...
Factoring completed.
Evaluation took:
  143.567 seconds of real time
  1140.287264 seconds of total run time (1092.216259 user, 48.071005 system)
  794.25% CPU
  287,137,733,934 processor cycles
  5 page faults
  818,816 bytes consed
  

About to backsub/solve...
Solve completed...
Evaluation took:
  15.203 seconds of real time
  25.661603 seconds of total run time (24.849553 user, 0.812050 system)
  168.80% CPU
  30,405,905,164 processor cycles
  1 page fault
  98,272 bytes consed
  
Evaluation took:
  214161.200 seconds of real time
  214485.436502 seconds of total run time (214249.725771 user, 235.710731 system)
  [ Run times consist of 5.312 seconds GC time, and 214480.125 seconds non-GC time. ]
  100.15% CPU
  428,326,202,022,888 processor cycles
  313 page faults
  2,529,255,376 bytes consed

 

Hi,

Could I ask you to provide testcase to reproduce the issue? We doesn't expect such behavior

Thanks,

Alex

Leave a Comment

Please sign in to add a comment. Not a member? Join today