Question about parameter setting of PARDISO

Question about parameter setting of PARDISO

Dear all,

I use the MKL under WIndows 7 64bit + VS 2012 + MKL 11.0 or 11.2.

I need to solve very large FEM matrix by PARDISO. Currently I set parameters as follows:

  iparm(1) = 1 ! no solver default
  iparm(2) = 3 ! parallel (OpenMP) version of the nested dissection algorithm
  iparm(4) = 0 ! no iterative-direct algorithm
  iparm(5) = 0 ! no user fill-in reducing permutation
  iparm(6) = 0 ! =0 solution on the first n compoments of x
  iparm(8) = 9 ! numbers of iterative refinement steps
  iparm(10) = 13 ! perturbe the pivot elements with 1E-13
  iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
  iparm(13) = 0 ! maximum weighted matching algorithm is switched-off (default for symmetric). Try iparm(13) = 1 in case of inappropriate accuracy
  iparm(18) = -1 ! Output: number of nonzeros in the factor LU
  iparm(19) = -1 ! Output: Mflops for LU factorization
  iparm(20) = 0 ! Output: Numbers of CG Iterations
  iparm(21) = 1 ! Apply 1x1 and 2x2 Bunch and Kaufman pivoting during the factorization process
  iparm(24) = 1 ! PARDISO uses new two-level factorization algorithm

The peak memory usage would reach to 20 GB (windows task manager->commit size). Is there any better parameter settings to decrease memory usage but don't increase the solution time significant?

Thanks,

Zhanghong Tang

 

31 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi Zhanghong,

Can you check the out-of-core solver for the pardiso?
iparm(60): switches between in-core (IC) and out-of-core (OOC) Intel MKL PARDISO. OOC can solve very large problems by holding the matrix factors in files on the disk, which requires a reduced amount of main memory compared to IC.

Thanks,
Chao

 

Dear Chao,

Thank you very much for your kindly reply. I have tested the OOC and it is very very slow. The most important is that I have tens of processes call the PARDISO at the same time and the OOC would lead to very large disk usage.

I wish to decrease memory usage of every process and then I can launch more processes in every machine (48 GB memory / machine and I wish to launch 3 processes in every machine).

Thanks,

Zhanghong Tang

Hi Zhanghong,

May I ask you what iparm(15), iparm(16), iparm(17) and iparm(63) returned by PARDISO after reordering step and what value of MKL_PARDISO_OOC_MAX_CORE_SIZE have you set? I ask because OOC PARDISO algorithm is greedy and use all available memory to improve performance. So, if you want to improvement solution time of OOC PARDISO just increase this parameter. Also, i see that you switch on two-level algorithm of factorization - in current version of OOC PARDISO this mode doesn't support.

Thanks,

Alex

 

Dear all,

Thank you very much for your kindly reply. After my several days' checking, I found the problem is from me. The filled matrix is not valid.

Now I have another question: for some large problem the memory is not enough (but I don't want to use the OOC), I can use the CGS by setting iparm(4) to non-zero, right? So how to setup this parameter? I have the following code, is it correct?

 

phase = 11

call pardiso(...)

phase = 22

call pardiso(...)

if(error=-2)then  ! not enough memory

    iparm(4) = 72

    call pardiso(...)

endif

phase = 33

call pardiso(...)

phase = -1

call pardiso(...)

 

Thanks,

Zhanghong Tang

 

 

Furthermore, can I decide whether to use direct method or not by iparm(15), just like this?

phase = 11

call pardiso(...)

phase = 22

if(iparm(15)>16777216)then   ! 16 GB: 16 kB * 1024 * 1024
    iparm(4)=72
endif

call pardiso(...)

if(error=-2)then  ! not enough memory

    iparm(4) = 72

    call pardiso(...)

endif

phase = 33

call pardiso(...)

phase = -1

call pardiso(...)

 

However, I also noticed iparm(20), when the cgs_error is 5 it means factorization is too fast for this matrix. It is better to use the factorization method with iparm(4) = 0

 

How to process this problem?

 

Thanks,

Zhanghong Tang

Dear all,

Is there any suggestion? Currently I did as what I said in my previous post except the following line changed:

  if(max(iparm(15),iparm(16)+iparm(17))>16777216)then   ! 16 GB: 16 kB * 1024 * 1024

 

However, the solution is still done by direct method but the memory usage is larger than 16 GB, usually the peak memory usage reaches to 21 GB.

Could anyone help me to take a look at it?

Thanks,

Zhanghong Tang

Not many of the non-Intel readers here have multiple machines with 48 GB of RAM. If you could scale down the problem to, say, 4 GB, while still seeing the issues that you have mentioned, you would see more answers.

Dear mecej4,

Thank you very much for your kindly reply.

I think it is simple to let it switch to iterative solver by 4 GB limit:

if(max(iparm(15),iparm(16)+iparm(17))>4194304)then   ! 4 GB:  4 kB * 1024 * 1024

    iparm(4)=72

endif

But now the problem is that the solver doesn't work as I set and I don't know what I have missed in parameter setting. Could you please help me to take a look at it?

Thanks,

Zhanghong Tang

 

If you provide complete source for an example, I should be happy to look at it. It is, however, not reasonable to expect us to reconstruct an example ourselves, using as basis a piecemeal description of the problem.

Dear mecej4,

Thank you very much for your kindly reply. The following are the complete source code I used to solve my problem. Could you please help me to take a look at it?
 

  subroutine solverall_pardiso(nz,irow,jcol,s,n,nrhs,nzr,irz,jrz,rz,x)
  ! sparse right hand side
  ! CSR format input for PARDISO
  USE mkl_pardiso
  implicit none
  TYPE(MKL_PARDISO_HANDLE), ALLOCATABLE  :: pt(:)
  INTEGER, ALLOCATABLE :: iparm( : )
  integer::maxfct, mnum, mtype, phase, n,nz,nnz,nzr,nrhs, error, msglvl 
  integer,pointer::irow(:),jcol(:),irz(:),jrz(:)
  real*8,pointer::s(:), rz(:)
  real*8,pointer::x(:,:),xi(:),bi(:)
  real*8,allocatable::b(:)
  integer::i,j,ii,ir,jc,idum(1)
  real*8::ddum(1)

  call mkl_set_dynamic(0)          !  prevent Intel MKL from dynamically reducing the number of threads in nested parallel regions. 
  
  maxfct=1
  mnum=1
  idum=0  
  ddum=0.0d0
  ALLOCATE( iparm ( 64 ) )
  do i = 1, 64
     iparm(i) = 0
  end do 
  iparm(1) = 1 ! no solver default
  iparm(2) = 3 ! parallel (OpenMP) version of the nested dissection algorithm
  iparm(4) = 0 ! no iterative-direct algorithm
  iparm(5) = 0 ! no user fill-in reducing permutation
  iparm(6) = 0 ! =0 solution on the first n compoments of x
  iparm(8) = 9 ! numbers of iterative refinement steps
  iparm(10) = 13 ! perturbe the pivot elements with 1E-13
  iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
  iparm(13) = 0 ! maximum weighted matching algorithm is switched-off (default for symmetric). Try iparm(13) = 1 in case of inappropriate accuracy
  iparm(18) = -1 ! Output: number of nonzeros in the factor LU
  iparm(19) = -1 ! Output: Mflops for LU factorization
  iparm(20) = 0 ! Output: Numbers of CG Iterations
  iparm(21) = 1 ! Apply 1x1 and 2x2 Bunch and Kaufman pivoting during the factorization process
  iparm(24) = 1 ! PARDISO uses new two-level factorization algorithm
 
  error = 0  
  msglvl = 0
  mtype = -2  
  !.. Initiliaze the internal solver memory pointer. This is only
  ! necessary for the FIRST call of the PARDISO solver.

  ALLOCATE ( pt ( 64 ) )
  do i = 1, 64
     pt( i )%DUMMY =  0 
  end do
  !.. Reordering and Symbolic Factorization, This step also allocates
  ! all memory that is necessary for the factorization
  phase = 11 ! only reordering and symbolic factorization
  call pardiso (pt, maxfct, mnum, mtype, phase, n, s, irow, jcol, idum, nrhs, iparm, msglvl, ddum, ddum, error)
  if(error/=0)write(*,*) 'error in solver at step 1, error code = ',error
  if(error/=0)stop 
  
  !.. Factorization.
  phase = 22 ! only factorization
  ! add new feature: check the peak memory usage by pardiso, if larger than available, use iterative instead of 
  ! direct method
  ! Zhanghong Tang @ 09/25/2014
  if(max(iparm(15),iparm(16)+iparm(17))>16777216)then   ! 16 GB: 16 kB * 1024 * 1024
    iparm(4)=72
  endif
  call pardiso (pt, maxfct, mnum, mtype, phase, n, s, irow, jcol, idum, nrhs, iparm, msglvl, ddum, ddum, error)
  if(error/=0)write(*,*) 'error in solver at step 2, error code = ',error
  if(error/=0)stop 
  
  !.. Back substitution and iterative refinement
  iparm(8) = 2 ! max numbers of iterative refinement steps
  phase = 33 ! only factorization
  if(allocated(b))deallocate(b)
  allocate(b(n))
  b=0.0d0
  do i=1,nrhs
    xi=>x(:,i)
    do j=1,nzr
      if(jrz(j)==i)b(irz(j))=rz(j)
    enddo
    call pardiso (pt, maxfct, mnum, mtype, phase, n, s, irow, jcol, idum, 1, iparm, msglvl, b, xi, error)
    do j=1,nzr
      if(jrz(j)==i)b(irz(j))=0.0d0
    enddo
    if(error/=0)write(*,*) 'error in solver at step 3, error code = ',error
    if(error/=0)stop 
  enddo 
  
  !.. Termination and release of memory
  phase = -1 ! release internal memory
  call pardiso (pt, maxfct, mnum, mtype, phase, n, ddum, idum, idum, idum, nrhs, iparm, msglvl, ddum, ddum, error)
  IF ( ALLOCATED( iparm ) )   DEALLOCATE( iparm )
  IF ( ALLOCATED( pt ) )   DEALLOCATE( pt )
  if(allocated(b))deallocate(b)
  end subroutine

 

You provided only the source for the subroutine. Please provide the source code for a main program that calls this subroutine with the input arguments properly filled in. It takes less effort to let the computer try to run the program and produce some results before one looks at the source code line by line. Secondly, whether a problem occurs depends often on the data on which the program is run.

 

Dear mecej4,

Thank you very much for your kindly reply. The main program only contains several lines-read sparse matrix and the call the subroutine. However, the sparse matrix is too large (more than 1 GB size) to be uploaded.

The code I pasted can work perfectly if the memory is enough. Now the problem is that it can't switch to iterative solver when estimated memory usage exceed the set value. I don't know where I have missed.

In addition, I don't think it requires my complete project (include the complete sparse matrix data) to reproduce this problem: it is a general issue and the problem will occur when the size of sparse matrix reaches to some value, for example, 1000000.

However, if you will still require the matrix, could you please suggest a space and I can upload the sparse matrix to it?

Thanks,

Zhanghong Tang

A short program that generates the matrix elements (by calculation) would work nicely. After all, nobody typed in 1 GB worth of data into a file, did they?

Dear mecej4,

Thank you very much for your kindly reply.

But I am sorry that the program that generates the matrix elements is not short or simple. It is from electromagnetic FEM problem.

On the other hand, as I said before, that's a general problem and it is not related to my special sparse matrix. That is to say, I think the expert will not need the data to debug the subroutine to point out my problem--the problem that set parameters to let the PARDISO choose work by direct method or iterative method.

Thanks,

Zhanghong Tang

 

Dear Intel administrator,

Is there any document/instruction of PARDISO to describe how to use/launch iterative solver when evaluated memory usage larger than the limit set before? For example, how to launch iterative solver (set iparm(4) to non-zero) when max(iparm(15),iparm(16)+iparm(17)) is larger than 16777216 (16 GB)?

Thanks,

Zhanghong Tang

 

Hi,

The main idea of CGS is to use iterative solver for case when we need to solve several systems with precondition as factorized matrix from one system. So it can be used in case of solve several systems only.

I really recommend to set iparm(60) to 1 and MKL_PARDISO_OOC_MAX_CORE_SIZE to memory size that you ready to use for pardiso. In such case PARDISO will choose internally algorithm to use only memory size that set by MKL_PARDISO_OOC_MAX_CORE_SIZE

Thanks,

Alex

Dear Alex,

Thank you very much for your kindly reply. Now the problem is that I will launch 3 or more processes on every machine to solve different systems. The total memory is 48 GB on every machine and I need to solve hundreds of this kind of systems. I am afraid that the write/read large data to/from disk will decrease the speed signification. On the other hand, I am afraid that the disk space is not enough (suppose 40 GB per matrix, the required disk space is at least 40*3 - 48 = 72 GB) since the machines are also used by other people.

I don't know whether the PARDISO do an imcomplete LU factorization according the limit set before and then use CGS to solve the problem.

Thanks,

Zhanghong Tang

Dear Alex,

I tried to use the OOC of PARDISO, but the PARDISO returns error code -10. Could you please help me to take a look at it?

Thanks,

Zhanghong Tang

 

PS: part of the code:

  ...
  iparm(60)=2
   i = MAKEDIRQQ('c:\temp\')
  if(myid<10)then
    write(ss2 , '(i1)') myid
  elseif(myid<100)then
    write(ss2 , '(i2)') myid
  elseif(myid<1000)then
    write(ss2 , '(i3)') myid
  elseif(myid<10000)then
    write(ss2 , '(i4)') myid
  elseif(myid<100000)then
    write(ss2 , '(i5)') myid
  endif
  file_name = 'c:\temp\'//trim(ss2)
  bufLen = len_trim(file_name)
  call mkl_cvt_to_null_terminated_str(buff, bufLen, trim(file_name))
  error = pardiso_setenv(pt, PARDISO_OOC_FILE_NAME, buff)
  open(1,file='pardiso_ooc.cfg',status='new',err=1)  
  write(1,*)'MKL_PARDISO_OOC_MAX_CORE_SIZE = ',16384
  write(1,*)'MKL_PARDISO_OOC_KEEP_FILE = 0'
  close(1)
  call pardiso(...)

 

The program is launched as following command:

mpiexec -wdir "Z:\test" -mapall -hosts 10 n01 2 n02 2 n03 4 n04 4 n05 4 n06 4 n07 4 n08 4 n09 4 n10 4 Z:\fem

where Z: is a mapped driver shared to all hosts.

Hi,

Look's like PARDISO got problem during reading/writing OOC file. Could you set msglvl to 1 and attach output that pardiso return to the screen?

Thanks,

Alex

Dear Alex,

Thank you very much for your kindly reply. The screen output is as follows:

=== PARDISO: solving a symmetric indefinite system ===
The local (internal) PARDISO version is                          : 103911000
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 1.468480 s
Time spent in reordering of the initial matrix (reorder)         : 20.758079 s
Time spent in symbolic factorization (symbfct)                   : 4.008938 s
Time spent in data preparations for factorization (parlist)      : 0.188263 s
Time spent in allocation of internal data structures (malloc)    : 0.310438 s
Time spent in additional calculations                            : 4.130998 s
Total time spent                                                 : 30.865196 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 24
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
             number of equations:           2547445
             number of non-zeros in A:      36680014
             number of non-zeros in A (): 0.000565

             number of right-hand sides:    7

< Factors L and U >
             number of columns for each panel: 192
             number of independent subgraphs:  0
             number of supernodes:                    955120
             size of largest supernode:               4429
             number of non-zeros in L:                750724073
             number of non-zeros in U:                1
             number of non-zeros in L+U:              750724074
=== PARDISO is running in Out-Of-Core mode, because iparam(60)=2 ===
Percentage of computed non-zeros for LL^T factorization
*** Error in PARDISO  (read/write OOC data file) error_num= 0

=== PARDISO: solving a symmetric indefinite system ===

Summary: ( factorization phase )
================

Times:
======
Time spent in allocation of internal data structures (malloc)    : 30.196333 s
Time spent in additional calculations                            : -14.900597 s
Total time spent                                                 : 15.295736 s

Statistics:
===========
< Parallel Direct Factorization with number of processors: > 24
< Numerical Factorization with BLAS3 and O(n) synchronization >

< Linear system Ax = b >
             number of equations:           2547445
             number of non-zeros in A:      36680014
             number of non-zeros in A (): 0.000565

             number of right-hand sides:    7

< Factors L and U >
             number of columns for each panel: 192
             number of independent subgraphs:  0
             number of supernodes:                    955120
             size of largest supernode:               4429
             number of non-zeros in L:                750724073
             number of non-zeros in U:                1
             number of non-zeros in L+U:              750724074
             gflop   for the numerical factorization: 639.881780

 error in solver at step 2, error code =          -11

Thanks,

Zhanghong Tang

Just for testing - could you send iparm(60) to 1? I try to check that you correctly read config file. Also, could you set some file name but not only the folder: 'c:\temp\'?
Thanks,
Alex
P.S. Am I correct that you have several process that going to work with one ooc file? It is better to choose different file name for different processes.

Dear Alex,

Thank you very much for your kindly reply. I have set file name not just the folder, the file name is

'c:\temp\'//trim(ss2)

where 'ss2' is a character from 'myid'.

In fact, just now I tested by the following command:

mpiexec -wdir "Z:\test" -mapall -hosts 1 n01 2 Z:\fem

where n01 is just the local machine.

and the PARDISO work successfully.

 

Thanks,

Zhanghong Tang

Good! And what the name of ooc files stored in correspondent machine in such case?

Hi Alex,

Thank you very much for your kindly reply.

The names of these files are:

 Directory of C:\temp

2014/10/02  23:25    <DIR>          .
2014/10/02  23:25    <DIR>          ..
2014/10/02  23:22       196,902,156 0_.ind
2014/10/02  23:22       295,044,736 0_.jal
2014/10/02  23:23     2,147,483,608 0_.lnz
2014/10/02  23:22       147,522,368 0_.lup
2014/10/02  23:22        34,423,440 0_.sin
2014/10/02  23:22        34,423,440 0_.sle
2014/10/02  23:22       202,666,344 0_2.lnz
2014/10/02  23:22                 0 0_3.lnz
2014/10/02  23:22                 0 0_4.lnz
2014/10/02  23:23       195,247,668 1_.ind
2014/10/02  23:22       291,282,696 1_.jal
2014/10/02  23:23       305,741,768 1_.lnz
2014/10/02  23:22       145,641,348 1_.lup
2014/10/02  23:22        34,093,524 1_.sin
2014/10/02  23:22        34,093,524 1_.sle
2014/10/02  23:23                 0 1_2.lnz
2014/10/02  23:23                 0 1_3.lnz
2014/10/02  23:23                 0 1_4.lnz
2014/10/02  23:25                 0 out.txt
              19 File(s)  4,064,566,620 bytes
               2 Dir(s)  54,129,131,520 bytes free

 

But the problem is that the PARDISO returns error -11 when I run by the following command:

mpiexec -wdir "Z:\test" -mapall -hosts 10 n01 2 n02 2 n03 4 n04 4 n05 4 n06 4 n07 4 n08 4 n09 4 n10 4 Z:\fem

Small correction, the initial error is -10, the -11 returned by solving step because factorization have not been competed....My feeling is that something incorrect with sile name - they became the same, but i far from my machine to check it. I understand that i am boring, but it is only way to find the issue - could you call pardiso_getenv routine and pring filename that have been returned by pardiso? To make a proof that name a different?
Thanks,
Alex

Dear Alex,

Thank you very much for your kindly reply. I did as you suggested, the screen output is attached.

 

Thanks,

Zhanghong Tang

Attachments: 

AttachmentSize
Download out_0.txt76.17 KB

Ok, everything looks correct. Let me some time to play with such configuration (several processes per one node with OOC PARDISO on each). Also, just to reproduce it more correctly, could you print here parameters of iparm(15), iparm(16), iparm(17) and iparm(63) after reordering step?

Thanks,

Alex

Dear Alex,

Thank you very much for your kindly reply. I did as you suggested, the screen output is attached.

Thanks,

Zhanghong Tang

Attachments: 

AttachmentSize
Download out_0.txt79.52 KB

Dear Alex,

Is there any progress the investigate this problem?

BTW: I have set MKL_PARDISO_OOC_KEEP_FILE = 0 but I noticed that the OOC files are not deleted after the solver finished (phase = -1). Is there anything missed?

Finally I wish the PARDISO work like this: if evaluated peak memory is larger than the limit set before (for example, 16 GB), the OOC is enabled and the peak memory should not larger than 16 GB, if evaluated peak memory is smaller than the limit, the IC is enabled. For choosing OOC or IC, I think it can be solved by setting Iparm(60) = 1, the key issue is that after enabled OOC, the peak memory should be limitted to 16 GB, is there any additional settings?

Thanks,

Zhanghong Tang

 

Dear Alex,

Is there any progress to investigate this problem?

 

Thanks,

Zhanghong Tang

Leave a Comment

Please sign in to add a comment. Not a member? Join today