I have a reactive transport model that utilizing pardiso as the solver. The model is discretized by control volume method and solved by newton iteration method. I do sybmolic factorization at first, then do numerical factorization and substitution in every time step. The model can work but I do have a question on the speedup problem.
The parameter for pardiso are as follows:
iparm = 0
iparm(1) = 1 ! no solver default
iparm(2) = 3 ! fill-in reordering from METIS ,0-MIN DEGREE, 2-METIS, 3-OPENMP VERSION
iparm(3) = 0 ! numbers of processors. Input the next call mkl_set_dynamic(0), mkl_set_num_threads(n);
iparm(4) = 61 ! 0-no iterative-direct algorithm; 10*L+K, K=1 CGS, K=2 CGS for symmetric, 1.0E-L: stopping criterion
iparm(5) = 0 ! no user fill-in reducing permutation
iparm(6) = 0 ! if == 0, the array of b is replaced with the solution x.
iparm(7) = 0 ! Output, Number of iterative refinement steps performed
iparm(8) = 9 ! numbers of iterative refinement steps, must be 0 if a solution is calculated with separate substitutions (phase = 331, 332, 333)
iparm(9) = 0 ! not in use
iparm(10) = 13 ! Default value 13, perturbe the pivot elements with 1E-13
iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
iparm(12) = 0 ! not in use
iparm(13) = 1 ! maximum weighted matching algorithm is switched-on (default for non-symmetric)
iparm(14) = 0 ! Output: number of perturbed pivots
iparm(15) = 0 ! Output, Peak memory on symbolic factorization.
iparm(16) = 0 ! Output, Permanent memory on symbolic factorization. This value is only computed in phase 1.
iparm(17) = 0 ! Output, Size of factors/Peak memory on numerical factorization and solution.
iparm(18) = 0 ! Input/output. Report the number of non-zero elements in the factors. >= 0 Disable reporting.
iparm(19) = 0 ! Input/output. Report number of floating point operations to factor matrix A. >= 0 Disable reporting.
iparm(20) = 0 ! Output: Numbers of CG Iterations. >0 CGS succeeded, reports the number of completed iterations.
iparm(24) = 1 ! Parallel factorization control, 0: classic algorithm, 1: two-level factorization algorithm, improve scalability on many threads.
iparm(25) = 0 ! Parallel forward/backward solve control. 0: Use parallel algorithm for the solve step; 1: Use the sequential forward/backward solve.
iparm(27) = 0 !check matrix error, 0-without check, 1-check
maxfct = 1
mnum = 1
nrhs = 1
error = 0 ! initialize error flag
msglvl = 0 ! print statistical information
mtype = 11 ! real unsymmetric
First, I do symbolic factorization with phase = 11 (One time)
phase = 11 ! only reordering and symbolic factorization
call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, idum, nrhs, iparm, msglvl, ddum, ddum, error)
Then, in every time step, I do newton iteration. For every newton iteration, I do numerical factorization and substitution:
phase = 22 ! only factorization
call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, idum, nrhs, iparm_in, msglvl, ddum, ddum, error)
phase = 33 ! only substitution
call pardiso (pt_in, maxfct, mnum, mtype, phase, n_in, a_in, ia_in, ja_in, idum, nrhs, iparm, msglvl, b_in, x_out, error)
The matrix is 2573775*2573775 with nozero entry of 12810555.
When I use 4 processors, the time consuming is as follows:
TimeStep NewtonIteration NumericalFactorization Substitution
1 1 2.673447 0.253894
1 2 0.000024 0.614713
1 3 0.000023 0.595046
2 1 0.000023 0.619408
2 2 0.000023 0.613553
My question is: Why does substitution cost so much time after the first step? In common sense, time consuming of substitution is much less than numerical factorization.
System is WIN7 X64
Compiler: Intel Parallel Studio XE 2013