Pardiso issues (small parallel speedup, direct-iterative solver crashes)

Pardiso issues (small parallel speedup, direct-iterative solver crashes)

Hi,

I am beta testing Pardiso for Mac OS X. I am doing this on a Mac Pro that has a dual processor Intel Xeon with 4 cores per processor (8 cores total), and 4 Gb of RAM. The OS is Mac OS X Leopard 10.5.3.

My matrices have sizes about 50,000 x 50,000, and they are quite sparse (about 3 million non-zero elements). They arise from nonlinear elasticity problems (and the Finite Element Method), and are in my opinion very common kind of matrices that one would use with Pardiso.

*** Issue #1: small parallel speedup

I am able to make the direct solver work. However, the parallel speedup is small. I get about 20% speedup when switching from 1 core to 2 cores. With more cores, I start getting even negative speedup (slowdown). These timings include all phases 1,2,3.

num cores | solve time (seconds)
======================
1 | 6.93
2 | 5.60
3 | 5.64
4 | 5.74
6 | 7.14
8 | 7.57

I always set the environment variable MKL_NUM_THREADS to the desired number of threads, and I am passing the same value for iparm(3). My computer is lightly loaded - I am not running anything else other than Pardiso.

It would be very helpful if Intel provided non-trivial size sparse test matrices (e.g., 50,000 x 50,000), together with some performance numbers obtained with those matrices by Intel (most notably, the parallel speedup as a function of the number of cores). So that one can roughly know the expected speedup before committing to coding the interface to the solver.

I got excited over Intel's MKL Pardiso because of all the speedup claims, but got little speedups in my case. It would be good to hear what speedups other people are getting - or if there are any tricks to make it faster.

*** Issue #2: direct-iterative solver crashes

I am unable to make the direct-iterative solver to work. It crashes inside the solver routine:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0xa0b5ed80
0x904c8150 in strlen ()
(gdb) where
#0 0x904c8150 in strlen ()
#1 0x000f1034 in pardiso_open_ooc_file_ ()
Cannot access memory at address 0xfdf8b55c

The out-of-core parameter iparm(60) is set to zero (I have no intention of using OOC), so it makes no sense why OOC should be called. I just spent 5 hours trying to set the iparam values to many different settings; I always get the crash. I am compiling with the GNU C/C++ compiler, in 32-bit mode, with LP64. What I do is I first call Pardiso with phase=11, then attempt to repeatedly call it with phase=23 (loading a different matrix each time, with same sparsity pattern). I get the crash the first time I call it with phase=23. It appears that this crash occurs after Pardiso has internally computed the first factorization (judging this from the elapsed time before the crash). I am using iparm(4)=62. My matrices are symmetric, and I am only passing the upper triangle, including the diagonal (and all diagonal elements are set).

It would be very helpful if a code example was provided illustrating the direct-iterative solver.

Also, the Pardiso section of the manual is currently not always easy to read. More space between the iparm() paragraphs would help.

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,
I see this is a fairly old post, but I am encountering exactly the same problems while attempting to use the parallel solver, so I thought I should post here first, before considering starting a new thread.
(In my case while solving FE problems, also under mac OS X but with IFORT, on Core 2 Duo iMac.) I have implemented the Pardiso solver with a similar set-up as you (doing phase 11 once, and then iterating phase 23), and I'm very disappointed to see a slight slow-down compared to the very old umfpack 2.2.1 version I was using before. I was hoping to at least get a ~ 2 factor speed-up by using 2 CPU's, but am getting virtually no speed-up at all (which effectively means a slow-down by a factor 2, as I need almost the same clock time on both processors, compared to just one processor before).

Did you get this issue resolved? Are there problems for which some kind of message passing makes the parallel Pardiso unsuitable? One potentially related setting is that I use -framework Accelerate to link to the most optimized BLAS and Lapack libraries on the mac. It is not necessary to link to MKL versions is it?

Other than that, my compilation flags/linking is:-O3 -m64 -xssse3 -framework Accelerate -fast-openmp -I/Library/Frameworks/Intel_MKL.framework/Versions/Current/include -L/Library/Frameworks/Intel_MKL.framework/Versions/Current/lib/em64t/ -lmkl_solver_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lpthread-L/opt/intel/Compiler/11.0/064/lib/

Any help would be appreciated.

Hi,

Thank you for the question. Let us clarify the problem. As you know, PARDISO solved the task in 3 steps: reordering, factorization and solution. Most probably, in your version of MKL reordering step was not parallelized yet, so the only factorization phase could give speedup on many threads. Test case, described in the first post, deals with sufficiently dense matrices: 3 000 000 non-zero elements for the sizes 50 000 x 50 000. For these tasks reordering stage could take more than a half of total time. To verify this assumption, could you provide times that PARDISO used for reordering, factorization and solving steps separately? Version of MKL and reordering used will be also helpful.

- Sergey

...I am encountering the same (no-Speedup on multicore-) issue when using PARDISO to solve symmetric complex matrices. I'm using 10.2.1.019.

Leave a Comment

Please sign in to add a comment. Not a member? Join today