threadsafe BLACS/SCALAPACK

threadsafe BLACS/SCALAPACK

Hello, I am writting a hybrid openmp/MPI program which will end up calling SCLAPACK routines from threaded regions (to solve independent problems similtanously) , but am unable to use the BLACS send/recieves from a parallel region without a high (>90%) crash rate. I have successfully used the thread safe INTEL MPI library to do multiple threaded MPI send/recieves. Three type of error comes up for the same code.

  1. Stall in the  'parallel blacs' region
  2. segfault in libmkl_blacs ... + libpthread ( see snippet 1)
  3. MPI error (see snippet 2)

 xxx@yyy:~/scalapack> ...
PARALLEL BLACS : 1
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libpthread.so.0 00002ADF44483C10 Unknown Unknown Unknown
libmkl_blacs_inte 00002ADF4425C6BB Unknown Unknown Unknown 

 

 ... 
PARALLEL BLACS : 1
Fatal error in MPI_Testall: Invalid MPI_Request, error stack: MPI_Testall(261): MPI_Testall(count=2, req_array=0x6107b0, flag=0x7fff4ce1cd80, status_array=0x60fe80) failed MPI_Testall(123): The supplied request in array element 1 was invalid (kind=15) APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1) 

The source code is attached, with the compiler line and the relevent parts of the PBS submission as a footer.

I am using the Intel Cluster Studio 2012, ifort 12.1.0, Intel MPI version 4.0, update 3 - i Am unsure of which MKL i have (came with cluster studio). I suspect this is a threading issue since the program works with OMP_NUM_THREADS=1.

Any help would be greatly appreciated! 

Thanks,

Andrew

附件尺寸
下载 test.f903.65 KB
5 帖子 / 0 全新
最新文章
如需更全面地了解编译器优化,请参阅优化注意事项

Hi, I was able to reproduce your problem.
Link-line:
mpiifort -mt_mpi -openmp -I$MKLROOT/include -check bounds -traceback -g blacs.f90 -o blacs -L$MKLROOT/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -lpthread -lm

Run:
OMP_NUM_THREADS=1 LD_LIBRARY_PATH=$MKLROOT/lib/intel64:$LD_LIBRARY_PATH mpirun -np 8 ./blacs
...
BLACS PARALLEL COMPLETED ALL 1000 TRIALS

But with:
% OMP_NUM_THREADS=2 LD_LIBRARY_PATH=$MKLROOT/lib/intel64:$LD_LIBRARY_PATH mpirun -np 2 ./blacs
MPI INITIALISATION : 3 3 0 0
MPI INITIALISATION : 3 3 0 0
MPI RANK : 1 / 2
MPI RANK : 0 / 2
MPI_SEND : 0 0
MPI_SEND : 0 0
MPI_RECV : 0 0
MPI_RECV : 0 0
MPI PARALLEL TEST COMPLETED
MPI PARALLEL TEST COMPLETED
BLACS SETUP : 0 0 0 0
BLACS SETUP : 0 1 0 1
BLACS SETUP : 1 0 0 0
BLACS SETUP : 1 1 0 1
BLACS SINGLE COMPLETE : 0 1
BLACS SINGLE COMPLETE : 0 0
BLACS SINGLE COMPLETE : 1 0
BLACS SINGLE COMPLETE : 1 1
Fatal error in MPI_Testall: Invalid MPI_Request, error stack:
MPI_Testall(261): MPI_Testall(count=2, req_array=0x23241b0, flag=0x7fffde64c190, status_array=0x2323900) failed
MPI_Testall(124): The supplied request in array element 1 was invalid (kind=15)
Fatal error in MPI_Testall: Invalid MPI_Request, error stack:
MPI_Testall(261): MPI_Testall(count=2, req_array=0x151701b0, flag=0x41a9bd10, status_array=0x1516f880) failed
MPI_Testall(124): The supplied request in array element 1 was invalid (kind=15)

However with:
OMP_NUM_THREADS=8 LD_LIBRARY_PATH=$MKLROOT/lib/intel64:$LD_LIBRARY_PATH mpirun -np 8 ./blacs
...
PARALLEL BLACS : 1
forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 5 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 2 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 4 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 5 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 4 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 2 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 6 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 6 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 2 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 4 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 3 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 6 which is greater than the upper bound of 1

forrtl: severe (408): fort: (2): Subscript #1 of the array CONTXT has value 3 which is greater than the upper bound of 1

Fatal error in MPI_Testall: Invalid MPI_Request, error stack:
MPI_Testall(261): MPI_Testall(count=2, req_array=0x1bf483b0, flag=0x41427d10, status_array=0x1bf47880) failed
MPI_Testall(124): The supplied request in array element 1 was invalid (kind=15)

I

Thanks, -- Victor

Hey Victor,
Thanks for the reply!

Do you know if there is any way around this problem without compiling a seperate
BLACS (and then SCALAPACK) (trying to link against the thread-safe MPI) ?

I figure another possbile way around this would be to spawn the same number of MPI processes
as there are cores and distribute the data for the SCALAPACK routines across these just before the solve -
making seperate contexts for each set of OMP threads.
The primary question concerning this would be what is the relative cost of communication between MPI processes on the
same physical computer node, and those that require the network?

Thanks,
Andrew

Andrew,

After looking at your code I see correct fragment for serial BLACS testing:

! TEST THE SERIAL BLACS
do i = 0, 1
CALL DGESD2D(CONTXT(i), 10, 1, SEND, 10, 0, MOD(COL(i)+1, MPI_PROCS))
CALL DGERV2D(CONTXT(i), 10, 1, RECV, 10, 0, MOD(MPI_PROCS + COL(i) - 1, MPI_PROCS))
WRITE(*,*) 'BLACS SINGLE COMPLETE : ', i, col(i)
CALL BLACS_BARRIER(CONTXT(i), 'A')
end do

But next parallel code fragment for BLACS is unclear for me: what actions are supposed to be done in parallel:

! TEST THE PARALLEL BLACS
do i = 1, 100
!$OMP PARALLEL
CALL DGESD2D(CONTXT(THREADNUM), 10, 1, SEND, 10, 0, MOD(COL(THREADNUM)+1, MPI_PROCS))
CALL DGERV2D(CONTXT(THREADNUM), 10, 1, RECV, 10, 0, MOD(MPI_PROCS + COL(THREADNUM) - 1, MPI_PROCS))
write(*,*) 'PARALLEL BLACS : ', i
!$OMP END PARALLEL
end do

Questions:
Why should it work?
Why SEND, RECV arrays are used?
Who sends/recieves data in parallel?

Also defined PSEND, PRECV are not used here

Thanks, -- Victor

Hey Victor,

For this code chunk

 ! TEST THE PARALLEL BLACS

do i = 1, 100

!$OMP PARALLEL

CALL DGESD2D(CONTXT(THREADNUM), 10, 1, SEND, 10, 0, MOD(COL(THREADNUM)+1, MPI_PROCS))

CALL DGERV2D(CONTXT(THREADNUM), 10, 1, RECV, 10, 0, MOD(MPI_PROCS + COL(THREADNUM) - 1, MPI_PROCS))

write(*,*) 'PARALLEL BLACS : ', i

!$OMP END PARALLEL

end do


The hope was that each threadnum = OMP_GET_THREAD_NUM() would have an associated BLACS context - so that
the same threadnum on each MPI process would form a communcation group - like having used the threadnum as the
communcation ID for the MPI send/recieves - and that multiple send/recieves could be done in parallel over a different context for each thread. In this case sending to the next (cyclic) column of the context.

Replacing the shared send/recv arrays by the threadprivate psend/precv in the parallel blacs region results in the same set of errors.
Sorry to not have that in the version i posted, I was stuffing around with the program trying to pin the error!

To answer your questions :
Why should it work?
From what i can gather a BLACS communcation invokes a set of MPI commands for communcation between MPI processes.
So my post is really about if a thread-safe MPI library is avaliable is a thread-safe BLACS (and thus scalapack) library also
possible/avaliable (with the hope that i havent made some ridiculous coding error! x.X)?
This brings up the question of how the BLACS (and scalapack) libraries in the MKL were built by the installer.

Why SEND, RECV arrays are used?
Sorry again for the edited version! replacing these with PSEND/PRECV doesnt seem to help!

Who sends/recieves data in parallel?
the threads with the same threadnum = OMP_GET_THREAD_NUM() on each MPI process sending to the next cyclic column in the BLACS contxt - one context for each threadnum.

One last comment : This program has hard coded a maximum of 2 openmp threads per node!
(this was to do with the cluster i was running it on)

Thanks again!
Andrew

发表评论

登录添加评论。还不是成员?立即加入