I'm implementing a dynamic scheduler for solving several sparse matrices (using the well known MUMPS solver) in parallel. Each process will ask for new work (new matrix, actually just a number of the matrix) to the work manager when it completes his task. The manager code is ran as a separate thread in master processes so the master process can do some work as well. This works well 9 out of 10 times but sometimes everything is just hanging. When I attach the debugger when this happens it seems that the processes are blocking at MPI_Test for some reason. This should not happen because MPI_Test is the non-blocking version of MPI_Wait. Any idea what could be wrong or how I can debug this.
I'm trying to use Intel Trace Analyser but I'm only able to get traces of working runs. When my program hangs (some kind of deadlock i guess) I have to kill all processes but this also means I do not get a trace.
I tried using VTmt.lib to check for errors but get none.
I tried using VTfs.lib to automatically detect deadlocks when tracing but it is unable do detect this case.
Please advice me on what could cause MPI_Test to become blocking of how I can debug this case.
Thanks in advance