impi/mkl_scalapack causing kernel panic

impi/mkl_scalapack causing kernel panic

I'm testing out the cluster toolkit for Linux on em64t and the scalapack Cholesky factorization routine (pdpotrf_) causes a Machine Check Exception and kernel panic very predictably. No other scalapack/pblas routine so far has caused any problems.

Additionally, each time I start an mpi job using 'mpiexec', the mpd daemon outputs the following:

unable to parse pmi message from the process :cmd=put kvsname=kvs_nerf_4268_1_0 key=DAPL_PROVIDER value=

It may not matter, but the only way I've found to get the mpd ring running is:

[host]$ mpd --ifhn=10.0.0.1 -l 4268 &

[node]$ mpd -h nerf -p 4268 &

Any help clearing these up would be greatly appreciated, as I won't be purchasing the software otherwise. MCEs are unacceptable.

12 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

The forum pulled out the last part of the error message:
"value="

In brackets, "NULL string"

Hi,

Could you clarify the Intel MPI Library version you use? Please check package ID information in the mpisupport.txt file.

By the way, you can fill a bug report at https://primer.intel.com to get a technical assistance.

Best regards,

Andrey

Package ID: l_mpi_p_3.1.026

If I had to guess, I would say it's in pdsyrk. I heavily performance tested pdtrsm and dpotrf before trying pdpotrf.

That link doesn't work.

Could you give more details on cluster configuration? I'd like to understand why you was not able to use mpdboot to launch MPD ring.

I did a misprint. Sorry. The right link is https://premier.intel.com

The test cluster consists of 2 4 processor machines behind a firewall. The headnode, nerf, has two ethernet ports, one connected to the firewall, one to the node, ball. All IPs are in the 10.0.0.0 network.

When I try:

mpdboot --totalnum=2 --file=./mpd.hosts --rsh=ssh

the output is:

mpdboot_nerf (handle_mpd_output 681): failed to ping mpd on ball; received output={}

Also, the premier support link won't let me in, as I'm only evaluating the software right now.

  1. Is it allowed to establish connection from compute nodes to the head node?mpdboot alwaysstart mpd daemon on local node first .After that remote mpd daemonst attemt to perfrom connection to it?
  2. Do you able to start mpd manually?
  • Run the mpd -e -d command on the head node. The port number will be printed on stdout.
  • Run the mpd -h head_node -p -d command to establish MPD ring. Use the port number printed at pervious step.
  • Check if ring was established succesfully. Run the mpdtrace command for that.

Ops! I see that you can start ring manually. Could you share the content of your mpd.hosts file? Could you share the output from mpdboot -d -v... command? Is there any useful information in /tmp/mpd2.logfile_

After reconfiguring the network settings several times, and reorganizing all of my environment variables (I had several MPI implementations installed), the problem went away, and I could boot up the MPD daemons via:

mpdboot --file= --rsh=ssh

I wish I could explain more specifically, but I changed far too many things in the process of compiling ScaLAPACK from scratch for several MPI implementations.

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen