Using Intel® MPI Library and Intel® Xeon Phi™ coprocessor tips

1. Check prerequisites

  • Each host and each Intel® Xeon Phi™ coprocessor should have a unique IP address across a cluster;
  • ssh access between host(s) and Intel® Xeon Phi™ coprocessor(s) should be password-less;
  • Update the Intel® Manycore Platform Software Stack (Intel® MPSS) to current version;
  • Make sure that the Intel® MPI Library has the same path on the host and on the coprocessor. For example, the directory with the installed Intel MPI Library shared to the coprocessor through the network file system. Refer to the Intel® MPSS Readme.txt for instructions on how to set it up.

2. Check connectivity

  • Check that firewall doesn’t block unprivileged ports or is disabled;
  • Check that connections between all nodes and coprocessors can be established.

3. "Missing ifname or invalid host/port description" error message

In case of the application termination with a message like:

Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(647)...................:
MPID_Init(192)..........................: channel initialization failed
MPIDI_CH3_Init(152).....................:
MPID_nem_tcp_post_init(578).............:
MPID_nem_tcp_connect(1100)..............:
MPID_nem_tcp_get_addr_port_from_bc(1207): Missing ifname or invalid host/port description in business card
APPLICATION TERMINATED WITH THE EXIT STRING: Hangup (signal 1)

The workaround is to add the " " string to the /etc/hosts file in the host OS, specifying IP and HOSTNAME of the host system.

4. "HYDU_sock_connect ... unable to connect" messages when using Intel® Xeon Phi™ coprocessors in a multi-host environment

When running in a multi-host/multi-coprocessor environment (using an Intel® Xeon Phi™ coprocessors only), you see the following diagnostics:

# 172.20.5.2 is the mic0 coprocessor in node5
# 172.20.6.2 is the mic0 coprocessor in node6
# node5 and node6 are different hosts
# this example shows the mpi ranks are running on 'remote' coprocessors ('remote' means the coprocessors are in different hosts)
# mpiexec.hydra starts on host node4, it launches processes on coprocessors 172.20.5.2 and 172.20.6.2

(host-node4)# mpiexec.hydra \
                -host 172.20.5.2 -n 1 /tmp/test.c.exe.mic \
              : -host 172.20.6.2 -n 1 /tmp/test.c.exe.mic

[proxy:0:0@node5-mic0] HYDU_sock_connect (./utils/sock/sock.c:213): unable to connect from "node5-mic0" to "192.168.1.4" (Connection timed out)
[proxy:0:0@node5-mic0] main (./pm/pmiserv/pmip.c:339): unable to connect to server 192.168.1.4 at port 45553 (check for firewalls!)
[proxy:0:1@node6-mic0] HYDU_sock_connect (./utils/sock/sock.c:213): unable to connect from "node6-mic0" to "192.168.1.4" (Connection timed out)
[proxy:0:1@node6-mic0] main (./pm/pmiserv/pmip.c:339): unable to connect to server 192.168.1.4 at port 45553 (check for firewalls!)
APPLICATION TERMINATED WITH THE EXIT STRING: job ending due to timeout = 30

Use the '-iface' option of the 'mpiexec.hydra' command and specify the 'micbr0' interface to make the above mentioned example work in your network configuration:

(host-node4)# mpiexec.hydra -iface micbr0 \
                -host 172.20.5.2 -n 1 /tmp/test.c.exe.mic \
              : -host 172.20.6.2 -n 1 /tmp/test.c.exe.mic

5. "HYDU_sock_connect ... unable to connect" messages when using Intel® Xeon Phi™ coprocessor in a single-host environment

When running in a single-host environment (using an Intel® Xeon Phi™ coprocessors only), you can see the following diagnostics:

(host)# mpiexec.hydra \
          -n 2 -wdir /tmp -host 192.168.1.100 /tmp/test_hello
[proxy:0:0@crt31-mic0] HYDU_sock_connect (./utils/sock/sock.c:213): unable to connect from "crt31-mic0" to "127.0.0.1" (Connection refused)
[proxy:0:0@crt31-mic0] main (./pm/pmiserv/pmip.c:339): unable to connect to server 127.0.0.1 at port 50791 (check for firewalls!)

Use the '-iface' option of the 'mpiexec.hydra' command and specify the 'mic0' interface to make the above mentioned example work in your network configuration:

(host)# mpiexec.hydra -iface mic0 \
          -n 2 -wdir /tmp -host 192.168.1.100 /tmp/test_hello

6. Missing second DAPL provider

When running with two DAPL providers (for example, I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1,ofa-v2-scif0) without using the second provider. Use the full name of the Intel® Xeon Phi™ coprocessor:
<hostname>-micx, where x stands for the number of the processor (for example, <hostname>-mic0) instead of just micx.

有关编译器优化的更完整信息,请参阅优化通知