SCIF connection refused

SCIF connection refused

For some reason the SCIF interface in my compute nodes is refusing connections. Any ideas on what's wrong or where to start investigating:

The node has a Mellanox ConnectX-3 HCA with the latest Gold Update 2 MPSS and everything else set up "by the book". All the IB services and modules load nicely and seem to work and I can ssh into the MIC and run natively.

However, if I try to run an offload (LEO or OpenCL) application it hangs. Doing an strace reveals the following:

mmap(NULL, 10489856, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f737396e000
mprotect(0x7f737396e000, 4096, PROT_NONE) = 0
clone(child_stack=0x7f737436dfd0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f737436e9d0, tls=0x7f737436e700, child_tidptr=0x7f737436e9d0) = 26801
open("/dev/mic/scif", O_RDWR)           = 5
fcntl(5, F_SETFD, FD_CLOEXEC)           = 0
ioctl(5, 0xc0087303, 0x7fffa02d2710)    = 0
futex(0x7f737436e9d0, FUTEX_WAIT, 26801, NULL) = 0
close(4)                                = 0
ioctl(3, 0xc0087303, 0x7fffa02d27d0)    = -1 ECONNREFUSED (Connection refused)
nanosleep({0, 10000000}, NULL)          = 0
ioctl(3, 0xc0087303, 0x7fffa02d27d0)    = -1 ECONNREFUSED (Connection refused)
nanosleep({0, 20000000}, NULL)          = 0
ioctl(3, 0xc0087303, 0x7fffa02d27d0)    = -1 ECONNREFUSED (Connection refused)
nanosleep({0, 40000000}, NULL)          = 0

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Pinpointed the problem: We use a slightly customized system for user management on the MICs and due to that the 'micuser' user was missing during mpssd and ofed-mic initialization. I now added the user and offloading seems to work again. Suggestion: It would be nice to have a sanity check for this.

Olli-Pekka

Leave a Comment

Please sign in to add a comment. Not a member? Join today