Error: Engine_connect for an offloading code - SCIF problems

Error: Engine_connect for an offloading code - SCIF problems

Hi there,

A user of ours is building a pre-release version of NAMD that includes Phi offloading support but when we try and run it it claims it cannot find the Phi cards.  I've also replicated the failure with xhpl_offload_intel64.

Reason: FATAL ERROR: MIC error on Pe 0 (barcoo062 device 0): No MIC devices found.

running with OFFLOAD_REPORT=2 reveals the following:

[SOURCE][0x9377bc80][1834028774450][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
[SOURCE][0x9377bc80][2055063906528][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
[SOURCE][0x9377bc80][2276654460069][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect
[SOURCE][0x9377bc80][2497819045011][engine.cpp:186][COILOG_LEVEL_ERROR][ConnectToDaemon]: Error: Engine_connect

Running it under strace shows:

5672  open("/dev/mic/scif", O_RDWR)     = 3
5672  fcntl(3, F_SETFD, FD_CLOEXEC)     = 0
5672  fcntl(3, F_GETFD)                 = 0x1 (flags FD_CLOEXEC)
5672  fcntl(3, F_SETFD, FD_CLOEXEC)     = 0
5672  ioctl(3, 0xc0087301, 0x7fff1c780f38) = 0
[...]
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 10000000}, NULL)    = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 20000000}, NULL)    = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 40000000}, NULL)    = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 80000000}, NULL)    = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 160000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 320000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({0, 640000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({1, 280000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({2, 560000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({5, 120000000}, NULL)   = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({10, 240000000}, NULL)  = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({20, 480000000}, NULL)  = 0
5672  ioctl(3, 0xc0087303, 0x7fff1c780f20) = -1 ECONNREFUSED (Connection refused)
5672  nanosleep({40, 960000000}, NULL)  = 0

At which point it writes out one of those errors and tries again.

I've also replicated this same problem with the xhpl_offload_intel64 which used to work under a previous install so I'd be curious if anyone knew what sort of things may have changed to cause this failure?

All the best,
Chris

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Solved - the xCAT cluster management software was copying the passwd file from our management node onto the Xeon Phi cards and so there was no "micuser" user present which caused the coi_daemon to (quite legitimately) refuse to start.   Figuring out what was needed and creating that user on the management node fixed it.

Leave a Comment

Please sign in to add a comment. Not a member? Join today