This particular problem is likely not an Intel problem, but may be one experienced here by someone, and who has some advice for resolution.
I have a compute intensive application that is written as MPI distributed, OpenMP threaded. I can run this program in my office directly (no MPI) or distributed (mpirun), I can run on 1 or 2 nodes using MPI. The systems locally are Xeon host (Cent OS) and KNL host (Cent OS). I've also have run this successfully by ssh-ing into the Colfax Cluster using 1 to 8 KNL's (couldn't get 16 KNLs to schedule).
I am now running (attempting to run) test on a hardware vendor's setup.
After resolving configuration issues and installation issues I can
ssh into their login server (Xeon host)
su to super user
ssh to KNL node (Xeon KNL)
mpirun ... using two KNL's
When I mpirun'd the application, it started up as expected (periodically emitting progress information to the console). Several minutes into the run it hung. I thought this was a programming error resulting in deadlock, or maybe a watchdog timer killed a thread or process without killing the mpi process manager.
To eliminate possible causes I started the application as stand alone (without mpirun).
Several minutes into this, the program hung as well. So not mpi messaging issue.
Pressing Ctrl-C on the keyboard (through two ssh connections) yielded no response (application not killed). I thought one of the systems in the ssh connections went down. Prior to doing anything on those ssh connections, I wrote an email to my client explaining the hang issue. Several minutes passed.
Now for the interesting part.
After this several minutes "hang" 100 to 150 lines of progress output from the application came out on the console window then the program terminated by Ctrl-C message came out. What appears to have happened was the application was running fine during the console hang, but the terminal output was suspended (as if flow control instructed it to stop). And no, I did not press Ctrl-S or the Pause key.
Anyone have information on this and how to avoid the hang, or at least how to resume the console output without killing the application.