Hybrid MPI/OpenMP process pinning

Hybrid MPI/OpenMP process pinning

I have a large SMP system on which I am trying to run a hybrid MPI/OpenMP code, and am looking for some info on doing correct process placement for my system when using Intel MPI. Using I_MPI_PIN_DOMAIN=socket and KMP_AFFINITY=compact gives expected results, with each rank (and all of its threads) running on a single socket. But this setup always includes the first socket in the system, which does not work since this is a multi user system and many people run on the system at the same time.
The logical step appears to use cpu masks with I_MPI_PIN_DOMAIN. I would have expected I_MPI_PIN_DOMAIN=[3F000,FC0000,3F000000,FC0000000] to create domains on cores 12-17, 18-23, 24-29, and 30-35. But one of these domains always ends up being a catch-all for the cores that were not specified, which leads to a rank being pinned again on the first socket.I have tried other methods such as numactl, but Intel MPI does not appear to respect these tools for help with placement.As an example some debugging output withI_MPI_PIN_DOMAIN=[3F000,FC0000,3F000000,FC0000000] is seen below. The Intel MPI is version 4.0 update 2.[0] MPI startup(): shm and tcp data transfer modes[1] MPI startup(): shm and tcp data transfer modes[2] MPI startup(): shm and tcp data transfer modes[3] MPI startup(): shm and tcp data transfer modes[0] Rank Pid Node name Pin cpu[0] 0 8231 host.domain {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239}[0] 1 8229 host.domain {24,25,26,27,28,29}[0] 2 8230 host.domain {30,31,32,33,34,35}[0] 3 8232 host.domain {36,37,38,39,40,41}[0] MPI startup(): I_MPI_ADJUST_BCAST=3[0] MPI startup(): I_MPI_DEBUG=5[0] MPI startup(): I_MPI_FABRICS=shm:tcp[0] MPI startup(): I_MPI_SHM_BUFFER_SIZE=131072[0] MPI startup(): MPICH_INTERFACE_HOSTNAME=192.168.1.1Are there any suggestions on making process placement work for hybrid MPI/OpenMP where the the first socket is never used?Thanks...Joel

7 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Best Reply

Hi Joel,

The mask you used is correct, but there was an issue in the library which didn't allow to set correct pinning.
Could you please download version 4.0 Update 3 of the Intel MPI Library and give it a try?

Regards!
Dmitry

Dmitry,That fixes it. Thanks.Joel

It appears that I am still running into a pinning problem when using masks, but the problem is different than before.I have a set of simulations that I am trying to execute, all using the same pinning variables:I_MPI_PIN=onI_MPI_PIN_MODE=mpdI_MPI_PIN_DOMAIN=[3F(...)000,FC0(...)000,3F(...)000000,FC0(...)000000,etc]where the masks are associated with cores 144-149, 150-155,(...), 228-233, 234-239. (i.e.16 ranks placed on sockets 24-39 of a 6-core-per-socket system).The system on which these simulations are running is idle except for the simulations I am running. In 8 or 9 out of 10 runs IMPI does placement correctly. But in the other cases (again with identical I_MPI_PIN variable settings) it is oversubscribing a set of nodes. In these cases it is placing ranks 0-7 AND ranks 8-15 on sockets 24-31 and ignoring sockets 32-39.I have not yet been able to make this a repeatable error, outside of running the software over and over until it occurs. My log file got wiped out so I do not have the output from I_MPI_DEBUG, and right now the system is correctly placing processes. If I get a case that fails I will attach the output.Joel

We found I_MPI_DEBUG output from a case that failed in respecting the mask settings. In this instance ranks 0,3,6 and 9 were placed on socket 28; ranks 1, 4, 7, and 10 were placed on socket 29; and ranks 2, 5, 8, and 11 were placed on socket 30. This job was just run again (with identical settings as before) to see if this would be reproduceable but the system correctly placed the processes this time.I_MPI_PIN=onI_MPI_PIN_MODE=mpdI_MPI_PIN_DOMAIN=[3F000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000000000000FC0000000000000000000000000000000000000000000000000000000000]...[7] MPI startup(): shm and tcp data transfer modes[0] MPI startup(): shm and tcp data transfer modes[3] MPI startup(): shm and tcp data transfer modes[11] MPI startup(): shm and tcp data transfer modes[8] MPI startup(): shm and tcp data transfer modes[5] MPI startup(): shm and tcp data transfer modes[4] MPI startup(): shm and tcp data transfer modes[2] MPI startup(): shm and tcp data transfer modes[6] MPI startup(): shm and tcp data transfer modes[1] MPI startup(): shm and tcp data transfer modes[10] MPI startup(): shm and tcp data transfer modes[9] MPI startup(): shm and tcp data transfer modes[0] Rank Pid Node name Pin cpu[0] 0 19930 local.domain {168,169,170,171,172,173}[0] 1 19919 local.domain {174,175,176,177,178,179}[0] 2 19921 local.domain {180,181,182,183,184,185}[0] 3 19920 local.domain {168,169,170,171,172,173}[0] 4 19922 local.domain {174,175,176,177,178,179}[0] 5 19923 local.domain {180,181,182,183,184,185}[0] 6 19924 local.domain {168,169,170,171,172,173}[0] 7 19925 local.domain {174,175,176,177,178,179}[0] 8 19926 local.domain {180,181,182,183,184,185}[0] 9 19927 local.domain {168,169,170,171,172,173}[0] 10 19929 local.domain {174,175,176,177,178,179}[0] 11 19928 local.domain {180,181,182,183,184,185}[0] MPI startup(): I_MPI_ADJUST_BCAST=3[0] MPI startup(): I_MPI_DEBUG=5[0] MPI startup(): I_MPI_FABRICS=shm:tcp[0] MPI startup(): I_MPI_SHM_BUFFER_SIZE=131072[0] MPI startup(): MPICH_INTERFACE_HOSTNAME=192.168.1.1~ I_MPI_PIN=onI_MPI_PIN_MODE=mpdI_MPI_PIN_DOMAIN=[3F000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000000000FC00000000000000000000000000000000000000000000000000000003F000000000000000000000000000000000000000000000000000000000FC0000000000000000000000000000000000000000000000000000000000]
...[7] MPI startup(): shm and tcp data transfer modes[0] MPI startup(): shm and tcp data transfer modes[3] MPI startup(): shm and tcp data transfer modes[11] MPI startup(): shm and tcp data transfer modes[8] MPI startup(): shm and tcp data transfer modes[5] MPI startup(): shm and tcp data transfer modes[4] MPI startup(): shm and tcp data transfer modes[2] MPI startup(): shm and tcp data transfer modes[6] MPI startup(): shm and tcp data transfer modes[1] MPI startup(): shm and tcp data transfer modes[10] MPI startup(): shm and tcp data transfer modes[9] MPI startup(): shm and tcp data transfer modes[0] Rank Pid Node name Pin cpu[0] 0 19930 local.domain {168,169,170,171,172,173}[0] 1 19919 local.domain {174,175,176,177,178,179}[0] 2 19921 local.domain {180,181,182,183,184,185}[0] 3 19920 local.domain {168,169,170,171,172,173}[0] 4 19922 local.domain {174,175,176,177,178,179}[0] 5 19923 local.domain {180,181,182,183,184,185}[0] 6 19924 local.domain {168,169,170,171,172,173}[0] 7 19925 local.domain {174,175,176,177,178,179}[0] 8 19926 local.domain {180,181,182,183,184,185}[0] 9 19927 local.domain {168,169,170,171,172,173}[0] 10 19929 local.domain {174,175,176,177,178,179}[0] 11 19928 local.domain {180,181,182,183,184,185}[0] MPI startup(): I_MPI_ADJUST_BCAST=3[0] MPI startup(): I_MPI_DEBUG=5[0] MPI startup(): I_MPI_FABRICS=shm:tcp[0] MPI startup(): I_MPI_SHM_BUFFER_SIZE=131072[0] MPI startup(): MPICH_INTERFACE_HOSTNAME=192.168.1.1~

Hi Joel,

Right now mask for pinnig has 64 bits only. So, might be this is the reason of unstable behavior.
Please try to avoid using of I_MPI_PIN_MODE - it's useless in your case.

Could you please provide output of cpuinfo utility (from Intel MPI) and command line used to run the application. (Also env variables which may affect execution)

Regards!
Dmitry

Hi,

Please try Intel(R) MPI Library 4.1.0.030.

You can find it at https://registrationcenter.intel.com/RegCenter/Download.aspx?productid=1626

--

Dmitry

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi