Phi fails after a certain number of offload+alloc calls

Phi fails after a certain number of offload+alloc calls

I have noticed what seems to be a bug when trying to run some code which makes a large(ish) number of offload calls. Too many offload (with alloc) calls over the entire program seems to be the problem, leading to a grinding halt (without error). I can replicate the problem with the code below, which just loops over a few thousand offload calls, allocating a very small amount of memory in each. The program stops dead after 1004 iterations, regardless of the upper limit on the loop, the size of the alloc, or any sleep time I put in the loop.

int N = 5000;
int size = 100;
float mem = 0.0;
int i;
for ( i = 0; i < N; i++ )
{
   float *thing = (float *) malloc ( size*sizeof(float) );
   mem += size;
   #pragma offload_transfer target(mic:0) \
      in ( thing : length(size) alloc_if(1) free_if(0) )
   printf ( "%d Size: %f Meg\n", i, mem/(1024.0*1000) );
}

With the last few outputs before it halts in its tracks (without error, just a freeze):

1001 Size: 0.097852 Meg
1002 Size: 0.097949 Meg
1003 Size: 0.098047 Meg
1004 Size: 0.098145 Meg

If I put a free_if(1) in the loop (meaning the memory is released when the offload has completed) then it runs perfectly fine, suggesting there is some upper limit on how many (not how much) offloaded memory blocks it can create, or how many host -> MIC indexes it can store.

Any insight into this is much appreciated. Perhaps there is a flag I can set somewhere for the number of offload allocs?

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I can reproduce this Corey. I reported this to Development (see tracking id below) and will post again when I know more.

(Internal tracking id: DPD200245684)

(Resolution Update on 11/06/2013): This defect is fixed in the MPSS 3.1 release (Oct. 2013) for Linux.

Ruibang L.'s picture

We are suffering from a similar problem, does Intel has any follow up on this problem?

Bioinformatician

For this particular issue, the MPSS Development team identified the root cause. As they described it: "The root cause is that the number of temp files being generated by COI is exceeding the number of allowable fd's in process, by default. i.e. since in the free_if(0) case at exactly 1005 iterations, is when the number of fd's plus the number of open /tmp/coi_procs/<card>/<pid>/.tmp files currently tracked by coi is at 1024."

They indicated increasing the limit for number of open files (ulimit -n) allows COI to continue generating temp files.

I verified to work around this issue one can modify the host file  ( /opt/intel/mic/coi/config/coi )  to add line 30 shown in the snippet below. This file is transferred to the card(s) under ( /etc/init.d/coi ). Where more than ~5000 threads are used, one must increase the open file limit accordingly. With the change shown below, I was able to run the test case where N=5000.

Snippet from (host file): /opt/intel/mic/coi/config/coi:

<...>
     30         ulimit -n 5120
     31         $coiexec $coiparams &
     32         echo_ok
<...>

An MPSS change will appear in a future release to increase the default limit on the # of open files and provide administrator configuration control of the limit.

Ruibang L.'s picture

Thanks Kevin, it's helpful.

Quote:

Kevin Davis (Intel) wrote:

For this particular issue, the MPSS Development team identified the root cause. As they described it: "The root cause is that the number of temp files being generated by COI is exceeding the number of allowable fd's in process, by default. i.e. since in the free_if(0) case at exactly 1005 iterations, is when the number of fd's plus the number of open /tmp/coi_procs/<card>/<pid>/.tmp files currently tracked by coi is at 1024."

They indicated increasing the limit for number of open files (ulimit -n) allows COI to continue generating temp files.

I verified to work around this issue one can modify the host file  ( /opt/intel/mic/coi/config/coi )  to add line 30 shown in the snippet below. This file is transferred to the card(s) under ( /etc/init.d/coi ). Where more than ~5000 threads are used, one must increase the open file limit accordingly. With the change shown below, I was able to run the test case where N=5000.

Snippet from (host file): /opt/intel/mic/coi/config/coi:

<...>
     30         ulimit -n 5120
     31         $coiexec $coiparams &
     32         echo_ok
<...>

An MPSS change will appear in a future release to increase the default limit on the # of open files and provide administrator configuration control of the limit.

Bioinformatician

The expected change to the COI configuration in MPSS is present in the latest MPSS 3.1 release (Oct. 2013) for Linux.

Details regarding the MPSS 3.1 for Linux are available at: http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#lx31rel
 
To confirm the change, under the card's OS, inspect the file /etc/init.d/coi that contains the following change:
 
dmax=10240        #max file descriptors per COI process
                           # linux default is 1024

 

Login to leave a comment.