autoparallelization problem

autoparallelization problem

When I compile the following small program with -O2 or -O3 and -parallel and run it with OMP_NUM_THREADS>1 it crashes with signal 11 either immedeately (-O2) or when trying to print out res (-O3). Running the parallel binary with OMP_NUM_THREADS=1 does work. When the last printf statement is commented out, the program will run with more than one thread if compiled with -O3 -parallel.

I first observed this with Compiler XE for applications running on Intel(R) 64, Version 13.1.2.183 Build 20130514, older versions starting from 10.1 show the same or similar problems, C++ Compiler for Intel(R) EM64T-based applications, Version 9.1    Build 20060925 is the last version which does produce a working parallel binary. All Compilers are run under Linux SLES11SP3.

#include <stdio.h>

#define ARDIM 2000

 

int main (int argc, char **argv)  {

  double a[ARDIM][ARDIM], b[ARDIM][ARDIM], c[ARDIM][ARDIM];

  double di=0.0,dj=0.0,res=0.0 ;

  int i,j,k;

 

 

  for (i=0;i<ARDIM;i++) {

    di+=1.0e0;

    for (j=0;j<ARDIM;j++) {

      dj+=1.0e0;

      a[i][j]=di/dj;

      b[i][j]=dj/di;

      c[i][j]=0.0 ;

    }

    dj=0.0;

  }

  for (i=0;i<ARDIM;i++) {

    for (k=0;k<ARDIM;k++) {

      for (j=0;j<ARDIM;j++) {

        c[i][j]+=a[i][k]*b[k][j];

      }

    }

  }

 

  for (i=0;i<ARDIM;i++) {

    for (j=0;j<ARDIM;j++) {

      res+=c[i][j];

    }

  }

 

  printf("\n c[1][2] = %f\n",c[1][2]);

 /* printf("\n res = %f\n",res); */

}

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I don't know why Build 20060925 worked.  But your program runs fine with current compilers, as long as you bump up the main stack size and thread stack sizes.  

Your perfectly nested loop does get auto-parallelized, and it has large stack requirements, since each double array requires 8*2000*2000 = 32M bytes.  So with 3 32M arrays, each thread needs at least 96M for its private stack.  100M works fine for both main stack and threads:

Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.144 Build 20140120
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

$ icc -parallel U509827.cpp -par-report -O3
U509827.cpp(51): (col. 3) remark: LOOP WAS AUTO-PARALLELIZED
$ ulimit -s 100000000
$ export OMP_STACKSIZE=100M
$ ./a.out

 c[1][2] = 9.866605

 res = 6584316148511.074219
$

Patrick

Thanks, defining OMP_STACKSIZE does help, we use ulimit -s unlimited anyway. Stacksize must be defined >= OMP_STACKSIZE*OMP_NUM_THREADS,

However I do not understand, why each thread requires its own copy of the three arrays, and I see the possibility to run into the vmem limit on manycore systems.

 

Axel

 

If you wish to avoid threadprivate arrays, which looks feasible with your example, you could use explicit parallelism (OpenMP or Cilk(tm) Plus).

The excessive stacksize requirement can be avoided by simply moving the array declarations outside of main().

The memory used (resident set size) is in both cases slightly more than the size of the 3 arrays, only the virtual address space required is increased by a factor of OMP_NUM_THREADS. It looks as if the threadprivate stack is allocated but (essentially) not used.

Login to leave a comment.