Issues running on Ubuntu10.04.2 LTS

Issues running on Ubuntu10.04.2 LTS

We have a software that runs as hybrid MPI-OpenMP. We have tested it on RHEL 4, 5 and SUSE 10, 11. We have a client trying to run it onUbuntu10.04.2 LTS and it fails. The application is written in Fortran, compiled with Intel Fortran 11.1 and Intel MPI 4. We compile with command mpiifort and use -openmp flags. However on this Ubuntu system the only way we can get it running is if we don't use the -openmp flag and at linking stage use the flag -liomp5.What is the explanation for this?Thanks!

6 post / 0 nuovi
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione

Hi,

I am currently attempting to reproduce the behavior you are experiencing on our systems. Can you please provide me with a sample that reproduces this behavior? Thank you.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Hi Jim,We have a laplace solver (hybrid openmp-mpi) that crashes on the system. On the ubuntu system this will run in pure MPI mode if we don't use the -openmp compiler flag. In hybrid mode however it runs only if the arrays u and du are allocated dynamically. If allocated statically it runs if imax and jmax have a value of less than 700 each. We checked 'ulimit -a' and found thatmax locked memory was set to 64. We changed it to unlimited but we still get crashes.We've tried intel mpi 4 as well as mpich2-1.4. The behavior is the same with both in hybrid mode.mpiifort/mpif90 -openmp laplace.fThanks,AnupHere is the code: laplace.f program lpmlp include 'mpif.h' include "omp_lib.h" integer imax,jmax,im1,im2,jm1,jm2,it,itmax !parameter (imax=2001,jmax=2001) parameter (imax=10,jmax=10) parameter (im1=imax-1,im2=imax-2,jm1=jmax-1,jm2=jmax-2) parameter (itmax=100) real*8 u(imax,jmax),du(imax,jmax),umax,dumax,tol,pi parameter (umax=10.0,tol=1.0e-6,pi=3.14159)! Additional MPI parameters integer istart,iend,jstart,jend integer size,rank,ierr,istat(MPI_STATUS_SIZE),mpigrid,length integer grdrnk,dims(1),gloc(1),up,down,isize,jsize integer ureq,dreq integer ustat(MPI_STATUS_SIZE),dstat(MPI_STATUS_SIZE) real*8 tstart,tend,gdumax logical cyclic(1) real*8 uibuf(imax),uobuf(imax),dibuf(imax),dobuf(imax)! OpenMP parameters integer nthrds! Initialize call MPI_INIT_THREAD(MPI_THREAD_FUNNELED,IMPI_prov,ierr) !call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)! 1D linear topology dims(1)=size cyclic(1)=.FALSE. call MPI_CART_CREATE(MPI_COMM_WORLD,1,dims,cyclic,.true.,mpigrid + ,ierr) call MPI_COMM_RANK(mpigrid,grdrnk,ierr) call MPI_CART_COORDS(mpigrid,grdrnk,1,gloc,ierr) call MPI_CART_SHIFT(mpigrid,0,1,down,up,ierr) istart=2 iend=imax-1 jsize=jmax/size jstart=gloc(1)*jsize+1 if (jstart.LE.1) jstart=2 jend=(gloc(1)+1)*jsize if (jend.GE.jmax) jend=jmax-1 nthrds=OMP_GET_NUM_PROCS() print*,"Rank=",rank,"Threads=",nthrds call omp_set_num_threads(nthrds)!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)! Initialize -- done in parallel to force "first-touch" distribution! on ccNUMA machines (i.e. O2k)!$OMP DO do j=jstart-1,jend+1 do i=istart-1,iend+1 u(i,j)=0.0 du(i,j)=0.0 enddo u(imax,j)=umax*sin(pi*float(j-1)/float(jmax-1)) enddo!$OMP END DO!$OMP END PARALLEL! Main computation loop call MPI_BARRIER(MPI_COMM_WORLD,ierr) tstart=MPI_WTIME() do it=1,itmax! We have to keep the OpenMP and MPI calls segregated... call omp_set_num_threads(nthrds)!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)!$OMP MASTER dumax=0.0!$OMP END MASTER!$OMP DO REDUCTION(max:dumax) do j=jstart,jend do i=istart,iend du(i,j)=0.25*(u(i-1,j)+u(i+1,j)+u(i,j-1)+u(i,j+1))-u(i,j) dumax=max(dumax,abs(du(i,j))) enddo enddo!$OMP END DO!$OMP DO do j=jstart,jend do i=istart,iend u(i,j)=u(i,j)+du(i,j) enddo enddo!$OMP END DO!$OMP END PARALLEL! Compute the overall residual call MPI_REDUCE(dumax,gdumax,1,MPI_REAL8,MPI_MAX,0 + ,MPI_COMM_WORLD,ierr)! Send phase if (down.NE.MPI_PROC_NULL) then j=1 do i=istart,iend dobuf(j)=u(i,jstart) j=j+1 enddo length=j-1 call MPI_ISEND(dobuf,length,MPI_REAL8,down,it,mpigrid, + dreq,ierr) endif if (up.NE.MPI_PROC_NULL) then j=1 do i=istart,iend uobuf(j)=u(i,jend) j=j+1 enddo length=j-1 call MPI_ISEND(uobuf,length,MPI_REAL8,up,it,mpigrid, + ureq,ierr) endif! Receive phase if (down.NE.MPI_PROC_NULL) then length=iend-istart+1 call MPI_RECV(dibuf,length,MPI_REAL8,down,it, + mpigrid,istat,ierr) call MPI_WAIT(dreq,dstat,ierr) j=1 do i=istart,iend u(i,jstart-1)=dibuf(j) j=j+1 enddo endif if (up.NE.MPI_PROC_NULL) then length=iend-istart+1 call MPI_RECV(uibuf,length,MPI_REAL8,up,it, + mpigrid,istat,ierr) call MPI_WAIT(ureq,ustat,ierr) j=1 do i=istart,iend u(i,jend+1)=uibuf(j) j=j+1 enddo endif write (rank+10,*) rank,it,dumax,gdumax if (rank.eq.0) write (1,*) it,gdumax enddo call MPI_BARRIER(MPI_COMM_WORLD,ierr) tend=MPI_WTIME() if (rank.EQ.0) then write(*,*) 'Calculation took ',tend-tstart,'s. on ',size, + ' MPI processes' + ,' with ',nthrds,' OpenMP threads per process' endif call MPI_FINALIZE(ierr) stop end

I think you're telling us that mpiifort isn't using the -openmp flag to imply -liomp5. You could verify this by adding -# to the mpiifort options at the failing step and showing the resulting diagnostics. That might also help to see whether you have mixed up 32- and 64-bit modes and paths, which is easy to do on Ubuntu since it doesn't conform to the normal way of handling 64-bit paths.
Needless to say, if you specify any of the system library directories in your Makefile, you must adjust those to the different locations used by Ubuntu.

An updated regarding this question (I am the user who is having problems). We have upgraded to Ubuntu 11.10 as we needed to and hoped in teh process the problem would be resolved but it wasn't. The current status of the problems, I'm trying to run a hybrid mpi(using mpich2)/openmp problem, and have a test f77 program that causes the same problem. (1) If I run the problem from a specific node (say "node1"), and only include "node1" in the machines file then the code runs fine(2) If I run without including "-openmp", the code runs fine across nodes (if the machines file has "node1", "node2" etc...), however it obviously isn't threaded(3) If I run across nodes ("node1", "node2" etc...) compiling with "-openmp" I get segmentation faults and crashes.In searching this further I discovered a problem with the statically linked arrays. If the arrays are "too big" the program crashes with the seg fault, however if it's small "enough", it runs to completion fine. As a hunch I converted the program to dynamic memory allocation which solved the problem no matter how big the arrays were. However, I do not have the capability of changing the parent, much larger, program thus this solution is not a long term one, but hopefully helps determine the problem. Attached here is the test.f file that I use. If imax=jmax>=721 the code crashes, but at 720 or smaller it works fine (a size of 10,000 worked fine in the dynamic allocation test). A couple other items worth mentioning, the stack size is set to unlimited, verified through "ulimit -a". I have changed the kmp_stacksize to be upwards of 4G and 8G to make sure this was not an issue and it worked fine. One final thing. I tried linking "-liomp5" at compile instead of "-openmp" which I thought worked as no segmentation faults occurred and the code ran to completion, however this was a partial lie as no threading was taking place, CPU utilization never went to 800%, even though "nthrds=OMP_GET_NUM_PROCS()" set nthrds to 8 (verified a number of places), when "call omp_set_num_threads(nthrds)" was invoked, with nthrds being 8, it always would set the number of threads to 1, verified by printing out:call omp_set_num_threads(nthrds)nthreads = OMP_GET_NUM_THREADS()print*,"Jack",rank,nthreads,nthrdsAnd seeing nthrds = 8, but nthreads = 1. I gave up on the liomp5 linking at this point as I believe the issue is more related to something manifest in the static vs dynamic memory allocation issues, however I'm somewhat stumped.

[fxfortran]      program lpmlp
      include 'mpif.h'
      include "omp_lib.h" 

      integer imax,jmax,im1,im2,jm1,jm2,it,itmax
      parameter (imax=10000,jmax=10000)
      parameter (im1=imax-1,im2=imax-2,jm1=jmax-1,jm2=jmax-2)
      parameter (itmax=100)
      real*8 u(imax,jmax),du(imax,jmax),umax,dumax,tol,pi
      parameter (umax=10.0,tol=1.0e-6,pi=3.14159)
! Additional MPI parameters
      integer istart,iend,jstart,jend
      integer size,rank,ierr,mpigrid,length
      integer grdrnk,dims(1),gloc(1),up,down,isize,jsize
      integer ureq,dreq
      integer ustat(MPI_STATUS_SIZE),dstat(MPI_STATUS_SIZE)
      integer istat(MPI_STATUS_SIZE)
      real*8 tstart,tend,gdumax
      logical cyclic(1)
      real*8 uibuf(imax),uobuf(imax),dibuf(imax),dobuf(imax)
! OpenMP parameters
      integer nthrds,nthreads      

! Initialize
      call MPI_INIT_THREAD(MPI_THREAD_FUNNELED,IMPI_prov,ierr)
      call MPI_COMM_RANK(MPI_COMM_WORLD,rank,ierr)
      call MPI_COMM_SIZE(MPI_COMM_WORLD,size,ierr)

      print*, "hello from", rank
      !call sleep(180) 

! 1D linear topology
      dims(1)=size
      cyclic(1)=.FALSE.
      call MPI_CART_CREATE(MPI_COMM_WORLD,1,dims,cyclic,.true.,mpigrid
     +     ,ierr)
      call MPI_COMM_RANK(mpigrid,grdrnk,ierr)
      call MPI_CART_COORDS(mpigrid,grdrnk,1,gloc,ierr)
      call MPI_CART_SHIFT(mpigrid,0,1,down,up,ierr)
      istart=2
      iend=imax-1
      jsize=jmax/size
      jstart=gloc(1)*jsize+1
      if (jstart.LE.1) jstart=2
      jend=(gloc(1)+1)*jsize
      if (jend.GE.jmax) jend=jmax-1
      nthrds=OMP_GET_NUM_PROCS()
      print*,"Rank=",rank,"Threads=",nthrds
      call omp_set_num_threads(nthrds)
                    
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)
! Initialize -- done in parallel to force "first-touch" distribution
! on ccNUMA machines (i.e. O2k)
!$OMP DO
      do j=jstart-1,jend+1
         do i=istart-1,iend+1
            u(i,j)=0.0
            du(i,j)=0.0
         enddo
         u(imax,j)=umax*sin(pi*float(j-1)/float(jmax-1))
      enddo
!$OMP END DO
!$OMP END PARALLEL

! Main computation loop
      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      tstart=MPI_WTIME()
      do it=1,itmax
! We have to keep the OpenMP and MPI calls segregated...
        call omp_set_num_threads(nthrds)
        !nthreads = OMP_GET_NUM_THREADS()
        !print*,"Jack",rank,nthreads,nthrds
               
        
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)
!$OMP MASTER
        dumax=0.0
!$OMP END MASTER
!$OMP DO REDUCTION(max:dumax)
         do j=jstart,jend
            do i=istart,iend
               !nthreads = OMP_GET_NUM_THREADS()
               !print*,"Jack",rank,nthreads,nthrds
               du(i,j)=0.25*(u(i-1,j)+u(i+1,j)+u(i,j-1)+u(i,j+1))-u(i,j)
               dumax=max(dumax,abs(du(i,j)))
            enddo
         enddo
!$OMP END DO
!$OMP DO
         do j=jstart,jend
            do i=istart,iend
               u(i,j)=u(i,j)+du(i,j)
            enddo
         enddo
!$OMP END DO
!$OMP END PARALLEL
! Compute the overall residual
         call MPI_REDUCE(dumax,gdumax,1,MPI_REAL8,MPI_MAX,0
     +        ,MPI_COMM_WORLD,ierr)

! Send phase
         if (down.NE.MPI_PROC_NULL) then
            j=1
            do i=istart,iend
               dobuf(j)=u(i,jstart)
               j=j+1
            enddo
            length=j-1
            call MPI_ISEND(dobuf,length,MPI_REAL8,down,it,mpigrid,
     +           dreq,ierr)
         endif
         if (up.NE.MPI_PROC_NULL) then
            j=1
            do i=istart,iend
               uobuf(j)=u(i,jend)
               j=j+1
            enddo
            length=j-1
            call MPI_ISEND(uobuf,length,MPI_REAL8,up,it,mpigrid,
     +           ureq,ierr)
         endif
! Receive phase
         if (down.NE.MPI_PROC_NULL) then
            length=iend-istart+1
            call MPI_RECV(dibuf,length,MPI_REAL8,down,it,
     +           mpigrid,istat,ierr)
            call MPI_WAIT(dreq,dstat,ierr)
            j=1
            do i=istart,iend
               u(i,jstart-1)=dibuf(j)
               j=j+1
            enddo
         endif
         if (up.NE.MPI_PROC_NULL) then
            length=iend-istart+1
            call MPI_RECV(uibuf,length,MPI_REAL8,up,it,
     +           mpigrid,istat,ierr)
            call MPI_WAIT(ureq,ustat,ierr)
            j=1
            do i=istart,iend
               u(i,jend+1)=uibuf(j)
               j=j+1
            enddo
         endif
         write (rank+10,*) rank,it,dumax,gdumax
         if (rank.eq.0) write (1,*) it,gdumax
      enddo
      call MPI_BARRIER(MPI_COMM_WORLD,ierr)
      tend=MPI_WTIME()
      if (rank.EQ.0) then
         write(*,*) 'Calculation took ',tend-tstart,'s. on ',size,
     +        ' MPI processes'
     +        ,' with ',nthrds,' OpenMP threads per process'
      endif
      call MPI_FINALIZE(ierr)
      stop
      end
[/fxfortran]

Hi Jack,

As you have found, simply linking to the OpenMP* library (using -liomp5) is insufficient. This will enable use of the OpenMP* functions, but compiler directives will be ignored. By fully enabling OpenMP* (-openmp) the compiler directives will be utilized.

It appears that your primary issue is in memory usage. You have stated that you have set your stack to unlimited. Is this a 32 bit or 64 bit system? If you are on a 64 bit system, are you using the 32 or 64 bit Intel Fortran Compiler?

I have compiled the code you provided. On our Ubuntu 10.04 system, I am able to run this code with a sufficiently large stack. Having a stack size limit too small will prevent the program from executing. The environment variables KMP_STACKSIZE and OMP_STACKSIZE were not set. I would recommend you verify that an unlimited stack size is really unlimited, or if there are other system constraints preventing your program from running. This could include other programs running simultaneously.

The Intel Software Network has two knowledge base articles that may help you with this issue.

http://software.intel.com/en-us/articles/openmp-option-no-pragmas-causes-segmentation-fault/
http://software.intel.com/en-us/articles/intel-fortran-compiler-increased-stack-usage-of-80-or-higher-compilers-causes-segmentation-fault/

Please let me know if any of this information helps address your issue.

Sincerely,
James Tullos
Technical Consulting Engineer
Intel Cluster Tools

Lascia un commento

Eseguire l'accesso per aggiungere un commento. Non siete membri? Iscriviti oggi