Loop variable optimized away

Loop variable optimized away

I'm experiencing a rather odd circumstance and I'm looking for any advice on how to diagnose it or fix it. I'm implementing a sparse matrix solver, and I'm dividing up a matrix-vector product over a team of OpenMP threads using a do loop with static scheduling and balanced chunks of my matrix.

The problem is, my loop variable for the OpenMP do loop is getting optimized away when optimizations are turned on (-O1, -O2, -O3) and the loop is being run more times than intended.

In my debugging environment, I can only work with one thread ($OMP_NUM_THREADS=1 by admin), so this "loop" should behave like serial code. However, my debug messages indicate that my loop variable is going beyond 1, and idbc reports when I'm inside the loop

(idb) print i
Info: symbol i is defined but not allocated (optimized away)
Error: no value for symbol i
Cannot evaluate 'i'.

How should I go about figuring out what ifort has done in this optimization? Superficially, this acts like a bug, but I'm uncomfortable making that assertion without seeing exactly what the optimizations have done.

Thanks,
Jonathan

26 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

When debugging with one thread, add a variable

integer, volatile :: iCopy

Then inside your loop add

iCopy = i

If you are debugging with multiple threads, then iCopy can be an array, then

iCopy[omp_get_thread_num()] = i

You can use a private variable to keep a copy of omp_get_thread_num()

Jim Dempsey

www.quickthreadprogramming.com

Thanks Jim,

The volatile variable helped, and will likely be very useful in the future. What it has revealed is that the optimized instructions ifort created do not check my loop bounds. Debugging statements printing off the value of my loop variable have been modified to print the expected value; the actual value (as reported by the volatile variable) does not change.

Is there a minimal expected number of loop iterations built into -O1, -O2, and -O3 which would remove the first loop bounds check?

Thanks,
Jonathan

Fortran does not perform loops the same way that C/C++ performs loops.

In Fortran, the DO iteration space is examined at entry to DO to produce an iteration count (i.e. it becomes do the loop N times). From that point on the loop control variable might be a) not used, b) registerized, d) in the event of unrolling advance by unrolled count. At loop termination, the value of the loop control variable is the last one used (not next), or the initial setting should the loop not iterate.

If you manage to insert a break point into an optimized loop, the debugger should tell you the loop control variable is not available due to being registerized. If you are not seeing this message then either the debugger figured out the registerization or may be showing you the out of synch value of the non-registerized loop control variable.

Also note, modifying the loop control variable within the loop does not alter the iteration count.

And, inserting the iCopy=i in the loop may interfere with unrolling.

Jim Dempsey

www.quickthreadprogramming.com

Quote:

jimdempseyatthecove wrote:

Fortran does not perform loops the same way that C/C++ performs loops.

In Fortran, the DO iteration space is examined at entry to DO to produce an iteration count (i.e. it becomes do the loop N times). From that point on the loop control variable might be a) not used, b) registerized, d) in the event of unrolling advance by unrolled count. At loop termination, the value of the loop control variable is the last one used (not next), or the initial setting should the loop not iterate.

If you manage to insert a break point into an optimized loop, the debugger should tell you the loop control variable is not available due to being registerized. If you are not seeing this message then either the debugger figured out the registerization or may be showing you the out of synch value of the non-registerized loop control variable.

Also note, modifying the loop control variable within the loop does not alter the iteration count.

And, inserting the iCopy=i in the loop may interfere with unrolling.

Jim Dempsey

In practice, the situation with C for loops is complicated enough that I've never seen it fully described.  If you don't conform with optimizable patterns set by individual compilers, performance will suffer, or, with OpenMP, you don't get parallelization.

As Jim says, in Fortran, the number of iterations (for the case without EXIT or equivalent) is determined before entering the loop. Modifying the loop counter inside the loop is an error (since 1977).  Some compilers may permit it in certain contexts (such as an EXIT block), as an extension.  In C, with optimization, pre-calculated loop count may also be the case, but due to the standard not requiring it, there are more cases to deal with.

Contrary to what Jim said, a DO loop which terminates normally (not by EXIT...) will set the loop index variable to the next value, analogous to what you expect in C.  Parallelization introduces possibilities in both Fortran and C for behavior to change; I've caught myself ignoring this problem.

TimP,

From my: C:\Program Files (x86)\Intel\Composer XE 2011 SP1\Documentation\en_US\compiler_f\cl\index.htm

After termination, the DO variable retains its last value (the one it had when the iteration count was tested and found to be zero).

Is the document wrong?

Apparently so (there may be a compiler switch to alter this behavior)


	  DO I=1,3

	      WRITE(*,*) I

	  ENDDO

	  WRITE(*,*) I

	           1

	           2

	           3

	           4

	

Jim Dempsey

www.quickthreadprogramming.com

That "last value" is the value that the loop index variable had when the loop count was tested and found to be zero, and not the last value with which the body of the loop was actually executed.

The ambiguity is:

Is the test made at the top or bottom of the loop? Or initial top, then subsequently at the bottom?
Is the loop control variable stride-stepped at the top (after initial test), or at the bottom before test?

Regardless, the document should be clear on what happens (and what happens should be consistent with standards when it addresses the issue).

Jim

www.quickthreadprogramming.com

Fortran standard is clearer on this point than the ifort document.  At the time when the (f77) standard was adopted, compilers varied in whether they tested at the top or bottom, or even switched with optimization level.  A compiler I used had 3 different treatments as side effects of other options, one of which conformed with f77.

It took quite a while for some compilers to comply with this, and I still don't count on it for cases involving parallelism, e.g. where the loop induction variable might need to be firstprivate or lastprivate (which aren't allowed).

Thank you all for helping to clarify the loop test. There are a few points I need to make in reference to the past posts.

Quote:

jimdempseyatthecove wrote:
If you manage to insert a break point into an optimized loop, the debugger should tell you the loop control variable is not available due to being registerized. If you are not seeing this message then either the debugger figured out the registerization or may be showing you the out of synch value of the non-registerized loop control variable.
The variable is defined, but not allocated. I interpret this to mean that the variable is not being used at all. Copying the value to a volatile variable proved problematic, as the behavior of the program changed (-O0 started experiencing errors later in the code).

My problem is that it seems the loop count is not effectively being tested after the first iteration - it would not matter if the test was at the beginning or end of the code segment. The iteration count is 1 for this test case, and and it is continuing to iterations 2 and 3 before causing a segmentation fault when optimization levels -O1, -O2, or -O3 are used. Using -O0 results in functional (though obviously slower) code.

Quote:

TimP (Intel) wrote:
As Jim says, in Fortran, the number of iterations (for the case without EXIT or equivalent) is determined before entering the loop.
No EXIT or equivalent is present. Since the number of iterations is dependent on run-time conditions (number of OpenMP threads available), this cannot be pre-calculated by the compiler. Is there a way to access the --actual-- iteration count and current iteration used in the assembly compare instruction? Either the number of iterations is not being calculated correctly at run-time, the current iteration is not correct, or the test is not being performed after completing all instructions corresponding to the code block of the loop (necessary regardless of the position of the test in assembly instructions). I suspect those are the three most likely reasons that the bounds of my do loop are not being respected.

Does this sound reasonable? If so, any suggestions for hunting down the cause? I very much suspect this will lead to a bug report, but since my program has dependencies of MKL, Intel Lapack and BLAS, and a separate sparse matrix solving library, I'd like to get all my ducks in a row to explain the issue. Otherwise, the appropriate development team would not have much to work on.

Thanks,
Jonathan

*Correction: -O1 is not behaving as nicely as -O2 and -O3 are.

Segmentation fault upon entering a single region:

Program received signal SIGSEGV
__kmp_enter_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
(idb) backtrace
#0  0x00002af9cc38d71a in __kmp_enter_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so
#1  0x00002af9cc370e16 in __kmpc_single () in /opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so

I thought this had been cleaned up by recompiling my sparse matrix solver library, but it turns out I was wrong. Still, I seriously doubt this issue is related to the original one, so I'll start a thread on it later if it still annoys me. The target optimization for the final program is -O3.

So, let's restrict optimizations considered for this to -O2 and -O3, which experience the same symptoms.

Jonathan

If you intend to run a DO loop over num_threads, would you not use something like

use omp_lib

!$omp parallel private(nt)

    nt=omp_get_num_threads()

!$omp do

    do i = 1,nt

....

    end do

!$omp end parallel

Nearly everything in OpenMP depends on loop counts being calculated before entering the loop.  So there's no reason here to rebel against the Fortran standard.  In fact, if you have an OpenMP loop in which later iterations may have nothing to do, you typically need to let them spin without exit, e.g.

if(nomorework)cycle

 

 

This is an abbreviation of what I'm working with:

integer, dimension(:), allocatable :: chunk

	...

	!$omp parallel shared(chunk)

	!$omp master

	allocate(chunk(omp_get_num_threads()+1))

	...

	! find appropriate divisions of sparse

	! matrix for omp_get_num_threads()

	! pieces

	...

	!$omp end master

	!$omp end parallel
...
!$omp parallel shared(chunk, ...)

	!$omp          private(i,j,k,...)

	! allocate local result array

	...

	!$omp do schedule(static)

	  do i=1,size(chnk)-1

	    do j=1,...

	      do k=...

	        localResult(j) = localResult(j) + Nonzero(k) * InputVector(k)

	      end do

	    end do
    !$omp critical

	      do j=1,MatrixSize

	        FinalResult(j) = FinalResult(j) + localResult(j)

	      end do

	    !$omp end critical

	...

	  end do

	!$omp end do

	!$omp end parallel
deallocate(chunk)

If I'm forced to write N copies where N is the maximum number of threads per node I anticipate using, I can, but that's not really the cleanest code.

I should also add that this scheme has worked fine in the past - this error only popped up when I took advantage of Hermitian symmetry and added the reduction phase (critical section).

Jonathan,

In the sketch code you showed, localResult would have to be private. This may be an omission in producing the code snip, but it could also be a coding oversight in your current code.

Also, the sketch code does not show how the thread unique "i"'s disambiguate the data. IOW is there some code between "do i" and "do j" that selects the stripe of data unique to each thread?

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim,

You are correct, localResult must be private, and it is in my actual code; I paraphrased too quickly. And despite the fact that I edited the code example, the j loop edit did not make it into the post.

do j=chunk(i)+1,chunk(i+1)
are the bounds on the j loop. The chunk array is what disambiguates the data. Also, MatrixSize is a global parameter, Nonzero is a global allocated array (both of which are encapsulated in a module), and finalResult is shared.

Jonathan

>>do j=chunk(i)+1,chunk(i+1)

Insert if(j.eq.chunk(i)+1) write(*,*) omp_thread_num(), chunk(i)+1,chunk(i+1)
(insert immediatly following do j=...)

Just to verify non-overlapping chunks

Jim Dempsey

www.quickthreadprogramming.com

Relevant output:

 Time to load hamiltonian into memory:  3.0000000E-03

	 Chunk:

	           0

	          43

	 Size of Chunk:           2

	 Time to complete Lanczos iteration:  6.0999999E-03

	 Dimension of localResult:          43

	 Dimension of workingArray:         129

	 Pointer to input vector first element:          87

	 Pointer to output vector first element:          44

	 Bounds on loop var i:           1           1

	 Bounds on            1 th loop var j:           1          43

	 Bounds on            1 th loop var k:           1         339

	 Loop var analysis complete.

	 Made it into i loop. Variable i=           1

	 Zeroed working array.

	           0           1          43

	 Made it into j loop. Variable j=           1

	 Made it into k loop. Variable k=           1

	 Made it into k loop. Variable k=           2

	 Made it into k loop. Variable k=           3

	 Made it into k loop. Variable k=           4

	 Made it into k loop. Variable k=           5

	 Made it into j loop. Variable j=           2

	 Made it into k loop. Variable k=           6

	 Made it into k loop. Variable k=           7

	 Made it into k loop. Variable k=           8

	 Made it into k loop. Variable k=           9

	...

	 Made it into j loop. Variable j=          43

	 Made it into k loop. Variable k=         339

	 Reducing matrix-vector product.

	 Reduction complete.

	 Made it into i loop. Variable i=           2

	 Zeroed working array.

	 Reducing matrix-vector product.

	 Reduction complete.

	 Made it into i loop. Variable i=           3

	 Zeroed working array.

	           0  -860295351          56

	 Made it into j loop. Variable j=  -860295351

	forrtl: severe (174): SIGSEGV, segmentation fault occurred

I just read the section on do constructs in the Fortran 95 spec. I'm not doing anything that is nonstandard. The iteration count can be calculated during runtime prior to entry into the loop block. I even modified the code by using

do i=1,omp_get_num_threads()

which produced identical results.

But disturbingly, I just logged into the server again and this output occurred:

 Time to load hamiltonian into memory:  2.4999999E-03

	 Chunk:

	           0

	          43

	 Size of Chunk:           2

	 Time to complete Lanczos iteration in ARPACK:  6.0000003E-04

	 Dimension of prodPart:          43

	 Dimension of workd:         129

	 Pointer to input vector first element:          87

	 Pointer to output vector first element:          44

	 Bounds on loop var i:           1           1

	 Bounds on            1 th loop var j:           1          43

	 Bounds on            1 th loop var k:           1         339

	 Loop var analysis complete.

	 Made it into i loop. Variable i=           1

	 Zeroed working array.

	           0           1          43

	 Made it into j loop. Variable j=           1

	 Made it into k loop. Variable k=           1

	 Made it into k loop. Variable k=           2

	 Made it into k loop. Variable k=           3

	 Made it into k loop. Variable k=           4

	 Made it into k loop. Variable k=           5

	 Made it into j loop. Variable j=           2

	 Made it into k loop. Variable k=           6

	 Made it into k loop. Variable k=           7

	 Made it into k loop. Variable k=           8

	 Made it into k loop. Variable k=           9

	 Made it into k loop. Variable k=          10

	...

	 Made it into j loop. Variable j=          43

	 Made it into k loop. Variable k=         339

	 Reducing matrix-vector product.

	 Reduction complete.

	 Made it into i loop. Variable i=           2

	 Zeroed working array.

	           0          44   146337608

	 Made it into j loop. Variable j=          44

	 Made it into j loop. Variable j=          45

	 Made it into j loop. Variable j=          46

	 Made it into k loop. Variable k=           0

	 Made it into k loop. Variable k=           1

	 Made it into k loop. Variable k=           2

	 Made it into k loop. Variable k=           3

	 Made it into k loop. Variable k=           4

	 Made it into k loop. Variable k=           5

	 Made it into k loop. Variable k=           6

	 Made it into k loop. Variable k=           7

	 Made it into k loop. Variable k=           8

	 Made it into k loop. Variable k=           9

	 Made it into k loop. Variable k=          10

	...

	 Made it into k loop. Variable k=         343

	 Made it into k loop. Variable k=         344

	 Made it into k loop. Variable k=         345

	forrtl: severe (174): SIGSEGV, segmentation fault occurred

	Image              PC                Routine            Line        Source             

	diagonalize_8      0000000000411360  Unknown               Unknown  Unknown

	libiomp5.so        00002B5ABEFCDFE3  Unknown               Unknown  Unknown

I have no idea why the behavior would vary like this, but at least the continuation of the loop (i=2) is consistent with the previous output.

Jonathan

It turns out that the difference in behavior is because I was testing on a different login node. To the best of my knowledge, the environment is uniform across all nodes in the system, but I'm verifying that with the sysadmins. Still, $OMP_NUM_THREADS is set to 1 for all login nodes, so my executable should not vary due to system load, right?

Anyone know of conditions in the OpenMP library that would cause an executable using only one thread to vary between two systems with identical hardware and software?

Thanks,
Jonathan

In your printout it lists:

Bounds on loop var i:           1           1

Your do i=1,omp_thread_num() is producing an i larger than the upper bound of the i index in your array.

Therefore you will requires immediately following the  do i=1,omp_thread_num() a statement like

if(i .gt. ubound(...)) exit

Where ... is replaced by the proper reference to obtain the bounds of the array indexed by i (same way as you obtained bounds for above report).

If your i bounds will be small in your production version (IOW smaller than thread count), then consider moving the parallization inwards.

Jim Dempsey

 

www.quickthreadprogramming.com

Hi Jim,

Thanks for the suggestion, however the sys admins set $OMP_NUM_THREADS=1 on non-compute nodes, so the bounds on variable i match with the number of threads. Still, it was worth a try, so I inserted

 if (i > size(chunk)) cycle 

since this is an OpenMP parallelized do loop and EXIT commands are prohibited. However, that had no effect. The program output was identical. This is why I would like to track down the internal loop iteration count variable and number of iterations calculated, but since they're not going to be in the debugging symbols I need a recommendation on how to find them.

In the production environment, this will be running with 16+ threads, but it should work with however many threads are available.

Jonathan

Best Reply

Consider (inside parallel region)

(I is private)


	do I =  omp_get_thread_num() + 1, yourUpper, omp_get_num_threads()

	  ...

	end do

	

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim,

That is a beautiful modification, and it worked. My best guess is that having all bounds of the loop dependent on the environment prohibited ifort from making the assumption that caused the error in optimization.

Thanks much!

Jonathan

The above suggestion will work best when the amount of work for each I is approximately equal.

Also note, if the output needs to be in order of I then consider something like this:

volatile integer :: NextOutput

(in parallel region, NextOutput shared, I private)

NextOutput = 1 ! all threads reset
!$OMP BARRIER ! assure all threads past above
do I = omp_get_thread_num() + 1, YourUpper, omp_get_num_threads()
  ... ! parallel computational work here
  do while(NextOutput .NE. I)
     call SleepQQ(0)
  end do
  write(*,*) YourOutput
  NextOutput = NextOutput + 1
end do
[/fortranj]

Jim Dempsey
 

www.quickthreadprogramming.com

I just checked the optimization report - the loop was not unrolled. Any guess for what I should look for to identify the source of the error? I'd like to find the behavior that caused this and submit that as a bug for optimizations.

Thanks,

Jonathan

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui