Advantage DO CONCURRENT against DO

Advantage DO CONCURRENT against DO

Maybe this is a dumb question:

What advantage has a DO CONCURRENT loop compared to a normal DO loop? I understand that in DO CONCURRENT you are not allowed to use an index (i-1), but basically every DO CONCURRENT can be written as a normal DO loop (but not the other way round).

In the documentation the following sample code is shown

      Q = B(I) + C(I)
      D(I) = Q + SIN(Q) + 2

which can be written as

   DO I = 1,N
      Q = B(I) + C(I)
      D(I) = Q + SIN(Q) + 2

Is the DO CONCURRENT construct faster or what is the advantage?

Another question: I remember that for parallel reasons it would be more effective to declare Q(N) too, so the DO loop can access Q(I) instead of just Q. Is that right?


12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I don't see any advantage since it is not performed in parallel unless /Qparallel is specified. Both a DO CONCURRENT and an ordinary DO would be parallelized when using /Qparallel and in both cases you would not need to make Q any array since it would be made private to each iteration automatically. At least that is my understanding.

The only difference I can see is that a DO CONCURRENT demands parallel when /Qparallel is specified, whereas a DO is conditionally parallelized.

If would rather that DO CONCURRENT was always parallelized, otherwise what's the point of having it ?

The rules around DO CONCURRENT mean that it is easier for a compiler to analyse the loop and determine whether it can be run concurrently (these rules mean that each iteration is independent from any other single iteration, but there is still potential for dependency across all iterations of the construct as a whole) and what transformations it needs to make to permit that.

A compiler that has its wits about it would probably generate equivalent code for the classic DO loop and the DO CONCURRENT.  That might be parallel code if the compiler thinks it is worthwhile.  There is a semantic difference in that Q is undefined after the DO CONCURRENT, but it must be defined with the value it had from the last iteration with the normal DO.

A compiler might decide that it is not worthwhile incurring the overhead of parallel execution.  Consider also the vector inner, parallel outer mantra.  Consequently I'm not sure that mandating always parallel execution for DO CONCURRENT would necessarily be a good thing. 

As of F2008, there are also syntax differences.with the ability to declare the iteration variable in the scope of the construct and to have a mask expression to determine active iterations.

In order for ifort to optimize DO CONCURRENT with a mask, in the usual case where the compiler will raise "protects exception," !dir$ vector (preferably vector aligned) is needed (as it would be with the similar DO).  Then I observe that the actual location of the conditionals is optimized; for example

!dir$ vector aligned

do concurrent(i=1:n,d(i) > 0.)

    a(i) = a(i) + b(i)*c(i)

end do

may produce similar code to

!dir$vector aligned

a(1:n) = a(1:n) + merge(b(1:n)*c(1:n),0.,d(1:n) > 0.)

I think the do concurrent version is a little easier to get right.

I suppose ifort doesn't want to use fma here anyway because of the shortage of memory operands in fma3.

With compilers other than ifort, masked do concurrent is likely not to perform as well.

All aspects of do concurrent are defined in f2008; there isn't a pre-f2008 subset definition corresponding to the part of the do concurrent standard implemented by ifort.

A single assignment ifort forall without mask (preceded by !dir$ ivdep if necessary) may be nearly equivalent to a do concurrent. These distinctions vary among implementations.

Implementation of !$omp parallel workshare might be needed to turn do concurrent into a parallel construct.  I don't know if there has been any expert discussion on this point.  workshare appears not to support omp simd.


onkelhotte wrote:

   DO I = 1,N
      Q = B(I) + C(I)
      D(I) = Q + SIN(Q) + 2

Another question: I remember that for parallel reasons it would be more effective to declare Q(N) too, so the DO loop can access Q(I) instead of just Q. Is that right?


There was a limitation in some vectorization compilers of 4 decades ago such as Markus suggested, where vector operations combining scalar and vector weren't supported.  This was an unsupportable limitation.

Now we have the option to precede such a loop by

!$omp parallel do simd private(Q)

where the private is mandatory for OpenMP parallelization (not that a compiler is prevented from

making it work without private).  Promoting the scalar to an array is too expensive.


DO CONCURRENT does not "demand parallel" - it allows/requests it. As others have said, the semantics of DO CONCURRENT make it more likely that the loop can be parallelized correctly. If you're not enabling auto-parallel, there is no benefit to DO CONCURRENT.

Steve - Intel Developer Support

DO CONCURRENT   asserts the absence of loop-carried dependencies, as Ian pointed out above. This helps auto-parallelization (threading) by the compiler with /Qparallel, but also helps vectorization, which on Intel Architecture is essentially SIMD parallelism. DO CONCURRENT provides a language standard way to enable vectorization of loops which might otherwise require a non-standard language extension such as a !DIR$ IVDEP compiler directive. Examples:

module mymod
  REAL, pointer, dimension(:) :: A,B,C,D,E,F
end module

subroutine Concurrent(N)

!!dir$ ivdep:loop  
!  DO  I = 1, N
      C(I) = A(I) + B(I) + SIN(E(I)) - 2.*F(I)*SQRT(D(I))

>ifort /c /nologo /Qvec-report2 test_concurrent1.f90               (DO CONCURRENT version)
test_concurrent1.f90(11): (col. 3) remark: LOOP WAS VECTORIZED

>ifort /c /nologo /Qvec-report2 test_concurrent2.f90               (DO version)
test_concurrent2.f90(10): (col. 3) remark: loop was not vectorized: existence of vector dependence

The compiler vectorizes with DO CONCURRENT,  but not with plain DO unless the directive is uncommented, because the pointers could be aliased (point to overlapping regions of memory) . It also auto-parallelizes if /Qparallel is specified.

The math functions are incorporated to ensure there is enough computational work to make auto-parallelization worthwhile, (the overheads are much more than for vectorization). The multiplicity of different variables is to ensure that there are too many combinations for the compiler to do dynamic testing for pointer aliasing (data overlap) at run-time.

subroutine Concurrent(C,D,IND,N)
  REAL,    dimension(:) :: C,D
  INTEGER               :: N
  INTEGER, dimension(:) :: IND

!  DO  I = 1, N
      D(IND(I)) = sqrt(C(I)) / SIN(C(I))

This loop would be unsafe to auto-parallelize, and could be unsafe to auto-vectorize, if IND(I) had the same value for two different values of I. If the programmer knows that all values of IND(I) are distinct, then use of DO CONCURRENT will allow the loop to be auto-parallelized and/or vectorized.

One final example of DO CONCURRENT, for auto-parallelization only:

subroutine Concurrent(A,B,N)
  REAL, intent(in ), dimension(N) :: A
  REAL, intent(out), dimension(N) :: B
  INTEGER, intent(in)             :: N
      REAL, INTENT(IN) :: X

!  DO  I = 1, N
      B(I) = myfun(A(I))

The combination of DO CONCURRENT with the "PURE" attribute of the function called within the loop allows this to be auto-parallelized with /Qparallel.

In general, if you prefer that a loop be always parallelized, you should use OpenMP (e.g. !$OMP PARALLEL DO  and /Qopenmp)  rather than auto-parallelization with /Qparallel. Likewise, if you prefer that a loop be always vectorized, you should use explicit vector programming, either with OpenMP (!$OMP SIMD - needs /Qopenmp or /Qopenmp-simd) or with the corresponding Intel extensions !DIR$ SIMD, etc. In both cases, though, the programmer is responsible for the correctness of the code, the compiler does not do dependency checking.


Is there a missing IND() inside the subscript of D in the right hand side of the assignment of Martyn's second last example?

Yes, indeed. In fact, a simpler example would be if D did not appear on the right hand side at all. The point is just that two different iterations of the loop could be trying to write to the same memory location. But if, for example, IND just contains the integers 1 to N reordered, with none appearing twice, then vectorization or parallelization is safe. I will fix it.

    I also inadvertently pasted in command lines from Linux instead of Windows, which I'll correct. I test on both, and the behavior of the vectorizer is generally the same for each.

I just looked at a case where apparently !dir$ vector always is needed for ifort to optimize for AVX2 with indexed store on the left hand side, regardless of whether DO or DO CONCURRENT is used (!$omp simd also can be used with DO).  For AVX2, it's still using sequential stores, so the partial vectorization wouldn't alter the behavior even if the restriction on the index vector were violated.

Latest gfortran does use DO CONCURRENT as about the only method available equivalent to an IVDEP directive, so there should be some cases where this can give you a portable method for optimization.

DO CONCURRENT   only asserts that parallel (or vector) execution would be safe, it does not require parallel or vector execution. The compiler will only vectorize if it's fairly confident that this will improve performance. (It often has incomplete information, of course).

!DIR$ VECTOR ALWAYS overrides the compiler's performance estimate and the compiler will vectorize, even if it thinks that is unlikely to lead to a performance gain. Despite its name, the loop is not "always" vectorized; the directive does not override the dependency analysis and the compiler will not vectorize if it thinks there's any possibility of a dependency that would make vectorization unsafe. To override both potential dependencies and the performance estimate, you need !DIR$ VECTOR ALWAYS as well as either DO CONCURRENT  or !DIR$ IVDEP.

Whether the compiler thinks vectorization will improve performance depends on how much work there is, compared to the overhead from the indexed store (scatter) on the left hand side. That's why my examples have those math functions on the right hand side - it's a more concise way to increase the computational load than adding extra lines inside the loop.

There is, as you know, a directive that overrides both the dependency analysis and the performance analysis of the compiler, and requires vectorization:

!DIR$ SIMD     or the OpenMP 4.0 equivalent    !$OMP SIMD

These are powerful but have their dangers, (like OpenMP threading). People shouldn't use them until they have read about and understood all the implications.

Your point about DO CONCURRENT in gfortran is a good one; as far as I know, non-standard vectorization and parallelization directives are not currently supported. However, I expect that at least the OpenMP 4.0 directives will be supported in due course.

!dir$ vector always is also required to over-ride "protects exception" assumptions in the compiler, which usually come up when using the mask field of do concurrent, and could also come up with merge.  So then it may be necessary to switch to !$omp simd if wishing to take advantage of future portability.

Leave a Comment

Please sign in to add a comment. Not a member? Join today