OpenMP speedup not realized due to spin time

OpenMP speedup not realized due to spin time

I am attempting to speed up a large program. I have identified the hot spots using profiling, but when I use OpenMP to parallelize the key loops, I get only a slight speed-up (about 30%), instead of the ideal factor of 8, for the key loops.

I created a smaller test program to figure out what is happening. This is the part containing the OpenMP loops (the complete program is attached).

!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO  PRIVATE(iRadius)
        DO iRadius = 1, nRadii
            Dradius(iRadius)=DiffCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3)) ! diffusion coefficient at iRadius
            sRadius(iRadius)=SedCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3)) ! if compression, this will account for it
        END DO
!$OMP END DO
!$OMP END PARALLEL

!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO PRIVATE(iRadius)

        L_R2_Parallel: &
            DO iRadius = 1, nRadii
                Z(iRadius)=ZCalc(iRadius, iSpecies)
                G(iRadius, 1)=Dradius(iRadius-1)*dt*A1(iRadius, 1)+B(iRadius, 1)-sRadius(iRadius-1)*omSqRun*dt*A2(iRadius, 1)
                G(iRadius, 2)=Dradius(iRadius)*dt*A1(iRadius, 2)+B(iRadius, 2)-sRadius(iRadius)*omSqRun*dt*A2(iRadius, 2)
                G(iRadius, 3)=Dradius(iRadius+1)*dt*A1(iRadius, 3)+B(iRadius, 3)-sRadius(iRadius+1)*omSqRun*dt*A2(iRadius, 3)
            END DO L_R2_Parallel
        nThreads=omp_get_num_threads()
!$OMP END PARALLEL

VTune shows a large amount of time was spent in __kmp_fork_barrier and _kmpc_barrier. I don't understand why any significant time is being spent at barriers, since an even division of the workload for each of the loops should result in all threads finishing at the same time. Task Manager shows 100% CPU usage while the program is running, as expected. I have attached the VTune summary; it also shows a large "spin time" which is mostly the sum of these two.

Compiled under Visual Studio as x64 release build, with option /Qopenmp.

3.4 GHz Haswell (8 logical CPUs), Windows 7 64-bit, Visual Studio 2017 15.6.7, Intel XE2018 update 1.

 

 

 

 

 

 
 

 

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

What is the value of nRadii?

Edit, I see from source 10,000.

This may be more of an issue of non-optimal array index order causing code to not be able to be vectorize and to increase memory bandwidth.

Jim Dempsey
 

Jim -

In the sample pgm I attached, nRadii=10,000. In the actual pgm, 10,000 is an upper limit; typically it's 1,000-3,000.

AFAIK, all the arrays are in the optimal order (1st subscript is the loop index). However, regardless, I expect a more significant speedup when multi-threading. Note that I did not compile the code with parallelization enabled, although /O2 includes vectorization. The Fortran options were

/nologo /O2 /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc150.pdb" /libs:dll /threads /c /Qopenmp

If I were writing the threading manually, I would divide the 10,000 into 8 parts, and assign thread 0 1-1250, thread 1 1251-2500, etc, and start them all. I would expect them to all finish at about the same time, since the work being done is identical. Further, there should be no contention due to invalidating cache lines, because each thread is writing to a different region of memory. I cannot tell what OpenMP is doing, though.

To put it another way, if the loops vectorize, you have some assurance as well that cache locality may be good enough that it's not a factor in limiting omp parallel speedup.  Even the second loop may be tricky to vectorize; the opt-report should show what happened, but it looks like the cache placement should be OK for omp parallel.  If putting on the omp parallel disables vectorization, that would a strong reason for limited speedup (comparing optimized 1 thread to non-optimized multiple threads).  Setting omp parallel simd might help in that case.  Still, if single thread vector code uses a significant fraction of maximum memory bandwidth, that will limit parallel speedup.

The first loop would probably not be vectorized effectively, and the cache behavior (and thus the memory bandwidth limitation) could depend strongly on the size of first dimension.

If you can't optimize the memory layout, you may need to analyze with a tool like Intel VTune or Advisor to begin to answer your questions.  For Advisor, you need to set up the opt-report anyway.

                Z(iRadius) = ZCalc(iRadius, iSpecies)
                vtd = [ Dradius(iRadius-1), Dradius(iRadius), Dradius(iRadius+1) ]*dt
                vts = [ Sradius(iRadius-1), Sradius(iRadius), Sradius(iRadius+1) ]*omSqRun*dt
                G(iRadius,:) = vtd*a1(iRadius,:) + B(iRadius,:) +  vts*A2(iRadius, :)

Looking at what your code does in the loop I created the lines (with two temp arrays vtd and vts) above that do array operations  which is normally faster but I will note these are non-optimal as in reality the second index is varying fastest as it is in effect the inner loop. if  omSqRun and dt are invariant then pre-multiplying the whole Dradius and Sradius arrays outside the loop as whole array operations will be much faster. IS Zcalc an array or a function? If it is any array you are just copying a slice of Zcalc so again why have it in a loop? I would sort the maths out for vectorisation before worrying about prallelisation. In reality your loop does not appear do much work so the overheads of parallelisation my remove much of the benefit.

 

Maybe Andrew means not much work relative to the amount of data movement.  Loop length 10000 might be enough for a gain from parallelism if memory access time didn't dominate. Dividing a loop length 1000 among threads almost certainly is not productive.  Andrew's points look generally well taken.  The compiler is more likely to take advantage of loop invariants when OpenMP is not applied, although this pattern may be too complicated for the compiler to see.

A second issue is the sample code provided, is likely not an ideal representation of the actual code. To wit: The first parallel region produces the same values in output for DiffCoeff and SedCoeff and thus should be lifted out of the integration loop. Should these procedures be more compute intensive then the ratio of overhead to productivity will change.

Jim Dempsey

@Tim P.: You make some good points. I failed to notice the G array was not being accessed optimally. G is a tridiagonal matrix storing only the non-0 elements. I will have to look at this further.

@andrew_4619: omSqRun, dt are both independent of iRadius. ZCalc is indeed a function (you can see it in the complete source that I attached). I like your suggestion of rewriting the G calculation as a vector operation.

@jimdempseyatthecove: I have simplified the actual pgm; in fact DiffCoeff and SedCoeff are more complex than just producing constants.

I am confused by the interaction of vectorization, SIMD, parallelization, and OpenMP. These are all compilation options, and naturally I want to use the combination of settings which produces the fastest code, but without having to test a large number of cases (e g, number of radii, complexity of DiffCoeff, SedCoeff and ZCalc, etc).

I am now rewriting the entire sample pgm in C++ and doing the threading myself (without OpenMP), so I can see where the bottlenecks actually are. VTune does a good job of summarizing the opportunities for improvement, but doesn't give the information needed to implement that improvement, at least as far as I know.

>>I am now rewriting the entire sample pgm in C++ and doing the threading myself (without OpenMP),

I suggest against doing the conversion. It is unnecessary work.

Regardless of choice of programming language (Fortran, C++, Go, Python, R, C#, ...) a technique for you to consider is to:

a) manually (programmically) partition the iteration space 1:nRadii by the number of threads (round the partition size to cache line size)
b) Have each thread contain private arrays for Dradius and sRadius, each allocated/indexed to partitionStart-1:partitionEnd+1, and compute the values for that range. Note, you are redundantly computing the -1 and +1 elements (not same locations of adjacent arrays)
c) Then you can compute the partitionStart:partitionEnd for Z and G arrays

Notes:

This method permits you to eliminate the (implied) barrier of your first loop
Threads of each zone can advance without waiting for completion of threads from adjacent zones.
Each zone can integrate at its own pace.
You can move the parallel region to encapsulate the main iteration loop (as opposed to inside the loop)

Notes 2:

If periodically you require to dump the results

do iDtOuter = 1, nDt, dumpInterval
 !$omp parallel do default(shared), private(iDt, othersAsNeeded)
  do iDt=iDtOuter,min(iDtOuter+dumpInterval-1,nDt)
    ...
  end do
  !$omp end parallel do
  call dumpData(iDtouter)
end do

I will leave it an exercise for you to perform the ... (hint use BLOCK/END BLOCK or call contained subroutine)

Also, through use of atomic progress variables, you can have each zone advance independently with loosly synchronization.

See: Plesiochronous Phasing Barriers In Action (about 1/3 the way down the web page)

Jim Dempsey

 

Here are my results, comparing Fortran without OpenMP (i e, single thread), Fortran with OpenMP, and C++ using my own thread control (no OpenMP or other library).

! Fortran,  Release:    8 threads 10000 radii  Elapsed=   20.0340 sec  CPU time=   144.3945 sec sum= 2.6670250794563941E+02
!                       1 threads 10000 radii  Elapsed=    5.7480 sec  CPU time=     5.6940 sec sum= 2.6670250794563941E+02
!
! C++,      Release:    8 threads 10000 radii  Elapsed=    5.7300 sec  CPU time=        n/a sec sum= 2.6670250794564652e+02

The 'sum' variable is present to keep the optimizer from removing the test code, and to verify that the C++ and Fortran versions are doing the same computation.

Surprisingly (to me, at least), the Fortran serial implementation is virtually as fast as the multi-threaded C++. Sadly, the OpenMP version is much slower; perhaps I'm not using it correctly.

Next up: reformat the arrays to increase locality, as suggested by Tim P. There is no point in eliminating the barrier after the Z,G-loop, because the entire arrays are needed immediately following. It's also true that the computation at each radius should take identical time (absent cache conflicts).

You can halve the number of barriers and reduce memory bandwidth if you can eliminate the Dradius and sRadius arrays and replace with scalar temporaries:

!$OMP PARALLEL DEFAULT(SHARED), PRIVATE(iRadius,Dradius_im1, Dradius_i, Dradius_ip1, sRadius_im1, sRadius_i, sRadius_ip1, once)
once = .true.
!$OMP DO

        L_R2_Parallel: &
            DO iRadius = 1, nRadii
                if(once) then
                    once = .false.
                    if(iRadius == 1) then
                        Dradius_im1 = 0.0
                        sRadius_im1 = 0.0
                    else
                        Dradius_im1 = DiffCoeff(iSpecies, iRadius-1, concTotal(iRadius-1, 1:3)) ! diffusion coefficient at iRadius
                        sRadius_im1 = SedCoeff(iSpecies, iRadius-1, concTotal(iRadius-1, 1:3)) ! if compression, this will account for it
                    endif
                        Dradius_i = DiffCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3)) ! diffusion coefficient at iRadius
                        sRadius_i = SedCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3)) ! if compression, this will account for it
                else
                    Dradius_im1 = Dradius_i
                    sRadius_im1 = sRadius_i
                    Dradius_i = Dradius_ip1
                    sRadius_i = sRadius_ip1
                endif
                Dradius_ip1 = DiffCoeff(iSpecies, iRadius+1, concTotal(iRadius+1, 1:3)) ! diffusion coefficient at iRadius
                sRadius_ip1 = SedCoeff(iSpecies, iRadius+1, concTotal(iRadius+1, 1:3)) ! if compression, this will account for it
                Z(iRadius)=ZCalc(iRadius, iSpecies)
                G(iRadius, 1)=Dradius_im1*dt*A1(iRadius, 1)+B(iRadius, 1)-sRadius_im1*omSqRun*dt*A2(iRadius, 1)
                G(iRadius, 2)=Dradius_i*dt*A1(iRadius, 2)+B(iRadius, 2)-sRadius_i*omSqRun*dt*A2(iRadius, 2)
                G(iRadius, 3)=Dradius_ip1*dt*A1(iRadius, 3)+B(iRadius, 3)-sRadius_ip1*omSqRun*dt*A2(iRadius, 3)
            END DO L_R2_Parallel
        nThreads=omp_get_num_threads()
!$OMP END PARALLEL

Jim Dempsey

[quote=jimdempseyatthecove]

You can halve the number of barriers and reduce memory bandwidth if you can eliminate the Dradius and sRadius arrays and replace with scalar temporaries:

This was how the code was before I started parallelizing it. I changed the "sliding" method you show to computing the entire array once, to avoid the extra memory writes copying D_i to D_im1 and D_ip1 to D_i .

But I think the barrier issue is a red herring. Each thread should be doing identical work, so it should finish at the same time. Now, I have not verified that for this particular set of loops using VTune, and I should.

I tried reversing the subscripts of G, to improve memory locality, and that made both the serial and parallel cases slightly slower. The same in the C++ multi-threaded version.

>> I changed the "sliding" method you show to computing the entire array once, to avoid the extra memory writes copying D_i to D_im1 and D_ip1 to D_i .

The three temporary variables should be registerized assuming DiffCoeff and SedCoeff get inlined (and are not too register intensive). You would have to inspect the dissassembly to be sure if these temps remain through the call.

*** However, use of the "sliding" method temporaries may interfere with vectorization of this loop. ***

Additional suggestion:

Relocate the calculation of Z array from the second loop to the first loop. This may improve the odds of optimally vectorizing the second loop.

Jim Dempsey

 

 

Atttached figures show a VTune analysis comparison between the Fortran+OpenMP and multi-threaded C++ (no OpenMP). The OpenMP loops have been modified to include the suggestions of moving the invariant dt and omSqRun out of the Z,G loop, and computing G as a vector product (see below).

In the VTune reports, red lines connect the corresponding loops, and blue line shows the spin time comparison.

The Fortran+OpenMP seems to be spending all its time waiting at barriers; Task Manager and the "Bottom-up" VTune report show 100% CPU utilization, but mostly spinning. The C++ multi-threaded version has much less spin time. Either the OpenMP is not being used correctly, or it is very inefficient for problems of this size.

This is the pair of OpenMP regions used for this test:

!$OMP PARALLEL DEFAULT(SHARED)

!$OMP DO PRIVATE(iRadius)
        DO iRadius = 1, nRadii
            Dradius(iRadius)=DiffCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3))
            sRadius(iRadius)=SedCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3))
        END DO
!$OMP END DO
!$OMP END PARALLEL
        DradiusDt(1:nRadii)=Dradius(1:nRadii)*dt
        sRadiusOmDt(1:nRadii)= sRadius(1:nRadii)*omSqRun*dt

!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO PRIVATE(iRadius)

        L_R2_Parallel: &
            DO iRadius = 1, nRadii
                Z(iRadius)=ZCalc(iRadius, iSpecies)
                G(iRadius, 1:3) = DradiusDt(iRadius-1:iRadius+1)*A1(iRadius, 1:3) + B(iRadius, 1:3) - sRadiusOmDt(iRadius-1:iRadius+1)*A2(iRadius, 1:3)
            END DO L_R2_Parallel
    nThreads=omp_get_num_threads()
    !$OMP END PARALLEL

 

Attachments: 

You mentioned logical cpus. If your system has hyperthreading, you might compare to setting OMP_SCHEDULE=spread and OMP_NUM_THREADS= number of physical cores. Otherwise, if you have hyperthreading and don't use all the logical processors, you can get a lot of spin time from unused physical cores. Though your VTune screenshot suggests you are using all the logical processors.

Explicit threading with OpenMP is far more effective than auto-parallelism (automatic threading by the compiler with /Qparallel). This is because you need to thread at a high level to have enough work to amortize the relatively high overhead of setting up threads, and it is hard for the compiler to prove that is safe. If you are using /Qopenmp, I would not use /Qparallel. However, vectorization is pretty much orthogonal to threading, you can and should use both. The compiler can do a much more effective job of auto-vectorization than of auto-parallelism, because the overhead is far lower so it can be done at a lower level (finer grain) that is easier to analyze. Explicit vectorization with OpenMP SIMD requires some care and knowledge, you should probably only attempt that if VTune or Intel Advisor indicate that a hotspot is not being vectorized effectively.

If you are lucky, the compiler will unroll the implied loop over the second subscript of G. If ZCalc can be inlined, it might then be able to auto-vectorize the loop over IRadius. Check the optimization report to see, and add a directive if necessary. Inlining or inter-procedural optimization may be able to eliminate the IF(nonIdeal) block as dead code, which will also make vectorization more efficient. If all this happens, I agree there’s unlikely to be enough work left to make threading worthwhile.

Your functions sedcoeff and diffcoeff look inefficient. You are passing an array section with non-unit stride to a function that expects unit stride, so forcing a temporary array copy, and then only using the first element. I don’t know whether inlining will overcome this.

To restate some of what Martyn has posted, you need to understand the overheads of using !$OMP PARALLEL, especially how many processor cycles it takes to start up. There needs to be sufficient work to overcome the thread startup to show any benefit.
Having !$OMP inside "DO iDt=1,100000" implies there is a lot of parallel region initiation.

I modified the parallel test 1.f90 to place some timing monitoring on the loops and also some simpler loops to identify performance.
( these tests assume that the trivial loops will not be optimised away and also SYSTEM_CLOCK has acceptable precision. Is there a system function integer(8) processor_cycles() based on rdtsc ? )
These tests should be run as multi-thread and single thread to compare the performance.

Without a more substantial calculation in DiffCoeff and SedCoeff, this approach does not appear to be productive. Your real program may perform differently, although the use of "DO iDt=1,100000" probably needs a rethink.

Attachments: 

AttachmentSize
Downloadapplication/octet-stream Parallel_test_1.f9013.19 KB

I moved the parallel region out of the iDt loop, which decreased the test time from 20 to 13 sec, a big improvement. The structure is now.

!$OMP PARALLEL DEFAULT(SHARED)
!$OMP END PARALLEL

CALL SYSTEM_CLOCK(countStart,count_rate)
DO iDt=1,100000
!$OMP DO PRIVATE(iRadius)
    DO iRadius = 1, nRadii
    ...
    END DO
!$OMP END DO

!$OMP DO PRIVATE(iRadius)
    DO iRadius = 1, nRadii
    ...
    END DO
!$OMP END DO
END DO

Quote:

Martyn Corden (Intel) wrote:

You mentioned logical cpus. ... your VTune screenshot suggests you are using all the logical processors.

If you are using /Qopenmp, I would not use /Qparallel. However, vectorization is pretty much orthogonal to threading, you can and should use both.

Your functions sedcoeff and diffcoeff look inefficient. You are passing an array section with non-unit stride to a function that expects unit stride, so forcing a temporary array copy, and then only using the first element..

Yes, all logical processors are in use, per Task Manager.

Removing /Qparallel increases the test time from 12.2 to 13.3 sec.

Removing the array section arg from SedCoeff and DiffCoeff decreases the test time from 13 to 10 sec, and so is a worthwhile optimization.

Did you consider forming your 2d arrays the other way around so you are not jumping around all over the memory?

Referring to #17, my understanding is you have removed the loops from the parallel region, so running as single thread. Task Manager can confirm this.

Andrew's recommendation is very good in this small test, and could help in your main program, if suitable. I made the following changes and it had a significant effect:

    REAL(8), DIMENSION(3,10000) :: A1, A2, B
...
    REAL(8), DIMENSION(3,10000+1) :: concTotal
    REAL(8), DIMENSION(3,10000)   :: G
...
      times = 0
      call delta_step (times(1,1))    ! initialise timer
      times = 0
      
 CALL SYSTEM_CLOCK (countStart,count_rate)
 CALL CPU_TIME     (startTime)
    sum   = 0
    sumx  = 0
    count = 0
    DO iDt=1,100000
        dt=dt*1.00002
!
!$OMP PARALLEL DEFAULT(SHARED)
!$OMP  DO PRIVATE(iRadius) schedule(static)
        DO iRadius = 1, nRadii
            Dradius(iRadius) = DiffCoeff (iSpecies, iRadius, concTotal(1:3, iRadius)) ! diffusion coefficient at iRadius
            sRadius(iRadius) = SedCoeff  (iSpecies, iRadius, concTotal(1:3, iRadius)) ! if compression, this will account for it
        END DO
!$OMP  END DO
!$OMP END PARALLEL

      call delta_step (times(1,1))

!$OMP PARALLEL DO           &
!$OMP  DEFAULT(SHARED)      &
!$OMP  PRIVATE(iRadius)     &
!$OMP  schedule(static)     &
!$OMP  REDUCTION(+ : sumx)

        L_R2_Parallel: &
            DO iRadius = 1, nRadii
                Z(iRadius)   = ZCalc(iRadius, iSpecies)
                G(1, iRadius)= Dradius(iRadius-1)*dt*A1(1, iRadius) + B(1,iRadius) - sRadius(iRadius-1)*omSqRun*dt*A2(1, iRadius)
                G(2, iRadius)= Dradius(iRadius)  *dt*A1(2, iRadius) + B(2,iRadius) - sRadius(iRadius)  *omSqRun*dt*A2(2, iRadius)
                G(3, iRadius)= Dradius(iRadius+1)*dt*A1(3, iRadius) + B(3,iRadius) - sRadius(iRadius+1)*omSqRun*dt*A2(3, iRadius)
                sumx         = sumx + Z(iRadius) * G(1, iRadius) + G(2, iRadius)* G(3, iRadius)
            END DO L_R2_Parallel
!!!        nThreads=omp_get_num_threads()
!$OMP END PARALLEL DO
!
      call delta_step (times(1,2))
      sum = sum                                               &
          + DOT_PRODUCT(Z(1:nRadii),   G(1,1:nRadii))         &
          + DOT_PRODUCT(G(2,1:nRadii), G(3,1:nRadii))

      call delta_step (times(1,3))
      op_n = 0

!$OMP PARALLEL DO                     &
!$OMP   DEFAULT  (SHARED)             &
!$OMP   PRIVATE  (iRadius)            & 
!$OMP   schedule (static)             & 
!$OMP   REDUCTION(+ : op_n)

            DO iRadius = 1, nRadii
              op_n = op_n+1
            END DO

!$OMP END PARALLEL DO
!
      call delta_step (times(1,4))
      count = count+1

      call delta_step (times(1,5))
    END DO
    
    CALL SYSTEM_CLOCK(countEnd)
    CALL CPU_TIME    (endTime)
    

Note : omSqRun has not been defined

I have now reversed the subscript order on A1, A2, B and G, to increase memory locality. I also defined omSqRun, just in case this had any effect on execution time (it didn't). The Fortran OpenMP, with 8 threads, now gives essentially the same execution time as the C++ multi-threaded without OpenMP. This is the good news. The bad news is that the Fortran single-threaded version (i e, compiled without /Qopenmp) is only slightly slower (6.78 sec vs 6.10). So the multi-threading is not buying much.

VTune shows most of the "wasted" CPU time is spent in _kmpc_barrier. In this test pgm, I have two separate loops; these could be combined, using the "sliding" technique suggested by jimdempseyatthecove (and in the original pgm), and perhaps this would lessen the barrier overhead.

This is the code as it stands now. I plan to remove the 'concTotal' arguments, as suggested by Martin Corden.

The reason for the 'sum' calculation is to keep the optimizer from removing all the code. The reason for the iDt loop is to make the test take long enough that I can time it, and watch its CPU usage on Task Manager. In fact, in the actual pgm, there is a lot of serial code between invocations of the parallel regions in the test pgm.

!$OMP PARALLEL DEFAULT(SHARED)
    DO iDt=1,100000
!$OMP SINGLE
        dt=dt*1.00002_long
!$OMP END SINGLE
!$OMP DO PRIVATE(iRadius)
       DO iRadius = 1, nRadii
            Dradius(iRadius)=DiffCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3))
            sRadius(iRadius)=SedCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3))
       END DO
!$OMP END DO

!$OMP DO PRIVATE(iRadius)

        L_R2_Parallel: &
            DO iRadius = 1, nRadii
                Z(iRadius)=ZCalc(iRadius, iSpecies)
                G(1, iRadius)=Dradius(iRadius-1)*dt*A1r(1, iRadius)+Br(1, iRadius)-sRadius(iRadius-1)*omSqRun*dt*A2r(1, iRadius)
                G(2, iRadius)=Dradius(iRadius)*dt*A1r(2, iRadius)+Br(2, iRadius)-sRadius(iRadius)*omSqRun*dt*A2r(2, iRadius)
                G(3, iRadius)=Dradius(iRadius+1)*dt*A1r(3, iRadius)+Br(3, iRadius)-sRadius(iRadius+1)*omSqRun*dt*A2r(3, iRadius)
            END DO L_R2_Parallel
!$OMP END DO
    nThreads=omp_get_num_threads()
!$OMP SINGLE
    sum=sum+DOT_PRODUCT(Z(1:nRadii), G(1, 1:nRadii))+DOT_PRODUCT(G(2, 1:nRadii), G(3, 1:nRadii))
!$OMP END SINGLE
    END DO
!$OMP END PARALLEL

 

Within the second loop you in effect multiply two constants omSqRun and dt by each other 3*nRadii times when once would do!

e.g. before the loop have dt2 = omSqRun*dt and then use dt2 in the loop rather than omSqRun*dt

I have also noted in the past (assuming the array is 3 big using (1:3) sometime optimises worse than using (:), but maybe those are fixed in the latest compilers.

 

 

Quote:

andrew_4619 wrote:

Within the second loop you in effect multiply two constants omSqRun and dt by each other 3*nRadii times when once would do!

This should be taken care of by the optimizer. It should recognize the common subexpression omSqRun*dt and notice it does not change during the loop, and therefore only do the multiplication once. This was my intent, but it does not work because in both omSqRun and dt are in MODULE ParallelTestData, and could potentially be changed by function ZCALC.

To get the Intel compiler to recognize the common subexpression as constant, I can make a local copy of each variable outside the loop. For the code as shown, without this change, the compiled code does load omSqRun and dt from memory once outside the loop and store them in registers (xmm6 and xmm7).

Here is the relevant assembler code produced by the Fortran compiler XE 2018 u1 (I left out most of the computation for clarity):

        movsd     xmm6, QWORD PTR [PARALLELTESTDATA_mp_OMSQRUN] ;125.68
        lea       rax, QWORD PTR [rax+r15*8]                    ;125.68
        movsd     xmm7, QWORD PTR [PARALLELTESTDATA_mp_DT]      ;125.17
...
                                ; LOE rbx rdi r12 r13 r14 r15 esi xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.50::                        ; Preds .B1.107 .B1.49
                                ; Execution count [9.72e+008]
                ; optimization report
                ; OPENMP LOOP
                ; %s was not vectorized: vector dependence prevents vectorization%s
                ; VECTOR TRIP COUNT IS ESTIMATED CONSTANT
...
        call      ZCALC                                         ;123.28
...
        mulsd     xmm1, xmm6                                    ;125.110
        mulsd     xmm3, xmm6                                    ;126.106
        mulsd     xmm5, xmm6                                    ;127.110


...
        mulsd     xmm2, xmm7                                    ;125.17
        mulsd     xmm4, xmm7                                    ;126.17
        mulsd     xmm0, xmm7                                    ;127.17
...
        inc       rdi                                           ;119.7
...
        jb        .B1.50        ; Prob 99%                      ;119.7


 

I think one of the problems with your smaller test program is there is not enough work in each loop iteration to justify !$OMP.
However, you splitting of !$OMP PARALLEL and !$OMP DO does appear to perform better for this test program. Implementing this approach does requires more care. I used some counters to confirm the looping and varied between $!OMP SINGLE, !$OMP MASTER and "IF ( iThread==0)" for limiting calculations. Minimising barriers can have an effect.

I experimented with a second generation 4-core i5 and an eight generation 6-core i7 and found that the i5 ran faster than the i7 (?)
I then limited the i5 to 3 threads and the i7 to 6 (threads = num cores) and found that for both, the run times improved.
I concluded from this that using !$OMP PARALLEL with max threads was not optimal and fewer threads works better (no hyper-threading), although this may be specific to my test configuration.

I have attached the changed test program.
I also placed the dot_product as a reduction in the DO loop, which appears to be more suited to the loop workload than a separate dot_product.
I also included some more monitoring of elapsed times, which may identify some relative performance.

Thanks for the example and hopefully some of these results may transfer to your main program.

Attachments: 

AttachmentSize
Downloadapplication/octet-stream Parallel_test_5.f9014.91 KB

Leave a Comment

Please sign in to add a comment. Not a member? Join today