Poor openmp performance

Poor openmp performance

We have E5-2670 * 2, 16 cores in total.
We get the openmp performance as follows (the code is also attached below):

 NUM THREADS:           1
 Time:    1.53331303596497  

 NUM THREADS:           2
 Time:   0.793078899383545

 NUM THREADS:           4
 Time:   0.475617885589600

 NUM THREADS:           8
 Time:   0.478277921676636

 NUM THREADS:          14
 Time:   0.479882955551147

 NUM THREADS:          16
 Time:   0.499575138092041     

OK, this scaling is very poor when the thread number larger than 4.
But if I uncomment the lines 17 and 24, let the initialization is
also done by openmp. The different results are:

 NUM THREADS:           1
 Time:    1.41038393974304

 NUM THREADS:           2
 Time:   0.723496913909912

 NUM THREADS:           4
 Time:   0.386450052261353

 NUM THREADS:           8
 Time:   0.211269855499268

 NUM THREADS:          14
 Time:   0.185739994049072

 NUM THREADS:          16
 Time:   0.214301824569702

Why the performances are so different?

Some information:
ifort version 13.1.0
ifort -warn -openmp -vec-report=4 openmp.f90

    use omp_lib
    !use mpi
    implicit none
    integer(4), parameter :: nx = 512, ny = 512, nz = 1024
    integer(4) :: ip, np, idx, nTotal = nx * ny * nz
    real(8) :: time, dx, dy, dz, bstore
    real(8), dimension(:), allocatable :: bx, ey, ez, hx
!   initial
    dx = 0.3; dy = 0.4; dz = 0.5
    do idx = 1, nTotal
       bx(idx) = idx
       ey(idx) = idx * 2
       ez(idx) = idx / 2
       hx(idx) = idx + 1
!   start
    time = omp_get_wtime()
    !$OMP PARALLEL PRIVATE(ip, bstore, idx)
    np = omp_get_num_threads()
    ip = omp_get_thread_num()
    !$OMP DO 
    do idx = 1, nTotal - 1
        bstore = bx(idx)
        bx(idx) = 2.0 * ((ey(idx + 1) - ey(idx)) / dz -                        &
            (ez(idx + 1) - ez(idx)) / dy)
        bx(idx) = 1.0 * bx(idx)  + 2.0 * ((ey(idx + 1) - ey(idx)) / dz  -      &
            (ez(idx + 1) - ez(idx)) / dy)
        hx(idx)= 3.0 * hx(idx) + 4.0 * (5.0 * bx(idx) - 6.0 * bstore)
    end do
    !$OMP END DO
!   end
    print*, "NUM THREADS:", np
    print*, "Time: ", omp_get_wtime() - time
    print*, "Result:", sum(hx)
    deallocate(bx, ey, ez, hx)

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The reason for the difference is when the first loop is parallized, the iteration space 1:nTotal-1 is partitioned by the number of thread in the thred team. Same true for the second loop's iteration space 1:nTotal-1. In the first loop bx(idx), ey(idx), ez(idx), hx(idx), for the index sub-range for a specific thread, are written not only to the RAM locations of the arrays, but also into the cache system used by the corrisponding threads as read by the second loop (due to same partitioning). IOW the second loop has higher probability of cache hit. Also, if your system BIOS was configured as NUMA, and if the runtime system is setup as "first touch", then at page level granularity, the pages of the corrisponding locations "touched" (written) by the first loop, will reside in the RAM attached (nearer) to the socket of the thread that first touches the RAM of a given page. Then for locations subsiquently referenced by the second loop that were not within a cache, then these would have faster RAM access (due to being located on the RAM directly attached to the CPU within which the thread resides).

Your program is an execellent example of why one should parallelize the initialization of data in the same manner as subsequent processing of the data.

Jim Dempsey


I am also interested in understanding the difference in performance. But I am doubtful about the local cache/local NUMA pages explanation because

1) The amount of data is 512*512*1024*4*4=1 GB, which is much greater than the L3 cache of two 8-core Xeons (~60 MB)

2) When I modified the code to run the processing loop (line 35) twice, the run time was identical for both runs. That holds with parallel or serial initialization. If the cache hit ratio was an issue, then the second run must have been faster than the first.

3) Also, I eliminated the NUMA hypothesis by using 16 threads and KMP_AFFINITY=compact (my system is 2-socket and has 32 logical cores). With OMP_NUM_THREADS=16 and KMP_AFFINITY=compact, all threads are placed on one CPU socket. However, when I run this code with multithreaded initialization, I get faster processing than with serial initialization.


 P.S.: Ronglin, if you do not declare the loop index "idx" as PRIVATE, the overall performance increases






>>With OMP_NUM_THREADS=16 and KMP_AFFINITY=compact, all threads are placed on one CPU socket

Have you verified this? The behavior seems to be contradictory. The "only" difference, assuming same socket for all threads, would be as to if the non-master threads had begun the timed region in an expired KMP_BLOCK_TIME state.

Have you run the timed loop several times under VTune to see what is going on? (set loop count to about 15-30 seconds to get a meaningful statistical sample).

Jim Dempsey

openmp makes the parallel loop index private by default.  to take advantage of first touch locality you will need affiinity set.  for one thread per core with ht you might set kmp_affinity=compact,1,1

To Jim: I realized that the initialization is also important to improve the efficiency of OMP parallelization. Thank you.

To Andrey and TimP: I set KMP_AFFINITY=compact, but the results seem even worse.

Thank all of you for reply.

Leave a Comment

Please sign in to add a comment. Not a member? Join today