Performance of double vs single precision

Performance of double vs single precision

Our software does lots of floating point linear algebra (CFD modeling). Many operations require double precision for larger size matrices. Unfortunately the performance we have been getting out of both Intel and AMD processors was very poor: double precision was almost 2x slower than the same code when using single precision.

We tested the software both under 32-bit and 64-bit Windows (and compiled with corresponding 32 and 64 bit options) with no visible difference. Are there compiler options that we are missing that could speed up double precision computations?

Thanks.

40 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'm curious about your expectations. Do you have a reason to believe that single and double precision operations should be the same speed? I'll comment that if you have large arrays of data, computation time might be swamped by memory traffic, as double precision doubles the memory use.

In general, I would recommend the following options.

/O3
/QxS (or T or P), depending on which Intel processor model you're using.
/Qprec-div-

I'd also strongly recommend running the application through the VTune Peformance Analyzer and see what it comes up with for hot spots and data on things such as cache misses.

Steve - Intel Developer Support

Thanks,Steve, my expections based on UNIX workstations experience (non-Intel processors), is that it should be roughly the same, i.e. within a few percent.

Arik

Was that an IBM workstation by any chance? The Power architecture does not have single-precision operations, if I recall correctly.

You definitely want to make sure that you are enabling vectorization on the Intel and AMD processors. (For AMD, use /QxW or /QxO (latter for SSE3-capable Opterons).)

/Qipo may also be of use.

Steve - Intel Developer Support

what can we do about cache misses? Thanks,

Arik

I never had an IBM, we used Spark(s), and HPs, as well as supercomputers such as Cray and Alliant.

As to vectorization options, we certainly use them, but they are the ones, which give grief. There are cases where one matrix solver dies, even though it would work fine with these option off.

If you are seeing a lot of cache misses, you may want to look at how the application is accessing the arrays - is it following the elements in memory order or are there gaps between memory locations? The compiler can do a lot of things to help with -O3 but there's a limit. Some judicious use of prefetch directives MAY help once you understand the memory access pattern.

But first, you need to know what exactly the program is doing, and that's where VTune can help.

Steve - Intel Developer Support

Simply telling us you have a CFD code doesn't give much information about your cache miss situation. You would want to process as many dependent variables as possible in a single loop, so as to economize access to your grid information. Strategies for sorting elements so as to improve cache locality may be worth investigation.

Specifically the software spends about 90% of time in subroutines using conjugate-gradient methods to solve banded matrices resulting from discretizing governing equations on a structured grid. All arrays are 3D, i.e. a(i,j,k) and hence each expression typically contains a(i,j+1,k), a(i,j,k-1) etc terms, which are not immediately next to each other storage-wise.

I would appreciate if you could comment more specifically as to what's "economize access to grid information" as well as "cache locality" means and how one can go about dealing with these issues. I don't expect a recipe, but rather a refence to papers/books etc. that explain this. I wonder if Intel has any papers addressing these issues.

Thanks.

Arik

If your data are stored that way, arranging your loops with the i index in the inner loop, j in the middle loop, and operating on all dependent variables in the same loop, should do the job. Typical cache miss problems in such codes are due to irregular storage of automatic meshed data; apparently, you should not be seeing so much difficulty. Ideally, you want to use all the data from a cache line, then move on to the next cache line, allowing hardware prefetch to bring in the cache lines which will be needed next.
If you must operate with the k index in the inner loop, unrolling and "jamming" outer loops so that you still operate on several values of i before moving to the next k should help. ifort -O3 can deal to a limited extent with re-ordering loops, when the nested loops aren't hidden by subroutine calls.

Definitely and always we've used i as the inner loop, etc. Here's a sample loop from the code:

!$omp parallel do
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
do i=1,Nx
! if possible, use canned routines
rb(i,j,k) = z(i,j,k) + beta*rb(i,j,k)
enddo
enddo
enddo

Apparently this has not done the trick.

Are you using version 10.1? That version has an improved optimizer that can both parallelize and vectorize code more efficiently.

You first asked about the relative difference between single and double. How is the overall performance compared to the UNIX platforms?

Steve - Intel Developer Support

As you're using OpenMP on Windows, setting KMP_AFFINITY environment variable should bring some benefit, in case you haven't tested it.

Arikd,

Your sample parallel do loop might benefit from assuring (!dec$ to declare) the arrays rb and z as being aligned and thus improve the vectorization performance of the inner most loop.

If Ny*Nx is quite large then you might benefit from using a dynamic schedule with an appropriate beginning chunk size. i.e. program to keep all available cores to finish at approximately the same time.

Also, depending on the relativevalues of Nz and Ny and the number of cores available it may be advantageous to swap orders of the outer loop. (or test and run on of two !$omp parallel do loops)

As a diagnostic aid during performance tuning I find it beneficial toset thread affinities at initialization. Then due the the threads not squirming about different processors, the per cpu runtime statistics on the nested loop as in your example, will easily disclose work starved threads.

Jim Dempsey

www.quickthreadprogramming.com

We are still on v.9.x, but are planning to port it over to 10right after new year.

I can't say much about UNIX these days, have not used any UNIX computers since last century. At the time when we switched, somewhere around 1994, 486-66 was giving the Spark a good run for the money, especially if you take into account that the least expensive Spark was, if I recall it correctly, ~10+k, while a comparable PC ~3k.

I wonder it the r8 vs r4 performance hit we see is due to Intel chips being still on some level 32 bit chips, while the old Unix workstations, were, I believe, native 64 bit.

No - the 64 vs. 32 bit has to do with address size and nothing to do with floating point.

Steve - Intel Developer Support

In such a case,using double precisionexplains only the doubling of RAM requirements, but not doubling of CPU time. Any ideas?

I would expect double-precision arithmetic to be slower than single-precision. Without a test case to look at, I'm not willing to speculate further. I'd also want to make sure that you're getting the advantage of vectorization.

Steve - Intel Developer Support

Comparing single and double precision vectorized code, double precision generally requires twice as many instructions to run the same job. Also, double precision evidently doubles the volume of data movement. A 90% increase in time to complete the same job in double precision is not unusual. As Steve says, you haven't given adequate information to judge whether this applies to your application.
If you restricted your application to x87 code, you might find that double precision slows down typically by 40%, compared to vectorized code, while single precision slows down much more, in case your goal is to find a compilation mode where there is relatively little difference between single and double precision. Intel compilers support x87 code only when running 32-bit mode. gfortran supports -mfpmath=387 in both 32- and 64-bit mode, in case you have a reason for using it.
IA-64 processors also show less difference in performance between single and double precision than Xeon does. There are no single precision 128-bit wide SIMD operations.

I read some papers on the subject of optimization of code and it appears a bit more than one can do on the fly. I wonder if you could you recommend either individuals or companies, which specialize in optimizing Fortran matrix solvers for Intel processors?

The example you provided is so simple that it's not clear how to improve on the suggestions already made. Where you have matrix operations which fit the lapack and BLAS mold, you have a wide variety of "canned solvers," as your comment says, including Intel MKL, AMD ACML, and more.

Arikd,

Try processing the rank 3 array as rank 1

CALL COMPUTE(rb, z, beta, Nz*Ny*Nx)
...
SUBROUTINE COMPUTE(rb, z, beta, N)
REAL :: rb(N), z(N), beta
INTEGER :: N

!$omp parallel do
!$omp& private( i)
do i=1,N
! if possible, use canned routines
rb(i) = z(i) + beta*rb(i)
enddo
enddo

For portibility issues you may want to insert some conditional compilation code that asserts the assumed indexing order.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

What kind of improvement can we expect?

I recall that long time ago, we did try something like this and it has not helped much ("much" is defined as >10% improvement). Back in the 80s it was helpful, but I recall that since the 90s such "tricks" were no longer required.

Also, the way the code is written, the index runs from 0 to Nx+1, such that at 0 and Nx+1 layers we store boundary conditions. In other words, as it runs from 1 to Nx, the Nx+1 layer has to be skipped, etc. This can be fixed, I assume, but for what benefit?

Thanks,

Arik

ifort -O3 is supposed to look for easily detected opportunities for this optimization. It would allow work to be distributed more evenly among threads, in cases where the outer loop count isn't evenly divisible by the number of threads.
I've seen applications which parallelize by first ignoring the difference between the boundary data, following the main loop by correction of the boundary elements. The automatic unrolling, evidently, wouldn't go this far.

Arik,

On your original code with the three subscripts compile with full optimizations including uses/requires SSE3and no runtime checks. Place a break point at the beginning of the outer loop or at the beginning of the subroutine. Open a disassembly window and examine the nested loop. See what is generated.

Next, make the little 20-line subroutine as suggested but only in a test "Hello world" application complete with a call to it with a test array. compile with full optimizations including uses/requires SSE3and no runtime checks. Place a break point at the beginning of the outer loop or at the beginning of the subroutine. Open a disassembly window and examine the nested loop. See what is generated.

Compare the two code sets.

By the way, running the sample test would likely have taken less time than to post the question on the forum.

Regarding boundary conditions:

If your inner loop has test for boundary conditions, although the integer test is not a large computational consideration, the branching around a piece of code is high computational over head, and the branch may gunk up the vectorization. It is best to eliminate the branches within the inner loop if at all possible.

Often this can be doneby computing beyond the bounds of the perceived array. e.g. if the "perceived" array is ARRAY(1:N), allocate it to (0:N+1) and place appropriate values into the extension cells prior to executing your loop (i.e. to avoid NaN or Divide by 0). This may present a problem elsewhere if you use ARRAY without the subscripts under the assumption that the array bounds is the "perceived" size.

If this presents a problem then work it the other way around - Perform the special conditions for ARRAY(1), and ARRAY(N) outside the loop and run the loop from (2) to (N-1).

I will leave this an exercise to you to extend this to multi-dimensional arrays.

Jim Dempsey

www.quickthreadprogramming.com

We have tested all the suggestions and the only one which produced a tangible improvement was the SSE3 option (~30% faster). The rest were all below 5%. However, our most time-consuming operation - matrix backsubstitution, was not helped at all, apparently because of recurrency. I have included the subroutine below, in case somebody has insights on what we are doing wrong. All the inputs fromeverybody Jim, Tim, Lionel et al, have been very much appreciated.

Arik

!-----------------------------------------------------
subroutine solvm13(
& bp, bw, bs, bu, bnu, beu, bes, r, zt
&)

use numbers
use MultiThreads
use GridBlocks
use grdblks

implicit none

real bp (0:Nx1,0:Ny1,0:Nz1),
& bw (0:Nx1,0:Ny1,0:Nz1),
& bs (0:Nx1,0:Ny1,0:Nz1),
& bu (0:Nx1,0:Ny1,0:Nz1),
& bnu (0:Nx1,0:Ny1,0:Nz1),
& beu (0:Nx1,0:Ny1,0:Nz1),
& bes (0:Nx1,0:Ny1,0:Nz1),
& r (0:Nx1,0:Ny1,0:Nz1),
& zt (0:Nx1,0:Ny1,0:Nz1)
! internal obj
integer i, j, k

zt = 0.0

! forward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
if ( k > 1 .And. j < Ny ) then
!$ call omp_set_lock( locks(j+1,k-1) )
endif
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu(i,j,k) * zt(i,j,k-1)
& - beu(i,
j,k) * zt(i+1,j,k-1)
& - bnu(i,j,k) * zt(i,j+1,k-1)
& - bs(i,j,k) * zt(i,j-1,k)
& - bes(i,j,k) * zt(i+1,j-1,k)
& - bw(i,j,k) * zt(i-1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

! backward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=Nz,1,-1
do j=Ny,1,-1
if ( j > 1 .And. k < Nz ) then
!$ call omp_set_lock( locks(j-1,k+1) )
endif
do i=Nx,1,-1
zt(i,j,k) = zt(i,j,k) / bp(i,j,k)
& - bu(i,j,k+1) * zt(i,j,k+1)
& - beu(i-1,j,k+1) * zt(i-1,j,k+1)
& - bnu(i,j-1,k+1) * zt(i,j-1,k+1)
& - bs(i,j+1,k) * zt(i,j+1,k)
& - bes(i-1,j+1,k) * zt(i-1,j+1,k)
& - bw(i+1,j,k) * zt(i+1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

end
!-----------------------------------------------------

Arik,
Rework:
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu(i,j,k) * zt(i,j,k-1)
& - beu(i,j,k) * zt(i+1,j,k-1)
& - bnu(i,j,k) * zt(i,j+1,k-1)
& - bs(i,j,k) * zt(i,j-1,k)
& - bes(i,j,k) * zt(i+1,j-1,k)
& - bw(i,j,k) * zt(i-1,j,k)
enddo
into:
do i=1,Nx
bu_x_zt(i,j,k) = bu(i,j,k) * zt(i,j,k-1)
enddo
do i=1,Nx
bue_x_zt(i,j,k) = beu(i,j,k) * zt(i+1,j,k-1)
enddo
do i=1,Nx
bnu_x_zt(i,j,k) = bnu(i,j,k) * zt(i,j+1,k-1)
enddo
do i=1,Nx
bs_x_zt(i,j,k) = bs(i,j,k) * zt(i,j-1,k)
enddo
do i=1,Nx
bes_x_zt(i,j,k) = bes(i,j,k) * zt(i+1,j-1,k)
enddo
do i=1,Nx
bw_x_zt(i,j,k) = bw(i,j,k) * zt(i-1,j,k)
enddo
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu_x_zt(i,j,k)
& - beu_x_zt(i,j,k)
& - bnu_x_zt(i,j,k)
& - bs_x_zt(i,j,k)
& - bes_x_zt(i,j,k)
& - bw_x_zt(i,j,k)
enddo
This will improve SSE performance as well as cache performance.

I will leave it an exercize for you do do the backward sweep loop.

Jim Dempsey
www.quickthreadprogramming.com

This changes my assumptions from your original posted code, as you have recurrence which would prevent vectorization of the inner loops. Your outer loop recurrence should raise a complaint from thread checker, as it would no doubt give wrong results if you increase the chunk size.
Most CPUs have fewer hardware prefetch streams for negative than for positive stride, so you may be losing more time on cache misses in the back substitution. If your code is correct otherwise, you may be able to alleviate this with software prefetch or by splitting the inner loop; the optimum strategy is far from evident. If you split the inner loop, you might save the terms which depend on the result of the other thread for the 2nd pass.

Jim,

Unfortunately your proposal made it slower, specifically 1.9 times slower. I can see that it has more do-loop and RAM work, which was apparently not compensated by whatever efficiencies it added. Unless, of course, the implementation was incorrect, so I am including this subroutine below. For compiler (Intel Fortran 10.1) options we used: -c -O3 -QxT -Qopenmp.

thanks,

Arik

zt = 0.0

! forward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
if ( k > 1 .And. j < Ny ) then
!$ call omp_set_lock( locks(j+1,k-1) )
endif
do i=1,Nx
bu_ (i,j,k) = bu (i,j,k) * zt(i,j,k-1)
enddo
do i=1,Nx
beu_(i,j,k) = beu(i,j,k) * zt(i+1,j,k-1)
enddo
do i=1,Nx
bnu_(i,j,k) = bnu(i,j,k) * zt(i,j+1,k-1)
enddo
do i=1,Nx
bs_ (i,j,k) = bs (i,j,k) * zt(i,j-1,k)
enddo
do i=1,Nx
bes_(i,j,k) = bes(i,j,k) * zt(i+1,j-1,k)
enddo
do i=1,Nx
bw_ (i,j,k) = bw (i,j,k) * zt(i-1,j,k)
enddo
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu_ (i,j,k)
& - beu_(i,j,k)
&&n
bsp; - bnu_(i,j,k)
& - bs_ (i,j,k)
& - bes_(i,j,k)
& - bw_ (i,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

! backward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=Nz,1,-1
do j=Ny,1,-1
if ( j > 1 .And. k < Nz ) then
!$ call omp_set_lock( locks(j-1,k+1) )
endif
do i=Nx,1,-1
bu_ (i,j,k+1) = bu(i,j,k+1) * zt(i,j,k+1)
enddo
do i=Nx,1,-1
beu_(i-1,j,k+1) = beu(i-1,j,k+1) * zt(i-1,j,k+1)
enddo
do i=Nx,1,-1
bnu_(i,j-1,k+1) = bnu(i,j-1,k+1) * zt(i,j-1,k+1)
enddo
do i=Nx,1,-1
bs_ (i,j+1,k) = bs(i,j+1,k) * zt(i,j+1,k)
enddo
do i=Nx,1,-1
bes_(i-1,j+1,k) = bes(i-1,j+1,k) * zt(i-1,j+1,k)
enddo
do i=Nx,1,-1
bw_ (i+1,j,k) = bw(i+1,j,k) * zt(i+1,j,k)
enddo
do i=Nx,1,-1
&nb
sp; zt(i,j,k) = zt(i,j,k) / bp(i,j,k)
& - bu_ (i,j,k+1)
& - beu_(i-1,j,k+1)
& - bnu_(i,j-1,k+1)
& - bs_ (i,j+1,k)
& - bes_(i-1,j+1,k)
& - bw_ (i+1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo
end

Arikd,

Do you have a single threaded example of the code that works?

By this I mean do not simply strip the !$OMP stuff out of thelast code posted but supply an example of the original single threaded code that worked. There are temporal issues with your code that need to be clarified and can best be clarified by examining the original single threaded code that worked.

Of particular concern for exampleis on your loops (of the attempt at parallel code) is the use of offfset indecies (e.g. i-1 and i+1) and wether or not the algorithm is to use the prior or post state change values. This will be evident by the working pre-parallization attempt at the code change. For portions of the expression(s) relying on prior values of zt this can be resolved by having two zt arrays (zt and zt_) one containing the prior values. The portion of the expression(s) that rely on using the post change state values might not be parallizable unless you can rework the code to compute the post value on-the-fly so to speak from the pre-change state values. But then this adds redundant computations of the same values.

A secondary cause of poor performance is excessive use of locks. It is best to eliminate locks wherever possible and to minimize there use whenever possible. Overuse of locks result in threads burning up CPU time in spinlocks or worse yet yields.

Jim Dempsey

www.quickthreadprogramming.com

Jim,

I am attaching complete text of the original, solvm13, and modified as per your post, solvm13_2, subroutines. They seem to be using the same number of locks etc. Both solvm13 and solvm13_2, the latter is just much slower.

Please comment, thanks,

Arik

*-----------------------------------------------------------------------
subroutine solvm13(
& bp, bw, bs, bu, bnu, beu, bes, r, zt
&)
use numbers
implicit none

real bp (0:Nx1,0:Ny1,0:Nz1),
& bw (0:Nx1,0:Ny1,0:Nz1),
& bs (0:Nx1,0:Ny1,0:Nz1),
& bu (0:Nx1,0:Ny1,0:Nz1),
& bnu(0:Nx1,0:Ny1,0:Nz1),
& beu(0:Nx1,0:Ny1,0:Nz1),
& bes(0:Nx1,0:Ny1,0:Nz1),
& r (0:Nx1,0:Ny1,0:Nz1),
& zt (0:Nx1,0:Ny1,0:Nz1)
! internal obj
integer i, j, k

zt = 0.0

! forward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
if ( k > 1 .And. j < Ny ) then
!$ call omp_set_lock( locks(j+1,k-1) )
endif
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu(i,j,k) * zt(i,j,k-1)
& - beu(i,j,k) * zt(i+1,j,k-1)
&amp
; - bnu(i,j,k) * zt(i,j+1,k-1)
& - bs(i,j,k) * zt(i,j-1,k)
& - bes(i,j,k) * zt(i+1,j-1,k)
& - bw(i,j,k) * zt(i-1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

! backward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=Nz,1,-1
do j=Ny,1,-1
if ( j > 1 .And. k < Nz ) then
!$ call omp_set_lock( locks(j-1,k+1) )
endif
do i=Nx,1,-1
zt(i,j,k) = zt(i,j,k) / bp(i,j,k)
& - bu(i,j,k+1) * zt(i,j,k+1)
& - beu(i-1,j,k+1) * zt(i-1,j,k+1)
& - bnu(i,j-1,k+1) * zt(i,j-1,k+1)
& - bs(i,j+1,k) * zt(i,j+1,k)
& - bes(i-1,j+1,k) * zt(i-1,j+1,k)
& - bw(i+1,j,k) * zt(i+1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo
end

*-----------------------------------------------------------------------
subroutine solvm13_2(
& bp, bw, bs, bu, bnu, beu, bes, r, zt,
& bw_, bs_, bu_, bnu_, beu_, bes_
&)
use numbers
implicit none

real bp (0:Nx1,0:Ny1,0:Nz1),
& bw (0:Nx1,0:Ny1,0:Nz1),
& bs (0:Nx1,0:Ny1,0:Nz1),
& bu (0:Nx1,0:Ny1,0:Nz1),
& bnu (0:Nx1,0:Ny1,0:Nz1),
& beu (0:Nx1,0:Ny1,0:Nz1),
& bes (0:Nx1,0:Ny1,0:Nz1),
& r (0:Nx1,0:Ny1,0:Nz1),
& zt (0:Nx1,0:Ny1,0:Nz1),
& bw_ (0:Nx1,0:Ny1,0:Nz1),
& bs_ (0:Nx1,0:Ny1,0:Nz1),
& bu_ (0:Nx1,0:Ny1,0:Nz1),
& bnu_(0:Nx1,0:Ny1,0:Nz1),
& beu_(0:Nx1,0:Ny1,0:Nz1),
& bes_(0:Nx1,0:Ny1,0:Nz1)

! internal obj
integer i, j, k

zt = 0.0

! forward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=1,Nz
do j=1,Ny
if ( k > 1 .And. j < Ny ) then
!$ call omp_set_lock( locks(j+1,k-1) )
endif
do i=1,Nx
bu_ (i,j,k) = bu (i,j,k) * zt(i,j,k-1)
enddo
do i=1,Nx
beu_(i,j,k) = beu(i,j,k) * zt(i+1,j,k-1)
&nb
sp; enddo
do i=1,Nx
bnu_(i,j,k) = bnu(i,j,k) * zt(i,j+1,k-1)
enddo
do i=1,Nx
bs_ (i,j,k) = bs (i,j,k) * zt(i,j-1,k)
enddo
do i=1,Nx
bes_(i,j,k) = bes(i,j,k) * zt(i+1,j-1,k)
enddo
do i=1,Nx
bw_ (i,j,k) = bw (i,j,k) * zt(i-1,j,k)
enddo
do i=1,Nx
zt(i,j,k) = r(i,j,k)
& - bu_ (i,j,k)
& - beu_(i,j,k)
& - bnu_(i,j,k)
& - bs_ (i,j,k)
& - bes_(i,j,k)
& - bw_ (i,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo

! backward sweep
!$ do j=1,Ny
!$ do k=1,Nz
!$ call omp_test_lock( locks(j,k) )
!$ enddo
!$ enddo

!$omp parallel do
!$omp& schedule( static, 1 )
!$omp& private( i, j, k )
do k=Nz,1,-1
do j=Ny,1,-1
if ( j > 1 .And. k < Nz ) then
!$ call omp_set_lock( locks(j-1,k+1) )
endif
do i=Nx,1,-1
bu_ (i,j,k+1) = bu(i,j,k+1) * zt(i,j,k+1)
enddo
&n
bsp; do i=Nx,1,-1
beu_(i-1,j,k+1) = beu(i-1,j,k+1) * zt(i-1,j,k+1)
enddo
do i=Nx,1,-1
bnu_(i,j-1,k+1) = bnu(i,j-1,k+1) * zt(i,j-1,k+1)
enddo
do i=Nx,1,-1
bs_ (i,j+1,k) = bs(i,j+1,k) * zt(i,j+1,k)
enddo
do i=Nx,1,-1
bes_(i-1,j+1,k) = bes(i-1,j+1,k) * zt(i-1,j+1,k)
enddo
do i=Nx,1,-1
bw_ (i+1,j,k) = bw(i+1,j,k) * zt(i+1,j,k)
enddo
do i=Nx,1,-1
zt(i,j,k) = zt(i,j,k) / bp(i,j,k)
& - bu_ (i,j,k+1)
& - beu_(i-1,j,k+1)
& - bnu_(i,j-1,k+1)
& - bs_ (i,j+1,k)
& - bes_(i-1,j+1,k)
& - bw_ (i+1,j,k)
enddo
!$ call omp_unset_lock( locks(j,k) )
enddo
enddo
end

Arikd,

I am preparing a test program using dummy data.

In overviewing your code, prior to each sweep you have a double loop issuingCALLs to omp_test_lock. This is a coding error since omp_test_lock is a logical function and not a subroutine. Also, at that point in the code, the locks should be in the state of unlocked (assuming your code properly initialized the locks).

Jim Dempsey

www.quickthreadprogramming.com

Arik,

Your "original" solvm13 is not what I asked for. What you provided is your attempt at parallizing the (or some) code. What I asked for is the non-parallel version of the code that used to work. Please confirm that running solvm13 with OpenMP disabled that the results produced are correct.If your non-parallel results are incorrect then it is not productive for us to suggest improvements in performance of this non-functioning code.

Additionaly, in looking at the solvm13 code, in addition to using CALL to call a function as opposed to a subroutine, your set lock and unset lock are not correct in that (assuming more than one thread) the thread unsetting a lock is not the same as the thread setting the lock. eg Thread 0 gets k=1, Thread 1 gets k=2, Thread 3 gets k=3, Thread 0 gets k=4, ... at some point Thread 0 will lock (j,2) whereas Thread 1 will be unlocking (j,2). This is not permitted (i.e. your code is relying on undefined behavior).

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Your locks note is well taken. It should be fixed, but that has nothing to do w/optimization. The code produces correct results in either configuration, i.e. the way we had it before or with the changes you proposed. The only issue is the solution time. We did run your modificationsboth with and without parallelization and the result is the same, roughly 2x slower than the original code.

Arik

Arik,

In looking at your forward sweep routine (were indexes increment) you are referencing offsets of -1 (i.e. k-1 and j-1). In the non-threaded execution of the code,this requires that those cells of zt have been updated prior to reference (with -1 offset). A similer situation occures in the backward sweep with the offsets of +1 (i.e. k+1 and j+1).

What this causes isthe requirement for the serializationof the process (only one thread at a time can progress).

In many filtering or transformation algorithems the next time state values depend on prior time state values but not the values in the state of flux. i.e. your forward sweep would use index offsets of +1 as well as backward sweep using index offsets of -1. This causes me to call into question if the algorithm as supplied is correct.

From the little understanding I have of what the code isit looks as if you are calculating new values for r in 3D space using faces of a cube/box. If this is so then each call to solvm13 would be part of a state advancement and therefore all the values on the right side of the zt(i,j,k)= ought to reference the prior values of zt(i,j,k) and not the subsequent values during the state advancement within solvm13. Could you comment on this?

Jim Dempsey

www.quickthreadprogramming.com

Jim,

I already questioned these serial dependencies. Given that a chunk size of 1 is used, with luck, and all threads progressing at the same rate, it's possible that the updates have completed soon enough, but there's still a race condition. If it's an iterative method, it's possible that small variations in iteration history from run to run are acceptable and don't cause it to fail.

Tim

Jim and Tim,

Unfortunately, both the algorithm and the code implementing it are correct. It has been tested to produce correct results and hence by definition it is fine. We are solving a banded matrix here, using Incomplete Cholesky Conjugate Gradient, ICCG, method. One of the key elements of this approach is to carry out ILU (Incomplete LU decomposition) on the original matrix and then solve the matrix iteratively. Each iteration involves all kinds of dot products (like the ones I posted in the beginning), but the most time-consuming operation is this forward and back-substitution step. (If you are interested, look up the standard direct, i.e. non-iterative, methods of matrix solution based on LU decomposition - all of them involve this type of a fw/back substitution step.)

Indeed, as I mentioned earlier and as you noticed, the step involves recurrent relations, which apparently impede its efficient implementation. It sounds as if you are saying that if that's what it is,there is not much wecan do. If so, I will look for ways to break that recurrency, somehow.

Thanks,

Arik

Arik,

Hope is not lost.

I spent the day working on a different method of interlocking and came up with a way that produces these results on a system with 4 cores (2x Opteron 270 w/ 2 cores each).

Available Cores 4
Nx=200
Ny=200
Nz=200
solvm13_s = 4.05135552724823 (unthreaded)
solvm13_s = 4.05434086732566 (unthreaded)
solvm13_s = 4.05069780210033 (unthreaded)
solvm13_s2 = 6.38864697469398 Threads=1
solvm13_s2 = 6.37160542188212 Threads=1
solvm13_s2 = 6.36766673671082 Threads=1
solvm13_s2 = 3.33201492624357 Threads=2
solvm13_s2 = 3.32790499506518 Threads=2
solvm13_s2 = 3.32613467006013 Threads=2
solvm13_s2 = 2.32680508820340 Threads=3
solvm13_s2 = 2.33010923257098 Threads=3
solvm13_s2 = 2.33329555764794 Threads=3
solvm13_s2 = 1.83062202343717 Threads=4
solvm13_s2 = 1.82885152008384 Threads=4
solvm13_s2 = 1.82682427763939 Threads=4

The system only has 2GB so I could not bump up the array sizes largest than 200 by 200 by 200 (202 by 202 by 202if you count the 0 and +1 perimeters).

The arrays were declared as REAL(8). There is interlocking in the code but there are no OpenMP locks involved. And no extraneous temporary arrays.

Each test was run 3 times in order to produce a run time with some confidence that it is approximately correct. There is an unthreaded test (Without OpenMP statements and without the interlocking code added to force proper sequencing). The OpenMP threaded tests run with 1, 2, 3, 4 threads and 3 iterations each (in addition to the OpenMP overhead there is also the synchronization code to assert the temporal requirements of the method).

Two sets of arrays were used. Each set of arrays were declared as a user defined type. One of the user defined type was A and the other B (each with 9 arrays zt(:,:,:), r(:,:,:),...).

The arrays were initialized with different values in each array. The values were chosen such that neither Infinities nor NaNs would result in the output array (zt(i,j,k)). The A set of initialized data was the same as the B set of initialized data.

The unthreaded code processed the A set of arrays, the threaded code processed the B set of arrays. After each iteration a comparison was performed

if(A%zt(i,j,k) .ne. B%zt(i,j,k)) write 'err ', i,j,k

No errors were reported between the various numbers of theads and the unthreaded code.

As you can see there is hit from going unthreaded to OpenMP with 1 thread. Additional improvement might be attained with closer examination of the code.

You can email me to discuss this in further detail, my address follows.

Jim Dempsey
jim_dempsey@ameritech.net

www.quickthreadprogramming.com

Further enhancement gets to

Available Cores 4
Nx 200
Ny 200
Nz 200
solvm13_s = 4.03220395976678 (unthreaded)
solvm13_s = 4.03583411965519 (unthreaded)
solvm13_s = 4.04411484347656 (unthreaded)
solvm13_s2 = 5.83800405170768 Threads= 1
solvm13_s2 = 5.83111096965149 Threads= 1
solvm13_s2 = 5.83113517612219 Threads= 1
solvm13_s2 = 3.02408889075741 Threads= 2
solvm13_s2 = 2.99818594055250 Threads= 2
solvm13_s2 = 2.99723869049922 Threads= 2
solvm13_s2 = 2.07681118417531 Threads= 3
solvm13_s2 = 2.07517947163433 Threads= 3
solvm13_s2 = 2.06526678288355 Threads= 3
solvm13_s2 = 1.59286772180349 Threads= 4
solvm13_s2 = 1.59124766616151 Threads= 4
solvm13_s2 = 1.58598867338151 Threads= 4

In examining the dissassembly code it appears that additional and significant improvement can yet be attained.

Jim Dempsey

www.quickthreadprogramming.com

You should try to make some tests and improve the 64 bit options. I will try to put some text on my website tomorrow when I have time.

Leave a Comment

Please sign in to add a comment. Not a member? Join today