WORKSHARE not working

WORKSHARE not working

hajek@vzlu.cz's picture

Hello,
I have problems with OpenMP WORKSHARE directive on ifort v. 9.0
(ifort -V produces:
[hajek@dell8 gpr]$ ifort -V
Intel Fortran Itanium Compiler for Itanium-based applications
Version 9.0 Build 20050624 Package ID: l_fc_c_9.0.024
Copyright (C) 1985-2005 Intel Corporation. All rights reserved.
)
If I compile the attached test program with

ifort -O3 -ip -openmp -openmp-report=2 omptest.f90

the compiler produces:
omptest.f90(30) : (col. 6) remark: OpenMP multithreaded code generation for SINGLE was successful.
omptest.f90(33) : (col. 6) remark: OpenMP multithreaded code generation for SINGLE was successful.
omptest.f90(29) : (col. 6) remark: OpenMP DEFINED REGION WAS PARALLELIZED.

There is only one SINGLE section in the parallel region, so my suspicion is that the compiler simply replaces the WORKSHARE region with SINGLE, an it is confirmed by the fact that I get no speedup for multiple threads:

[hajek@dell8 scratch]$ OMP_NUM_THREADS=4 ; export OMP_NUM_THREADS; time ./a.out
running two-dimensional version
OpenMP: using 4 threads.
0.000000000000000E+000 1.54969673800415

real 0m10.607s
user 0m5.595s
sys 0m5.513s

[hajek@dell8 scratch]$ OMP_NUM_THREADS=1 ; export OMP_NUM_THREADS; time ./a.out
running two-dimensional version
OpenMP: using 1 threads.
0.000000000000000E+000 1.54969673800415

real 0m10.716s
user 0m4.891s
sys 0m5.724s

I've searched the forum and only read that WORKSHARE might not work well with rank-2 arrays. But if I compile the previous with 1D version (-D ONED suffices), I get the same results.
I have a code with (relatively) large matrices assembling using forall statements and I really do not want to rewrite them as DO loops. Is WORKSHARE better supported in some later build?

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Tim Prince's picture

With OpenMP, plus opportunities for software pipelining, you generally have to be specific about which loop should be parallelized. If you are at all interested in performance, your first interest should be in enabling successful pipelining.

hajek@vzlu.cz's picture



tim18 wrote:
With OpenMP, plus opportunities for software pipelining, you generally have to be specific about which loop should be parallelized. If you are at all interested in performance, your first interest should be in enabling successful pipelining.



This is indeed true, but compiling with
ifort -O3 -openmp -openmp-report=2 -opt-report -opt-report-phase=ecg_swp -opt-report-file=report.txt omptest.f90

shows that the forall statement in question really gets pipelined.
(see attachment)
Moreover, I tried to rewrite the forall statement as a DO loop in the following way:


do j=1,n
do i=1,n
if (i.ge.j) R(i,j) = sqrt(dot_product(theta,(X(:,i)-X(:,j))**2))
end do
end do


and it was parallelized and shown speedup by a factor of 2
(with 4 threads, but there are also other instructions)
So what am I doing wrong?
In Fortran 95 course at college we were encouraged to use forall's and where's wherever possible to inform the compiler about independency and help it optimizing and parallelizing the loop. Is it not true?

Tim Prince's picture

You may have enabled an optimization
do j=1,n
do i=j,n
R(i,j) = sqrt(dot_product(theta,(X(:,i)-X(:,j))**2))
end do
end do

You certainly have clarified the task set to the parallelizing pre-processor and compiler, to pipeline the i loop and restrict any parallelization to the j loop.

hajek@vzlu.cz's picture

In other words, you would discourage the use of FORALL statements on such "totally independent" loops? And I used to trust FORALL be a great step towards portable parallel programming in Fortran... seems like having a good working implementation of WORKSHARE in Fortran 95 is a real challenge for a compiler. Well, I guess I'll return to good old DO loops and forget these F95 features that glitter but have no gold inside them.

Login to leave a comment.