# Parallel programming: Paralleled do loops

## Parallel programming: Paralleled do loops

Hi everyone;

I had a problem with paralellizing of fortran program here to share. My program had a three nested big do loops which were taking too much time to run. So, I decided to paralellize outside do loop (n=1,nszf) and then use it in some supercomuters in my university. My current Dell xps 8500 Windows 7 ultimate PC has 8 threads. here is that part of my program:

!\$omp parallel private(i,j,k,n,l,c) shared (nszf, nband) reduction (- : sk, r1)
!\$omp do
do 300 n=1,nszf            ! nszf=139020
i=n
do 290 l=2,nband          ! nband=7000
i=i+1
if (sk(n,l)) 240,290,240
240   c = sk(n,l)/sk(n,1)
J=0
do 270 k=l,nband
j=j+1
if(sk(n,k)) 260,270,260
260  sk(i,j)=sk(i,j)-c*sk(n,k)
270  continue
280  sk(n,l)=c
r1(i)=r1(i)-c*r1(n)
290  continue
300  r1(n)=r1(n)/sk(n,1)
!\$omp end do
!\$omp end parallel

Without omp clauses program runs well, but using paralell gives some errors. I think program have problem in private, shared, and reduction clauses parts. sk is a stiffness matrix and r1 is force array. It should be noted that this program is the same famous Gaussian elimination method to solve a unknown in a matrix. Is it true to use reduction for matrix and arrays this way here?  please let me know where the problem is?

Thanks

25 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Experiment with having the !\$omp do enclose the 270 loop (and remove the reduction clause on the !\$omp parallel)

Have only the master thread update the variables outside the !\$omp do (and inside the !\$omp parallel).

Once this is working, then see how to partition the iteration spaces such that you can move the parallelization to outer levels. Note, keeping the !\$omp parallel at the outer level will reduce the number of enter/exit parallel region, but comes at the expense of delegating the work outside the !\$omp do to one thread.

Jim Dempsey

This is a fixed band direct solver, circa 1972.
My experience with a direct solver has been that it is not easy to apply the openMP to this solution approach, although I assume it can be done. Nband is typically too small to justify multi-threading the inner loop, while sk(i,j) varying prohibits the outer loops being parallelised.
With this solver, It might be possible to parallelise loop 290, as this applies row N to all lower rows.
Would taking a copy of row N, say as vector Row_N, make it easier for the compiler to parallelise loop 290 ?
I have had more (easier) success with vectorising the calculation, with AVX instructions now offering more in this area. With /O2 (default?) vectorisation is enabled.
Two areas I would check:
1) How often does the test "if(sk(n,k)) 260,270,260 " skip to 270. It might be better to ignore this test and simplify the inner loop.
2) How uniformly banded is your matrix ? There are many examples of skyline solvers, which identify the band-width of each equation, rather than the maximum band. COLSOL by R L Taylor is very well published, including in Zienkiewicz: The Finite Element Method.
For most problems, changing from a fixed band row storage solver to a variable band column storage solver should provide a significant speed improvement.
.
John
.
PS: I changed the code to F90 and reviewed your IF tests, now hopefully without errors!!
low case l or upper case I are not a good variables with most fonts !
.
!\$omp parallel private(i,j,k,n,l,c) shared (nszf, nband) reduction (- : sk, r1)
!\$omp do
do n = 1,nszf ! nszf=139020
if (sk(n,1) < tiny_value) then ! test for singularity , could be abs test if eigenvalue extraction
sk(n,1) = 1
cycle
end if
!
iband = 1
Row_n(1) = sk(n,1)
do b = 2,nband
Row_n(b) = sk(n,b)
if (Row_n(b) == 0) cycle
iband = b
sk(n,b) = Row_n(b)/Row_n(1)
end do
!
! This loop could be parallelised, with Row_n and iband shared ?
i=n
do b = 2,iband ! nband=7000
i = i+1
if (Row_n(b)==0) cycle ! main zero test
c = Row_n(b)/Row_n(1)
j = 0
do k = b,iband
j = j+1
! omit if (Row_n(k)==0) cycle ! low probability test ?
sk(i,j) = sk(i,j) - c*Row_n(k)
end do
! sk(n,b) = c ! already done
r1(i) = r1(i) - c*r1(n)
end do
r1(n) = r1(n) / Row_n(1)
!
end do
!\$omp end do
!\$omp end parallel

The existing 290/270 loop structure should be more suitable for \$omp with the introduction of Row_n
.
Some other ideas that might help run time are:
1) seperate r1 to a seperate forward reduction ( an order of mannitude faster than sk reduction )
2) swap loops 270 and 290 to address sk(i,j) sequentially might reduce memory access times
.
! Seperate Loops for r1 (after sk(n.) has been modified)
do n = 1,nszf ! nszf=139020
if (sk(n,1) < tiny_value) cycle ! test for singularity , could be abs test if eigenvalue extraction
i=n
do b = 2,nband
i = i+1
if (sk(n,b)==0) cycle ! main zero test
r1(i) = r1(i) - sk(n,b)*r1(n)
end do
r1(n) = r1(n) / sk(n,1)
end do
.
! re-ordered i and j might be better (coding needs to be checked)
! if loops b and k were swapped, sk(i,j) would then be addressed sequentially
! but
! involves two ops (* and /) which can be overcome by using sk(n,b)
! no easy zero test in loop j
! b = i-n+1
! i = b+n-1
! k = j+b-1 = j+i-n
! j = k-b+1
do j = 1,iband
b = 1
k = j
do i = n+1,n+1+iband-j
b = b+1 ! b = i-n+1
k = k+1 ! k = i-n+j
! sk(i,j) = sk(i,j) - Row_n(b)*Row_n(k)/Row_n(1)
sk(i,j) = sk(i,j) - sk(n,b)*Row_n(k)
end do
end do
.

After further review, there are two areas where you could improve performance could be:
.
1) storage : you have three possible storage methods to use, being
store by diagonal :: sk(nszf,nband), which you are using
store by row :: sk(nband,nszf), which should be better
store by column :: sk(nband,nszf), which suits COLSOL approach
row storage is more suited to Gaussian elimination
.
2) zero values : if there is not total non-zero filling of the reduced sk array, then monitoriong of non zero values might be a significant improvenment.
It can be done in the calculation of the Row_f vector, reducing the inner loops to only the non-zero coefficients in row n.
!
! Find active values in this row
iband = 1
Row_i(1) = 1
Row_f(1) = skt(1,n)
do b = 2,nband
if (skt(b,n) == 0) cycle
iband = iband + 1
Row_i(iband) = b
Row_f(iband) = skt(b,n)
skt(b,n) = Row_f(iband)/Row_f(1)
end do
!
! This key loop could be parallelised, with Row_f, Row_i and iband shared ?
do b = 2,iband
i = n+row_i(b)-1
c = Row_f(b)/Row_f(1)
do k = b,iband
j = row_i(k)-row_i(b)+1
skt(j,i) = skt(j,i) - c*Row_f(k)
end do
end do
.
(you would need to check this coding, especially for i and j)
.
I would expect that the use of Row_f will improve loops 290/270 being parallelised. Parallelising is not easily achieved in COLSOL codings.
If your model is a rectangular mesh with uniform banding, then the identification of zeros might not offer significant improvement.
However, both improvement areas should offer some improvement.
.
John

John, my mesh which I use for this program is uniform rectangular of length 120m, width= 135m, and height =55m all sides of mesh general boundary except top parts of it.

John Campbell 写道：

1) How often does the test "if(sk(n,k)) 260,270,260 " skip to 270. It might be better to ignore this test and simplify the inner loop.

In sk(nszf, nband) matrix, I would say that around 80% of row is zero.

John Campbell 写道：

2) How uniformly banded is your matrix ? There are many examples of skyline solvers, which identify the band-width of each equation, rather than the maximum band.

I think actual bandwidth is pretty close to my nband=7000, maybe actual bandwith be around 6800. so, this one is okay, i think.

Last comment of you seems to be good alternative, but it looks that transposing of sk(nzsf, nband) matrix to sk(nband,nszf) makes it a bit confusing, as we are dealing with bandwith too, but I was wondering to know couldn't we use the sk(nband, nszf) with Row_f ?

I'm new to openMP too, so, is the variable declaration in openMP in my program true especialy reduction (- : sk, r1)?

Whole of the program should be as follow, right? but, what about r1 array? based on this program, I should transpose the sk to skt, and then after it retrasnpose again after do loops, as I use sk in following parts for stress and strain alanysis too, right?

!\$omp parallel private (i,j,k,n,l,c) shared (nszf, nband,Row_f, Row_i,iband) reduction (- : sk, r1)

do n = 1,nszf ! nszf=139020
if (sk(n,1) < tiny_value) then ! test for singularity , could be abs test if eigenvalue extraction
sk(n,1) = 1
cycle
end if

.

.

iband = 1
Row_i(1) = 1
Row_f(1) = skt(1,n)
do b = 2,nband
if (skt(b,n) == 0) cycle
iband = iband + 1
Row_i(iband) = b
Row_f(iband) = skt(b,n)
skt(b,n) = Row_f(iband)/Row_f(1)
end do

!\$omp do
do b = 2,iband
i = n+row_i(b)-1
c = Row_f(b)/Row_f(1)
do k = b,iband
j = row_i(k)-row_i(b)+1
skt(j,i) = skt(j,i) - c*Row_f(k)
end do
end do

!\$omp end do

!\$omp end parallel

end do

thanks :)

Alex,

If there is only one r1 array, ie only 1 load case in your problem solution, I would put the r1 reduction back in the "290" loop, although it can be a neater solution to have the r1 forward reduction and backward substitution in a seperate sk_solve routine.

You stated that "In sk(nszf, nband) matrix, I would say that around 80% of row is zero."  I think you should generate some statistics on this, as if you have a regular 3D mesh as described, then there could be as little as 1% zeros. The values of sk are benig tested after previous equations have trickled down their influence. If there is significant non-zero infill then the Row_f indexed version of Row_n might not be very effective. For each row, count the number of non zero values and report the average. You might be surprised.
I'd recommend first try the row_n approach, as it should help the \$omp performance.

As the sk matrix is so large, I think there would be benefit in transposing. This would localise the memory usage. You must have the value of nband, before you start generating the sk matrix. Isn't sk an allocatable array ? There should be limited reference to this array in the matrix formation and reduction codes. The back-substitution code for r1 might only need the sk transpose change, as it is much faster than the sk reduction.

The combination of sk transposed and Row_n changes should produce some benefits. (Row_n woudl work better with sk transposed)

I hope this helps,

John

Alex,

introduce row_f : the active row of values to remove sk(..) from the RHS in 290 loop
introduce iband : the bandwidth limit for equation n
summary statistics for bandwidth, zero values and elapsed time
changed l to b for easier reading

I assumed that sk was declared as:
real*8, allocatable, dimension(:,:) :: sk
allocate ( sk(nszf,nband) )

I have made some changes to the !\$omp syntax, but I am not sure of the syntax.

Could you run it with /O2 and with or without /Qparallel. I would be interested to find out that:
the results remain the same
the value of summary statistics, and
statistics of elapse time (using system_clock) compared with the previous version.

These could identify how the changes I identified might benefit the run time.

The next step would be to transpose sk in your program and see the changes.

John

## Attachments:

AttachmentSize
2.31 KB

thnaks John for responce,

I was trying to run the program with openMp, before running, I noticed a minor issue in your attachment file:

In loop 290, you have used iband, but row_f (b) index is varying from 1 to nband, I mean array row_f (b) (b=1,nband) in previous parts also include zero values, and its size is nband, right?

i = n
do b = 2,nband           ! nband=7000
i = i+1
if (row_f(b)==0) cycle
c = row_f(b)/row_f(1)
j = 0
do k = b, nband
j = j+1
if (row_f(k)==0) cycle
sk(i,j) = sk(i,j) - c*row_f(k)

Next main issue, as you told is that sk is allocatable matrix, I think there is no way to not use allocatable array because of large size of this array,

allocate (sk(nszf,nband)). During running with openMP, and using allocatable matrix of sk, getting following error:

A deferred array shape array is not permitted in an openMP

Not using allocatable array, during running (building solution), facing: compilation aborted (code 3), exactly at the first line of entire program.

so, what else can I do?

thanks.

Allex,

Loop 290 should now use iband. In past posts I have identified three bandwidth measures:
1) nband is the maximum bandwidth of any row
2) iband in gaussean.f90 is the length of the n row to the last non-zero value
3) iband in my post of Wed, 01/30/2013 - 15:22, where I included row_i, is the count of the number of non-zero values that were indexed in row_i. Depending on the statistics this could be a good change.
You should use the iband, as I provided in gaussean.f90.

I provided some statistics in this code. You might want to include the following, before the start of the reduction:

!     initial statistics
real*8 sum_zero
sum_zero = 0
do n = 1,nszf                   ! nszf=139020
do b = 1,nband
if (sk(n,b)==0) sum_zero = sum_zero+1
end do
end do
write (*,*) ' % zero before      =', sum_zero/dble(nszf*nband) * 100.

Your comment about sk not being able to be used for openMP is a surprise to be. I would have a main module where sk, nszf, nband and other important system variables are used. You should allocate sk, prior to assembly of the global stiffness matrix, once you know nszf and nband. I have given an example of this module definition in gaussean.f90.
Not being able to use an allocatable array in openMP should not be the case.

Is sk declared as real*8 or real*4 ?

Transposing the use of sk from sk(i,j) to sk(j,i) should help the run time.

Could you adapt gaussean.f90 and run to get the statistics. These would show if including row_i is worthwile, certainly which adaptation of loop 290 should work best. Once we understand this I think the next is to determine why loop 290 can not work with !\$omp.

I am tempted to make loop 290 a subroutine, however I have found this approach not work with /Qparallel. I had hoped that !\$omp would overcome this problem.

John

## Attachments:

AttachmentSize
3.32 KB

Allex,

I have reviewed the example I gave you and included extra code it to generate a dummy smaller stiffness matrix so that I could run it. (neq=130000, nband=1700) The results of these runs, especially the degree on non-zero infill, may be different from your case, but it shows:
gaussean.f90 is the run where iband (the band envelope) is used for each equation, (306 seconds)
gaussean_tran.f90 is the run where SK is transposed (106 seconds)
gaussean_t2.f90 is using transposed and row_i to remove zeros from the active row ( 60 seconds)
The run time improvement is heavily dependant on the % zero when reduced of 55%. If your example gives a much lower figure, say < 10% (which I expect) then the savings will be less, but probably worthwhile.
I also separated out the reduction of R1, as this is much faster than reducing SK. You could also add the loop for back-substitution. I'd recommend this.
You should check gaussean_tran.f90 to see how easy it is to transpose SK. (use FC utility)

All this has been run with /O2 but not /Qparallel or /Qopenmp, which brings us to where you first began with your question about !\$OMP directives in the code. This needs some more review.
I tried !\$OMP CRITICAL for the initial setup of Row_f, but I don't think that did what I intended, which was that it ran only once and not for each thread.
The !\$OMP DO did parallelise this code segment, which is what is required, but I could not guarantee it is working correctly.
I ran gaussean_t2.f90 with /Qopenmp, but it ran in only 6 seconds with 4 processes, which is not right. It should be at least 15 seconds. There is a two-stage process of 1) generation of row_f/row_i then 2) loop 290. This needs to be enforced in the !\$OMP directives.
I think it is close to working, but needs changes to the !\$OMP to handle the loops correctly, but only run the row_f generation once. Hopefilly someone else might be able to review the !\$OMP commands to get this right.

I am attaching the 3 versions of the test program plus a run trace of each and the batch file I have been using to test with ifort Ver 12.1.
What I have been describing should work and hopefully I have not introduced other coding errors.

I hope this gives you some ideas. I'd be interested in how you proceed,

John

PS: I have now attached an updated version of option 3, which is a more complete test, by also providing load case back substitution then a check calculation of : force  - [sk] . disp, to check the maximum calculation error. This should be useful to test the parallel operation.
The changes I have now made to the \$OMP commands appear to be working, although there might be a better implementation.
!\$OMP appears to reduce the run time by about 50% with 4 processes.

## Attachments:

AttachmentSize
7.36 KB
7.88 KB

In my program, sk was allocated in some other subroutines (sk declared real*4, all of my variables are real*4) and, then was used in other subroutine, something like:

! Main program

real, allocatable:: sk(:,:)

call formk (sk,nszf,...)

call solve (sk,r1,nszf)

! end end of main program

Subroutine formk (sk,nszf,...)

real, allocatable:: sk(:,:)
.
.
allocate (sk(nszf, nband))
! sk matrix is formed here, boundary condition is applied, and reduced to nszf x nband. then, it is sent to be solved in solve subroutine.
end subroutine formk

subroutine solve (sk,r1,nszf)
real, allocatable:: sk(:,:)
allocate (sk(nszf, nband))
nband=7000
call gaussian_reduction (sk, nszf, nband, r1) (your subroutine)

end subrouotine solve
.
subroutine gaussian_reduction (sk, nszf, nband, r1)
real:: sk(nszf,nband)
.. and then rest of program you provided....
This one gives unhandled exception at .... stack overflow error at reduction (- : sk, r1) line. Need to tell that I've allocated large number in stack reserve size (1000000000).

changing  real:: sk(nszf,nband) at gaussian_reduction  subroutine into real, allocatable:: sk(:,:), and then running gives:
A deferred array shape array is not permitted in an openMP FIRSTPRIVATE, .... [SK]

and, in first case, program executed the statistcs parts and gives % zero before = 95%.
You know that could run gaussean.f90, probably because your dummy stiffness matrix size is small.
so, in this case, what can i do?

thank you

Alex

Alex,

You should only allocate SK once. I would place SK in a module and USE the module in both FORMK and SOLVE. (SOLVE is a very common name, I'd change it). You should try to understand why this is the case.
It is wrong to allocate SK more than once, or define nband=7000 after the allocate statement in SOLVE. This might be the cause of the error you reported earlier with \$OMP and SK.

Look at the way I call gaussean_reduction (sk, nszf, nband), then declare the arguments. This will overcome the "A deferred array shape array is not permitted in an openMP FIRSTPRIVATE, .... [SK]" error

If you don't have any knowledge of the precision required for SK, I'd recommend real*8. You havn't described the system you are modelling, but I'd expect real*8 would be required.
My testing also shows a significant gain by transposing SK. You should try to identify the changes that would be required for this. It should not be extensive, unless you are using SK for other purposes, which would not be advisable for such a large array.

Try the latest version I attached, as I think the \$OMP commands are working. It would be good to see results on your i7 which supports 8 processes. I think this version is fairly close.

John

Alex,

I have tidied up the code to include some more statistics and confirm that it appears to be working for !\$OMP.
It could be a good example of a basic openMP implementation.

The compilation options I have selected are: /Tf %1.f90 /free /O2 /Qopenmp /QxHost

I have some questions about !\$OMP which someone might be able to comment on:
Is it possible to move the following command outside the outer DO loop to improve performance ?
!\$OMP PARALLEL PRIVATE (c,i,j,k,b), SHARED (SK, N, Row_f, Row_i, iband)
Are the !\$OMP commands I have selected the most appropriate ?
What would be the minimal install on a PC, which does not have ifort installed, so the program could run on another PC ?

John

## Attachments:

AttachmentSize
9.44 KB

Alex,

Could you please run the latest version I attached on your i7, with and without /Qopenmp compile option. I'd like to see the difference with 8 possible processes. You should run task manager to confirm you are getting 100% cpu, rather than 12.5% without /Qopenmp. The R1 solver has not been optimised or uses !\$OMP. It runs much quicker than the SK reduction. It could be improved if it runs many times.

The !\$OMP commands could be reviewed, however the outer "n" loops must run sequentially in all cases. That is why I placed the "do n" outside the !\$OMP construct. I'm not sure if this means that the thread processes are set up nszf times, or only once. Once would be better!

The fortran ALLOCATE statement, provides a memory address space for the SK array, once it is allocated it is fixed for the full run.

Let me know how you go.

John

John;
I ran your latest attached file with my PC, Dell XPS 8500, assuming sk(nband=3700,nszf=130000) with !\$omp, it took about 197 sec, and without it, it was about 202 sec. I was checking the task manager as well, so during !\$omp, CPU usage was 98-100%, and without it it was 12-17%. Same for  sk(nband=3700,nszf=70000),  with !\$omp 98.5 sec, and without it 102 sec.
It seems that using !\$omp doesn't affect greatly the calculation time !?? I was expecting large difference between the times.
however I want to try your latest attachment in my program, i'll post about it here soon.

Thanks
Alex

I just came across with another question about parallel programing;
In following code, I'm trying to trasnpose a vector (sk vecor according to my previous post in this thread), a=sk and b=skt arrays are large arrays and have been allocated already in main program, as I was monitoing the task manager duing running of the program at this part, task manager>performance>physical memory, free value of physical memory was 0. And, becasue of using parallel programing, I was expecting to see around 100% cpu usage, while it was around 1-29%. I was wondering, as the whole point of using parallel programing was to use all CPU capacity, here low cpu uage, it is probably due to the lack of free physical memory, right, please correct me if i'm wrong? if so, how can I over come this issue?

subroutine trans(a,b,row,col)
integer::i,j,row,col
real*8 a(row,col),b(col,row)
!\$omp parallel private(i,j) shared(a,b,row,col)
!\$omp do
do i=1,row
do j=1,col
b(j,i)=a(i,j)
end do
end do
!\$omp end do
!\$omp end parallel
end subroutine trans

Thanks
Alex

Parallel optimization of transpose seems to be a frequently set homework assignment, part of the point being to set an incentive to browse prior posts on this.

You might start by checking that your single thread performance is as good as transpose intrinsic, including checking the effect of vector nontemporal and architecture settings.

Alex,
I would not transpose SK, but generate it in the transposed form, with ALLOCATE ( sk(nband,nszf) )
There should not be too many occurrences in your code of SK(j,i) =
which you convert to SK(i,j) =
There might also be some DO loops which you could change their order.

Using real*4 is a low precision, unless the problem is very well behaved. You should make sure you back calculate for force, given the displacement field result;
Force = sum (K_element x Disp)  ( using all element stiffness matrics is usually much quicker)
Error = Force - Applied_Force
checking the maximum error for each degree of freedom. It would be good to declare FORCE and ERROR as real*8

John

Tin,

I am investigating some other matrix processing problems, related to the use of "row_i" in the above example for indexing the non-zero coefficients in the active row.
You made the comment about "checking the effect of vector nontemporal and architecture settings."
Are you able to provide more details on these settings and how they might be checked ?
How might these settings be altered ?

John

trahspose is a standard Fortran intrinsic, since 20 years ago.   Normally, it could be expected to be implemented in an effective way for the target platform, although it may not effectively with auto-parallel or OpenMP.  You could use it to check that your explicit do loops aren't costing you performance at 1 thread, and that you aren't losing performance by adding threads.

Intel Fortran supports

!dir\$ vector nontemporal

so as to request stores to bypass cache.  In some situations, the compiler may do it automatically, if the stored array is expected to be larger than say 50% of cache.  In a case where the stored data aren't accessed again while remaining in cache on the same CPU, this directive can avoid reading the data prior to writing, thus potentially cutting memory traffic by 50%.

If you haven't noticed any advertising about /arch:AVX for corei7-2 CPUs, you still might want to try varying the settings.  Depending on the situation, /Qxsse4.1 or the default /Qxsse2 may work best on corei7[-1] in spite of the advertising about /Qxsse4.2 or /QxHost (implied by /fast).

Alex,

I have not been successful with the application of !\$OMP commands to this calculation, but have had further success with improving the sequential procedure.

• ·        Improved timers using QueryPerformanceCounter and rdtsc (single thread suitable)
• ·        Modified the length of each row of SK to be a multiple of 256 bytes. Suits AVX
• ·        Replaced the inner loop with an array section calculation, which should still suit OMP
• ·        Duplicated the middle loop. I was trying to force 2 threads.
• ·        Other tidy ups and improved comments
• ·        Reduced nszf to 13,000 to speed up tests.

The key improvements for performance have been transposing SK (allowing row section calculations) and setting the row length as a multiple of 256 bytes. (There may still be potential for improving the run time by improving the % alignment of row_f in the two reduction DO loops. Perhaps two versions of row_f.)

The duplicating the middle loop was an idea to improve the multi-thread option, as I had not been able to get better than 10% improvement with OMP. I was hoping for 50% with this approach. I could not get this to work with OMP. However this appeared to improve run time for the single thread version, probably by enforcing memory availability for the second loop.

For ifort compiler options, I tried /QxAVX or /Qxsse4.1 or /Qxsse4.2, but /QxHost appears to work well.

This latest attached version appears to run well as a single thread.

There remains the problem that I have not succeeded in providing you an appropriate OMP. To fix this:

• ·        It would be good to move “do n = 1,nszf” into !\$OMP PARALLEL, but ensure that N is processed sequentially
• ·        The determine IBAND stage should be a single thread, as iband, row_f and row_i are shared, while
• ·        The SK reduction stage is ideally suited to multi-thread.

I hope this helps you understand the Gaussean solution process.

John

PS : unfortunately, my change to nband to be a multiple of 32, changed the resulting zero infill for the changed SK matrix, so the 256 byte idea has not worked as well as initially thought. Your results will be dependent on the infill in your real SK matrix.

## Attachments:

AttachmentSize
19.3 KB
428 bytes

Alex,

I have progressed my understanding of the options available for improving run time for the equation solution.

I have tested:
a)    The benefit of OpenMP (none)
b)   The use of /QxAVX on i5 ( 73%run time vs Xeon )
c)    Gausean elimination vs Skyline solver ( Skyline solver is 66% vs Gaussean Elimination )
d)   Transposing SK to suit Gaussean elimination. ( I adopted this as a base case !  82% vs original)
e)    5 different versions of the inner loop ( no significant change)

I have performed the tests on Intel ifort Ver 11.1 and ifort Ver 2011 compilers and Xeon and i5 processors.

OpenMP : I have attempted to implement OpenMP commands to improve performance, but have not been able to achieve any significant gains. Ifort Ver 11.1 failed, with run times 3 to 5 times the serial version. Ifort Ver 2011 produced improvements of at best < 10% to typically 0%. I have contacted others in the forum, but I was not able to identify a way of improving this performance. Tests were on a 4 process Xeon or 4 process i5. Jim’s conclusion was that the process is hitting a memory bandwidth issue with the hardware available. I don’t know how to address this problem.

AVX : Making use of AVX instructions on the i5 processor appears to reduce run time to about 73% of the Xeon performance. The use of these vector instructions is probably a sensible way to go for improving performance.

Skyline solver vs Gaussean Elimination. A Skyline solution improves performance significantly. This reduces the run time to about 66% of the Gaussean solver. The reason for this is that there are fewer changes to the coefficients of the Matrix. For the inner loop, Gaussean has SK(j,i) = SK(j,i) – const * SK(j+ii,n). Skyline has the inner loop as a dot_product, so each coefficient of SK is updated once vs nband times for gaussean.

Transposing SK : Gaussean solver works best if SK is stored as SK(row,column) while skyline is stored as a list of active columns. Both these are better than storing SK(equations,band)

Different Inner Loops. I tried different inner loops for both Gaussean elimination or Skyline solver. The optimisation in ifort resulted in no significant difference between any of the versions. My preference is for case 4
For Gaussean solver the alternative inner loops were:

case (1)    !  Run for standard DO loop'
do ik = ib,iband ; k = row_i(ik) ;  sk(k-ii,i) = sk(k-ii,i) - c*row_f(k) ;  end do

case (2)    !  Run for DO loop for SK(1.'
do k = 1,ienvl-ii ; sk(k,i) = sk(k,i) - c*row_f(k+ii) ; end do

case (3)    !  Run for array sections'
sk(1:k,i) = sk(1:k,i) - c*row_f(b:ienvl)

case (4)    !  Run for Vec_Sub using array syntax'
call Vec_Sub (sk(1,i), row_f(b), k, c)

case (5,6)    !  Run for Vec_Sub using DO syntax
call Vec_Sub_do (sk(1,i), row_f(b), k, c)

The combination of transpose, AVX and Skyline solver reduces the run time from 330 seconds to 127 seconds.

SK(eqn,band) > SK(band, eqn) : 330 seconds to 270 seconds
SK(band.) > AVX instructions : 270 seconds to 192 seconds
SK_AVX to Profile_AVX : 192 seconds to 127 seconds
SK(band.) > Profile : 270 seconds to 173 seconds

This is for 130,000 equations with a max bandwidth of 1,700 and an average bandwidth of 1,686; which is a very evenly banded matrix.

I would look forward to being proved wrong on my estimation of OpenMP performance.

I hope this helps.

John

ps: I can not reproduce the SK(eqn,band) case for 330 seconds. The best I am now getting is 2,600 seconds !! using the matrix real(8) SK(130000,1700) is very slow. I think my earlier test results were for a SK(13000,1700) size. This makes the non-transposed case look even worse for large size problems.

To test OpenMP for skyline solver, you would likely set schedule(runtime) so as to use the environment variable OMP_SCHEDULE, and try options such as dynamic,2 or guided.  You would also try the KMP environment variables to set 1 thread per core and localize adjacent threads to a single CPU if you have multiple CPUs.  You're mighty sketchy on details.

If your problem is big enough, on a multiple CPU platform, it may be worth while to pass through first and estimate the operation counts, then divide the work among the threads, with each thread working on a contiguous group of columns.

Tim,

I have tried not to be sketchy on the details of the coding I am testing. I am hoping that the examples I have attached could help other ifort users, like myself, who are attempting to use OpenMP with ifort, to be able to implement some basic practical examples.

I am not yet familiar with the scheduling approach you have identified. I was hoping to understand a simpler implementation of OpenMP, similar to the examples I have found in the Ifort documentation. I would hope that the options you have highlighted ( which I am yet to learn about ) would be a further level of sophistocation.

I have provided two examples of my !\$OMP attempts on 4-Feb and 12-Feb, which do not achieve any improvement when compiling with /O2 /Qopenmp /QxHost. Is it possible to achieve improved parallel performance by using the !\$OMP instructions ?
I am using Intel Composer XE 2011 ( version 12.1.5.344 6-Jun-12 ). I also tried Ver 11, but it's performance was much worse for the coding I attempted.

I have assumed that the Gaussean reduction is more suited to OpenMP, as the SK matrix is not referenced in the inner 2 DO loops; only re-defined, while for a skyline solver, the modified SK matrix is updated in the middle loop and referenced in the inner loop, limiting OpenMP to apply to the inner dot_product loop.  My limited understanding of OpenMP tells me that Gaussean elimination should be both more suited to OpenMP and easier to implement.

I would welcome your recomendations on alternative instructions in either code example I have attached, so that we might see how to proceed.

John