Optimising convolution using BLAS

Optimising convolution using BLAS

Hi,

I am using Ifort to perform scientific calculations.

According to Valgrind, the most cpu costly subroutine is the derivative routine, which perform a simple 2D convolution :

do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j) + A(i,j,2) * F(i-1,j) + A(i,j,3) * F(i+1,j)+ A(i,j,4) * F(i,j-1)+ A(i,j,5) * F(i,j+1)
end do
end do

I am trying to use BLAS library to accelerate this calculation, but I failed to find the appropriate way.

I think the best way would be to unrolle the loop, and then use BLAS 2 Vector/Matrix calculations, but I am not sure about that.

Does someone as an idea on how to optimize this calculation ?

18 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

This doesn't look like a BLAS operation nor what is usually described as convolution. Unless you have hidden something from us, it does look as if it should easily auto-vectorize, and (assuming Nx2 is large enough) should easily benefit from parallelization of the outer loop (by OpenMP or -parallel).

Hi,

I thought that was what it's called convolution. I do apology.

In fact, my code is already using IFORT MPI to perform huge calculations on 128 procs, but even with many cores, the calculation of that loop is cpu costly because the array F is not read perfectly (when to compute DF(i,j) I read F(i,j-1), F(i,j+1), F(i,j), F(i+1,j) and F(i-1,j), these values are not stored at nearby places in memory, so there are discontinuities).

I have read somewhere that BLAS can optimize this type of calculations. Do you think I am looking the wrong library ?

The right hand side of the assignment reminds one of the discretized Laplacian in 2D.

As to improving the calculation: what can you tell us about the array A?

Do the elements of A change with i,j?

Are most of the elements A(:,:,1) the same in value? Similarly for A(:,:,2) and A(:,:,3)?

I suggest not unrolling the loop.

```do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1)
end do
end do
```

Try the above and please report the effect on performance.

Jim Dempsey

Also try shearing the inner loop

```do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1)
end do
end do
```

Jim Dempsey

Either approach will reduce the pressure on the TLB's (Translation Look-aside Buffers)

Jim Dempsey

As to improving the calculation: what can you tell us about the array
A?

In fact, I have two routines that use this type of calculation :

- the Laplacian discretization as you said before (in a preconditioned BICGSTAB) to solve Poisson equation.
- the Finite Difference 4 for derivatives (but with a larger stencil : from F(i-2) to F(i+2) )

In the Laplacian, coefficients of A(i,j,1) are not the same due to immersed bodies in the CFD simulation. That is why I use BICGSTAB instead of a simple PCG. (A(i-1,j,1) /= A(i,j,1) /= A(i,j,2), etc.)
In the FD4, coefficients with the same last argument are all the sames. (A(i,j,1) = A(i-1,j,1) = A(i,j+1,1) but A(i,j,1) /= A(i,j,2) )

So you can consider :

Poisson :
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j) +
A(i,j,2) * F(i-1,j) + A(i,j,3) * F(i+1,j)+ A(i,j,4) * F(i,j-1)+ A(i,j,5)
* F(i,j+1)
end do
end do

FD2 (FD4 in normal time, but FD2 for this example):
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(1) * F(i,j) +
A(2) * F(i-1,j) + A(3) * F(i+1,j)+ A(4) * F(i,j-1)+ A(5)
* F(i,j+1)
end do
end do

Try the above and please report the effect on performance.

I will do it now, and I will report the effects with -O4.

I tried with two compilers, and results are not the same.
My processors is a Intel Xeon CPU X5570 2.93GHz

The code is the following :

[CODE]
program perfs

implicit none

integer, parameter :: Nx1=512,Nx2=512

real(8), allocatable, dimension(:,:) :: F,DF
real(8), allocatable, dimension(:,:,:) :: A

integer :: i,j,n

real(8) :: time1,time2,totaltime

allocate(F(0:Nx1+1,0:Nx2+1),DF(1:Nx1,1:Nx2))
allocate(A(1:Nx1,1:Nx2,1:5))

F(:,:) = dacos(1.0d0)

! suppose A coefficents are constant in space
A(:,:,1) = 1.0d0
A(:,:,2) = 2.0d0
A(:,:,3) = 3.0d0
A(:,:,4) = 4.0d0
A(:,:,5) = 5.0d0

call cpu_time(time1)
do n=1,1000
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j) + A(i,j,2) * F(i-1,j) + A(i,j,3) * F(i+1,j)+ A(i,j,4) * F(i,j-1)+ A(i,j,5) * F(i,j+1)
end do
end do
end do
call cpu_time(time2)
totaltime=time2-time1
print *,"Test1 :",totaltime

call cpu_time(time1)
do n=1,1000
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1)
end do
end do
do j=1,Nx2
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1)
end do
end do
end do
call cpu_time(time2)
totaltime=time2-time1
print *,"Test2 :",totaltime

call cpu_time(time1)
do n=1,1000
do j=1,Nx2
do i=1,Nx1
DF(i,j) = A(i,j,1) * F(i,j)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,2) * F(i-1,j)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,3) * F(i+1,j)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,4) * F(i,j-1)
end do
do i=1,Nx1
DF(i,j) = DF(i,j) + A(i,j,5) * F(i,j+1)
end do
end do
end do
call cpu_time(time2)
totaltime=time2-time1
print *,"Test3 :",totaltime

deallocate(F,DF)

end program perfs
[\CODE]

With Ifort (non MPI version) :

bl@galilee:~/Desktop\$ ifort -O4 perfs.f90
bl@galilee:~/Desktop\$ ./a.out
Test1 : 1.32000000000000
Test2 : 1.35000000000000
Test3 : 0.910000000000000
bl@galilee:~/Desktop\$ ./a.out
Test1 : 1.33000000000000
Test2 : 1.34000000000000
Test3 : 0.910000000000000
bl@galilee:~/Desktop\$

With Gfortran :

bl@galilee:~/Desktop\$ gfortran -O4 perfs.f90
bl@galilee:~/Desktop\$ ./a.out
Test1 : 1.3799999999999999
Test2 : 1.8300000000000001
Test3 : 1.5599999999999996
bl@galilee:~/Desktop\$ ./a.out
Test1 : 1.3699999999999999
Test2 : 1.8400000000000003
Test3 : 1.5600000000000001
bl@galilee:~/Desktop\$

The third way has clearly improved performances with Ifort. However, I need the code to be the same on all Intel platforms, including the cases when others compilers are used... (i.e. the huge distant supercalculator that I use sometimes and that isn't using ifort).

I will keep this trick in mind and make a special Ifort routine for the local cluster.

Any other ideas to improve this loop ?

A simple remark : the last test was greatly improved using this compilation option :

ifort -O4 perfs.f90
Test3 : 0.920000000000000

ifort -O4 -xSSE4.2 perfs.f90
Test3 : 0.530000000000000

It is now 2.5 times faster than the original routine.

However, it as no inpact on the two first tests.

RE: SSE4.2

The inner most loops of test 2 and the inner most loops of test 3 are effectively the same.
The difference in the behavior of the two loops is test3 has a smaller hot in cachefootprint.
For gfort, try SSE *something supported less than 4.2"
I suspect gfort is not vectorizing the inner loops as there appears to be no time difference.

Jim

It has no effects.

My GFortran is :
\$ gfortran --version
GNU Fortran (Ubuntu 4.4.3-4ubuntu5) 4.4.3

I tried with the following options, all together, and separately : (My processors are Nehalem Xeon)

gfortran -O4 -mtune=barcelona -mfpmath=sse -msse4.2 -mmmx perfs.f90

Is there a way to rewritte the loop in a vectorized shape ? In order to make GFortran run a little faster.

Or maybe I should try to compute values in blocks of data (considering the fact that my block uses less than 90% of the L3 cache) ? If the array is 512*512 size, then I will for example compute blocks of 128*128 so that all data used resides in the cache (assuming I made the size calculation before and 128*128*numberofarrays uses less than 90% of L3). Could it improve performances ?

If you don't want the n=1,1000 loop to be executed 1000 times, simply leave it out.
gfortran is more likely to take such code literally than commercial compilers, for which there is a demand to "cheat" overly simplified benchmarks.

If you're looking for hints specific to gfortran, the gnu Fortran mailing list is the place.

Presumably, -O4 is treated the same as -O3 by the compilers you cite, but why take a chance on undocumented behavior?

why take a chance on undocumented behavior?

That is right. I will try this implementation in the main code and see if performances are really improved or not.

In fact I was wondering if I could have the same improvement in calculations on my local workstation (GFortran) and on the cluster (IFORT). I prefer having one subroutine for all compilers than having one specific for each compiler.

However, I consider that the objective of this topic has been achieved because heavy calculations are only done on the cluster (IFORT).

I would like to thank all of you for your help in this topic. It will allow us to perform larger simulations with the same amount of CPU hours, which are a rare resource here. :)

Ben

It would appear that test 3 method (with SSE4.2 if using IFORT) is the way to go. The effect on gfort is benign (i.e. no worse off). In the later event that improvements are made in gfort with respect to vectorizing this subroutine, then you will get your additional performance with no additional code change. Your build configuration options will have to vary from platform to platform and or compiler to compiler.

Jim

GFortran is used only for non MPI or low MPI calculations (up to 8 cores), so performances are non that important.

I will proceed as you said, and keep one subroutine for all compilers.

Build configuration is already specific for each architecture we use, I just need to add SSE4.2 option.

This improvement will allow me to simulate larger fields with our local clusters.

An other time, I really thank you for this idea of loop reorganisation. :)

Ben

Ben,

Look in the ISN Blogs section in the next few days/weeks, I will be submitting an article illustrating a high level of optimization. I've collected some impressive run data on a Core i7 920 but would like to find some time on a larger SMP NUMA system. If you know of someone with a few hours of free system time I would appreciate a referral.

Jim Dempsey