I am using Ifort to perform scientific calculations.
According to Valgrind, the most cpu costly subroutine is the derivative routine, which perform a simple 2D convolution :
DF(i,j) = A(i,j,1) * F(i,j) + A(i,j,2) * F(i-1,j) + A(i,j,3) * F(i+1,j)+ A(i,j,4) * F(i,j-1)+ A(i,j,5) * F(i,j+1)
I am trying to use BLAS library to accelerate this calculation, but I failed to find the appropriate way.
I think the best way would be to unrolle the loop, and then use BLAS 2 Vector/Matrix calculations, but I am not sure about that.
Does someone as an idea on how to optimize this calculation ?