# Improve Speed of Looped Matrix Muliplication

## Improve Speed of Looped Matrix Muliplication

I have two matrices A(100K,100K) and B(100K,50) and I want to get C=[BTAB]

A is a matrix from finite differences, but is stored in a vector to exploit its sparsity. It has two referance variables, S which refers to the location in A where the first value of row I is stored. So the range of values in A for ROW I would be S(I) to S(I+1)-1

The second variable is IDX which says the column location of a value in the vector A. For example S(I)+1 would be the second value stored in noncompressed Row I and IDX(S(I)+1) its column.

Below is my current code where I multiply AB first through loops and then multiply BTAB through the MKL/BLAS.

```DO J=1,COL
DO I=1,ROW
DO  IJ= S(I), S(I+1)-1
AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)
END DO
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)```

Any suggestions how I can improve this code/speed it up would be greatly appreciated. Right now it is a major choke point in a program I have been working on for the past six months.

28 Beiträge / 0 neu
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

You could try:
DO I=1,ROW
DO  IJ= S(I), S(I+1)-1
DO J=1,COL
AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)
END DO
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)

or the inner loop as

k = idx(IJ)
AB(I,1:ROW)= AB(I,1:ROW) + A(IJ)*B(k,1:ROW)

Alternatively, ignoring the compression, you might be able to work from the attached approach to compute BAB, by having a temporary copy of AB. I'm not sure of IDX in your example above.
Is A and BAB symmetric ?

John

## Anlagen:

AnhangGröße
1.22 KB

Scott, I'd like to make a couple of comments.

Regarding the 1st part of your calculations, that is multiplication of matricies A and B. Since sizes are big you need to consider parallelization of your processing. Also, from your initial post it is not clear how long does it take to calculate the 1st part and then the 2nd part? It would be nice to give us at least some numbers.

Then, did you consider a dense layout for matricies A and B? Take into account that Fortran's MATMUL function does calculations in 8x - 9x times faster on a system with 4 CPUs ( 8 cores ) in that case and /Qparallel and Parallel version of MKL need to be used.

Another thing is how much physical and virtual memory do you have on your computer? When all matricies are stored in physical memory only and virtual memory is not used matrix multiplication is CPU bound. As soon as at least some amount of virtual memory is used then matrix multiplication is virtual memory bound and performance affected significantly. For example, I see that my HDD LED never stops blinking when I multiply two dense 64K x 64K matricies on a system with 32GB of physical and 128GB of virtual memory.

If the first quoted loop doesn't optimize (what do opt-report or equivalent options for your compiler say?) it would be worth while to put it in the form required for ifort simd reduction.   The "vectorized" reduction would be accomplished by "simulated gather."  OpenMP parallel should be useful (as Sergey said).  Software prefetch on the B array may help, either explicitly prefetching elements for future loop iterations, or looking for compiler automatic software prefetch facility.

I assume you're using MKL parallel and setting affinity if on a multiple CPU platform.

>>I have two matrices A(100K,100K).
>>AB(I,J)= AB(I,J) + A(IJ)*B(IDX(IJ),J)

A(IJ) has one dimension, is this a type-o or is your comment wrong?

Are IDX(IJ), IDX(IJ+1), IDX(IJ+2), ... sequential numbers?
If yes, then lift IDX(S(I)) outside the DO IJ loop.
If no, then consider performing a gather {BIJJ(IJ) = B(IDX(IJ),J)} loop just prior to DO IJ loop then use gathered BIJJ in DO IJ loop.

Jim Dempsey

A is not symettric/positive definite and is a sparce square matrix in compressed storage. So in the code its actually a rank 1 (vector) whose index is refered to as IJ. To prevent confusion, I will call the actual square matrix As and the compress stored version A.  As holds by row the nonzero elements of As. The location in A of where a new row is stored is in S. For example the elements in row 10 would have the first value stored at S(10) and all the values for row 10 would be A( S(10) ), A(S(10) + 1), A(S(10) + 2),..., A(S(11) - 1).  The range from S(10) to S(10)-1 are sequential. IDX is a referance to the column of As that an element in A refers to. For example the second value in row 10 would be located at column IDX(S(10) + 1) of As.

IDX is not sequential nor in any specific order, it can be 75, 10000, 50, 753, ... and so forth. What is:

gather {BIJJ(IJ) = B(IDX(IJ),J)} loop just prior to DO IJ loop then use gathered BIJJ in DO IJ loop.

and I am not familar with automatic "software prefetch facility."  I am using MKL parallel, but unsure how to set the affinity (thought it was set to max # of cores by default). Are you suggestioning something like:

```DO J=1,COL
DO I=1,ROW
PF= B(IDX(S(I):S(I+1)-1),J)
AB(I,J)= MATMUL( A(S(I):S(I+1)-1)), PF )
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)```

My current code uses the array AB as a temporary storage to multiply out the compressed version. Then I use MKL to do BT*(AB)

The time it current takes to solve AB on an i7-2600K with 16GB ram is 0.33 s and BT(AB) is 0.06 s. The reason this is a choke point is that BTAB has to be computed between 10,000 and 100,000 times during the run of the program. This process only resides in physiscal memory, so it is CPU bound.

I am not familar with opt-report or any of the other optimization software. I am just getting into the optimization side of code. I do have Intel Fortran with Visual Studio. I did try converting the I and J outer loops with a single DO CONCURRENT loop, but that resulted in an increase of time to 1 s.

The most basic option to suggest to the compiler the automatic addition of software prefetch is /Qopt-prefetch.  mm_prefetch intrinsics and the like are in the ifort help file.

MKL chooses by default to place 1 thread on each core, regardless of whether HyperThreading is enabled.   You would want the same effect in your code if you add -parallel or OpenMP.  You still didn't say whether you are using a multiple CPU platform where dealing with KMP_AFFINITY is more important.

It is just a single CPU system i7-2600K. The final version of the code will be an executable that will run on personal highend desktops with i7's. Also this is one part of an enormous legacy code, so would it impact the rest of the code if I only compile my section with the /Qopt-prefetch and -parallel.

This is my current compile option for the entire program:

/nologo /O3 /QaxAVX /QxAVX /heap-arrays1024 /arch:AVX /real_size:64 /fp:precise /module:"Release_x64\\" /object:"Release_x64\\" /Fd"Release_x64\vc110.pdb" /libs:static /threads /Qmkl:parallel /c

>>.../fp:precise...

If you're not worried about accuracy ( to some degree ) then two floating point options for faster operations could be used, that is /fp:fast and /fp:fast=2.

>>gather {BIJJ(IJ) = B(IDX(IJ),J)} loop just prior to DO IJ loop then use gathered BIJJ in DO IJ loop.

The above is annotation for forum message. Your

PF= B(IDX(S(I):S(I+1)-1),J)

Is performing a gather as a single statement as opposed to a DO loop as indicated by my post. (iow attaining the same goal of creating a rank 1 array prior to multiplication. The purpose of which is to improve vectorization. The MATMUL, using MKL is likely not inlined by IPO, therefore the gather would likely not be performed in your original code, but will be performed in your revised code. Your code makes better use of language features (re: array sections).

On large PF matricies, you may find it benificial to breakup the load of PF into smaller pieces, such that each piece fits nicely into L1.

Jim Dempsey

PF will actually never be larger than 10, so I will give that a try and see if there is any improvement. Should I use the MATMUL or the MKL/BLAS DDOT() to mulitiply out A(S(I):S(I+1)-1)) and PF?

Also would there be any benifit to prefetching  A(S(I):S(I+1)-1)), even though S(I):S(I+1)-1 is sequential?

>>...Also would there be any benifit to prefetching...

You could try the following compiler options:

/opt-prefetch[:n]
enable levels of prefetch insertion, where 0 disables.
n may be 0 through 4 inclusive. Default is 2.

/opt-prefetch-
disable(DEFAULT) prefetch insertion. Equivalent to /Qopt-prefetch:0

Thanks, I implemented that code and tested all the prefetch options but all the cases actually resulted in slower code. It went from an average of 0.3 s to about 0.55 s.

Attached is an option which does not store AB.
It assumes that B is not a sparse matrix, which might be sonething that could be utilised.
I have assumed storage orientation for A and B which might not be correct.
You could compile with /Qparallel to see if you gain from that.

John

## Anlagen:

AnhangGröße
1.36 KB

Thanks, I will take a look at that and see if it improves the code I have. Right now the best times I have been able to get have been with prefetching and changing the looping. So here is the new code I am using now:

```DO I=1,ROW
LB=S(I)
UB=S(I+1)-1
U =S(I+1)-S(I)
PF(1:U)=A(LB:UB)               !PREFETCH VECTOR ON A
DO J=1,COL
AB(I,J)= SUM( PF(1:U)*B(IDX(LB:UB),J) )
END DO
END DO
CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)```

This does not seem intuitive to me why it would be faster compared to prefetching B(IDX(LB:UB), but this reduced the CPU time from 0.33 s to 0.24 s.

Any suggestions for improving upon this would be greatly appreciated.

Scott,

I don't like "B(IDX(LB:UB),J)". I'd be interested to find out how well it works.

Also it would be good to know the denstiy of the matrices. You could try something like:

call Report_Density ( 'Ae',  Ae,  s(row+1)-1 )
call Report_Density ( 'B',   B,   row*col )
call Report_Density ( 'BAB', BAB, col*col )

subroutine Report_Density ( name, a, num )
character name*(*)
real*8 a(*)
integer*4 num, i, n
!
n = 0
do i = 1,num
if (a(i)/=0) n = n+1
end do
write (*,*) 'Density for ',name,' =', real(n)/real(max(num,1))*100.,' %'
end

I do not like it either, but it is producing the fastest CPU times.The code presented in my first post ran in 0.33 s, the updated version runs in 0.24 s. These are averaged times over 1000 calls. I use the following commands to time my code:

```CALL SYSTEM_CLOCK(COUNT_RATE=ClockRate)
CALL SYSTEM_CLOCK(COUNT=START1)
'The code being tested'
CALL SYSTEM_CLOCK(COUNT=FINISH1)
TIME=DBLE(FINISH1-START1)/DBLE(ClockRate)
```

For the densities, the compressed stored matrix A has no zero values by construction. The matrix B returned 100%, so there are no zero values within it. This is because B is an orthnormal matrix (ie BTB=I, but BBT/=I).

If you spelled your option /heap-arrays:1024 you should remember that this switches only arrays of a fixed larger size known at compile time to heap.  Remember the comment Steve Lionel made yesterday that the numeric option has no useful effect.

Would you be able to use /assume:protect_parens /Qprec-div /Qprec-sqrt instead of /fp:precise?

As Sergey hinted, your /fp: option instructs the compiler not to optimize the SUM intrinsic. You would need the simd reduction if you wish to over-ride and optimize just this construct.  It would be clearer to me if you would use dot_product, but the optimization may be the same; neither SUM nor dot_product allow for optimization as an over-ride of /fp:precise.

I'm not sure I understand your objection to parallelizing this loop (by OpenMP if not /Qparallel) when you are already linking MKL parallel, although, if you intend to run only on single CPU platforms, you may be able to saturate memory bandwidth without parallelizing it.  For parallelization, you would keep the J loop as outer as you presented it originally.

>>Thanks, I implemented that code and tested all the prefetch options but all the cases actually resulted in slower code. It went
>>from an average of 0.3 s to about 0.55 s.

Could you provide some technical details for your system? Thanks in advance.

I recently detected a prefetch related issue on Ivy Bridge and I have Not seen even small performance improvements when prefetch was used. I suspect this is due to larger sizes of L3, L2 and L1 cache lines.

Sandy Bridge i7-2600K with 16GB ram, I do not have any performance improvement when I turn on or off prefetching, but when I include that addition vector PF for matrix A, speeds up the code a little bit.

Scott,

The following change might improve performance for the example I provided if there are a significant number of empty rows in A

do k = 1,row
if (s(k)==s(k+1)) cycle

This would retain sparsity information for the B'.AB calculation.
{  Special cases of S(k+1)==S(k)  (empty row)
or S(k+1)==S(k)+1   (single coefficient)  could be coded for BAB calculation if this approach shows promise.  }

Also, as B is so dense, the zero test for BAB accumulation probably hinders /Qparallel.

John

Try:

```!\$OMP PARALLEL DO PRIVATE(PF, LB,UB,U,J)
DO I=1,ROW
LB=S(I)
UB=S(I+1)-1
U =S(I+1)-S(I)
PF(1:U)=A(LB:UB)               !PREFETCH VECTOR ON A
DO J=1,COL
AB(I,J)= SUM( PF(1:U)*B(IDX(LB:UB),J) )
END DO
END DO
!\$OMP END PARALLEL DO
10 CALL DGEMM('T','N',COL,COL,ROW,1D0,B,ROW,AB,ROW,0D0,C,COL)
```

Jim Dempsey

Scott,

Storage of B as B'(col,row) could improve the memory access and improve the solution time. Attached is a derivation of a solution, which uses array sections for the inner loop, where the inputs are A(*) and Bt(col,row).
This would suit AVX instructions.
The attached file shows the development of subroutine Generate_BABt (BtAB, Bt, Ae, row, col, S, idx) which might provide a fast solution using AVX instructions. ( you might experiment with col*8 being a multiple of 64 to improve AVX alignment, although I struggle to control this possible improvement)

I think that some of the problems with the other solutions is that B(row,col) addresses a large memory footprint, while addressing columns of B'(col,row) might overcome, (assuming row >> col)

John

## Anlagen:

AnhangGröße
4.41 KB

>>... you might experiment with col*8 being a multiple of 64 to improve AVX alignment...

Please take into account that when AVX instructions are used alignment on a 32-byte memory boundary is needed.

Sergey,

This is an area I have experimented with and not found a definate improvement when adjusting the alignment. My understanding is 32 byte or 256 bits is the size of the AVX register. Either my testing has been flawed, but I do not get a significant improvement.
I'll give it another try.
32 bytes implies col should be a multiple of 4 for real*8. If you try a multiple of 2 or 3 and get a different run time with AVX instructions, then that would show it is worth fixing.

Using col=50 rather than row=100,000 as the 1st dimension of B' should have a much more significant effect. (Perhaps trying col=52 and see if this is better than 50 ??)

I actually do not know why this is a limitation. Surely with the L1 and L2 cache, this could be addressed without the need to align the memory.

John

32-byte alignment may be useful even when using only 128-bit parallel instructions.  In the present case, with the indirect access, it would not be surprising to see little effect of alignment.  In a packed matrix, to preserve alignment, the leading dimension would need to be a multiple of the alignment size, but then the larger dimensions might need to avoid even sizes which could produce cache mapping conflicts.

There seems to have been confusion in this thread as to an assumption that copying an array section before using it would take the place of optimizing prefetch, and perhaps about the performance loss associated with making unnecessary copies of data.  Current compilers have "simulated gather" facilities to be capable of optimizing a dot product with a gather operand, if that optimization is not suppressed by options such as /fp:precise.  If you turned on the vectorization report, successful optimization would be reported as VECTORIZED.

The definitions of /fp:precise and the like aren't spelled out fully in the documents.  While ifort makes no distinction between /fp:precise and /fp:source, for icl "precise" refers to improved support of exceptions, so "source" is the one which has similar meaning between C and Fortran.  /fp:source includes the effects of /assume:protect_parens /Qprec-div /Qprec-sqrt /Qftz- in addition to suppressing optimization of sum reduction (unless over-ridden by !dir\$ simd reduction).

A reason for suppressing optimization of sum reduction is that the optimization by vectorization gives rise to slight differences in results with alignment (which could be avoided by aligning operands).  The other standards compliance effects of /fp:precise are potentially more important.

The core7-2 (which was mentioned above) and -3 CPUs should support the ftz- setting without loss of performance.

Tim,

Do you have any comment on the use of B(IDX(LB:UB),J) in "AB(I,J)= SUM( PF(1:U)*B(IDX(LB:UB),J) )" ?
Does the use of IDX in this way improve the performance, or would it be better to say use:
PF(1:U) = B(IDX(LB:UB),j)
AB(I,J) = SUM ( A(LB:UB)*PF(1:U) )

Or, should "AB(I,J) = SUM(  A(LB:UB)*B(IDX(LB:UB),J) )" perform just as well ?
Would there be a prefetch to assemble the elements of B?

I presume the use of array sections improves the AVX performance in comparison to the F77 coding of:
x = 0
DO ij = S(I),S(I+1)-1
k = idx(ij)
x = x + A(ij)*B(k,j)
END DO
AB(I,J) = x

I'd appreciate your comments on the merits of the different possible approaches.

John

If the compiler doesn't optimize away your extra copy of the array section, and if you optimize the dot product in both cases, the extra copy will cost performance.

I think you may still be insisting on interpreting prefetch as making a separate copy of the array section or as a synonym for gather.

Indirect prefetch doesn't happen automatically.  If you look it up, e.g. by search engine terms such as indirect prefetch site:software.intel.com you will see examples with the prefetch intrinsic both for C and Fortran; possibly the latter will be intended for Intel(r) Xeon Phi(tm) but can be tried on Xeon host.  Additional directive options may come out in future compilers.

You do really need to check what opt-report says about the optimization of your source code.  The f77 code should work fine and gives you the additional option of applying !dir\$ simd reduction(+: x) to promote optimization with simulated gather as an over-ride of your /fp:precise. This may give you an advantage with AVX compilation.   Your MKL call will switch automatically into AVX by default.

f77 doesn't require you to introduce the local variable k as f66 did.  Not that ifort has many remaining preferences for f77, outside of cases where ifort array assignments introduce extra array copies, which you seem to like anyway.