Remove Many Bank Conflicts on 64-Bit Intel® Architecture

Submit New Article

March 10, 2009 1:00 AM PDT



Challenge

Remove bank conflicts from high-level loops. Removing bank conflicts is, in many cases, rather simple. Consider the following double-precision matrix multiply in Fortran:

Do k=1,MAX 
Do j=1,MAX
Do i=1,MAX
a(i,k)=a(i,k) + b(i,j)*c(j,k)
enddo
enddo
enddo 

 


Solution

Unroll the inner loop and then interlace the unrolled lines. The first thing to improve performance is to unroll the inner loop.

Do k=1,MAX 
Do j=1,MAX
Do i=1,MAX
a(i,k)=a(i,k) + b(i,j)*c(j,k)
a(I+1,k)=a(I+1,k) + b(I+1,j)*c(j,k)
a(I+2,k)=a(I+2,k) + b(I+2,j)*c(j,k)
a(I+3,k)=a(I+3,k) + b(I+3,j)*c(j,k)
enddo
enddo
enddo 

 

While this coding improves the performance significantly, it may result in many bank conflicts caused by loading successive elements in memory during the same cycle. The simple solution is to interleave the unrolled lines as follows:

Do k=1,MAX 
Do j=1,MAX
Do i=1,MAX
a(i,k)=a(i,k) + b(i,j)*c(j,k)
a(I+2,k)=a(I+2,k) + b(I+2,j)*c(j,k)
a(I+1,k)=a(I+1,k) + b(I+1,j)*c(j,k)
a(I+3,k)=a(I+3,k) + b(I+3,j)*c(j,k)
enddo
enddo
enddo 

 

This change results in separating the loads from the same banks (adjacent addresses in memory) by at least a cycle, thus removing the bank conflict. In the case of single-precision data (four bytes), a greater degree of unrolling (over j) and a more complex interleaving is required to remove the bank conflicts. In a real measurement of the above example, interleaving improves performance by more than 10%.


Source

Introduction to Microarchitectural Optimization for Itanium® Processors