Cache Blocking Techniques

Compiler Methodology for Intel® MIC Architecture
Cache Blocking Techniques

An important class of algorithmic changes involves blocking data structures to fit in cache. By organizing data memory accesses, one can load the cache with a small subset of a much larger data set. The idea is then to work on this block of data in cache. By using/reusing this data in cache we reduce the need to go to memory (reduce memory bandwidth pressure).

Blocking is a well-known optimization technique that can help avoid memory bandwidth bottlenecks in a number of applications. The key idea behind blocking is to exploit the inherent data reuse available in the application by ensuring that data remains in cache across multiple uses. Blocking can be performed on 1-D, 2-D or 3-D spatial data structures. Some iterative applications can further benefit from blocking over multiple iterations (commonly called temporal blocking) to further mitigate bandwidth bottlenecks. In terms of code change, blocking typically involves a combination of loop splitting and interchange. In most application codes, blocking is best performed by the user by making the right source changes with some parameterization of the block-factors.

Original Source:

for (body1 = 0; body1 < NBODIES; body1 ++) {
   for (body2=0; body2 < NBODIES; body2++) {
     OUT[body1] += compute(body1, body2);

In this example, data (body2) is streamed from memory. Assuming NBODIES is large, we would have no reuse in cache. This application is memory bandwidth bound. The application will run at the speed of memory to CPU speeds, less than optimal.

Modified Source (with 1-D blocking):

for (body2 = 0; body2 < NBODIES; body2 += BLOCK) {
   for (body1=0; body1 < NBODIES; body1 ++) {
      for (body22=0; body22 < BLOCK; body22 ++) {
         OUT[body1] += compute(body1, body2 + body22);

In this modified code, data (body22) is kept and reused in cache, resulting in better performance.

For instance, the code snippet above shows an example of blocking NBody code. There are two loops (body1 and body2) iterating over all bodies. The original code on the top streams through the entire set of bodies in the inner loop, and must load the body2 value from memory in each iteration. The blocked code at the bottom is obtained by splitting the body2 loop into an outer loop iterating over bodies in multiple of BLOCK, and an inner body22 loop iterating over elements within the block, and interleaving the body1 and body2 loops. This code reuses a set of BLOCK body2 values across multiple iterations of the body1 loop. If BLOCK is chosen such that this set of values fits in cache, memory traffic is brought down by a factor of BLOCK.

Here is the relevant code-snippet from an OpenMP* version of the NBody benchmark (with blocking applied using a factor of CHUNK_SIZE). In this case, the loop unroll-jam transformation is expressed as a pragma and is left to be done by the compiler. In such cases, studying the output from -opt-report can confirm that the compiler did perform the unroll-jam optimization for your loop(s).

#define CHUNK_SIZE 8192

#pragma omp parallel private(body_start_index)
  for(body_start_index = 0; body_start_index < global_number_of_bodies; body_start_index += CHUNK_SIZE) {
    int i;
    int body_end_index = body_start_index + CHUNK_SIZE;

    #pragma omp for private(i) schedule(guided)
    #pragma unroll_and_jam (4)
    for(i=starting_index; i<ending_index; i++) {
      int j;
      TYPE acc_x_0 = 0, acc_y_0 = 0, acc_z_0 = 0;
      for(j=body_start_index; j<body_end_index; j+=1) {
        TYPE delta_x_0 = Input_Position_X[(j+0)] - Input_Position_X[i];
        TYPE delta_y_0 = Input_Position_Y[(j+0)] - Input_Position_Y[i];
        TYPE delta_z_0 = Input_Position_Z[(j+0)] - Input_Position_Z[i];

        TYPE gamma_0 = delta_x_0*delta_x_0 + delta_y_0*delta_y_0 + delta_z_0*delta_z_0 + epsilon_sqr;
        TYPE s_0 = Mass[j+0]/(gamma_0 * SQRT(gamma_0));
        acc_x_0 += s_0*delta_x_0;
        acc_y_0 += s_0*delta_y_0;
        acc_z_0 += s_0*delta_z_0;
      Output_Acceleration[3*(i+0)+0] += acc_x_0;
      Output_Acceleration[3*(i+0)+1] += acc_y_0;
      Output_Acceleration[3*(i+0)+2] += acc_z_0;

Here is an example of a matrix-multiply code in Fortran where the user performs advanced block-unroll-jam transformations (in the modified version) involving local copy-arrays for best performance.

Fortran Source Example:

do j=1,N
  do k = 1,N
    do i = 1,N
      c(i,j) = c(i,j) + a(i,k) * b(k,j)
    end do
  end do
end do

Modified Fortran Source:

     do JJ = 1, N, TJ

       do KK = 1, N, TK
         do jjj = 1,min(tj,N-jj+1)                     ! BCOPY - no transpose
           do kkk = 1, min(tk,N-kk+1)
             p(kkk,jjj-1+1) = B(kk+kkk-1, jj+jjj-1)
           end do
         end do
         do II = 1, N, TI
           do iii = 1,
             min(ti,N-ii+1)                   !ACOPY - transpose
             do kkk = 1, min(tk,N-kk+1)
                Q(kkk,iii) = A(ii+iii-1, kk+kkk-1)
             end do
           end do
           do J = 1, min(tj,N-jj+1), 4
             do I = 1, min(ti,N-ii+1), 2
                t1 = 0 ; t2 = 0 ; t5 = 0 ; t6 = 0 ; t9 = 0 ; t10 = 0 ; t13 =0 ; t14 = 0
                !DIR$ vector aligned                      !DIR$ unroll(2)
                do K = 1,min(TK,N-kk+1)      ! Innermost loop, vectorized and unrolled by 2 after that
                   qi = Q(K,I)           ;    qi1 = Q(K,I+1)  
                   t1 = t1+qi*P(K,J)     ;    t2 = t2+ qi1*P(K,J)
                   t5 = t5+ qi*P(K,J+1)  ;    t6 = t6+ qi1*P(K,J+1)
                   t9 = t9+ qi*P(K,J+2)  ;    t10 = t10+ qi1*P(K,J+2)
                   t13 = t13+ qi*P(K,J+3);    t14 = t14+qi1*P(K,J+3)
                end do
               c(i+ii-1,j+jj-1) = c(i+ii-1,j+jj-1) +t1          ; c(i+1+ii-1,j+jj-1) = c(i+1+ii-1,j+jj-1) + t2
               c(i+ii-1,j+1+jj-1) = c(i+ii-1,j+1+jj-1) + t5     ; c(i+1+ii-1,j+1+jj-1) = c(i+1+ii-1,j+1+jj-1) + t6
               c(i+ii-1,j+2+jj-1) = c(i+ii-1,j+2+jj-1) + t9     ; c(i+1+ii-1,j+2+jj-1) = c(i+1+ii-1,j+2+jj-1) + t10
               c(i+ii-1,j+3+jj-1) = c(i+ii-1,j+3+jj-1) + t13    ; c(i+1+ii-1,j+3+jj-1) = c(i+1+ii-1,j+3+jj-1) + t14
             end do
           end do
         end do
       end do
     end do

Take Aways

Cache Blocking is a technique to rearrange data access to pull subsets (blocks) of data into cache and to operate on this block to avoid having to repeatedly fetch data from main memory. As the examples above show, it is possible to manually block loop data in such a way to reuse cache. For performance critical loops where a performance analysis indicates memory bandwidth constraints and -opt-report is showing that the compiler is not blocking the loop optimally, you may consider hand-unrolling of the loop to better block data for cache reuse.


It is essential that you read this guide from start to finish using the built-in hyperlinks to guide you along a path to a successful port and tuning of your application(s) on Intel® Xeon Phi™ coprocessors. The paths provided in this guide reflect the steps necessary to get best possible application performance.

BACK to Preparing for the Intel® Many Integrated Core Architecture

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.