Putting Your Data and Code in Order: Optimization and Memory – Part 1

This series of two articles discusses how data and memory layout affect performance and suggests specific steps to improve software performance. The basic steps shown in these two articles can yield significant performance gains. Many articles on software performance optimization address parallelism as either: distributed memory parallelism (such as MPI), shared memory parallelism (such as threads) or SIMD (also called vectorization), but true parallelism should address all three areas. While these items are very important, memory is equally important and is often overlooked. Changes to software architecture and parallel design impact memory and performance.

These two articles are designed at an intermediate level. It is assumed the reader desires to optimize software performance using common C, C++ and Fortran* programming options. The use of assembly and intrinsics is left to the advanced users. The author recommends those seeking advanced material to familiarize themselves with processor instruction set architecture and check out the numerous research journals for excellent articles for analyzing and designing data positioning and layout.

Putting your data and code in order builds on two basic principles: minimize data movement and place data close to where it is used. When data is brought into or close to the processor register use it as much as it is may be needed before it is evicted or moved further away from the processor execution units.

 

Data Positioning

Let's consider the different levels where data may reside. The closest point to execution units are the processor registers. Data in the registers may be acted upon immediately; incremented, multiplied, added, used in a comparison or boolean operation. Each core in a multicore processor typically has a private first level cache (called L1). Data can be moved from the first level cache to a register very quickly. There may be several levels of cache, at minimum the last level of cache (called LLC) is typically shared among all cores in the processor. Intermediate levels of cache vary depending on the processor whether they are shared or private. On Intel platforms, the caches maintain coherency across a single platform even when there are multiple sockets. Data movement from the caches to the registers is faster than data fetches from main memory.

A figurative illustration of data positioning, proximity to the processor registers and relative time to access is illustrated in Figure 1. The closer a block is to the register the faster the movement or the lower the latency in getting data into the register for execution. Caches are the fastest or lowest latency. Main memory is the next fastest. There may be multiple levels of memory, although different levels of memory are not addressed until Part 2. When pages are swapped out on to a hard disk or SSD, virtual memory may be significantly slower. The traditional MPI send/receive over an interconnect fabric (ethernet, infiniband, or other) has higher latency than getting data locally. Data movement from a remote system accessed through something like MPI varies in its access time depending on the interconnect fabric – ethernet, infiniband, Intel® True Scale, or Intel® Omni Scale.

 

Figure 1. Latency memory access, showing relative time to access data.

 

The closest points to the execution units are the processor registers. Given the number of registers, the latency of time it takes to load data into the registers, and the width of the memory operations queue, it isn't possible to use each value in the registers only once and bring in data fast enough to keep all of the execution units fully busy. Once data is close to the execution unit it is desirable to increase its reuse before it is evicted from cache or removed from a register. In fact, some variables exist only as variable registers and are never stored to main memory. The compiler does an excellent job of recognizing when a variable scope is best as a register only variable, so using the “register” keyword in C/C++ is not recommended. Compilers already do a good job of recognizing this optimization and the compiler is permitted to ignore the “register” keyword.

The main point is for the software developer to look at the code, think about how data is used, how long it needs to exist. As yourself: Should I create a temporary variable? Should I create a temporary array? Do I need to keep this many temporary variables around? When going through a performance improvement process a developer should begin by collecting software performance metrics and focusing their efforts on data locality to the modules or branches of the code where significant portions of time is spent executing the code. A few of the popular performance data collection utilities include Intel® VTune™ Amplifier XE, gprof, and Tau*.

 

Data Use and Reuse

An excellent example for teaching these two steps is to consider matrix multiply. A matrix multiply A = A + B*C for three square n-by-n matrices can be represented by three simple nested for loops as shown below:

for (i=0;i<n; i++)                      // line 136
   for (j = 0; j<n; j++)                // line 137
     for (k=0;k<n; k++)                 // line 138
         A[i][j] += B[i][k]* C[k][j] ;  // line 139

The main problem with this ordering is that it contains a reduction operation (lines 138 and 139). The left hand side of line 139 is a single value. Although the compiler will partially unroll the loop at line 138 to fill the width of the SIMD registers and form 4 or 8 products from elements of B and C, those products must be summed into a single value. Summing up 4 or 8 values into one position is a reduction operation which doesn’t provide the same parallel performance or utilize the full width of the SIMD units effectively. Minimizing or eliminating the use of reduction operations improves parallel performance. Seeing a single value on the left hand side inside a loop is an indication of a probable reduction. The data access pattern for a single iteration of line 137 is shown below in Figure 2 (i,j=2).

Figure 2. Ordering; showing a single value in matrix A.

 

Sometimes the reduction can be removed by reordering operations. Consider the ordering where the two inner loops are swapped. The number of floating point operations is the same. However, because the reduction operation or summing of values into a single left-hand side is eliminated, the processor can utilize the full width of the SIMD execution units and registers every time. This improves performance significantly.

for (i=0;i<n; i++)                    // line 136
   for (k = 0; k<n; k++)              // line 137new
     for (j=0;j<n; j++)               // line 138new
       a[i][j] += b[i][k]* c[k][j] ;  // line 139

When this is done, elements of A and C are accessed contiguously.

Figure 3. Updated ordering, showing contiguous access.

 

 

The original ijk ordering is a dot product method. A dot product of two vectors is used to calculate the value of each element of A. The ikj ordering is a saxpy or daxpy operation. A multiple of one vector is added to another vector. Both dot products and axpy operations are level 1 BLAS routines. In the ikj ordering there is no reduction required. A subset of the row of C is multiplied by a scalar from matrix B and added to a subset of the row of A (the compiler will determine the subset size depending on the width of the SIMD registers targeted – SSE4, AVX or AVX512). Memory access for a single iteration of loop 137new is shown in Figure 3 above (again i,j=2).

The elimination of the reduction in the dot product view is a significant improvement. With optimization level O2, both the Intel compiler and gcc* generate vectorized code utilizing the SIMD registers and execution units. In addition, the Intel compiler automatically swaps the order of the j and k loops. This is revealed by looking at the compiler optimization report, which is available using the opt-report compiler option (-qopt-report in Linux*). The optimization report is placed (by default) into a file named filename.optrpt. For this case the optimization report contains the following text excerpts:

LOOP BEGIN at mm.c(136,4)
   remark #25444: Loopnest Interchanged: ( 1 2 3 ) --> ( 1 3 2 )

 

The report also shows that the interchanged loop was vectorized:

LOOP BEGIN at mm.c(137,7)
    remark #15301: PERMUTED LOOP WAS VECTORIZED
 LOOP END

The gcc compiler (version 4.1.2-55) does not automatically make this loop swap. The developer will need to make this swap.

Additional performance gains come from blocking the loops to improve data reuse. In the above representations (Figure 3) for each iteration of a middle loop two vectors of length n are referenced (as well as a scalar), and each element of those two vectors is only used once! For large n it is quite common for each element of the vector to be evicted from caches between each middle loop iteration. When the loops are blocked to provide some data reuse performance jumps again.

The final code shows the j and k loops reversed and the addition of blocking. The code operates on submatrices or blocks of the matrix at a time, of size blockSize, in this simple case blockSize is a multiple of n the code.

for (i = 0; i < n; i+=blockSize)
   for (k=0; k<n ; k+= blockSize)   
      for (j = 0 ; j < n; j+=blockSize)      
         for (iInner = i; iInner<j+blockSize; iInner++)     
            for (kInner = k ; kInner<k+blockSize; kInner++)
               for (jInner = j ; jInner<j+blockSize ; jInner++)
                 a[iInner,jInner] += b[iInner,kInner] *
                    c[kInner, jInner]

 

In this model the data access for one particular iteration of loop j may look like this:

Figure 4. block model representation.

 

If the blocksize is selected appropriately, it is safe to assume that each block will remain in cache during the operations of the inner three loops (and maybe even in the SIMD registers). Each value or element of A, B, and C will be used blockSize times before it is removed from the SIMD registers or evicted from cache. This increases data reuse by a factor of blockSize times. When a modest size matrix is used, there is little to no gain from blocking. As the matrix size increases the performance difference is significant.

The table below shows some performance ratios measured on a system using different compilers. Please note the Intel compiler automatically swaps loops at lines 137 and line 138. So the Intel compiler doesn’t show significant differences between ijk and ikj orderings. This also makes the baseline performance for the Intel compiler much better, so the final speedup over baseline appears lower – this is because the Intel baseline performance is already so fast.

Ordering

Matrix/block size

Gcc* 4.1.2 -O2
speed up/ performance ratio over base

Intel compiler 16.1 -O2
speed up/ performance ratio over base

ijk

1600

1.0 (base)

12.32

ikj

1600

6.25

12.33

ikj block

1600/8

6.44

8.44

ijk

4000

1.0 (base)

6.39

ikj

4000

6.04

6.38

ikj block

4000/8

8.42

10.53

Table 1. Performance ratios measured on gcc* and the Intel compiler.

The code sample here is not complicated and both compilers will generate simd instructions. This is an older gcc compiler, it isn't meant to compare compilers, it serves an important teaching principal - it helps to illustrate the impact of order of operations and impact of reductions even when the data going into the reduction can be done in parallel.  Many loops are more complicated and no compiler will recognize all the opportunities. So it can be worthwhile for developers to consider the time intensive regions of code, check compiler reports to see if the compiler has already done it or consider doing it themselves.  Second notice that the importance of blocking data is when the data becomes large. There is no performance increase for the smaller of the two matrices. For the larger matrices the performance gain is significant.    So before blocking all of their code developers should consider relative data size and cache. With the addition of a few nested loops and appropriate bounds developers can achieve performance gains that are 2 to 10 times faster than their original code sections. This is a big performance boost for only a reasonable amount of effort. 

 

Using Optimized Libraries

There is still more work to be done for the truly motivated. The blocked code above is still far behind the performance improvement that would be delivered by using the level 3 BLAS routines DGEMM from one of the optimized LAPACK libraries such as Intel® Math Kernel Library (Intel® MKL). For common linear algebra and Fourier transforms, modern libraries such as Intel Math Kernel Library offer even better optimizations than what will be achieved by simple blocking and reordering exercises covered in this paper. When available, the developer should utilize these optimized libraries that are performance tuned.

Even though optimized libraries exist for matrix multiply, optimized libraries do not exist for all the possible situations where blocking improves performance. The matrix multiply provides an easy to visualize example to teach the principle. This works very well for finite difference stencils as well.

Figure 5. 2d block model representation.

 

A simple nine point stencil would use the highlighted blocks below to update the values in the center block. Nine values are brought in to update one position. When its neighbor is updated, six of those values will be used again. If the code progresses in an order through the domain as shown behavior will be like that shown in the adjacent figure – then we can see that to update three positions, 15 values must be brought in. Asymptotically this approaches a 1:3 ratio.

If we put things into 2d blocks again then as shown in Figure 5, six positions are updated with only 20 values being brought into the registers, for a block width of two; this would asymptotically approach a 1:2 ratio.

I encourage readers interested in finite difference techniques to study the excellent article: Eight Optimizations for 3-Dimensional Finite Difference (3DFD) Code with an Isotropic (ISO) by Cedric Andreolli. In addition to blocking, this paper also addresses other memory optimization techniques discussed in these two articles.

 

Summary

In summary, there are three key steps illustrated in this paper that developers can apply to their software. First order operations to avoid parallel reductions. Second look for data reuse opportunities and restructure nested loops into blocks to allow for data reuse. This can double performance of some operations. Lastly, use optimized libraries when available; they are much faster than typical developers will obtain with simple reorderings.

Download the Code Sample

In Part 2, I will consider parallelism across multiple cores, not just SIMD registers and will address topics of false sharing and arrays of structures.

For more complete information about compiler optimizations, see our Optimization Notice.

7 comments

Top
Sergey Kostrov's picture

>>...
>>These two articles gave the basics that most any developer can apply
>>without needing to learn all the options.
>>...

Strongly disagree with that statement and you're undervalueing importance of a correct set of C++ compiler(s) options.

David M.'s picture

Hello Sergey,  I am pleased to see your continued interest in this topic.   Part 2 has been published now.  In it I point out that the better performance with the intel compiler was because it took better advantage of data alignment on cache line boundaries.   When one of the matrices was intentionally offset so that it data is not aligned on a cache line boundary the performance advantage of the intel compiler goes away.    These are fairly straight forward loops and I think every compiler does a good job with them.  For this article I only used -O2 compiler option.  There are additional compiler options and pragmas that can be used to give the compiler more information that allows them to do more optimizations.    These two articles gave the basics that most any developer can apply without needing to learn all the options.  For those who want to do more, these two articles are good introductions and I hope to see comments from others.

 

 

Sergey Kostrov's picture

Here are a couple of more notes.

Modern Software Developers and Engineers are thinking that C/C++ compilers can Do almost everything they need and they could Not do anything else. But, in reality, a very simple piece of C codes, like 3-for-loop matrix multiplication, needs to be Tuned ( ! ) for every C/C++ compiler, if a team uses more than two C/C++ compilers on a project.

This is what I went through and, since I'm dealing with more than five C/C++ compilers on the project, I designed a concept of 'Core Processing Template Based Inline functions' in order to get as better as possible, and as faster as possible, binary codes.

For example, this is how a generic piece of matrix multiplication ( transposed version ) looks like:

 // Implementation Code: CTv1D ( LOS=1:1:1 )

 template < class T >
 _RTINLINE RTvoid _MatrixMulProcessingCTv1D( T * _RTRESTRICT ptA, T * _RTRESTRICT ptB,
            T * _RTRESTRICT ptC,
            RTssize_t iM, RTssize_t iK, RTssize_t iN,
            RTint iNumOfThreads )
   {
   // CrtPrintf( RTU("_MatrixMulProcessingCTv1D\n") );
    RTssize_t i;
    RTssize_t j;
    RTssize_t k;
    _RTALIGNED register T tR;
    #if ( defined ( _RTPRAGMAOMP_PARALLELFOR ) )
    #pragma omp parallel for private( i, j, k, tR ) num_threads( iNumOfThreads )
    #endif
    for( i = 0; i < iM; i += 1 )
    {
     for( j = 0; j < iK; j += 1 )
     {
      tR = ( T )0;
      for( k = 0; k < iN; k += 1 )
      {
       tR += ( ptA[ iN*i+k ] * ptB[ iK*j+k ] );
      }
      ptC[ iK*i+j ] = ( T )tR;
     }
     CrtSfence();
    }
   }
 

and, this is how a Tuned C codes look like for Microsoft or Intel C/C++ compilers:

// Implementation Code: CTUnRv1A ( LOS=1:1:4 )

 template < class T >
 _RTINLINE RTvoid _MatrixMulProcessingCTUnRv1A( T ** _RTRESTRICT pptA, T ** _RTRESTRICT pptB,
               T ** _RTRESTRICT pptC,
               RTssize_t iM, RTssize_t iK, RTssize_t iN,
               RTint iNumOfThreads )
   {
   // CrtPrintf( RTU("_MatrixMulProcessingCTUnRv1A\n") );
    RTssize_t iN4 = iN - ( iN & 3 );
    RTssize_t i;
    RTssize_t j;
    RTssize_t k;
    _RTALIGNED register T tR;
    #if ( defined ( _RTPRAGMAOMP_PARALLELFOR ) )
    #pragma omp parallel for private( i, j, k, tR ) num_threads( iNumOfThreads )
    #endif
    for( i = 0; i < iM; i += 1 )
    {
     for( j = 0; j < iK; j += 1 )
     {
      tR = ( T )0;
      for( k = 0; k < iN4; k += 4 )
      {
       tR += ( pptA[i][k  ] * pptB[j][k  ] ) +
          ( pptA[i][k+1] * pptB[j][k+1] ) +
          ( pptA[i][k+2] * pptB[j][k+2] ) +
          ( pptA[i][k+3] * pptB[j][k+3] );
      }
      for( k = iN4; k < iN; k += 1 )
      {
       tR += ( pptA[i][k] * pptB[j][k] );
      }
      pptC[i][j] = ( T )tR;
     }
     CrtSfence();
    }
   }
 

and, this is how a Tuned C codes look like for MinGW C/C++ compiler:

 // Implementation Code: CTv1E ( LOS=1:1:1 )

 template < class T >
 _RTINLINE RTvoid _MatrixMulProcessingCTv1E( T * _RTRESTRICT ptA, T * _RTRESTRICT ptB,
            T * _RTRESTRICT ptC,
            RTssize_t iM, RTssize_t iK, RTssize_t iN,
            RTint iNumOfThreads )
   {
   // CrtPrintf( RTU("_MatrixMulProcessingCTv1E\n") );
    _RTALIGNED T *ptA2 = ( T * )__builtin_assume_aligned( ptA, _RTDEFAULT_ALIGNMENT );
    _RTALIGNED T *ptB2 = ( T * )__builtin_assume_aligned( ptB, _RTDEFAULT_ALIGNMENT );
    _RTALIGNED T *ptC2 = ( T * )__builtin_assume_aligned( ptC, _RTDEFAULT_ALIGNMENT );
    RTssize_t i;
    RTssize_t j;
    RTssize_t k;
    _RTALIGNED register T tR;
    #if ( defined ( _RTPRAGMAOMP_PARALLELFOR ) )
    #pragma omp parallel for private( i, j, k, tR ) num_threads( iNumOfThreads )
    #endif
    for( i = 0; i < iM; i += 1 )
    {
     for( j = 0; j < iK; j += 1 )
     {
      tR = ( T )0;
      for( k = 0; k < iN; k += 1 )
      {
       tR += ( ptA2[ iN*i+k ] * ptB2[ iK*j+k ] );
      }
      ptC2[ iK*i+j ] = ( T )tR;
     }
     CrtSfence();
    }
   }
 

During compilation a correct Core Processing function is used depending what C/C++ compiler is used. LOS stands for Loop Optimization Schema.

Sergey Kostrov's picture

>>...I don't have access to a system now with the gcc 5 compilers - but I am curious to know how they do...

I've been working with MinGW C++ compilers ( GCC-alike / for Windows platforms ) for more than 5 years and I have experience with these versions: v3.4.2, v4.8.1, v4.9.0, v4.9.2. Version 5.1.0 is in Pending state and it should be integrated, with a project I've been working on, soon. I'd like to say that versions of MinGW C++ compiler after v4.8.1 are very impressive and it applies for 32-bit and 64-bit Windows platforms.

There are two IDZ threads where you could see some performance results of MinGW C++ compiler v4.9.2:

 *** Analysis of 128-bit Streaming store codes vs. Non Streaming store codes ***
http://software.intel.com/en-us/forums/watercooler-catchall/topic/609392

and

*** Integration of Open Watcom C++ compiler - details, performance evaluation ***
http://software.intel.com/en-us/forums/watercooler-catchall/topic/606505
 

David M.'s picture

Thanks for your comment Sergey.  You are correct the gcc compiler used was old.  It is important to get the correct message from the article - I used matrix multiply because it is easy to explain, easy to understand and can be used to show multiple principles (I use it again in part 2).  If the intel compiler had not automatically interchanged loops I would have only used one compiler and eliminated this point.  These three nested loops are very simple to analyze and in this case this version of the intel compiler recognizes it.  In your code or another developer's code there may be value in restructuring the order of operations where neither the intel compiler nor the most recent gcc compiler recognized the opportunity.  In these cases the developers may do it themselves and gain the benefit.   The message is to look at the order of operations, could it be structured to eliminate reductions and/or improve use of simd?  If so check did the compiler already do it.  If not it may be worth modifying the source code.  This is the message. 

Regarding compilers, in general gcc is a very popular compiler and does well.   gcc 4.1.2 was the default on the system I had access to for running these tests.   I tried gcc 4.7.2 on another system and it also did not interchange the loops.   I don't have access to a system now with the gcc 5 compilers - but I am curious to know how they do.   The message from this article is not that one compiler is better than another. All compilers have limitations.   Knowing how to give them help, can improve performance.

Sergey Kostrov's picture

I'd like to make a comment about performance evaluation results, that is Intel C/C++ compiler version 2016 vs.
GCC C/C++ compiler version 4.1.2.

- GCC C/C++ compiler version 4.1.2 is almost 8 years old (!) and was released some time in 2008. It supports a set of Instruction Set Architectures ( ISAs ) up to SSE 4.x and it does not support any AVX-like ISAs ( AVX, AVX2, AVX-512 );

- The latest ( stable ) version of GCC C/C++ compiler is version 5.3 and it was released in December 2015. It supports a complete set of ISAs of Intel CPUs;

- Intel C/C++ compiler version 2016 is the latest version of C/C++ compiler released recently by Intel and any attempts to compare its performance against obsolete and already out-of-support compilers, like GCC C/C++ compiler version 4.1.2, are inappropriate;

In another words, if you want to compare performance of C/C++ compilers than their capabilities, that is support of ISAs, should match to each other. Usage of -O2 compilation option ( for both compilers ) does not guarantee anything because both compilers use different optimization techniques and, as I've already told, these C/C++ compilers are from different Generations.

If I "down-normalize" performance results of Intel C/C++ compiler version 2016, or "up-normalize" performance results of GCC C/C++ compiler version 4.1.2, than an overall picture will be very different and it could show who is doing better.
 

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.