Notes about Loop-Blocking Optimization Technique to increase performance of processing

Notes about Loop-Blocking Optimization Technique to increase performance of processing

[ Note 1 ]

Loop-Blocking Optimization Technique is well described in Intel Software Development Manual and Intel C++ compiler User and Reference Guides. After extensive testing I could say that it is very important to select a right Block Size for the last for-loop and its optimal size depends on a size of L1 cache line of a CPU.

 

17 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

[ Note 1 - part 2 ]

Here are some performance numbers:

...
Sub-Test - Adds array B to array A - Loop-Blocking Optimization Technique ( Single-Precision Floating-Point type )
...
[ Block size 16 elements ( 64 bytes ) ]
...
ICC - Block Size: 16 - Sub-Test completed in 1516 ticks ( T1 )
MSC - Block Size: 16 - Sub-Test completed in 1531 ticks ( T2 )
MGW - Block Size: 16 - Sub-Test completed in 1546 ticks ( T3 )
...
[ Block size 32 elements ( 128 bytes ) ]
...
ICC - Block Size: 32 - Sub-Test completed in 6422 ticks
MSC - Block Size: 32 - Sub-Test completed in 6671 ticks
MGW - Block Size: 32 - Sub-Test completed in 6734 ticks
...
[ Block size 64 elements ( 256 bytes ) ]
...
ICC - Block Size: 64 - Sub-Test completed in 6593 ticks
MSC - Block Size: 64 - Sub-Test completed in 6735 ticks
MGW - Block Size: 64 - Sub-Test completed in 6765 ticks
...

[ Note 2 ]

Unrolling, or Vectorization, for the last for-loop as 4-in-1 improves performance by ~2.3% ( it is average for three C++ compilers I tested ):

...
[ Block size 16 elements ( 16 bytes ) ]
...
ICC - Block Size: 16 - Sub-Test completed in 1500 ticks Note: Faster by ~1% compared to T1 ( see previous post for all Tx values )
MSC - Block Size: 16 - Sub-Test completed in 1516 ticks Note: Faster by ~1% compared to T2
MGW - Block Size: 16 - Sub-Test completed in 1469 ticks Note: Faster by ~5% compared to T3
...

[ Note 3 ]

It is possible that Loop-Blocking Optimization Technique won't improve performance of some processing for very large data sets, if they are greater than GBs when loaded into memory, and when Virtual Memory ( paging file ) is used.

[ Note 4 ]

ICC - Intel C++ compiler
MSC - Microsoft C++ compiler
MGW - MinGW C++ compiler

[ Note 5 ]

Optimizations for speed ( /O2 ) used for all C++ compilers.

[ Note 6 ]

There is a very good article about the Loop-Blocking Optimization Technique at:
.
http://software.intel.com/en-us/articles/performance-tools-for-software-...

[ Note 7 ]

Here is another set of performance numbers ( with Intel C++ compiler ):

Sub-Test - MatrixA + MatrixB - Loop-Blocking Optimization Technique
Matrix Size: 4096x4096
...
Block Size : 2 elements ( 8 bytes ) - Sub-Test completed in 1265 ticks
Block Size : 4 elements ( 16 bytes ) - Sub-Test completed in 625 ticks
Block Size : 8 elements ( 32 bytes ) - Sub-Test completed in 485 ticks
Block Size : 16 elements ( 64 bytes ) - Sub-Test completed in 469 ticks <- Best Time
Block Size : 32 elements ( 128 bytes ) - Sub-Test completed in 1678 ticks
Block Size : 64 elements ( 256 bytes ) - Sub-Test completed in 1682 ticks
Block Size : 128 elements ( 512 bytes ) - Sub-Test completed in 1687 ticks
...

By Loop - Blocking Optimization techniques do you mean dividing  data block int cache lines (32-bytes) long  and inner loop iteration on every double or float value(inside cache line)?

Here is an example:

void arrayAdditionTest2(double (*input)[MAX_SIZE],double (*output)[MAX_SIZE]){
    double _in[MAX_SIZE][MAX_SIZE],_out[MAX_SIZE][MAX_SIZE],result[MAX_SIZE][MAX_SIZE];
    double (*res)[MAX_SIZE];
    if(input == NULL || output == NULL)
        return;
   
    input = _in;
    output = _out;
    res = result;

    for(int i = 0;i < MAX_SIZE;i++){
        for(int j = 0;j < MAX_SIZE;j++){
            printf("array input[] = %.17f \n",*(*(input+i)+j));
        }
    }
           

                for(int i = 0;i < MAX_SIZE;i+=CACHE_LINE){
             for(int j = 0;j < MAX_SIZE;j+=CACHE_LINE){
            for(int ii = i;ii <i + CACHE_LINE;ii++){
                for(int jj = j;jj <j + CACHE_LINE;jj++){
                    *(*(res+ii)+jj) = *(*(output+ii)+jj) + *(*(input+ii)+jj);
                    printf("Loop Blocking test2 =  %.17f %.17f \n",*(*(res+ii)+jj));
                }
            }
        }
    }

   
}

See Note 1 and Note 6.

Thanks.

[ Note 8 ]

There are some recommendations related to

-fomit-frame-pointer
-fprefetch-loop-arrays

command line options for MinGW C++ compiler in order to improve performance of processing. However, I did not see any performance gains ( especially for -fprefetch-loop-arrays ) when I tried to use both options.

According to gcc docs, -fomit-frame-pointer is implied by -O, for cases where it is possible (-g would turn it off).  It seems it would be important mainly for 32-bit mode.

IIRC -fprefetch-loop-arrays was designed for AMD athlon-32 CPUs.  On any current CPU, it could be useful only for specialized cases, such as where the limit on hardware prefetched streams is exceeded, or DTLB misses can be mitigated without premature cache eviction.

Thanks, Tim, for these comments.

>>According to gcc docs, -fomit-frame-pointer is implied by -O, for cases where it is possible (-g would turn it off).
>>It seems it would be important mainly for 32-bit mode.

I tested it with a test application compiled for Release configuration and here is a part of command line options for the compiler:

...-O2 -m32 -ffast-math -fomit-frame-pointer...

and I use -g option for Debug configuration only:

...-O0 -m32 -g...

>>IIRC -fprefetch-loop-arrays was designed for AMD athlon-32 CPUs...

That's good to know and I'll check docs as well.

>>>fomit-frame-pointer>>>

Is  that option used to load ebp register with arbitrary data?So call stack frames are accessed with esp register.

[ Note 9 ]

Here are results of tests on:

Dell Precision Mobile M4700
Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 )
Size of L3 Cache = 8MB ( shared between all cores for data & instructions )
Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions )
Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
Windows 7 Professional 64-bit

Two versions of Classic matrix multiplication algorithm tested:

- Transposed Based
- Loop-Blocking Optimization Based

Matrix Size: 2048x2048

Block Size : 2 elements ( 8 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1325 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 29781 ticks

Block Size : 4 elements ( 16 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1326 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 9142 ticks

Block Size : 8 elements ( 32 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1339 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 3978 ticks

Block Size : 16 elements ( 64 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1329 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2637 ticks

Block Size : 32 elements ( 128 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1336 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2585 ticks

Block Size : 64 elements ( 256 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1321 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2543 ticks

Block Size : 128 elements ( 512 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1323 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2465 ticks

Block Size : 256 elements ( 1024 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1326 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2418 ticks

Block Size : 512 elements ( 2048 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1326 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2371 ticks

Block Size : 1024 elements ( 4096 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1321 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2372 ticks

Block Size : 2048 elements ( 8192 bytes )
Sub-Test 4.2 - Transposed Technique ( MatMulV2 ) ***************** Completed in 1326 ticks
Sub-Test 4.3 - Loop-Blocking Optimization Technique ( MatMulV3 ) * Completed in 2356 ticks

1. Since cache lines are larger than smaller sizes for the Block Size parameter don't increase performance of calculations

2. It is important to note that Classic matrix multiplication Transposed Based algorithm always outperforms Classic matrix multiplication Loop-Blocking Optimization Based algorithm

[ Summary for Loop-Blocking Optimization Technique on Ivy Bridge ]

Matrix Size: 2048x2048

Block Size : ....2 elements ( ......8 bytes ) - Completed in 29781 ticks
Block Size : ....4 elements ( .....16 bytes ) - Completed in 9142 ticks
Block Size : ....8 elements ( .....32 bytes ) - Completed in 3978 ticks
Block Size : ...16 elements ( ....64 bytes ) - Completed in 2637 ticks
Block Size : ...32 elements ( ..128 bytes ) - Completed in 2585 ticks
Block Size : ...64 elements ( ..256 bytes ) - Completed in 2543 ticks
Block Size : ..128 elements ( ..512 bytes ) - Completed in 2465 ticks
Block Size : ..256 elements ( 1024 bytes ) - Completed in 2418 ticks
Block Size : ..512 elements ( 2048 bytes ) - Completed in 2371 ticks
Block Size : 1024 elements ( 4096 bytes ) - Completed in 2372 ticks
Block Size : 2048 elements ( 8192 bytes ) - Completed in 2356 ticks

>>...Is that option used to load ebp register with arbitrary data?

This is what MSDN says about it:

...
This option speeds function calls, because no frame pointers need to be set up and removed. It also frees one more register, (EBP on the Intel 386 or later) for storing frequently used variables and sub-expressions.

/Oy is only available in x86 compilers.
...

Yep that true, but FPO complicates debugging.

I don't understand the note about Debugging and it is Not relevant to the subject of the thread.

FPO = frame pointer omittion.

Sorry for offtopic post.

Login to leave a comment.