Improve Code Based on Root-Cause Analysis on 64-Bit Intel® Architecture

Submit New Article

November 24, 2008 11:00 PM PST



Challenge

Improve the efficiency of code based on root-cause analysis of back-end bubbles. That analysis is covered in a separate item How to Perform Back-End Bubble Root-Cause Analysis on 64-Bit Intel® Architecture. The solution in this item is based on that analysis.


Solution

Reorder and unroll loops to improve the efficiency of memory access. Studying the core loops identified in the root-cause analysis shows that in every iteration, a new element of matrix a and b are accessed:

(i = 0; i < 512; i++) { 
(j = 0; j < 512; j++) {
(k = 0; k < 512; k++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}

 

Take a moment for a quick review of how matrixes are stored in memory; the scheme can be visualized as a straight line. If given a segment of memory that can hold 1024 elements, memory position 0000 will hold element [0][0], 0001 will hold [0][1], 0002 will hold [0][2], and so on until 0511 holds [0][511]. The new row of data starts at the next available memory address.

This means that memory position 0512 holds element [1][0], 0513 holds [1][1], and so on until 1023 holds [1][511]. This can also affect cache, because when data is accessed from one segment of memory, data in nearby locations is also loaded, since memory accesses are often close to one another.

Matrix c is static for 512 iterations of k. Matrix a accesses successive elements of memory, but matrix b is accessed in a sequential but not successive manner. This means that matrix a takes advantage of cache, and matrix b does not.

Suppose that the first-level cache can hold 64 elements of a matrix, and the second-level cache can hold 256 elements. As the first element for matrix a is loaded into cache, the next 64 values are loaded into the quickest cache. Those 64 elements plus additional elements are loaded into second-level cache, for a total of 256 values ready for quick access.

Matrix a will not have to wait for a load from main memory for at least another 255 loop iterations. Matrix b, however, has to wait for memory with each iteration, because it increments each row per iteration, instead of per column like matrix a. All the data that gets loaded into cache with the first memory access is then wasted, because none of the other values will get used soon.

In an effort to improve the way matrix b access memory, let us swap the positions of loops j and k, like so:

(i = 0; i < 512; i++) { 
(k = 0; k < 512; k++) {
(j = 0; j < 512; j++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}

 

Matrix c now gets accessed sequentially, but it changes every iteration. Matrix a still gets accessed sequentially and becomes sta tic for 512 iterations, as matrix c did earlier. Matrix b is accessed sequentially, because j increments every iteration instead of k. Therefore, the program should take advantage of spatial locality (memory elements being next to each other) and run faster. Make the change in code, recompile, and replace the old executable.

Right click the “BACK_END root cause” activity and select “Copy Object.” Right-click on the parent folder and paste. Rename the new “BACK_END root cause” to “BE root cause 02.” Run the sampling activity again:



Notice that the total number of CPU_CYCLES has been reduced by half! BE_EXE_BUBBLE-ALL is still a large consumer of clockticks, though, and there is more that can be done to improve performance.

The next step of tuning this code is to take advantage of the Intel® Itanium® processor’s ability to execute six instructions at one time. Since we have established that the loops are independent of each other, it is possible to unroll the loops, executing multiple computations per iteration.

The way to unroll loops is to write code similar to this:

(i = 0; i < 512; i++) { 
(k = 0; k < 512; k++) {
(j = 0; j < 512; j++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
c[i+1][j+1] = c[i+1][j+1] + a[i+1][k+1] * b[k+1][j+1];
c[i+2][j+2] = c[i+2][j+2] + a[i+2][k+2] * b[k+2][j+2];
c[i+3][j+3] = c[i+3][j+3] + a[i+3][k+3] * b[k+3][j+3];
}
}

 

This way, four computations are done per iteration. After recompiling and replacing the executable, the VTune analyzer reports the following:



This item is part of a series, which is introduced in the separate item "How to Resolve Back-End Bubbles on 64-Bit Intel® Architecture."


Source

Identifying Root Causes Using the VTune™ Performance Analyzer