| March 2, 2009 11:00 PM PST | |
Guide the compiler to perform the proper amount of optimization on an inner loop. One of the quickest ways to find an inner loop in an assembly language listing is to look for sections of code with heavy use of instruction predication, often on almost every functional line. When register rotation is used, predicates control almost all the action. This example has a very small inner loop, but two of its three instructions are predicated:
.b1_3: |
The inner loop shown here is one of the smallest, and potentially fastest, inner loops possible on the Itanium® Processor family, operating with only three instructions in a single bundle. Using register rotation, it loads and adds successive values from memory. Ideally, this loop could execute in two cycles per iteration.
A more difficult example to understand is generated by optimizing the loop using the -O3 switch, as shown here:
ecl -O3 -S checksum.c |
The following sample code shows that the inner loop in the -O3 compiled code is larger than the code with no optimization switches:
.b1_3: |
The compiler, without the use of command-line switches, provides a significant amount of optimization. However, most optimizations performed by the compiler involve a tradeoff between execution speed and code size. In general, your program can be made smaller or faster, but not both. So it would appear that the code resulting from the use of the -O3 switch should run faster than the code with no optimization switch. That may or may not be true.
Examine the assembly-language representation of the inner loop, and apply your knowledge of the data set to determine the correct level of optimization. Reading a loop like this is not easy. An obvious question is, "Why is a nine-instruction loop more optimal than a single-cycle three-instruction loop?" The answer has to do with the possibility that the data being fetched is not coming from L1 cache. In the worst case, it might not be in cache at all. This is the premise that guides the -O3 optimization.
If that assumption is true, the -O3 optimized version of the loop would indeed run much faster. The lfetch instruction would bring code up to the highest level of cache during each iteration, thus minimizing the effect of the main memory/L3 cache latencies. Conversely, if the user has not invoked the -O3 switch and it was necessary to access main memory for the loads, then tight inner loops execute very slowly.
Recognizing Efficient Use of Caches in Code for the Itanium® Processor Family
For more complete information about compiler optimizations, see our Optimization Notice.

