| March 11, 2009 1:00 AM PDT | |
Analyze compiler-generated assembly language to determine the logic of critical sections of code. A structured methodology for gaining an understanding of the assembly code is essential for identifying the strengths and weaknesses of that code for a specific implementation.
Trace the natural path of execution by following the predication order or using the commented numbers inserted by the compiler, then reorganize the sub-operations into a logical order. One way to analyze the execution of heavily predicated loops is to follow the predication order of instructions. Loops usually start with only predicate p16 set to true on the first iteration, then p17 on the next pass, p18 on the next, and so forth. As a result, the operations performed on a single item on the register rotation computational "conveyor belt" usually begin with the p16 instructions on the first pass through the loop, followed by the p17 instructions on the second pass, and so forth.
The checksum loop instructions can be thought of as executing on a single input piece of data in something like the order shown in the sample code below. Note that predicate p19 is actually explicitly set by a comparison operation (to p18, which rotates into the p19 position), and so it appears out of the natural 16, 17, 18... order.
(p16) add r32=4,r33 //0: 7
(p16) and r38=7,r35 //0: 7
(p16) ld4 r36=[r33] ;; //0: 8
(p16) cmp4.eq.unc p18,p0=r38,r0 //1: 7
(p16) add r34=1,r35 //1: 7
// looping and register rotation happens here (br.ctop)
(p19) add r39=144,r34 //2: 7
(p19) lfetch.excl [r39] //3: 7
(p17) add r40=r41,r37 //3: 8
Another way to trace the natural path of instructions is to follow the trail of commented numbers immediately preceding the colon at the right of each assembly language line. The compiler attempts to show the logical order of parallel instruction groups in the comment field. (A number of Intel® compilers are available for use with the Itanium® processor family. See the Intel® Software Development Products web site for more information.
The following sample code shows the unscrambled instruction sequence commented. There are essentially two logical sequences intermixed here, one that fetches and adds integers to the checksum, and one that prefetches cache lines every eighth time through the loop. The latter task is done by the lfetch instruction, which fetches a cache line from main memory and loads it in the highest available level in the cache hierarchy.
(p16) add r32=4,r33 // point to next integer to read
(p16) and r38=7,r35 // r38 is loop counter modulo 8
(p16) ld4 r36=[r33] ;; // fetch integer to checksum
(p16) cmp4.eq.unc p18,p0=r38,r0 // is (r38 % 8) zero?
(p16) add r34=1,r35 // keep the loop counter going
// looping and register rotation happens here (br.ctop)
(p19) add r39=144,r34 // calculate next prefetch address
(p19) lfetch.excl [r39]// prefetch a cache line
(p17) add r40=r41,r37 // do the checksum addition
The two logically distinct parts of the loop are shown separated in the final passages of sample code, below. Of course, after this much rearrangement of code, the instructions no longer read as logically correct assembly-language code, since register-rotation effects are not made explicit in these modified listings. Nevertheless, this sample code serves to illustrate how a complex out-of-order mix of instructions in an original loop can be deciphered – in this case, as two small in-order loops. (Note that "out-of-order" always refers to static, not dynamic, ordering in the Itanium® processor.)
The three-instruction checksumming operation:
(p16) add r32=4,r33 // point to next integer to read
(p16) ld4 r36=[r33] ;; // fetch integer to checksum
(p17) add r40=r41,r37 // do the checksum addition
The five-instruction prefetching operation:
(p16) and r38=7,r35 // r38 is loop counter modulo 8
(p16) cmp4.eq.unc p18,p0=r38,r0 // is (r38 % 8) zero?
(p16) add r34=1,r35 // keep a loop counter going
(p19) add r39=144,r34 // calculate next prefetch address
(p19) lfetch.excl [r39]// prefetch a cache line
The three-instruction core checksum sequence is easy enough to follow. The five-instruction prefetch sequence is more challenging. The prefetch is only performed every eighth time through the loop, and the location of the prefetched memory is about 144 bytes ahead of where the current checksum fetches are occurring. The reason that a prefetch occurs only every eighth pass has to do with the 32-byte Itanium processor cache-line size. An integer is (still) four bytes, so eight of them take up 32 bytes. Therefore, every eighth time through the checksum loop, a new cache line is needed.
For the cache line to be ready when needed, the prefetch has to start well in advance of being needed. Assuming that the checksum loop can execute in the ideal two cycles per iteration, prefetching 144 bytes ahead of time translates into about 72 cycles of lead time in which to get the prefetched data from as far away as L3 cache, or even main memory, into an L1 cache line. Although it might seem logically simpler to issue a prefetch request on every pass through the loop, too many outstanding prefetch requests can clog up the prefetch queues, so it is worth a little extra logic to keep the prefetches to a minimum.
Recognizing Efficient Use of Caches in Code for the Itanium® Processor Family
For more complete information about compiler optimizations, see our Optimization Notice.

