# What are PEEL and REMAINDER loops? (Fortran and C vectorization support)

By Martyn Corden (Intel), published on October 28, 2014

The vectorization component of the optimization report produced by the Intel® Compiler version 15 may contain remarks such as:

<Peeled> remark #15301: PEEL LOOP WAS VECTORIZED <Remainder> remark #15301: REMAINDER LOOP WAS VECTORIZED

The compiler does its best to create an efficient, vectorized loop kernel. The “peel” and “remainder” loops are those left-over iterations of the original source loop that occur before and after the main vectorized kernel respectively. Consider the following example (in Fortran, but similar considerations apply to C and C++ - see example at end of article):

#ifndef DIM_A #define DIM_A 8 #endif subroutine pe_rm(a,n) implicit none real, dimension(DIM_A), intent(inout) :: a integer, intent(in) :: n integer :: i #ifdef ALIGN_A !dir$ assume_aligned a:ALIGN_A #endif do i=1,DIM_A a(i) = sin(a(i)) enddo end subroutine pe_rm

Here, a Fortran77 style DO loop is used, but an array assignment would behave in exactly the same way.

Let’s consider first the simple, default case of a small, fixed size array and compile to create an optimization report and an assembly file:

ifort -c -S -g -O2 -unroll0 -qopt-report=4 -qopt-report-phase=loop,vec / -qopt-report-file=stderr pe_rm.F90 -DDIM_A=8 Begin optimization report for: PE_RM Report from: Loop nest & Vector optimizations [loop, vec] LOOP BEGIN at pe_rm.F90(14,3) remark #15389: vectorization support: reference a has unaligned access [ pe_rm.F90(15,13) ] remark #15389: vectorization support: reference a has unaligned access [ pe_rm.F90(15,6) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 1 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 106 remark #15477: vector loop cost: 20.000 remark #15478: estimated potential speedup: 5.070 remark #15479: lightweight vector operations: 6 remark #15481: heavy-overhead vector operations: 1 remark #15482: vectorized math library calls: 1 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=2 LOOP END ===========================================================================

In this first example, the -unroll0 option has been used to prevent further unrolling of the vectorized loop. This doesn’t change the behavior we’re interested in, but it simplifies the interpretation.

The default target instruction set is Intel® Streaming SIMD Extensions 2 (Intel® SSE2). The corresponding vector registers are 16 bytes wide; a single Intel® SSE instruction can therefore operate on 4 REAL(4) data elements at once. The 8 iterations of the original scalar loop can be processed in just two iterations of the vectorized loop, with no scalar iterations left over. The compiler does not know the data alignment, (more on that later), so it contents itself with unaligned loads and stores from and to memory. The estimated speedup of 5x is somewhat greater than the <4x you might have expected, because the default vectorized math functions are very slightly less accurate but faster than the scalar versions. The “loop cost” numbers can be thought of as the vectorizer’s rough estimates of the execution time per iteration of the original loop, in arbitrary units, for the original scalar version and for the vectorized loop version. They are used to estimate the likelihood of a performance gain if the loop is vectorized. The actual execution times may of course depend strongly on factors not known to the compiler at vectorization time, including subsequent compiler optimizations, cache and memory access patterns, etc.

Next, let’s change the array size and loop trip count from 8 to 10. This would correspond to 2½ iterations of our vectorized loop kernel. What is the meaning of “half an iteration”? This is the origin of the <Remainder> or “remainder loop”:

ifort -c -S -g -O2 -qopt-report=4 -qopt-report-phase=loop,vec / -qopt-report-file=stderr pe_rm.F90 ‑DDIM_A=10 Begin optimization report for: PE_RM Report from: Loop nest & Vector optimizations [loop, vec] LOOP BEGIN at pe_rm.F90(14,3) remark #15389: vectorization support: reference a has unaligned access [ pe_rm.F90(15,13) ] remark #15389: vectorization support: reference a has unaligned access [ pe_rm.F90(15,6) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 1 remark #15451: unmasked unaligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 106 remark #15477: vector loop cost: 20.000 remark #15478: estimated potential speedup: 2.790 remark #15479: lightweight vector operations: 6 remark #15481: heavy-overhead vector operations: 1 remark #15482: vectorized math library calls: 1 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=2 LOOP END LOOP BEGIN at pe_rm.F90(14,3) <Remainder> remark #25436: completely unrolled by 2 LOOP END ===========================================================================

The first 8 scalar iterations still correspond to the two iterations of the vectorized loop kernel, but the two additional iterations are not enough for another iteration of the vectorized kernel, so they have become a remainder loop. The remainder loop was subsequently optimized by unrolling its two iterations.

If you look at the assembly file, pe_rm.s, following the vectorized kernel you will see the comments:

followed by the assembly code for the remainder loop. The estimated speedup is almost a factor of 2 less. This is because execution time is dominated by the expensive sine function; in addition to the two calls from the two iterations of the vectorized kernel, we now have two additional calls from the remainder loop.

Next, let’s increase the array size to 500 elements. The extra work allows the compiler to amortize the overhead from setting up some additional optimization:

ifort -c -S -g -O2 -qopt-report=4 -qopt-report-phase=loop,vec / -qopt-report-file=stderr pe_rm.F90 ‑DDIM_A=500 Begin optimization report for: PE_RM Report from: Loop nest & Vector optimizations [loop, vec] LOOP BEGIN at pe_rm.F90(14,3) <Peeled> LOOP END LOOP BEGIN at pe_rm.F90(14,3) remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,13) ] remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,6) ] remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15448: unmasked aligned unit stride loads: 1 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 106 remark #15477: vector loop cost: 19.250 remark #15478: estimated potential speedup: 5.300 remark #15479: lightweight vector operations: 6 remark #15481: heavy-overhead vector operations: 1 remark #15482: vectorized math library calls: 1 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=125 LOOP END LOOP BEGIN at pe_rm.F90(14,3) <Remainder> LOOP END ===========================================================================

We see that the compiler has created an additional loop, with the annotation <Peeled>, and that the formerly unaligned array accesses in the vectorized loop kernel have been replaced by aligned loads and stores. Loads and stores using Intel® SSE instructions are more efficient when the base memory address is aligned on a 16 byte boundary (similarly, 32 bytes for Intel® Advanced Vector Extensions or 64 bytes for Intel® Initial Multi-Core Instructions that target the Intel® Xeon Phi™ coprocessor). At compile time, the compiler does not know the alignment of data that is passed as a subroutine argument, (it would have more chance to know if the array was accessed through a module). At run time, if the address of a(1) is not aligned at a 16 byte boundary, the compiler executes loop iterations in scalar mode until it reaches a value of i for which a(i) is aligned on a 16 byte boundary. This and subsequent iterations, apart from a possible remainder loop, are then executed in the vector kernel using aligned loads and stores. Those initial iterations executed prior to the vectorized kernel are commonly known as the “peel loop”, because the iterations are “peeled” away one by one, until the data accesses in the main loop kernel are aligned. The corresponding assembly code is prefixed by the comment:

Sometimes, you know that the array argument will always be aligned on a 16 byte boundary, (for example, if the entire array is passed, and the program is compiled with ‑align array16byte). In this case, you can assert alignment to the compiler using an ASSUME_ALIGNED directive. Doing this for the above example:

ifort -c -S -g -O2 -qopt-report=4 -qopt-report-phase=loop,vec / -qopt-report-file=stderr pe_rm.F90 ‑DDIM_A=500 -DALIGN_A=16 Begin optimization report for: PE_RM Report from: Loop nest & Vector optimizations [loop, vec] LOOP BEGIN at pe_rm.F90(14,3) remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,13) ] remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,6) ] remark #15300: LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 1 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 106 remark #15477: vector loop cost: 19.250 remark #15478: estimated potential speedup: 5.500 remark #15479: lightweight vector operations: 6 remark #15481: heavy-overhead vector operations: 1 remark #15482: vectorized math library calls: 1 remark #15488: --- end vector loop cost summary --- remark #25015: Estimate of max trip count of loop=125 LOOP END ===========================================================================

Because the compiler knows that the array is already aligned, it does not generate a peel loop, but the vectorized loop kernel still has aligned memory accesses. Because the loop count is known and an exact multiple of the vector length, no remainder loop is needed. The compiler estimates a small additional speedup.

It is important to be sure that the data will always be aligned before using an ASSUME_ALIGNED or VECTOR ALIGNED directive. If it is not, the application may fault on some platforms, or run more slowly on others. See the article at https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization for more discussion of data alignment. The impact of data alignment on performance can vary considerably depending on the target processor and on the ratio of memory access to computation in the loop. However, the performance benefit from aligning data is often much less than the benefit from vectorizing the loop in the first place.

The PEEL and REMAINDER loops may themselves be vectorized if that seems likely to help performance, even though they typically have insufficient data to completely fill a vector register. This is more likely on architectures or micro-architectures with wider registers. Suppose we compile our example for native execution on an Intel Xeon Phi coprocessor, (if software supporting the Intel Xeon Phi coprocessor is not installed, you may see warning message #10362, but the compilation should still proceed):

ifort -c -S -g -O2 -qopt-report=4 -qopt-report-phase=loop,vec / -qopt-report-file=stderr pe_rm.F90 ‑DDIM_A=500 -mmic Begin optimization report for: PE_RM Report from: Loop nest & Vector optimizations [loop, vec] LOOP BEGIN at pe_rm.F90(14,3) <Peeled> remark #15389: vectorization support: reference a has unaligned access [ pe_rm.F90(15,13) ] remark #15389: vectorization support: reference a has unaligned access [ pe_rm.F90(15,6) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15301: PEEL LOOP WAS VECTORIZED LOOP END LOOP BEGIN at pe_rm.F90(14,3) remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,13) ] remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,6) ] remark #15300: LOOP WAS VECTORIZED remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 106 remark #15477: vector loop cost: 4.810 remark #15478: estimated potential speedup: 20.150 remark #15479: lightweight vector operations: 6 remark #15481: heavy-overhead vector operations: 1 remark #15482: vectorized math library calls: 1 remark #15488: --- end vector loop cost summary --- remark #25018: Total number of lines prefetched=2 remark #25019: Number of spatial prefetches=2, dist=8 remark #25021: Number of initial-value prefetches=4 remark #25139: Using second-level distance 4 for prefetching spatial memory reference [ pe_rm.F90(15,17) ] remark #25015: Estimate of max trip count of loop=31 LOOP END LOOP BEGIN at pe_rm.F90(14,3) <Remainder> remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,13) ] remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,6) ] remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP END ===========================================================================

In this case, both peel and remainder loops have been vectorized, using unaligned and aligned memory accesses respectively. Because of the greater width of the vector registers, 64 bytes compare to 16 bytes for Intel SSE, the estimated speedup is nearly 4 times greater. Again, the high level remarks are embedded in the assembly file as comments.

We have used explicit array sizes in order to more clearly illustrate and explain compiler behavior. In practice, array sizes are rarely explicit and are typically variable at run-time. In our example, this corresponds to compiling with –DDIM_A=n :

ifort -c -S -g -O2 -qopt-report=4 -qopt-report-phase=loop,vec / -qopt-report-file=stderr pe_rm.F90 ‑DDIM_A=n Begin optimization report for: PE_RM Report from: Loop nest & Vector optimizations [loop, vec] LOOP BEGIN at pe_rm.F90(14,3) <Peeled> LOOP END LOOP BEGIN at pe_rm.F90(14,3) remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,13) ] remark #15388: vectorization support: reference a has aligned access [ pe_rm.F90(15,6) ] remark #15300: LOOP WAS VECTORIZED remark #15442: entire loop may be executed in remainder remark #15448: unmasked aligned unit stride loads: 1 remark #15449: unmasked aligned unit stride stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 106 remark #15477: vector loop cost: 19.250 remark #15478: estimated potential speedup: 5.180 remark #15479: lightweight vector operations: 6 remark #15481: heavy-overhead vector operations: 1 remark #15482: vectorized math library calls: 1 remark #15488: --- end vector loop cost summary --- LOOP END LOOP BEGIN at pe_rm.F90(14,3) <Remainder> LOOP END ===========================================================================

In the absence of information to the contrary, the compiler assumes that the trip count is likely to be fairly large and vectorizes accordingly. The behavior is very similar to that seen for an explicit array size of 500, with the generation of peel and remainder loops. Hints to the compiler as to likely values of the loop trip count can be provided via a directive. For example, !DIR$ LOOP COUNT (8) would cause the compiler to generate one optimized loop version for a trip count of exactly 8, similar to our first example, and another loop version for the general case, similar to the last example.

The following C source code will give results and report messages very close to the preceding Fortran example. Replacing ifort by icc on the command line invokes the C compiler. The equivalent C syntax for the loop count directive is #pragma loop count(8)

#ifndef DIM_A #define DIM_A 8 #endif #include void pe_rm(float *a, int n) { #ifdef ALIGN_A __assume_aligned (a,ALIGN_A); #endif int i; for (i=0; i<DIM_A; i++) a[i] = sinf(a[i]); }