# Puzzle: changing the order of outer loops leads to significant performance increase

## Puzzle: changing the order of outer loops leads to significant performance increase

Hi,

I have a puzzling finding that changing the order of the outer loops led to significant performance increase. I am playing with the following two versions of a small code piece:

Version 1: ii, k, j, i

```1529         do ii = iis, iie
1530           value = vals(ii)
1531           do k = ks, ke
1532             do j = js, je
1533               ind_offset = ( (k-1)*N2 + (j-1) ) * N1g
1534 !DIR\$ SIMD
1535               do i = is, ie
1536                 l0 = ind_offset + i
1537                 dF(l0) = dF(l0) + value * F(l0 + ii)
1538               end do
1539             end do
1540           end do
1541         end do```

Version 2: k, ii, j, i

```1529         do k = ks, ke
1530           do ii = iis, iie
1531             value = vals(ii)
1532             do j = js, je
1533               ind_offset = ( (k-1)*N2 + (j-1) ) * N1g
1534 !DIR\$ SIMD
1535               do i = is, ie
1536                 l0 = ind_offset + i
1537                 dF(l0) = dF(l0) + value * F(l0 + ii)
1538               end do
1539             end do
1540           end do
1541         end do```

The ONLY difference between these two versions is the order of the outermost two loops: Version 1 has a loop order of ii, k, j, i while Version 2 has a loop order of k, ii, j, i. The profiling results of these two versions are summarized as below:

CPU Time(s)     Load Instructions     L1 Cache Hits     L2 Cache Hits     L3 Cache Hits     MainMemory Hits
Version 1           11.282                1.36E+10                  75.86%                  3.46%                    20.69%                0.00%
Version 2             7.372                1.36E+10                  94.76%                  1.24%                      4.00%                0.00%

The results really surprised me in two ways:
(1) I observed a non-trivial speedup 11.282/7.372 = 1.53 and a significant increase in L1 Cache Hits.
(2) The only change I made was rearranging the order of the two OUTER loops, i.e., do ii loop and do k loop.

I have checked the vectorization report and found the inner loop (do i loop) in both of the two versions have been vectorized. So now I really have no idea what is going on. I compiled the code using ifort 13.1.0 with -O2 -xHost. The loop bound (length) for each loop level is:

do ii loop: 5
do k loop: 48
do j  loop: 40
do i  loop: 36

I truly appreciate your time and help.

Best regards,
Wentao

5 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

Hi, Wentao

In your second version, for the same k, within loop ii, ind_offset was unchanged for Dimension k. Thus l0 is only counted by Dimension j and i. The corresponding memory block (size 40*36) of dF can be re-used in loop ii for 5 times. Since ii is quite small

While in your first version, for each the same ii, l0 is counted by Dimension k, j and i which refers to much larger data blocks (size 48*40*36) and cannot fit  L1 cashe. Most of them will lie on L2 or L3. That's why you will get lower L1 cashe hits and bigger L and L3 cashe hits.

Access of Array F is counted by all loop variables: ii, k, j, and i, so does not impact much.

Hope this explains.

Quote:

Yolanda Chen (Intel) wrote:

Hi, Wentao

In your second version, for the same k, within loop ii, ind_offset was unchanged for Dimension k. Thus l0 is only counted by Dimension j and i. The corresponding memory block (size 40*36) of dF can be re-used in loop ii for 5 times. Since ii is quite small

While in your first version, for each the same ii, l0 is counted by Dimension k, j and i which refers to much larger data blocks (size 48*40*36) and cannot fit  L1 cashe. Most of them will lie on L2 or L3. That's why you will get lower L1 cashe hits and bigger L and L3 cashe hits.

Access of Array F is counted by all loop variables: ii, k, j, and i, so does not impact much.

Hope this explains.

Hi Yolanda,

You explanations REALLY helps. Many thanks!

Best regards,
Wentao

I think another factor is at play as well

(k-1)*N2 is loop invariant after k is established, and thus can be moved to immediately following do k=. The second version will compute (k-1)*N2 fewer times than the first. The L1 cache influence is likely greater contributor.

Jim Dempsey

Quote:

jimdempseyatthecove wrote:

I think another factor is at play as well

(k-1)*N2 is loop invariant after k is established, and thus can be moved to immediately following do k=. The second version will compute (k-1)*N2 fewer times than the first. The L1 cache influence is likely greater contributor.

Jim Dempsey

Hi Jim,

Thanks for pointing out this :-)

Best regards,
Wentao