I have a puzzling finding that changing the order of the outer loops led to significant performance increase. I am playing with the following two versions of a small code piece:
Version 1: ii, k, j, i
1529 do ii = iis, iie 1530 value = vals(ii) 1531 do k = ks, ke 1532 do j = js, je 1533 ind_offset = ( (k-1)*N2 + (j-1) ) * N1g 1534 !DIR$ SIMD 1535 do i = is, ie 1536 l0 = ind_offset + i 1537 dF(l0) = dF(l0) + value * F(l0 + ii) 1538 end do 1539 end do 1540 end do 1541 end do
Version 2: k, ii, j, i
1529 do k = ks, ke 1530 do ii = iis, iie 1531 value = vals(ii) 1532 do j = js, je 1533 ind_offset = ( (k-1)*N2 + (j-1) ) * N1g 1534 !DIR$ SIMD 1535 do i = is, ie 1536 l0 = ind_offset + i 1537 dF(l0) = dF(l0) + value * F(l0 + ii) 1538 end do 1539 end do 1540 end do 1541 end do
The ONLY difference between these two versions is the order of the outermost two loops: Version 1 has a loop order of ii, k, j, i while Version 2 has a loop order of k, ii, j, i. The profiling results of these two versions are summarized as below:
CPU Time(s) Load Instructions L1 Cache Hits L2 Cache Hits L3 Cache Hits MainMemory Hits
Version 1 11.282 1.36E+10 75.86% 3.46% 20.69% 0.00%
Version 2 7.372 1.36E+10 94.76% 1.24% 4.00% 0.00%
The results really surprised me in two ways:
(1) I observed a non-trivial speedup 11.282/7.372 = 1.53 and a significant increase in L1 Cache Hits.
(2) The only change I made was rearranging the order of the two OUTER loops, i.e., do ii loop and do k loop.
I have checked the vectorization report and found the inner loop (do i loop) in both of the two versions have been vectorized. So now I really have no idea what is going on. I compiled the code using ifort 13.1.0 with -O2 -xHost. The loop bound (length) for each loop level is:
do ii loop: 5
do k loop: 48
do j loop: 40
do i loop: 36
I truly appreciate your time and help.