I'm compiling some code in fortran with the -xAVX option (ifort version 12.1.0 20111011), and depending on whether or not the code is standalone or part of a subroutine, the compiler vectorizes differently and I was wondering why.The code is as follows:
program example implicit none double precision, dimension(4, 180000, 4) :: qold, q, res double precision, dimension(4, 180000) :: adt double precision, dimension(4) :: adti, del integer :: i, j, diff, acc diff = 180000 do j = 1, diff adti(:) = 1.0 / adt(:,j) # line 10 do i = 1, 4 del(:) = adti(:) * res(:,j,i) # line 12 q(:,j,i) = qold(:,j,i) - del(:) # line 13 res(:,j,i) = 0.0 # line 14 end do end do acc = sum(q) print *, acc end program
Using -vec-report it says that the loops were indeed vectorized.Furthermore, looking at the assembly, all the relevant instructions were using the ymm registers.However, if I were to encompass the code within a subroutine which takes qold, q, res, adt and diff as arguments and include it in a module, things go differently;This time -vec-report tells me that only lines 10 and 12 could be vectorized, whereas 13 and 14 the vectorization is "possible but seems inefficient".If I decide to force the vectorization to happen using the !DIR$ SIMD directive before lines 13 and 14, it then tells me that the SIMD loop was vectorized.However, looking at the assembly, instead of using AVX instructions it seems to be using SSE instructions.
- For line 13,
- vsubpd generated_module_mp_update_kernel_caller_$DEL.0.2(%rip), %xmm6, %xmm7
- vsubpd 16+generated_module_mp_update_kernel_caller_$DEL.0.2(%rip), %xmm8, %xmm9
- For line 14,
- vxorpd %xmm0, %xmm0, %xmm0
- vmovupd %xmm0, (%rbx,%r9)
- vmovupd %xmm0, 16(%rbx,%r9)
Because it does two similar instructions for both, with an offset of 16 bytes (128 bits) which is the size of an SSE register, it seems to be doing two SSE vector instructions instead of one AVX instruction.
Does anyone know why this happens and how to change this?