I'm having trouble with the optimisation of single-precision complex arithmetic. While for single complex numbers it does well (i.e. adds via movsd/addps/movsd), when working on arrays it uses the same algorithms, wasting the upper half of the xmm registers.
For example, this (trivial) function
pure function array_plus_array ( left, right ) result ( r ) complex(kind=C_FLOAT), dimension(2), intent(in) :: left complex(kind=C_FLOAT), dimension(2), intent(in) :: right complex(kind=C_FLOAT), dimension(2) :: r r = left + right end function array_plus_arraycompiles (using /QxHost /O3 on an i7 920) into
; parameter 1: rcx ; parameter 2: rdx ; parameter 3: r8 mov r9, QWORD PTR [rcx] movsd xmm1, QWORD PTR [rdx] movsd xmm0, QWORD PTR [r8] addps xmm1, xmm0 movsd QWORD PTR [r9], xmm1 movsd xmm1, QWORD PTR [8+rdx] movsd xmm0, QWORD PTR [8+r8] addps xmm1, xmm0 movsd QWORD PTR [8+r9], xmm1 mov rax, rcx retI've trimmed it to show the important stuff, in that it's still working on each complex number separately.
A much better routine would be along the lines of
mov r9, QWORD PTR [rcx] movups xmm1, QWORD PTR [rdx] movups xmm0, QWORD PTR [r8] addps xmm1, xmm0 movups QWORD PTR [r9], xmm1 mov rax, rcx retThis essentially halves the number of instructions. While not important in this example, I'm working with arrays of ~150 million elements or more, so it adds up quickly.
Is there a way to get better optimisation?
As a potential solution, I have C routines written using intrinsics that generate better assembly. If I compile both the C and Fortran using /Qipo, will it be able to optimise across the mixed languages?
EDIT: Sorry, my version of ifort is 184.108.40.206, build 20110811 (Intel 64).
EDIT: Fixed suggested assembly (changed movpd to movups).