Single-precision Complex Optimisation

Single-precision Complex Optimisation

Hi,

I'm having trouble with the optimisation of single-precision complex arithmetic. While for single complex numbers it does well (i.e. adds via movsd/addps/movsd), when working on arrays it uses the same algorithms, wasting the upper half of the xmm registers.

For example, this (trivial) function

pure function array_plus_array ( left, right ) result ( r )

  complex(kind=C_FLOAT), dimension(2), intent(in) :: left

  complex(kind=C_FLOAT), dimension(2), intent(in) :: right

  complex(kind=C_FLOAT), dimension(2)             :: r

  r = left + right

end function array_plus_array
compiles (using /QxHost /O3 on an i7 920) into
; parameter 1: rcx

; parameter 2: rdx

; parameter 3: r8

mov       r9, QWORD PTR [rcx]

movsd     xmm1, QWORD PTR [rdx]

movsd     xmm0, QWORD PTR [r8]

addps     xmm1, xmm0

movsd     QWORD PTR [r9], xmm1

movsd     xmm1, QWORD PTR [8+rdx]

movsd     xmm0, QWORD PTR [8+r8]

addps     xmm1, xmm0

movsd     QWORD PTR [8+r9], xmm1

mov       rax, rcx

ret
I've trimmed it to show the important stuff, in that it's still working on each complex number separately.

A much better routine would be along the lines of

mov       r9, QWORD PTR [rcx]

movups    xmm1, QWORD PTR [rdx]

movups    xmm0, QWORD PTR [r8]

addps     xmm1, xmm0

movups    QWORD PTR [r9], xmm1

mov       rax, rcx

ret
This essentially halves the number of instructions. While not important in this example, I'm working with arrays of ~150 million elements or more, so it adds up quickly.

Is there a way to get better optimisation?

As a potential solution, I have C routines written using intrinsics that generate better assembly. If I compile both the C and Fortran using /Qipo, will it be able to optimise across the mixed languages?

Cheers,
Ben

EDIT: Sorry, my version of ifort is 12.1.0.233, build 20110811 (Intel 64).
EDIT: Fixed suggested assembly (changed movpd to movups).

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

IPO will optimize across the languages. I'm not sufficiently familiar with the instruction set to judge your suggested code but I will pass it on to the developers for their comments.

Steve - Intel Developer Support

In the example of arrays of length 2, it may be that the compiler would say "seems inefficient" for complex even though it might recognize data alignment. !dir$ vector aligned could over-rule "seems inefficient." There is no possibility of determining alignment at compile time in the case presented here.
While not applicable to this case, the option -complex-limited-range is likely to be needed to get vector speedup (at the expense of the limited range) for sequences including complex abs, divide, and sqrt.
Also not applicable to this case, for double precision complex, even though the compiler optimizes with simd instructions, it will report vectorization only for AVX (and MIC) compilations, given that the 128-bit simd instructions implement only a single real/imaginary operand pair.

Thanks guys.

Yes, this case is rather trivial. However, I normally use derived types containing an array, like this

type :: testType

   complex(kind=C_FLOAT), dimension(4, 4, 3, 3) :: sc

end type testType
and have helper functions that look like
elemental type(testType) function add_testType ( left, right ) result ( r )

  type(testType), intent(in) :: left

  type(testType), intent(in) :: right

  r%sc = left%sc + right%sc

end function add_testType
I get the same assembly, just completely unrolled (one set of movsd/addps/movsd for each of the 144 elements). Turning vec-report on, it repors this
D:Codetestingprogram.f90(14): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.

D:Codetestingprogram.f90(14): (col. 5) remark: loop was not vectorized: vectorization possible but seems inefficient.

D:Codetestingprogram.f90(14): (col. 5) remark: loop was not vectorized: not inner loop.
If I replace the dimension(4, 4, 3, 3) with dimension(144), it does indeed vectorise it (but not completely unrolled, and it looks like it's checking for alignment - is there any way to force all variables of this type to be 16-byte aligned, rather than using directives for each one?). I'm guessing it's getting hung up about the multiple small dimension, even though they're sequential in memory.

Cheers,
Ben

The only ways the compiler could take advantage of alignment would be by the directive (where you take responsibility that the operands are aligned) or by interprocedural analysis, with the caller declaring the arrays and callee in a single invocation of ifort.
I don't know whether specifying a sequence attribute would help out; without that or the inter-procedural analysis, the compiler can't assume it.
I submitted a premier.com report myself recently on a case where the compiler doesn't optimize well with 4 subscripts, some of them short sizes. You are certainly entitled to file such a report where you believe the compiler should do better.
ifort doesn't go out of its way to unroll loops fully; I suppose the alignment and sequence questions are more important here.

You can tell the compiler to assume that the operands are aligned as follows

!DIR$ ASSUME_ALIGNED left:16,right:16

but I didn't find that helped. There are various directives such as !DIR$ SIMD which may be of help.

You can't attach alignment to a type, but I think you'll tend to get at least 16-byte alignment anyway.

Steve - Intel Developer Support

Under normal circumstances, X64 gives 16-byte alignment, unless you pass a section in the middle of an array or a COMMON block or derived type with misalignment. So the compiler can't assume alignment for separate compilation; it has to see the connection to the declaration.
!dir$ simd includes the effect of !dir$ vector always (compiler doesn't make "seems ineffective" analyses) but doesn't assert alignment.

Under normal circumstances, you will get 16-byte alignment ...

When I used win32 API function LocalAlloc for memory allocation, it sometimes returned 8-byte aligned memory and I had problems with optimized code. So do not count on 16-byte alignment of this function.

We were discussing X64, which has better default alignments than win32. Besides, when you go to API or C++ programming, you expose more problems than you see in portable Fortran. In win32, you often require special functions such as aligned_malloc. The early version of that, mm_malloc, didn't translate consistently to X64.

Leave a Comment

Please sign in to add a comment. Not a member? Join today