PAOS - Packed Array Of Structures

by Jim Dempsey

Programmers are familiar with the small vector capability of today’s processors, referred to as Single Instruction – Multiple Data (SIMD). SIMD, in Intel Architecture (IA32, IA64, Intel64), was introduced with MMX, then followed up with SSE, SSSE, SSE2, SSE3, AVX, AVX2 (and a few others). The major benefit with SIMD is that a single instruction can perform the same instruction on multiple pieces of data at the same time. Though the SIMD instructions support different sizes and types of data (byte, word, dword, qword, (dqword), float, double, …), this article will address the use of SIMD for manipulating doubles known as REAL*8 in FORTRAN (a.k.a. DOUBLE PRECISION).

Most processors support SSE2 or SSE3 SIMD instruction formats, and the recent Intel* Core* i7 Sandy Bridge supports AVX with AVX2 and newer SIMD formats in various stages of development. From the high-level programmer’s view point, the main difference is in the width of the data (number of multiple data) manipulated by the SIMD instruction. While the high-level language programmer will rely on the compiler to generate optimal code paths for instruction differences, this same programmer can take advantage of the knowledge of the width of the MD portion of SIMD when laying out the data.

The SIMD widths supported are:

SSEn    2-wide doubles (2 x 64-bit)

AVX    4-wide doubles (4 x 64-bit)

AVX2    4-wide doubles (4 x 64-bit)

MIC    8-wide doubles (8 x 64-bit) (a.k.a. “Xeon Phi”)

To take advantage of the SIMD capability of the processor, the programmer needs to design the data layout used by the application. The ideal layout is such, that data undergoing the same manipulations, are adjacent in memory. When using arrays of data, programmers don't need to worry about programming considerations other than the nesting level of array indexes for multi-dimensional arrays. This article will address programming issues relating to data other than arrays of single elements.

To date, most papers in the technical literature discuss two programming techniques for data layout:

AOS    Array of structures

SOA    Structure of arrays

This article presents a third alternative, which is a hybrid of AOS and SOA:

PAOS    Packed array of structures

Or, alternately the programming technique can be called:

PSOA    Partitioned structure of arrays

Programmers tend to prefer array of structures (AOS) during development. This is mostly due to how the program debugger works. With AOS, by providing a single pointer or reference to a structure in the debugger, the programmer can then explore the entire contents of the structure. This makes it very convenient to follow the flow of the program during a debug session.

When debugging sessions are mostly complete, the programmer can assess if structure of arrays (SOA) might yield better performance. We have a dilemma whether to perform the conversion from AOS to SOA, which is a non-trivial process that may break working code. AOS has better cache locality for intra-structure manipulations. SOA has better cache locality for inter-structure manipulations. Considering that major portions of any application manipulating data are neither AOS nor SOA, projecting the performance impact with conversion is difficult at best.

Note, this article uses sample code in FORTRAN. It is easier to adapt C++ programs to PAOS due to C++’s templates.

Adaptation in FORTRAN (without templates) requires more copy and paste of existing code.

The sample code presented in this article came from a bead model synthesis program used principally in the study of tethers attached to objects. The program can be used to model a “tether ball” system. This would look like a volleyball attached by a cord to a pole.

The program was developed to simulate tethered satellites, initially the SED experiment with the Space Shuttle and a smaller satellite connected by long (20 km) conducting tether. The conducting tether, moving through the magnetosphere, was intended to generate power. Later revisions of the simulation program were made for the study of space elevator designs. A space elevator is a tether attached to the earth near the equator and approximately 100,000 km long going radially outwards to a ballast mass. A space elevator is a self-supporting structure with an electrically powered elevator car/truck that can climb the tether to deliver orbital payloads.

A tether, as used by the simulation program, is a series of one-dimensional segments interconnected and terminated to an object (space shuttle, satellite, earth, etc.). You can view this as a series of springs. The connection point of adjacent segments is called a bead. A bead has position, velocity, mass, and other variables for properties (mostly force vectors: gravity, external, spring, etc.). The segments have: length (un-deformed), length (deformed), length rate, length acceleration, elastic diameter, aerodynamic diameter, resistivity, spring constant, spring damping constant, temperature, and several other properties. The bodies have many properties and may have things attached to them (tether, boom, thrusters, etc.). And the simulation can incorporate the JPL (Jet Propulsion Mission Data) ephemeris database to obtain initial planetary body information as well as satellites (moons, asteroids, comets) and space craft. The simulator is a comprehensive modeling system written in FORTRAN and has a development history of over 25 years.

AOS format in FORTRAN (C++ programmers, think of this as a struct)

type TypeFiniteSolution
    SEQUENCE
    ! REAL(8) variables
    ! INTEGRATION INTERVAL & LAPSED TIME OF FINITE TETHER SOLN
    real :: rDELTIS	! Integration step size
    real :: rTISTIM 	! Integration time
	…
    ! NOMINAL INSTANTANEOUS DEPLOYED (UNDEFORMED) TETHER LENGTH 
    real :: rEL
    
    ! rate of deployment +out, -in through the X end
    real :: rELD
    
    ! rate of rate of deployment +out, -in through the X end
    real :: rELDD
	…
    ! POSITION VECTOR BETWEEN ATT PTS (INER FRAME)
    real ::	rELI(3)

    ! DERIVATIVE OF POS VECTOR BETWEEN ATT PTS (INER FRAME)
    real ::	rELDI(3)

    ! 2ND DERIV OF POS VECTOR BETWEEN ATT PTS (INER FRAME)
    real ::	rELDDI(3)
	…
    ! Position with respect to IP IF
    real, pointer :: rBeadIPIF(:,:)   	! (3,NumberOfBeads+2)

    ! Velocity with respect to IP IF
    real, pointer :: rBeadDIPIF(:,:)  	! (3,NumberOfBeads+2)

    ! Acceleration with respect to IP IF
    real, pointer :: rBeadDDIPIF(:,:) 	! (3,NumberOfBeads+2)
	…
    ! SEGMENT LENGTH AND LENGTH RATE
    real, pointer :: rELBSG(:) 	! (NumberOfBeads+1)
    real, pointer :: rELBSGD(:)	! (NumberOfBeads+1)
	…
end type TypeFiniteSolution

With a memory layout of:

Where the green bar represents one cache line of 64 bytes (8 doubles), and the blue bar represents the adjacent cache line. Many of the current processors will fetch two adjacent cache lines at the same time. If we recompose this for SSE opportunities, we find:


Majority of operations are inter-bead


Majority of operations are inter-bead

Where the cells are outlined for SSE optimization opportunities. Only when two adjacent like operational cells are grouped in a box do you have an SSE vectorization opportunity. The large array of bead positions (rBeadIPIF) are presented above in two organizations. The first box grouping represents the preponderance of code for Bead and Inter-Bead calculations. The second box grouping represents only the last loop optimizations where the positions can be advanced (by dV * dT).

The above figures illustrate that AOS has:

No vectorization opportunities for scalars
Half vectorization opportunities for small dimensioned vectors (3) (3,3)
Half vectorization opportunities for preponderance of large vectors (3,nBeads)
Full vectorization opportunity for final integration of large vectors (3,nBeads)

The above, though, is a postulation; so let us examine some code samples.

Simple loop – Euler Integration using AOS

Let’s take a look at the final phase of the integration process where we advance the bead position based on the bead velocity and integration step interval. This is known as Euler integration.

Note, the following code should be one of the least optimized opportunities for PAOS over AOS.

call EULINTxyz(DELTIS, NBEAD, rBeadDIPIF(1:3,1:NBEAD), rBeadIPIF(1:3,1:NBEAD))
...
subroutine EULINTxyz_r_n_r_r(dT, n, dX, X)
    implicit none
    real :: dT
    integer :: n
    real :: dX(1:3,1:n)
    real :: X(1:3,1:n)
    integer :: i,j
    
    do i=1,n
        do j=1,3
            X(j,i) = X(j,i) + dX(j,i) * dT
        end do
    end do
end subroutine EULINTxyz_r_n_r_r

(The above code includes generic interface signature “_r_n_r_r”)

The AOS assembly code for Euler integration

; parameter 1(DT): rcx
; parameter 2(N): rdx
; parameter 3(DX): r8
; parameter 4(X): r9
;;; recursive subroutine EULINTxyz_r_n_r_r(dT, n, dX, X)
        push      rbp				; save stack frame pointer
        sub       rsp, 48			; allocate stack space
        lea       rbp, QWORD PTR [32+rsp]	; set new stack frame pointer
        movsxd    rax, DWORD PTR [rdx]		; rax = n
        mov       QWORD PTR [rbp], rax		; temp0 = n (unnecessary)
        mov       QWORD PTR [8+rbp], rax		; temp1 = n (unnecessary)
;;;     implicit none
;;;     real :: dT
;;;     integer :: n
;;;     real :: dX(1:3,1:n)
;;;     real :: X(1:3,1:n)
;;;     integer :: i,j
;;;     
;;;     do i=1,n
        movsxd    r10, DWORD PTR [rdx]		; r10 = n (could have obtained from rax)
        xor       edx, edx			; zero iteration count
        xor       eax, eax			; zero linear index into both arrays
        test      r10, r10			; if(n <= 0)
        jle       .B5.5         ; Prob 3%	;	goto .B5.5
.B5.2::
;;;         do j=1,3
;;;             X(j,i) = X(j,i) + dX(j,i) * dT

        vmovsd    xmm0, QWORD PTR [rcx]		; xmm0= dT (scalar)
        ALIGN     16				; align following code to 16 byte address
.B5.3::						: loop:
        vmulsd    xmm1, xmm0, QWORD PTR [rax+r8]		; xmm1 = dT * dX(index) (scalar)
        inc       rdx					; ++iteration count
        vmulsd    xmm3, xmm0, QWORD PTR [8+rax+r8]	; xmm3 = dT * dX(index+1) (scalar)
        vmulsd    xmm5, xmm0, QWORD PTR [16+rax+r8]	; xmm5 = dT * dX(index+2) (scalar)
        vaddsd    xmm2, xmm1, QWORD PTR [rax+r9]		; xmm2 = dT * dX(index) + 
X(index) (scalar) vaddsd xmm4, xmm3, QWORD PTR [8+rax+r9] ; xmm4 = dT * dX(index+1) + X(index+1)
 (scalar) vaddsd xmm1, xmm5, QWORD PTR [16+rax+r9] ; xmm1 = dT * dX(index+2) + X(index+2)
 (scalar) vmovsd QWORD PTR [rax+r9], xmm2 ; X(index) = dT * dX(index) + X(index)
 (scalar) vmovsd QWORD PTR [8+rax+r9], xmm4 ; X(index+1) = dT * dX(index+1) +
X(index+1) (s) vmovsd QWORD PTR [16+rax+r9], xmm1 ; X(index+2) = dT * dX(index+2) +
X(index+2) (s) add rax, 24 ; index += 3 cmp rdx, r10 ; if(iteration count < n) jb .B5.3 ; Prob 82% ; goto loop .B5.5:: ;;; end do ;;; end do ;;; end subroutine EULINTxyz_r_n_r_r lea rsp, QWORD PTR [16+rbp] ; restore stack pointer pop rbp ; restore stack frame ret ; return

Comments on compiler generated code:

Loop is aligned (good)
The inner and outer loops were fused (good)
There were two unnecessary savings of n onto stack (not good)
The calculations are all scalar (non-optimal)

While the programmer can address these issues by rewriting the Euler integration function such as casting the (3,NBEAD) two dimensional array into a (3*NBEAD) one dimensional array (using TRANSFER), the programmer would have to rewrite all loops that traverse the bead (and segment) information for the tethers. This would require edits to hundreds of loops within the program. The object of this article is to recode such that the compiler does most of the work for you.

SOA format

    ! REAL variables (set compiler option for default real == real(8)
    ! INTEGRATION INTERVAL & LAPSED TIME OF FINITE TETHER SOLN
    real, pointer :: rDELTIS(:)	! (NumberOfTethers)
    real, pointer :: rTISTIM(:)	! (NumberOfTethers)
	…
    ! NOMINAL INSTANTANEOUS DEPLOYED (UNDEFORMED) TETHER LENGTH 
    real, pointer :: rEL(:)	! (NumberOfTethers)
    
    ! rate of deployment +out, -in through the X end
    real, pointer :: rELD(:)	! (NumberOfTethers)
    
    ! rate of rate of deployment +out, -in through the X end
    real, pointer :: rELDD(:)	! (NumberOfTethers)
	…
    ! POSITION VECTOR BETWEEN ATT PTS (INER FRAME)
    real, pointer ::	rELI(:,:)	! (3, NumberOfTethers)

    ! DERIVATIVE OF POS VECTOR BETWEEN ATT PTS (INER FRAME)
    real, pointer ::	rELDI(:,:)	! (3, NumberOfTethers)

    ! 2ND DERIV OF POS VECTOR BETWEEN ATT PTS (INER FRAME)
    real, pointer ::	rELDDI(:,:)	! (3, NumberOfTethers)
	…
    ! Position with respect to IP IF
! *** note change in index order and number of indexes
    real, pointer :: rBeadIPIF(:,:,:) 	! (NumberOfBeads+2, 3, NumberOfTethers)

    ! Velocity with respect to IP IF
    real, pointer :: rBeadDIPIF(:,:,:)  ! (NumberOfBeads+2, 3, NumberOfTethers)

    ! Acceleration with respect to IP IF
    real, pointer :: rBeadDDIPIF(:,:,:) ! (NumberOfBeads+2, 3, NumberOfTethers)
	…
    ! SEGMENT LENGTH AND LENGTH RATE
    real, pointer :: rELBSG(:,:) 	! (NumberOfBeads+1, NumberOfTethers)
    real, pointer :: rELBSGD(:,:)	! (NumberOfBeads+1, NumberOfTethers)

Or:

real, pointer :: rBeadIPIF(:,:,:) ! (NumberOfTethers, NumberOfBeads+2, 3)

Or:

real, pointer :: rBeadIPIF(:,:,:) ! (NumberOfTethers, 3, NumberOfBeads+2)

Or:

real, pointer :: rBeadIPIF(:,:,:) ! (3, NumberOfTethers, NumberOfBeads+2)

The order of the indexing of the SOA is an important consideration and will vary depending on programming requirements. Of particular concern in this application is the performance impact on inter-bead calculations verses bead/intra-bead calculations. The SOA will favor one type of calculation but not both.

Debugging SOA will require inspection of individual array elements and synchronizing the array indexes. This is relatively difficult (cumbersome) compared to the AOS format. Note, the choice of “pointer” verses “allocatable” is my personal programming preference.

For an application such as the one presented in this article, the programmer would arrange the scalars and small vectors into AOS format. And only consider the SOA conversion of the large vectors of NumberOfBead sized allocations. The programmer still has the difficult decision as to the impact of index order, and would likely have to resort to making multiple conversions for a different index order, and run performance tests on each. SOA conversion of the application would be partially implemented and optimal for only a portion of the calculations.

PAOS format

Packed structure of arrays (PAOS) has a different representation. I use three representations. One representation for SSE for 2-wide doubles and two for AVX/AVX2: a) four-wide doubles, and interestingly, b) 3-wide doubles.

! AVX four-wide double vector
type TypeYMM
    SEQUENCE
    real(8) :: v(0:3)
end type TypeYMM

! AVX three-wide double vector
type TypeYMM02
    SEQUENCE
    real(8) :: v(0:2)
end type TypeYMM02

! SSE two-wide double vector
type TypeXMM
    SEQUENCE
    real(8) :: v(0:1)
end type TypeXMM

type TypeFiniteSolution
SEQUENCE
    ! REAL variables (4-wide AVX)
    ! INTEGRATION INTERVAL & LAPSED TIME OF FINITE TETHER SOLN
    type(TypeYMM) :: rDELTIS	! Integration step size
    type(TypeYMM) :: rTISTIM 	! Integration time
	…
    ! NOMINAL INSTANTANEOUS DEPLOYED (UNDEFORMED) TETHER LENGTH 
    type(TypeYMM) :: rEL
    ! rate of deployment +out, -in through the X end
    type(TypeYMM) :: rELD
    ! rate of rate of deployment +out, -in through the X end
    type(TypeYMM) :: rELDD
	…
    ! POSITION VECTOR BETWEEN ATT PTS (INER FRAME)
    type(TypeYMM) :: rELI(3)
    ! DERIVATIVE OF POS VECTOR BETWEEN ATT PTS (INER FRAME)
    type(TypeYMM) :: rELDI(3)
    ! 2ND DERIV OF POS VECTOR BETWEEN ATT PTS (INER FRAME)
    type(TypeYMM) :: rELDDI(3)
	…
    ! Position with respect to IP IF
    type(TypeYMM), pointer :: rBeadIPIF(:,:)   	! (3,NumberOfBeads+2)
    ! Velocity with respect to IP IF
    type(TypeYMM), pointer :: rBeadDIPIF(:,:)  	! (3,NumberOfBeads+2)
    ! Acceleration with respect to IP IF
    type(TypeYMM), pointer :: rBeadDDIPIF(:,:) 	! (3,NumberOfBeads+2)
	…
    ! SEGMENT LENGTH AND LENGTH RATE
    type(TypeYMM), pointer :: rELBSG(:) 	! (NumberOfBeads+1)
    type(TypeYMM), pointer :: rELBSGD(:)	! (NumberOfBeads+1)
	…
end type TypeFiniteSolution

The data organization for PAOS is identical to the AOS with the exception of changing the variable type “real” to “type(TypeYMM)”. The array indexing scheme remains the same. Debugging PAOS is almost as convenient as AOS. You do have to explode the SIMD vector to observe the 4-wide variables.

The in-memory layout for the 4-wide YMM use is:


The in-memory 2-wide XMM layout:

Note that all references are available for full SIMD vectorization when you perform aggregate work on all tethers. Portions of the code that perform individual access may suffer additional cache line loads, but the preponderance of the calculations are performed on the aggregate of tethers.

Simple loop – Euler Integration using PAOS

call EULINTxyz(DELTIS, NBEAD, rBeadDDIPIF(1:3,1:NBEAD), rBeadDIPIF(1:3,1:NBEAD))
...
subroutine EULINTxyz_ymm_n_ymm_ymm(dT, n, dX, X)
  USE MOD_ALL
  implicit none
  type(TypeYMM) :: dT
  integer :: n
  type(TypeYMM) :: dX(1:3,1:n)
  type(TypeYMM) :: X(1:3,1:n)
  integer :: i,j
    
  do i=1,n
    do j=1,3
      !DEC$ VECTOR ALWAYS
      X(j,i).v = X(j,i).v + dX(j,i).v * dT.v
    end do
  end do
end subroutine EULINTxyz_ymm_n_ymm_ymm

(Note the generic interface selected signature suffix “_ymm_n_ymm_ymm”)

The PAOS assembly code

; parameter 1(DT): rcx
; parameter 2(N): rdx
; parameter 3(DX): r8
; parameter 4(X): r9
;;; recursive subroutine EULINTxyz_ymm_n_ymm_ymm(dT, n, dX, X)
        push      rbp				; save stack frame pointer
        sub       rsp, 48			; allocate stack space
        lea       rbp, QWORD PTR [32+rsp]	; set new stack frame pointer
        movsxd    rax, DWORD PTR [rdx]		; rax = n
        mov       QWORD PTR [rbp], rax		; temp0 = n (unnecessary)
        mov       QWORD PTR [8+rbp], rax		; temp1 = n (unnecessary)
;;;     type(TypeYMM) :: dT
;;;     integer :: n
;;;     type(TypeYMM) :: dX(1:3,1:n)
;;;     type(TypeYMM) :: X(1:3,1:n)
;;;     integer :: i,j
;;;     
;;;     do i=1,n
movsxd    r10, DWORD PTR [rdx]			; r10 = n (could have obtained from rax)
xor       edx, edx				; zero loop iteration count
xor       eax, eax				; zero linear index into dX and X
test      r10, r10				; if(n <= 0)
jle       .B6.5         ; Prob 3%		;     goto .B6.6
.B6.2::
;;;         do j=1,3
;;;             !DEC$ VECTOR ALWAYS
;;;             X(j,i).v = X(j,i).v + dX(j,i).v * dT.v
vmovupd   xmm1, XMMWORD PTR [rcx]		; xmm1 = dT.v(0:1)
vmovupd   xmm0, XMMWORD PTR [16+rcx]		; xmm0 = dT.v(2:3)
.B6.3::						; loop:
vmulpd    xmm2, xmm1, XMMWORD PTR [rax+r8]	; xmm2=dT(index).v(0:1) * 
dX(index).v(0:1) vmulpd xmm4, xmm0, XMMWORD PTR [16+rax+r8] ; xmm4=dT(index).v(2:3) *
dX(index).v(2:3) vaddpd xmm3, xmm2, XMMWORD PTR [rax+r9] ; xmm3=dT(index).v(0:1)*dX(index).v(0:1)+
X(index).v(0:1) vaddpd xmm5, xmm4, XMMWORD PTR [16+rax+r9] ; xmm5=dT(index).v(2:3)*dX(index).v(2:3)+
X(index).v(2:3) vmulpd xmm2, xmm1, XMMWORD PTR [32+rax+r8] ; xmm2 = dT.v(0:1) * dX(index+1).v(0:1) ; X(index).v(0:1)= dT(index).v(0:1) * dX(index).v(0:1) + X(index).v(0:1) vmovupd XMMWORD PTR [rax+r9], xmm3 ; X(index).v(2:3)= dT(index).v(2:3) * dX(index).v(2:3) + X(index).v(2:3) vmovupd XMMWORD PTR [16+rax+r9], xmm5 vmulpd xmm4, xmm0, XMMWORD PTR [48+rax+r8] ; xmm4 = dT.v(2:3) * dX(index+1).v(2:3) ; xmm3 = dT.v(0:1) * dX(index+1).v(0:1) + X(index+1).v(0:1) vaddpd xmm3, xmm2, XMMWORD PTR [32+rax+r9] ; xmm5 = dT.v(2:3) * dX(index+1).v(2:3) + X(index+1).v(2:3) vaddpd xmm5, xmm4, XMMWORD PTR [48+rax+r9] vmulpd xmm2, xmm1, XMMWORD PTR [64+rax+r8] ; xmm2 = dT(index+2).v(0:1) * dX(index+2).
v(0:1) ; X(index+1).v(0:1)= dT(index+1).v(0:1) * dX(index+1).v(0:1) + X(index+1).v(0:1) vmovupd XMMWORD PTR [32+rax+r9], xmm3 ; X(index+1).v(2:3)= dT(index+1).v(2:3) * dX(index+1).v(2:3) + X(index+1).v(2:3) vmovupd XMMWORD PTR [48+rax+r9], xmm5 vmulpd xmm4, xmm0, XMMWORD PTR [80+rax+r8] ; xmm4 = dT(index+2).v(2:3) * dX(index+2).
v(2:3) ;xmm3=dT(index+2).v(0:1)*dX(index+2).v(0:1)+X(index+2).v(0:1) vaddpd xmm3, xmm2, XMMWORD PTR [64+rax+r9] ; xmm5 = dT(index+2).v(2:3) * dX(index+2).v(2:3) + X(index+2).v(2:3) vaddpd xmm5, xmm4, XMMWORD PTR [80+rax+r9] ; X(index+2).v(0:1)= dT(index+2).v(0:1) * dX(index+2).v(0:1) + X(index+2).v(0:1) vmovupd XMMWORD PTR [64+rax+r9], xmm3 ; X(index+2).v(2:3)= dT(index+2).v(2:3) * dX(index+2).v(2:3) + X(index+2).v(2:3) vmovupd XMMWORD PTR [80+rax+r9], xmm5 inc rdx ; ++loop count add rax, 96 ; index += 3 cmp rdx, r10 ; if(loop count < n) jb .B6.3 ; Prob 82% ; goto loop .B6.5:: ;;; end do ;;; end do ;;; end subroutine EULINTxyz_ymm_n_ymm_ymm

In examining the assembly code you find the use of v….pd with xmm register notation. These instructions represent 2-wide packed doubles. While AVX supports 4-wide packed doubles, the Sandy Bridge architecture performs fetches and stores as 2-wide. A 4-wide fetch or store can be issued; however, this makes the adjacent fetches or stores occur together. Should the data of these two 2-wide fetches or stores span a cache line, the pipeline will stall for the data. By using 2-wide fetches and stores, code optimizations can interleave the instructions such that computations can occur concurrently with fetch or store.

vmulpd    xmm2, xmm1, XMMWORD PTR [rax+r8]	; xmm2=dT(index).v(0:1) * dX(index).v(0:1)
vmulpd    xmm4, xmm0, XMMWORD PTR [16+rax+r8] 	; xmm4=dT(index).v(2:3) * dX(index).v(2:3)
vaddpd    xmm3, xmm2, XMMWORD PTR [rax+r9]	; xmm3=dT(index).v(0:1)*dX(index).v(0:1)+
X(index).v(0:1) vaddpd xmm5, xmm4, XMMWORD PTR [16+rax+r9] ; xmm5=dT(index).v(2:3)*dX(index).v(2:3)+
X(index).v(2:3) ------------ or ------------- vmulpd ymm2, ymm1, XMMWORD PTR [rax+r8] ; ymm2=dT(index).v(0:3) * dX(index).v(0:3) vaddpd ymm3, ymm2, XMMWORD PTR [rax+r9] ; ymm3=dT(index).v(0:3)*dX(index).v(0:3)+X(index)
.v(0:3)

On the Sandy Bridge architecture, the two execute at approximately the same time, except when the fetch is split by a cache line, meaning the .v(0:1) lies in one cache line and the .v(2:3) lies in a different cache line. In cache line split circumstances, the 2-wide instructions perform faster.

Migration to Ivy Bridge can eliminate this penalty. Code targeted for Ivy Bridge realizes an advantage because it can perform 4-wide fetch and store operations, together with 4-wide computation.

In many places in the code, the compiler will perform 4-wide fetch and store (see later example).

Typical problem

The Euler advancement code examined earlier is the least likely example of PAOS over AOS. It also is atypical of loop traversal. Let us examine a typical loop. The following loop calculates the spring tension.

The AOS version of the loop

DO JSEG = 1,JSEGlast
  !***********************************************************
  ! FORM SEGMENT VECTOR BETWEEN ADJACENT BEADS (IN INER FRAME)
  !***********************************************************
  do i=1,3
      ! Determine TUI at 1/2 integration step from now
      ! using current velocity and acceleration
      ! Position of X-End of segment
      TOSVX1(i) = rBeadIPIF(i,JSEG-1) &
        & + (rBeadDIPIF(i,JSEG-1) * (rDELTIS/2.0_8)) &
        & + (rBeadDDIPIF(i,JSEG-1) * (rDELTIS**2/4.0_8))
      ! Position of Y-End of segment
      TOSVX2(i) = rBeadIPIF(i,JSEG).v &
        & + (rBeadDIPIF(i,JSEG) * (rDELTIS/2.0_8)) &
        & + (rBeadDDIPIF(i,JSEG) * (rDELTIS**2/4.0_8))
      ! SUI X->Y
      SUI(i) = TOSVX2(i) - TOSVX1(i)
  end do

  ! CALC DIST BETWEEN BEADS ADJACENT TO THIS SEGMENT
  !-------------------------------------------------
  DUMEL = dsqrt(SUI(1)**2 + SUI(2)**2 + SUI(3)**2)
  rELBSG(JSEG) = DUMEL

  ! TALLY TOTAL SPATIAL ARC LENGTH
  ARCTOTlocal = ARCTOTlocal + DUMEL

  !---------------------------------------------
  ! POPULATE THE JSEG ELEMENT OF THE BBTUI ARRAY
  !---------------------------------------------
  ! FORM UNIT VECTOR FROM X-END TO Y-END OF BEAD SEGMENT
  if(DUMEL .ne. 0.0_8) then
    TUI = SUI / DUMEL
  else
    call RandomUnitVector(TUI)
  endif
  rBBTUI(1:3,JSEG) = TUI(1:3)
  ! recompute velocity of Bead 0
  if(JSEG .eq. 1) then
    rBeadDIPIF(:,0) = rAVI + (TUI * rELD)
  endif

  ! ARC LENGTH WR/T UNDEFORMED TETHER
  !--------------------------------------------
  ! (NOTE, NEGATIVE SIGN REFLECTS MU DEFINITION FROM FAR END) 
  rDSDU(JSEG)  = - DUMEL/ELBis

  !**************************************************************
  ! FORM DERIVATIVE OF SEGMENT VECTOR (INER FRAME)
  !**************************************************************
  ! compute equivelent to
  ! SUDI = BUDI(JB1:JB1+2) - BUDI(JB1-3:JB1-1) + ELSGX_x_UITDI_ELSGXD_x_UITI
  SUDI = rBeadDIPIF(:,JSEG) - rBeadDIPIF(:,JSEG-1)
        
  ! CALC DERIV OF DIST BETWEEN BEADS ADJACENT TO SEGMENT
  !-----------------------------------------------------
  DUMELD = DOT(TUI,SUDI)
  rELBSGD(JSEG) = DUMELD

  !--------------------------------------------------------
  ! POPULATE THE SEGMENT DATA ARRAY'S FOR THIS JSEG ELEMENT
  ! (FOR AVERAGING AND NUMERICAL DIFF IN DEPLOY FLOW CALCS)
  !--------------------------------------------------------
  !  BBTUDI(1:3,JSEG) = (SUDI / DUMEL) + (SUI * -DUMELD/DUMEL**2)
  if(DUMEL .ne. 0.0_8) then
    rBBTUDI(1:3,JSEG) =((SUDI) + (SUI * -DUMELD/DUMEL)) / DUMEL
  else
    rBBTUDI(1:3,JSEG) = 0.0_8
  endif

  ! FINALLY, ARC LENGTH DERIV WR/T UNDEFORMED TETHER, AND ITS TIME DERIV
  !---------------------------------------------------------------------
  ! (NOTE, NEGATIVE SIGN REFLECTS MU DEFINITION FROM FAR END) 
  rDSDUD(JSEG) = - ( DUMELD - DUMEL*ELDoverEL )/ELBis
end do

(Note, the pointers were removed to clarify reading the code)

The PAOS version of the loop

DO JSEG = 1,JSEGlast
!***********************************************************
! FORM SEGMENT VECTOR BETWEEN ADJACENT BEADS (IN INER FRAME)
!***********************************************************
  do i=1,3
      ! Determine TUI at 1/2 integration step from now
      ! using current velocity and acceleration
      ! Position of X-End of segment
      TOSVX1(i).v = rBeadIPIF(i,JSEG-1).v &
        & + (rBeadDIPIF(i,JSEG-1).v * (rDELTIS.v/2.0_8)) &
        & + (rBeadDDIPIF(i,JSEG-1).v * (rDELTIS.v**2/4.0_8))
      ! Position of Y-End of segment
      TOSVX2(i).v = rBeadIPIF(i,JSEG).v &
        & + (rBeadDIPIF(i,JSEG).v * (rDELTIS.v/2.0_8)) &
        & + (rBeadDDIPIF(i,JSEG).v * (rDELTIS.v**2/4.0_8))
      ! SUI X->Y
      SUI(i).v = TOSVX2(i).v - TOSVX1(i).v
  end do
  ! CALC DIST BETWEEN BEADS ADJACENT TO THIS SEGMENT
  !-------------------------------------------------
  DUMEL.v = dsqrt(SUI(1).v**2 + SUI(2).v**2 + SUI(3).v**2)
  rELBSG(JSEG) = DUMEL

  ! TALLY TOTAL SPATIAL ARC LENGTH
  ARCTOTlocal.v = ARCTOTlocal.v + DUMEL.v

  !---------------------------------------------
  ! POPULATE THE JSEG ELEMENT OF THE BBTUI ARRAY
  !---------------------------------------------
  ! FORM UNIT VECTOR FROM X-END TO Y-END OF BEAD SEGMENT
  ! *** code change for SIMD horizontal operation ***
  do s=0,sLast
    if(DUMEL.v(s) .ne. 0.0_8) then
      TUI.v(s) = SUI.v(s) / DUMEL.v(s)
    else
      call RandomUnitVector(TUI.v(s))
    endif
  end do
  rBBTUI(1:3,JSEG) = TUI(1:3)
  ! recompute velocity of Bead 0
  if(JSEG .eq. 1) then
    do i=1,3
      rBeadDIPIF(i,0).v = rAVI(i).v + (TUI(i).v * rELD.v)
    end do
  endif

  ! ARC LENGTH WR/T UNDEFORMED TETHER
  !--------------------------------------------
  ! (NOTE, NEGATIVE SIGN REFLECTS MU DEFINITION FROM FAR END) 
  rDSDU(JSEG).v  = - DUMEL.v/ELBis.v

  !**************************************************************
  ! FORM DERIVATIVE OF SEGMENT VECTOR (INER FRAME)
  !**************************************************************
  do i=1,3
    SUDI(i).v = rBeadDIPIF(i,JSEG).v - rBeadDIPIF(i,JSEG-1).v
  end do
  ! CALC DERIV OF DIST BETWEEN BEADS ADJACENT TO SEGMENT
  !-----------------------------------------------------
  DUMELD = DOT(TUI,SUDI)
  rELBSGD(JSEG) = DUMELD

  !--------------------------------------------------------
  ! POPULATE THE SEGMENT DATA ARRAY'S FOR THIS JSEG ELEMENT
  ! (FOR AVERAGING AND NUMERICAL DIFF IN DEPLOY FLOW CALCS)
  !--------------------------------------------------------
  if((DUMEL.v(0) .ne. 0.0_8) .and. (DUMEL.v(1) .ne. 0.0_8) &
     & .and. (DUMEL.v(2) .ne. 0.0_8) .and. (DUMEL.v(3) .ne. 0.0_8)) then
    do i=1,3
     rBBTUDI(i,JSEG).v =((SUDI(i).v) + (SUI(i).v * -DUMELD.v / DUMEL.v)) / DUMEL.v
    end do
  else
    do s=0,sLast ! *** code change for SIMD horizontal operation ***
      if(DUMEL.v(s) .ne. 0.0_8) then
        do i=1,3
          rBBTUDI(i,JSEG).v(s) = ((SUDI(i).v(s)) &
             & + (SUI(i).v(s) * -DUMELD.v(s) / DUMEL.v(s))) / DUMEL.v(s)
        end do
      else
        do i=1,3
          rBBTUDI(i,JSEG).v(s) = 0.0_8
        end do
      endif
    end do ! *** code change for SIMD horizontal operation ***
 endif

  ! FINALLY, ARC LENGTH DERIV WR/T UNDEFORMED TETHER, AND ITS TIME DERIV
  !---------------------------------------------------------------------
  ! (NOTE, NEGATIVE SIGN REFLECTS MU DEFINITION FROM FAR END) 
  rDSDUD(JSEG).v = - ( DUMELD.v - DUMEL.v * ELDoverEL.v ) / ELBis.v
end do

Now let’s decompose the first small loop within the main loop. This computes a vector between two points (beads) as it may reside halfway through the integration interval. This vector will be used later to determine the direction of two spring force vectors (one for each bead on the two segment ends).

AOS

        do i=1,3
            ! Determine TUI at 1/2 integration step from now
            ! using current velocity and acceleration
            ! Position of X-End of segment
            TOSVX1(i) = rBeadIPIF(i,JSEG-1) &
            & + (rBeadDIPIF(i,JSEG-1)*(rDELTIS/2.0_8)) &
            & + (rBeadDDIPIF(i,JSEG-1)*(rDELTIS**2/4.0_8))
            ! Position of Y-End of segment
            TOSVX2(i) = rBeadIPIF(i,JSEG) &
            & + (rBeadDIPIF(i,JSEG)*(rDELTIS/2.0_8)) &
            & + (rBeadDDIPIF(i,JSEG)*(rDELTIS**2/4.0_8))
          ! SUI X->Y
          SUI(i) = TOSVX2(i) - TOSVX1(i)
        end do

PAOS

  do i=1,3
      ! Determine TUI at 1/2 integration step from now
      ! using current velocity and acceleration
      ! Position of X-End of segment
      TOSVX1(i).v = rBeadIPIF(i,JSEG-1).v &
        & + (rBeadDIPIF(i,JSEG-1).v * (rDELTIS.v/2.0_8)) &
        & + (rBeadDDIPIF(i,JSEG-1).v * (rDELTIS.v**2/4.0_8))
      ! Position of Y-End of segment
      TOSVX2(i).v = rBeadIPIF(i,JSEG).v &
        & + (rBeadDIPIF(i,JSEG).v * (rDELTIS.v/2.0_8)) &
        & + (rBeadDDIPIF(i,JSEG).v * (rDELTIS.v**2/4.0_8))
      ! SUI X->Y
      SUI(i).v = TOSVX2(i).v - TOSVX1(i).v
  end do

Notice the addition of the suffix “.v” to the statements containing arithmetic operators. This inconvenience can be eliminated if you elect to write overloaded operator functions for these data types. I chose not to for debugging reasons (wanting the debug build to produce the code inline).

AOS assembler code for the small loop of three statements

        mov       r12, QWORD PTR [80+rbp]                       ;178.25
        mov       r11, QWORD PTR [176+rbp]                      ;179.18
        mov       rdi, QWORD PTR [152+rbp]                      ;179.18
        mov       r8, QWORD PTR [272+rbp]                       ;180.18
        mov       rax, QWORD PTR [56+rbp]                       ;178.25
        mov       rdx, QWORD PTR [248+rbp]                      ;180.18
        mov       QWORD PTR [896+rbp], r12                      ;178.25
        mov       QWORD PTR [928+rbp], r11                      ;179.18
        mov       QWORD PTR [912+rbp], rdi                      ;179.18
        mov       QWORD PTR [936+rbp], r8                       ;180.18
        mov       r9, QWORD PTR [rbp]                           ;178.13
        mov       r15, QWORD PTR [64+rbp]                       ;178.25
        mov       rcx, QWORD PTR [88+rbp]                       ;178.25
        mov       r12, QWORD PTR [184+rbp]                      ;179.18
        mov       r11, QWORD PTR [160+rbp]                      ;179.18
        mov       r14, QWORD PTR [96+rbp]                       ;178.25
        mov       rdi, QWORD PTR [192+rbp]                      ;179.15
        mov       r8, QWORD PTR [256+rbp]                       ;180.18
        mov       r10, QWORD PTR [280+rbp]                      ;180.18
        mov       QWORD PTR [920+rbp], rax                      ;178.25
        mov       QWORD PTR [904+rbp], rdx                      ;180.18
        cmp       rax, 8                                        ;180.18
        jne       .B1.12        ; Prob 50%                      ;180.18
.B1.9::                         ; Preds .B1.8
        cmp       QWORD PTR [912+rbp], 8                        ;180.18
        jne       .B1.12        ; Prob 50%                      ;180.18
.B1.10::                        ; Preds .B1.9
        cmp       QWORD PTR [904+rbp], 8                        ;180.18
        jne       .B1.12        ; Prob 50%                      ;180.18
.B1.11::                        ; Preds .B1.10
        mov       QWORD PTR [960+rbp], rbx                      ;
        mov       rbx, QWORD PTR [936+rbp]                      ;182.13
        imul      r10, rbx                                      ;182.13
        mov       QWORD PTR [968+rbp], r10                      ;182.13
        mov       r10, rsi                                      ;182.25
        mov       rax, QWORD PTR [896+rbp]                      ;182.13
        imul      r10, rax                                      ;182.25
        imul      rcx, rax                                      ;182.13
        add       r10, r9                                       ;182.13
        shl       r15, 3                                        ;182.13
        sub       r10, rcx                                      ;182.13
        mov       QWORD PTR [976+rbp], r15                      ;182.13
        sub       r10, r15                                      ;182.13
        mov       r15, rsi                                      ;183.18
        mov       rdx, QWORD PTR [928+rbp]                      ;182.13
        imul      r15, rdx                                      ;183.18
        imul      r12, rdx                                      ;182.13
        add       r15, r14                                      ;182.13
        shl       r11, 3                                        ;182.13
        sub       r15, r12                                      ;182.13
        mov       QWORD PTR [984+rbp], r11                      ;182.13
        sub       r15, r11                                      ;182.13
        mov       r11, rsi                                      ;184.18
        imul      r11, rbx                                      ;184.18
        vmulsd    xmm0, xmm10, QWORD PTR [8+r15]                ;183.45
        mov       QWORD PTR [944+rbp], rsi                      ;
        lea       rsi, QWORD PTR [-1+rsi]                       ;178.25
        imul      rdx, rsi                                      ;179.18
        imul      rax, rsi                                      ;178.25
        imul      rbx, rsi                                      ;180.18
        vaddsd    xmm0, xmm0, QWORD PTR [8+r10]                 ;183.15
        add       r11, rdi                                      ;182.13
        add       r14, rdx                                      ;178.13
        mov       QWORD PTR [952+rbp], rdi                      ;
        sub       r14, r12                                      ;178.13
        mov       rdi, QWORD PTR [968+rbp]                      ;182.13
        sub       r11, rdi                                      ;182.13
        shl       r8, 3                                         ;182.13
        add       r9, rax                                       ;178.13
        sub       r11, r8                                       ;182.13
        sub       r9, rcx                                       ;178.13
        sub       r14, QWORD PTR [984+rbp]                      ;178.13
        mov       rcx, QWORD PTR [952+rbp]                      ;178.13
        add       rcx, rbx                                      ;178.13
        vmulsd    xmm5, xmm8, QWORD PTR [8+r11]                 ;184.49
        sub       rcx, rdi                                      ;178.13
        sub       rcx, r8                                       ;178.13
        vaddsd    xmm14, xmm0, xmm5                             ;182.13
        vmulsd    xmm0, xmm10, QWORD PTR [8+r14]                ;179.47
        vmulsd    xmm5, xmm8, QWORD PTR [8+rcx]                 ;180.51
        vmulsd    xmm1, xmm8, QWORD PTR [24+rcx]                ;180.51
        sub       r9, QWORD PTR [976+rbp]                       ;178.13
        vmovsd    QWORD PTR [704+rbp], xmm14                    ;182.13
        mov       QWORD PTR [888+rbp], rsi                      ;178.25
        mov       rbx, QWORD PTR [960+rbp]                      ;186.11
        mov       rsi, QWORD PTR [944+rbp]                      ;186.11
        vaddsd    xmm0, xmm0, QWORD PTR [8+r9]                  ;179.15
        vaddsd    xmm13, xmm0, xmm5                             ;178.13
        vmulsd    xmm0, xmm10, QWORD PTR [16+r15]               ;183.45
        vmulsd    xmm5, xmm8, QWORD PTR [16+r11]                ;184.49
        vaddsd    xmm0, xmm0, QWORD PTR [16+r10]                ;183.15
        vmovsd    QWORD PTR [736+rbp], xmm13                    ;178.13
        vaddsd    xmm12, xmm0, xmm5                             ;182.13
        vmulsd    xmm0, xmm10, QWORD PTR [16+r14]               ;179.47
        vmulsd    xmm5, xmm8, QWORD PTR [16+rcx]                ;180.51
        vaddsd    xmm0, xmm0, QWORD PTR [16+r9]                 ;179.15
        vmovsd    QWORD PTR [712+rbp], xmm12                    ;182.13
        vaddsd    xmm11, xmm0, xmm5                             ;178.13
        vmulsd    xmm0, xmm10, QWORD PTR [24+r15]               ;183.45
        vmulsd    xmm5, xmm8, QWORD PTR [24+r11]                ;184.49
        vaddsd    xmm0, xmm0, QWORD PTR [24+r10]                ;183.15
        vmovsd    QWORD PTR [744+rbp], xmm11                    ;178.13
        vaddsd    xmm5, xmm0, xmm5                              ;182.13
        vmulsd    xmm0, xmm10, QWORD PTR [24+r14]               ;179.47
        vunpcklpd xmm12, xmm14, xmm12                           ;186.20
        vaddsd    xmm0, xmm0, QWORD PTR [24+r9]                 ;179.15
        vunpcklpd xmm11, xmm13, xmm11                           ;186.32
        vaddsd    xmm0, xmm0, xmm1                              ;178.13
        vsubpd    xmm11, xmm12, xmm11                           ;186.11
        vmovupd   XMMWORD PTR [1024+rbp], xmm11                 ;186.11
        vsubsd    xmm11, xmm5, xmm0                             ;186.11
        vmovsd    QWORD PTR [720+rbp], xmm5                     ;182.13
        vmovsd    QWORD PTR [752+rbp], xmm0                     ;178.13
        vmovsd    QWORD PTR [1040+rbp], xmm11                   ;186.11
        jmp       .B1.13        ; Prob 100%                     ;186.11
.B1.12::                        ; Preds .B1.8 .B1.9 .B1.10
        mov       rax, rsi                                      ;182.25
        mov       rdx, QWORD PTR [896+rbp]                      ;182.13
        imul      rax, rdx                                      ;182.25
        imul      rcx, rdx                                      ;182.13
        add       rax, r9                                       ;182.13
        mov       QWORD PTR [1000+rbp], rcx                     ;182.13
        sub       rax, rcx                                      ;182.13
        mov       rcx, QWORD PTR [920+rbp]                      ;182.13
        imul      r15, rcx                                      ;182.13
        mov       QWORD PTR [960+rbp], rbx                      ;
        mov       rbx, rsi                                      ;183.18
        mov       QWORD PTR [976+rbp], r15                      ;182.13
        sub       rcx, r15                                      ;182.13
        mov       r15, QWORD PTR [928+rbp]                      ;182.13
        imul      rbx, r15                                      ;183.18
        imul      r12, r15                                      ;182.13
        add       rbx, r14                                      ;182.13
        mov       QWORD PTR [1008+rbp], r12                     ;182.13
        sub       rbx, r12                                      ;182.13
        mov       r12, QWORD PTR [912+rbp]                      ;182.13
        imul      r11, r12                                      ;182.13
        mov       QWORD PTR [984+rbp], r11                      ;182.13
        sub       r12, r11                                      ;182.13
        mov       r11, rsi                                      ;184.18
        mov       QWORD PTR [992+rbp], r14                      ;
        mov       r14, QWORD PTR [936+rbp]                      ;182.13
        imul      r11, r14                                      ;184.18
        imul      r10, r14                                      ;182.13
        vmulsd    xmm0, xmm10, QWORD PTR [r12+rbx]              ;183.45
        add       r11, rdi                                      ;182.13
        vaddsd    xmm0, xmm0, QWORD PTR [rcx+rax]               ;183.15
        mov       QWORD PTR [968+rbp], r10                      ;182.13
        sub       r11, r10                                      ;182.13
        mov       r10, QWORD PTR [904+rbp]                      ;182.13
        imul      r8, r10                                       ;182.13
        mov       QWORD PTR [944+rbp], rsi                      ;
        lea       rsi, QWORD PTR [-1+rsi]                       ;178.25
        imul      rdx, rsi                                      ;178.25
        imul      r15, rsi                                      ;179.18
        imul      r14, rsi                                      ;180.18
        sub       r10, r8                                       ;182.13
        add       r9, rdx                                       ;178.13
        mov       rdx, QWORD PTR [992+rbp]                      ;178.13
        add       rdi, r14                                      ;178.13
        add       rdx, r15                                      ;178.13
        sub       rdx, QWORD PTR [1008+rbp]                     ;178.13
        sub       r9, QWORD PTR [1000+rbp]                      ;178.13
        sub       rdi, QWORD PTR [968+rbp]                      ;178.13
        mov       QWORD PTR [888+rbp], rsi                      ;178.25
        mov       rsi, QWORD PTR [912+rbp]                      ;183.18
        vmulsd    xmm5, xmm8, QWORD PTR [r10+r11]               ;184.49
        mov       r15, QWORD PTR [976+rbp]                      ;182.13
        lea       r14, QWORD PTR [rsi+rsi]                      ;183.18
        sub       r14, QWORD PTR [984+rbp]                      ;182.13
        vaddsd    xmm14, xmm0, xmm5                             ;182.13
        vmulsd    xmm0, xmm10, QWORD PTR [rdx+r12]              ;179.47
        vmulsd    xmm5, xmm8, QWORD PTR [rdi+r10]               ;180.51
        vaddsd    xmm0, xmm0, QWORD PTR [r9+rcx]                ;179.15
        mov       rcx, QWORD PTR [920+rbp]                      ;182.25
        vaddsd    xmm13, xmm0, xmm5                             ;178.13
        vmulsd    xmm0, xmm10, QWORD PTR [r14+rbx]              ;183.45
        mov       r10, QWORD PTR [904+rbp]                      ;184.18
        lea       r12, QWORD PTR [rcx+rcx]                      ;182.25
        sub       r12, r15                                      ;182.13
        lea       rcx, QWORD PTR [rcx+rcx*2]                    ;182.25
        sub       rcx, r15                                      ;182.13
        lea       r15, QWORD PTR [rsi+rsi*2]                    ;183.18
        mov       rsi, QWORD PTR [904+rbp]                      ;184.18
        add       r10, r10                                      ;184.18
        sub       r10, r8                                       ;182.13
        sub       r15, QWORD PTR [984+rbp]                      ;182.13
        vmovsd    QWORD PTR [704+rbp], xmm14                    ;182.13
        vmovsd    QWORD PTR [736+rbp], xmm13                    ;178.13
        vmulsd    xmm5, xmm8, QWORD PTR [r10+r11]               ;184.49
        vaddsd    xmm0, xmm0, QWORD PTR [r12+rax]               ;183.15
        vaddsd    xmm12, xmm0, xmm5                             ;182.13
        vmulsd    xmm0, xmm10, QWORD PTR [r14+rdx]              ;179.47
        vmulsd    xmm5, xmm8, QWORD PTR [r10+rdi]               ;180.51
        vaddsd    xmm0, xmm0, QWORD PTR [r12+r9]                ;179.15
        vmovsd    QWORD PTR [712+rbp], xmm12                    ;182.13
        lea       r12, QWORD PTR [rsi+rsi*2]                    ;184.18
        sub       r12, r8                                       ;182.13
        vaddsd    xmm11, xmm0, xmm5                             ;178.13
        vmulsd    xmm0, xmm10, QWORD PTR [r15+rbx]              ;183.45
        vmulsd    xmm5, xmm8, QWORD PTR [r12+r11]               ;184.49
        vmulsd    xmm1, xmm8, QWORD PTR [r12+rdi]               ;180.51
        vaddsd    xmm0, xmm0, QWORD PTR [rcx+rax]               ;183.15
        vmovsd    QWORD PTR [744+rbp], xmm11                    ;178.13
        vaddsd    xmm5, xmm0, xmm5                              ;182.13
        vmulsd    xmm0, xmm10, QWORD PTR [r15+rdx]              ;179.47
        vunpcklpd xmm12, xmm14, xmm12                           ;186.20
        vaddsd    xmm0, xmm0, QWORD PTR [rcx+r9]                ;179.15
        vunpcklpd xmm11, xmm13, xmm11                           ;186.32
        vaddsd    xmm0, xmm0, xmm1                              ;178.13
        vsubpd    xmm11, xmm12, xmm11                           ;186.11
        vmovupd   XMMWORD PTR [1024+rbp], xmm11                 ;186.11
        vsubsd    xmm11, xmm5, xmm0                             ;186.11
        vmovsd    QWORD PTR [720+rbp], xmm5                     ;182.13
        vmovsd    QWORD PTR [752+rbp], xmm0                     ;178.13
        vmovsd    QWORD PTR [1040+rbp], xmm11                   ;186.11
        mov       rbx, QWORD PTR [960+rbp]                      ;186.11
        mov       rsi, QWORD PTR [944+rbp]                      ;186.11
.B1.13::                        ; Preds .B1.12 .B1.11

The assembler code generated for three statements is 214 instructions. This represents typical statements found within loops traversing tethers. A significant portion of the generated code computes the indexes to the components. This is principally due to the programmer’s use of pointers, which can have non-stride 1 references. Of the following:

	(known stride-1 to the compiler)
    	TOSVX1(3),  TOSVX2(3),   SUI(3)
(unknown stride to the compiler)
rBeadIPIF(3,0:NBEAD+1),   rBeadDIPIF(3,0:NBEAD+1),   rBeadDDIPIF(3,0:NBEAD+1)

These indexes are computed for each bead in all tethers.

Further, all the double precision floating point computations performed in AOS are with scalars.

Also note, the code generated contains data stride tests:

        cmp       rax, 8                                        ;180.18
        jne       .B1.12        ; Prob 50%                      ;180.18
.B1.9::                         ; Preds .B1.8
        cmp       QWORD PTR [912+rbp], 8                        ;180.18
        jne       .B1.12        ; Prob 50%                      ;180.18
.B1.10::                        ; Preds .B1.9
        cmp       QWORD PTR [904+rbp], 8                        ;180.18
        jne       .B1.12        ; Prob 50%                      ;180.18
.B1.11::                        ; Preds .B1.10

The stride test will take one of two different code paths depending on the result of the test.

PAOS assembler code for the small loop of three statements

        mov       r14, QWORD PTR [11184+rdi]                    ;172.15
        mov       r15, QWORD PTR [11176+rdi]                    ;172.15
        imul      r14, r15                                      ;
        vmovupd   ymm1, YMMWORD PTR [160+r13]                   ;204.10
        vmovupd   xmm2, XMMWORD PTR [32+rdi]                    ;178.94
        vmovupd   ymm0, YMMWORD PTR [_2il0floatpacket.40]       ;
        vmovupd   YMMWORD PTR [576+r13], ymm1                   ;204.10
        vmovupd   ymm1, YMMWORD PTR [_2il0floatpacket.39]       ;
        mov       QWORD PTR [1432+rbp], r15                     ;172.15
        sub       r14, r15                                      ;
        mov       r15, rsi                                      ;176.15
        mov       r8, QWORD PTR [11200+rdi]                     ;172.15
        imul      r15, r8                                       ;176.15
        mov       r12, QWORD PTR [11208+rdi]                    ;172.15
        imul      r12, r8                                       ;
        mov       rax, QWORD PTR [11240+rdi]                    ;173.19
        mov       QWORD PTR [1528+rbp], rax                     ;173.19
        mov       rax, QWORD PTR [11328+rdi]                    ;173.19
        mov       QWORD PTR [1360+rbp], rax                     ;173.19
        mov       rax, QWORD PTR [11320+rdi]                    ;173.19
        mov       QWORD PTR [1408+rbp], rax                     ;173.19
        mov       rax, QWORD PTR [11304+rdi]                    ;173.19
        mov       r9, QWORD PTR [11120+rdi]                     ;172.15
        add       r15, r9                                       ;
        mov       QWORD PTR [1352+rbp], rax                     ;173.19
        sub       r15, r12                                      ;
        mov       rax, QWORD PTR [11296+rdi]                    ;173.19
        sub       r15, r14                                      ;
        mov       QWORD PTR [1400+rbp], r12                     ;
        mov       r12, rsi                                      ;176.15
        mov       r10, QWORD PTR [11080+rdi]                    ;172.15
        mov       QWORD PTR [1440+rbp], rax                     ;173.19
        xor       eax, eax                                      ;173.19
        imul      r12, r10                                      ;176.15
        mov       rdx, QWORD PTR [11088+rdi]                    ;172.15
        mov       QWORD PTR [1416+rbp], r10                     ;172.15
        mov       QWORD PTR [1392+rbp], r14                     ;
        imul      rdx, r10                                      ;
        mov       r14, QWORD PTR [1352+rbp]                     ;
        mov       r10, QWORD PTR [1440+rbp]                     ;
        imul      r14, r10                                      ;
        mov       rbx, QWORD PTR [11064+rdi]                    ;172.15
        sub       r14, r10                                      ;
        mov       r11, QWORD PTR [11056+rdi]                    ;172.15
        mov       r10, rsi                                      ;177.19
        imul      rbx, r11                                      ;
        mov       rcx, QWORD PTR [11000+rdi]                    ;172.15
        add       r12, rcx                                      ;
        sub       rbx, r11                                      ;
        sub       r12, rdx                                      ;
        mov       QWORD PTR [1448+rbp], r11                     ;172.15
        sub       r12, rbx                                      ;
        mov       r11, QWORD PTR [1408+rbp]                     ;
        imul      r10, r11                                      ;177.19
        mov       QWORD PTR [1496+rbp], r12                     ;
        mov       r12, QWORD PTR [1360+rbp]                     ;
        imul      r12, r11                                      ;
        mov       QWORD PTR [1504+rbp], r15                     ;
        mov       r15, QWORD PTR [1528+rbp]                     ;
        add       r10, r15                                      ;
        sub       r10, r12                                      ;
        sub       r10, r14                                      ;
        mov       QWORD PTR [1520+rbp], rax                     ;173.19
        mov       QWORD PTR [1488+rbp], r10                     ;
        mov       QWORD PTR [1480+rbp], rax                     ;
        mov       QWORD PTR [1472+rbp], rax                     ;
        mov       QWORD PTR [1464+rbp], rax                     ;
        mov       QWORD PTR [1456+rbp], rax                     ;
        mov       QWORD PTR [1344+rbp], rcx                     ;
        mov       QWORD PTR [1352+rbp], r14                     ;
        mov       QWORD PTR [1360+rbp], r12                     ;
        mov       QWORD PTR [1368+rbp], rbx                     ;
        mov       QWORD PTR [1376+rbp], rdx                     ;
        mov       QWORD PTR [1384+rbp], r8                      ;
        mov       QWORD PTR [1336+rbp], rsi                     ;
        mov       QWORD PTR [1320+rbp], rdi                     ;
        mov       rax, QWORD PTR [1456+rbp]                     ;
        mov       rdx, QWORD PTR [1464+rbp]                     ;
        mov       rcx, QWORD PTR [1472+rbp]                     ;
        mov       rbx, QWORD PTR [1480+rbp]                     ;
        mov       rsi, QWORD PTR [1488+rbp]                     ;
        mov       r8, QWORD PTR [1504+rbp]                      ;
        mov       r10, QWORD PTR [1520+rbp]                     ;
        mov       r11, QWORD PTR [1440+rbp]                     ;
        mov       r12, QWORD PTR [1432+rbp]                     ;
        mov       r14, QWORD PTR [1448+rbp]                     ;
        vinsertf128 ymm2, ymm2, XMMWORD PTR [48+rdi], 1         ;178.94
        mov       rdi, QWORD PTR [1496+rbp]                     ;
        ALIGN     16
.B1.6::                         ; Preds .B1.6 .B1.5
        vmovupd   xmm10, XMMWORD PTR [rdx+rsi]                  ;178.63
        vmovupd   xmm3, XMMWORD PTR [rbx+r8]                    ;177.62
        vmovupd   xmm7, XMMWORD PTR [rcx+rdi]                   ;177.19
        inc       r10                                           ;173.19
        vinsertf128 ymm11, ymm10, XMMWORD PTR [16+rdx+rsi], 1   ;178.63
        add       rdx, r11                                      ;173.19
        vmulpd    ymm12, ymm0, ymm11                            ;178.63
        vmulpd    ymm13, ymm12, ymm2                            ;178.94
        vmulpd    ymm15, ymm13, ymm2                            ;178.94
        vinsertf128 ymm4, ymm3, XMMWORD PTR [16+rbx+r8], 1      ;177.62
        add       rbx, r12                                      ;173.19
        vmulpd    ymm5, ymm1, ymm4                              ;177.62
        vmulpd    ymm9, ymm5, ymm2                              ;177.64
        vinsertf128 ymm8, ymm7, XMMWORD PTR [16+rcx+rdi], 1     ;177.19
        add       rcx, r14                                      ;173.19
        vaddpd    ymm14, ymm8, ymm9                             ;177.19
        vaddpd    ymm3, ymm14, ymm15                            ;176.15
        vmovupd   YMMWORD PTR [256+r13+rax], ymm3               ;176.15
        add       rax, 32                                       ;173.19
        cmp       r10, 3                                        ;173.19
        jb        .B1.6         ; Prob 66%                      ;173.19
.B1.7::                         ; Preds .B1.6
        mov       rsi, QWORD PTR [1336+rbp]                     ;
        lea       r11, QWORD PTR [-1+rsi]                       ;172.15
        mov       rax, QWORD PTR [1416+rbp]                     ;172.15
        xor       r10d, r10d                                    ;173.19
        imul      rax, r11                                      ;172.15
        vmovupd   ymm0, YMMWORD PTR [_2il0floatpacket.40]       ;
        vmovupd   ymm1, YMMWORD PTR [_2il0floatpacket.39]       ;
        mov       rcx, QWORD PTR [1344+rbp]                     ;
        add       rcx, rax                                      ;
        mov       rdx, QWORD PTR [1376+rbp]                     ;
        sub       rcx, rdx                                      ;
        mov       r8, QWORD PTR [1384+rbp]                      ;
        mov       rdx, QWORD PTR [1408+rbp]                     ;173.19
        imul      rdx, r11                                      ;173.19
        imul      r8, r11                                       ;172.15
        add       r15, rdx                                      ;
        add       r9, r8                                        ;
        xor       r8d, r8d                                      ;
        mov       r12, QWORD PTR [1360+rbp]                     ;
        sub       r15, r12                                      ;
        mov       rbx, QWORD PTR [1368+rbp]                     ;
        sub       rcx, rbx                                      ;
        sub       r9, QWORD PTR [1400+rbp]                      ;
        xor       ebx, ebx                                      ;
        mov       QWORD PTR [1512+rbp], r10                     ;173.19
        xor       edx, edx                                      ;
        mov       r14, QWORD PTR [1352+rbp]                     ;
        sub       r15, r14                                      ;
        mov       QWORD PTR [1424+rbp], r11                     ;172.15
        mov       rdi, QWORD PTR [1320+rbp]                     ;
        sub       r9, QWORD PTR [1392+rbp]                      ;
        mov       r11, QWORD PTR [1512+rbp]                     ;
        mov       r12, QWORD PTR [1440+rbp]                     ;
        mov       r14, QWORD PTR [1432+rbp]                     ;
        mov       rax, QWORD PTR [1448+rbp]                     ;
        ALIGN     16
.B1.8::                         ; Preds .B1.8 .B1.7
        vmovupd   xmm10, XMMWORD PTR [rbx+r15]                  ;174.65
        vmovupd   xmm3, XMMWORD PTR [r10+r9]                    ;173.64
        vmovupd   xmm7, XMMWORD PTR [r8+rcx]                    ;173.19
        inc       r11                                           ;173.19
        vinsertf128 ymm11, ymm10, XMMWORD PTR [16+rbx+r15], 1   ;174.65
        add       rbx, r12                                      ;173.19
        vmulpd    ymm12, ymm0, ymm11                            ;174.65
        vmulpd    ymm13, ymm12, ymm2                            ;174.96
        vmulpd    ymm15, ymm13, ymm2                            ;174.96
        vinsertf128 ymm4, ymm3, XMMWORD PTR [16+r10+r9], 1      ;173.64
        add       r10, r14                                      ;173.19
        vmulpd    ymm5, ymm1, ymm4                              ;173.64
        vmulpd    ymm9, ymm5, ymm2                              ;173.66
        vinsertf128 ymm8, ymm7, XMMWORD PTR [16+r8+rcx], 1      ;173.19
        add       r8, rax                                       ;173.19
        vaddpd    ymm14, ymm8, ymm9                             ;173.19
        vaddpd    ymm3, ymm14, ymm15                            ;172.15
        vmovupd   YMMWORD PTR [352+r13+rdx], ymm3               ;172.15
        add       rdx, 32                                       ;173.19
        cmp       r11, 3                                        ;173.19
        jb        .B1.8         ; Prob 66%                      ;173.19
.B1.9::                         ; Preds .B1.8
        xor       ecx, ecx                                      ;173.19
        xor       edx, edx                                      ;
.B1.10::                        ; Preds .B1.10 .B1.9
        vmovupd   ymm1, YMMWORD PTR [256+r13+rdx]               ;192.11
        inc       rcx                                           ;173.19
        vsubpd    ymm2, ymm1, YMMWORD PTR [352+r13+rdx]         ;192.11
        vmovupd   YMMWORD PTR [608+r13+rdx], ymm2               ;192.11
        add       rdx, 32                                       ;173.19
        cmp       rcx, 3                                        ;173.19
        jb        .B1.10        ; Prob 66%                      ;173.19

PAOS generates 164 instructions (plus an undetermined number of NOPs for the ALIGN 16 pads).

The PAOS code, formerly one loop of 3 iterations in source, now has three loops each of 3 iterations-one each to compute TOSVX1, TOSVX2, and the SUI result. These loops are relatively short, and the major index calculations are performed only once prior to loop entry. The execution path extends by 98 instructions or 262 instructions total for four SUI results or 65.5 instructions per SUI result. Comparing this with the AOS method that generated one SUI result in 214 instructions, we have an approximate 3.26x speedup for this section of code. Memory and cache latencies will affect this rough estimate.

Vectorization of double precision floating point is found, most are now 4-wide:

        vmovupd   ymm1, YMMWORD PTR [160+r13]                   ;204.10
        vmovupd   xmm2, XMMWORD PTR [32+rdi]                    ;178.94
...
        vmulpd    ymm12, ymm0, ymm11                            ;178.63
        vmulpd    ymm13, ymm12, ymm2                            ;178.94
        vmulpd    ymm15, ymm13, ymm2                            ;178.94

The reason the compiler chose 4-wide loads and use in the above cases is that the potential penalty for split cache line loads was less than the performance gain through multiple re-use of the variable fetched.

Some of the fetches are still performed in two steps,

        vmovupd   xmm10, XMMWORD PTR [rbx+r15]                  ;174.65
	...
        vinsertf128 ymm11, ymm10, XMMWORD PTR [16+rbx+r15], 1   ;174.65

which removes the potential split cache line load performance hit.

Also note that the compiler can (does) force alignment of the local temps: TOSVX1(3), TOSVX2(3), and SUI(3). Therefore, it can know there will be no cache line split issue with 4-wide access to these variables.

PAOS has the following advantages:

Index calculations per bead are similar but for only a quarter of the number of tethers
Many fetches and stores are 4-wide, a few are 2-wide
Double precision adds, subtracts, and multiplies are 4-wide
Shorter code means better utilization of L1 instruction cache

Notes for C++ programmers (not familiar with FORTRAN)

In the code shown, the arrays were using pointers to array format:

AOS

    real, pointer :: rBeadIPIF(:,:)
    real, pointer :: rBeadDIPIF(:,:)
    real, pointer :: rBeadDDIPIF(:,:)
...
    rBeadIPIF => pFiniteSolution.rBeadIPIF
    rBeadDIPIF => pFiniteSolution.rBeadDIPIF
    rBeadDDIPIF => pFiniteSolution.rBeadDDIPIF

POS

    type(TypeYMM), pointer :: rBeadIPIF(:,:)
    type(TypeYMM), pointer :: rBeadDIPIF(:,:)
    type(TypeYMM), pointer :: rBeadDDIPIF(:,:)
...
    rBeadIPIF => pFiniteSolutionSIMD.rBeadIPIF
    rBeadDIPIF => pFiniteSolutionSIMD.rBeadDIPIF
    rBeadDDIPIF => pFiniteSolutionSIMD.rBeadDDIPIF

In both variants of the code, these pointers point to the entire array in the pFiniteSolution and pFiniteSolutionSIMD structures respectively. These happen to be pointers to arrays as well. In FORTRAN, pointers can point to portions of an array that need not be contiguous.

    real, pointer :: f(:,:)
    type(TypeYMM), pointer :: t(:,:)
    real, pointer:: slice(:,:)
...
        s = 2 ! 3rd cell of 4-wide v(0:3)
    slice(LBOUND(f, DIM=1):UBOUND(f, DIM=1), LBOUND(f, DIM=2):UBOUND(f, DIM=2)) => &
& t(LBOUND(f, DIM=1), LBOUND(f, DIM=2))%v(s::4)
    f => slice

In the preceding statements, the pointer f would have a stride of 4 (doubles) assuming the array pointer in the pFiniteSolutionSIMD structure had a stride of 1 (it need not have). And start at v(s), s=2. This presents in f, a two dimensional array that looks like (3,0:NBEAD+1), and is indexed so. However, indexing walks down the column s of the v array in the two dimensional array of 4-wide vectors. C++ programmers may legitimately ask “Why would anyone need to do this?”

To answer this, recall that only 25% of the code was converted to use PAOS. What about the other 75% of the code? As it turns out, using pointers and non-stride 1 assignments, the other 75% of the code can be left unchanged yet access and manipulate all SIMD allocated arrays as they were formerly allocated non-PAOS. With this capability, conversion of an application need not be an all–or-nothing situation. Without this capability of FORTRAN, multiple sets of data would have to be continuously synchronized.

Because of a potential for strides not equal to 1, the FORTRAN compiler will generate multiple paths of code. One for stride=1 and the other for stride != 1. I should correct this statement. The preceding is correct when the array is “natural” numbers for the language (char, word, integer, float, double) where the processor instruction set has the capability of scaling the array cell by *1, *2, *4 or *8. In the case of the Type(TypeYMM) array, each cell in the array consists of four doubles (32 bytes). Therefore, due to the instruction set being incapable of a scale of *32, only one execution path is generated (and taken). The overhead is trivial:

add       rdx, 32                                       ;173.19

C++ equivalent code

For the benefit of the C++ programmers lets piece together a mockup of the equivalent example:

// AOS
struct xyz_t
{
	double xyz[3];
};

struct FiniteSoluton_t
{
	double	rDELTIS;
	int	nBeads;

	xyz_t*	rBeadIPIF;
	xyz_t*	rBeadDIPIF;
	xyz_t*	rBeadDDIPIF;

	FiniteSoluton_t()
	{
		nBeads = 0;
		rBeadIPIF = NULL;
		rBeadDIPIF = NULL;
		rBeadDDIPIF = NULL;
	}
	void Allocate(int n)
	{
		if(rBeadIPIF) delete [] rBeadIPIF;
		if(rBeadDIPIF) delete [] rBeadDIPIF;
		if(rBeadDDIPIF) delete [] rBeadDDIPIF;
		nBeads = n;
		rBeadIPIF = new xyz_t[n];
		rBeadDIPIF = new xyz_t[n];
		rBeadDDIPIF = new xyz_t[n];
	}
};

FiniteSoluton_t	fs;
// dummy external function to force optimizer NOT to remove 
unused code // *** this will generate a link error, ignore error and look at
.ASM file void foo(xyz_t& x); void TENSEG(FiniteSoluton_t* pFiniteSolution) { int JSEG; // pull in frequently used items from fs struct int JSEGlast = pFiniteSolution->nBeads+1; double rDELTIS = pFiniteSolution->rDELTIS; xyz_t* rBeadIPIF = pFiniteSolution->rBeadIPIF; xyz_t* rBeadDIPIF = pFiniteSolution->rBeadDIPIF; xyz_t* rBeadDDIPIF = pFiniteSolution->rBeadDDIPIF; xyz_t TOSVX1, TOSVX2, SUI; for(JSEG=0; JSEG < JSEGlast; ++JSEG) { for(int i=0; i<3; ++i) { // Determine TUI at 1/2 integration step from now // using current velocity and acceleration // Position of X-End of segment TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i] + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0)) + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0)); // Position of Y-End of segment TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i] + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0)) + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0)); // SUI X->Y SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i]; } // for(int i=0; i<3; ++i) // insert code to assure optimizer does not remove loop foo(SUI); } // for(JSEG=0; JSEG < JSEGlast; ++JSEG) }

PAOS code

#include < dvec.h >

struct xyzF64vec4
{
	F64vec4 xyz[3];
};

struct FiniteSolutonF64vec4
{
	F64vec4	rDELTIS;
	int	nBeads;

	xyzF64vec4*	rBeadIPIF;
	xyzF64vec4*	rBeadDIPIF;
	xyzF64vec4*	rBeadDDIPIF;

	FiniteSolutonF64vec4()
	{
		nBeads = 0;
		rBeadIPIF = NULL;
		rBeadDIPIF = NULL;
		rBeadDDIPIF = NULL;
	}
	void Allocate(int n)
	{
		if(rBeadIPIF) delete [] rBeadIPIF;
		if(rBeadDIPIF) delete [] rBeadDIPIF;
		if(rBeadDDIPIF) delete [] rBeadDDIPIF;
		nBeads = n;
		rBeadIPIF = new xyzF64vec4[n];
		rBeadDIPIF = new xyzF64vec4[n];
		rBeadDDIPIF = new xyzF64vec4[n];
	}
};

FiniteSolutonF64vec4	fsF64vec4;

void foo(xyzF64vec4& x);	// dummy external functon to force optimizer NOT to remove unused code
					// *** this will generate a link error, ignore error and look at .ASM file

void TENSEG(FiniteSolutonF64vec4* pFiniteSolution)
{
	int JSEG;
	// pull in frequrently used items from fs struct
	int JSEGlast = pFiniteSolution->nBeads+1;
	F64vec4	rDELTIS = pFiniteSolution->rDELTIS;
	xyzF64vec4*	rBeadIPIF = pFiniteSolution->rBeadIPIF;
	xyzF64vec4*	rBeadDIPIF = pFiniteSolution->rBeadDIPIF;
	xyzF64vec4*	rBeadDDIPIF = pFiniteSolution->rBeadDDIPIF;
	xyzF64vec4	TOSVX1, TOSVX2, SUI;
	for(JSEG=0; JSEG < JSEGlast; ++JSEG)
	{
        for(int i=0; i<3; ++i)
		{

            // Determine TUI at 1/2 integration step from now
            // using current velocity and acceleration
            // Position of X-End of segment
            TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i]
             + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0))
             + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0));
            // Position of Y-End of segment
            TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i]
             + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0))
             + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0));
          // SUI X->Y
          SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i];
		} // for(int i=0; i<3; ++i)
		// insert code to assure optimizer does not remove loop
		foo(SUI);
	} // for(JSEG=0; JSEG < JSEGlast; ++JSEG)
}

Other than for the type declaration, the code section remains the same.

The sections of code to be adapted for PAOS

Some additional coding considerations will have to be made when programming in PAOS. One example is defensive code for divide by zero protection. An example where you may have had:

xyz_t& unitVector(xyz_t& v)
{
  xyz_t ret;
  double Length = sqrt(
            v.xyz[0]*v.xyz[0] 
            + v.xyz[1]*v.xyz[1] 
            + v.xyz[2]*v.xyz[2]);
  // defensive code to avoid divide by 0
  // When (if) Length = 0.0 then substitute 1.0
  // X = 0.0 / 1.0 = uX 0.0 (not divide by 0)
  // Y = 0.0 / 1.0 = uY 0.0 (not divide by 0)
  // Z = 0.0 / 1.0 = uZ 0.0 (not divide by 0)
  if(Length == 0.0) Length = 1.0;
  ret.xyz[0] = v.xyz[0] / Length;
  ret.xyz[1] = v.xyz[1] / Length;
  ret.xyz[2] = v.xyz[2] / Length;
  return ret;
}

Becomes:
xyzF64vec4& unitVector(xyzF64vec4& v)
{
	xyzF64vec4 ret;
	static const F64vec4	Zero(0.0);
	static const F64vec4	One(1.0);
	F64vec4 Length = sqrt(
		v.xyz[0]*v.xyz[0] 
		+ v.xyz[1]*v.xyz[1] 
		+ v.xyz[2]*v.xyz[2]);
	// defensive code to avoid divide by 0
	// When (if) Length = 0.0 then substitute 1.0
	// X = 0.0 / 1.0 = uX 0.0 (not divide by 0)
	// Y = 0.0 / 1.0 = uY 0.0 (not divide by 0)
	// Z = 0.0 / 1.0 = uZ 0.0 (not divide by 0)
	Length = select_eq(Length, Zero, Length, One);
	ret.xyz[0] = v.xyz[0] / Length;
	ret.xyz[1] = v.xyz[1] / Length;
	ret.xyz[2] = v.xyz[2] / Length;
	return ret;
}

Other than for the type declaration, only one statement was changed.

PAOS advantage in the assembly code

The real advantage of PAOS can be found in the assembly code, with the understanding that the PAOS code is operating on four entities at the same time. Let’s compare the assembly code of both the original AOS code and the modified PAOS code. The source code is the same; the only difference is in the data type. The source code:

    for(int i=0; i<3; ++i)
    {
      // Determine TUI at 1/2 integration step from now
      // using current velocity and acceleration
      // Position of X-End of segment
      TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i]
        + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0))
        + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0));
      // Position of Y-End of segment
      TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i]
        + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0))
        + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0));
      // SUI X->Y
      SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i];
    } // for(int i=0; i<3; ++i)

AOS assembly code

;;;         for(int i=0; i<3; ++i)
;;; 		{
;;; 
;;;             // Determine TUI at 1/2 integration step from now
;;;             // using current velocity and acceleration
;;;             // Position of X-End of segment
;;;             TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i]
;;;              + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0))
;;;              + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0));
;;;             // Position of Y-End of segment
;;;             TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i]
        vmulsd    xmm1, xmm2, QWORD PTR [_2il0floatpacket.42]   ;66.13
        vmulsd    xmm2, xmm2, xmm2                              ;66.13
        vmulsd    xmm3, xmm2, QWORD PTR [_2il0floatpacket.43]   ;66.13
        vmovddup  xmm0, xmm1                                    ;66.13
        vmovddup  xmm2, xmm3                                    ;66.13
        vmovups   XMMWORD PTR [80+rsp], xmm6                    ;66.13
        vmovapd   xmm6, xmm2                                    ;66.13
        vmovups   XMMWORD PTR [64+rsp], xmm7                    ;66.13
        vmovapd   xmm7, xmm1                                    ;66.13
        vmovups   XMMWORD PTR [48+rsp], xmm8                    ;66.13
        vmovapd   xmm8, xmm3                                    ;66.13
        vmovups   XMMWORD PTR [32+rsp], xmm9                    ;66.13
        vmovapd   xmm9, xmm0                                    ;66.13
        mov       QWORD PTR [136+rsp], rbx                      ;66.13
        mov       rbx, rax                                      ;66.13
        mov       QWORD PTR [128+rsp], rsi                      ;66.13
        mov       rsi, rdx                                      ;66.13
        mov       QWORD PTR [120+rsp], rdi                      ;66.13
        mov       rdi, rcx                                      ;66.13
        mov       QWORD PTR [112+rsp], r12                      ;66.13
        mov       r12, r8                                       ;66.13
        mov       QWORD PTR [104+rsp], r13                      ;66.13
        mov       r13, r9                                       ;66.13
        mov       QWORD PTR [96+rsp], r14                       ;66.13
        mov       r14, r10                                      ;66.13
.B5.3::                         ; Preds .B5.4 .B5.2
        vmulpd    xmm4, xmm9, XMMWORD PTR [24+rbx+r12]          ;66.13
        vaddpd    xmm5, xmm4, XMMWORD PTR [24+rbx+r13]          ;66.13
        vmulpd    xmm4, xmm6, XMMWORD PTR [24+rbx+rdi]          ;66.13
        vaddpd    xmm0, xmm5, xmm4                              ;66.13
        vmulpd    xmm4, xmm9, XMMWORD PTR [rbx+r12]             ;62.13
        vmulpd    xmm5, xmm6, XMMWORD PTR [rbx+rdi]             ;62.13
        vaddpd    xmm4, xmm4, XMMWORD PTR [rbx+r13]             ;62.13
        vaddpd    xmm4, xmm4, xmm5                              ;62.13
;;;              + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0))
;;;              + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0));
;;;           // SUI X->Y
;;;           SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i];
        vsubpd    xmm4, xmm0, xmm4                              ;70.11
        vmovupd   XMMWORD PTR [144+rsp], xmm4                   ;70.11
;;; 		} // for(int i=0; i<3; ++i)
        lea       rcx, QWORD PTR [144+rsp]                      ;73.3
        vmulsd    xmm4, xmm7, QWORD PTR [40+rbx+r12]            ;66.13
        vmulsd    xmm5, xmm8, QWORD PTR [40+rbx+rdi]            ;66.13
        vaddsd    xmm4, xmm4, QWORD PTR [40+rbx+r13]            ;66.13
        vaddsd    xmm0, xmm4, xmm5                              ;66.13
        vmulsd    xmm4, xmm7, QWORD PTR [16+rbx+r12]            ;62.13
        vmulsd    xmm5, xmm8, QWORD PTR [16+rbx+rdi]            ;62.13
        vaddsd    xmm4, xmm4, QWORD PTR [16+rbx+r13]            ;62.13
        vaddsd    xmm4, xmm4, xmm5                              ;62.13
        vsubsd    xmm4, xmm0, xmm4                              ;70.11
        vmovsd    QWORD PTR [160+rsp], xmm4                     ;70.11
The AOS computed one SUI vector in 46 instructions using a mixture of 2-wide vector and scalar 
instructions. This loop was unrolled.

PAOS assembly code

;;;         for(int i=0; i<3; ++i)
;;; 		{
;;; 
;;;             // Determine TUI at 1/2 integration step from now
;;;             // using current velocity and acceleration
;;;             // Position of X-End of segment
;;;             TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i]
        vdivpd    ymm1, ymm0, YMMWORD PTR [_2il0floatpacket.437] ;213.13
        vmulpd    ymm0, ymm0, ymm0                              ;213.13
        vmovupd   YMMWORD PTR [288+r13], ymm1                   ;213.13
        vdivpd    ymm0, ymm0, YMMWORD PTR [_2il0floatpacket.438] ;213.13
        vmovupd   YMMWORD PTR [320+r13], ymm0                   ;213.13
        mov       QWORD PTR [456+rsp], rbx                      ;213.13
        mov       rbx, rax                                      ;213.13
        mov       QWORD PTR [448+rsp], rsi                      ;213.13
        mov       rsi, rdx                                      ;213.13
        mov       QWORD PTR [440+rsp], rdi                      ;213.13
        mov       rdi, rcx                                      ;213.13
        mov       QWORD PTR [432+rsp], r12                      ;213.13
        mov       r12, r8                                       ;213.13
        mov       QWORD PTR [424+rsp], r14                      ;213.13
        mov       r14, r9                                       ;213.13
        mov       QWORD PTR [416+rsp], r15                      ;213.13
        mov       r15, r10                                      ;213.13
.B5.3::                         ; Preds .B5.4 .B5.2
        vmovupd   ymm4, YMMWORD PTR [288+r13]                   ;213.13
;;;              + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0))
;;;              + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0));
;;;             // Position of Y-End of segment
;;;             TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i]
;;;              + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0))
;;;              + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0));
;;;           // SUI X->Y
;;;           SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i];
;;; 		} // for(int i=0; i<3; ++i)
;;; 		// insert code to assure optimizer does not remove loop
;;; 		foo(SUI);
        lea       rcx, QWORD PTR [192+r13]                      ;224.3
        vmovupd   ymm3, YMMWORD PTR [320+r13]                   ;213.13
        vmulpd    ymm5, ymm4, YMMWORD PTR [rbx+r12]             ;213.13
        vmulpd    ymm1, ymm3, YMMWORD PTR [rbx+rdi]             ;213.13
        vaddpd    ymm0, ymm5, YMMWORD PTR [rbx+r14]             ;213.13
        vmulpd    ymm5, ymm4, YMMWORD PTR [96+rbx+r12]          ;217.13
        vaddpd    ymm2, ymm0, ymm1                              ;213.13
        vmulpd    ymm1, ymm3, YMMWORD PTR [96+rbx+rdi]          ;217.13
        vmovupd   YMMWORD PTR [r13], ymm2                       ;213.13
        vaddpd    ymm0, ymm5, YMMWORD PTR [96+rbx+r14]          ;217.13
        vaddpd    ymm5, ymm0, ymm1                              ;217.13
        vsubpd    ymm2, ymm5, ymm2                              ;221.11
        vmovupd   YMMWORD PTR [96+r13], ymm5                    ;217.13
        vmovupd   YMMWORD PTR [192+r13], ymm2                   ;221.11
        vmulpd    ymm0, ymm4, YMMWORD PTR [32+rbx+r12]          ;213.13
        vmulpd    ymm2, ymm3, YMMWORD PTR [32+rbx+rdi]          ;213.13
        vmulpd    ymm5, ymm4, YMMWORD PTR [128+rbx+r12]         ;217.13
        vaddpd    ymm1, ymm0, YMMWORD PTR [32+rbx+r14]          ;213.13
        vaddpd    ymm0, ymm5, YMMWORD PTR [128+rbx+r14]         ;217.13
        vaddpd    ymm1, ymm1, ymm2                              ;213.13
        vmulpd    ymm2, ymm3, YMMWORD PTR [128+rbx+rdi]         ;217.13
        vmovupd   YMMWORD PTR [32+r13], ymm1                    ;213.13
        vaddpd    ymm5, ymm0, ymm2                              ;217.13
        vsubpd    ymm1, ymm5, ymm1                              ;221.11
        vmovupd   YMMWORD PTR [128+r13], ymm5                   ;217.13
        vmovupd   YMMWORD PTR [224+r13], ymm1                   ;221.11
        vmulpd    ymm0, ymm4, YMMWORD PTR [64+rbx+r12]          ;213.13
        vmulpd    ymm2, ymm3, YMMWORD PTR [64+rbx+rdi]          ;213.13
        vmulpd    ymm4, ymm4, YMMWORD PTR [160+rbx+r12]         ;217.13
        vmulpd    ymm3, ymm3, YMMWORD PTR [160+rbx+rdi]         ;217.13
        vaddpd    ymm1, ymm0, YMMWORD PTR [64+rbx+r14]          ;213.13
        vaddpd    ymm0, ymm1, ymm2                              ;213.13
        vaddpd    ymm1, ymm4, YMMWORD PTR [160+rbx+r14]         ;217.13
        vmovupd   YMMWORD PTR [64+r13], ymm0                    ;213.13
        vaddpd    ymm1, ymm1, ymm3                              ;217.13
        vsubpd    ymm0, ymm1, ymm0                              ;221.11
        vmovupd   YMMWORD PTR [160+r13], ymm1                   ;217.13
        vmovupd   YMMWORD PTR [256+r13], ymm0                   ;221.11

The PAOS used all 4-wide vectors computing four segments at the same time (from four different tethers). The for loop was unrolled, and the resulting code was a very Spartan 56 instructions per four SUI results. Giving 56/4 = 14 instructions per SUI.

AOS took 46 instructions per SUI calculation; PAOS took 114 instructions per SUI. Assuming equal time per instruction PAOS has a performance advantage of 3.28x over AOS (for this code sample).

Caution: do not assume that you will experience an actual 3.28x improvement in an application as a whole.

While not all applications are suitable for conversion from AOS to PAOS, portions of your application may benefit from conversion. Later in this article, we will show the results of using PAOS modifications in a FORTRAN application for the study of tether systems. In this application only 25% of the code was modified for PAOS (mostly copy and paste, search, and replace). A collection of spacecraft were interconnected with a collection of 96 tethers. This configuration yielded a 2x performance boost. Though not the 3.28x optimization estimate for the code snippet above, it shows a rather impressive performance gain for the programming effort involved.

Program conversion techniques

In the tether simulation program, I took the body of the “type TypeFiniteSolution” and placed it into a separate file (FiniteSolution.inc). Compilation of the module containing the type definition uses the FORTRAN preprocessor (FPP):

! Note, "real" is not defined at this point
! i.e. "real" in the FPP #include file passes through as "real"
type TypeFiniteSolution
    SEQUENCE
#include "FiniteSolution.inc"
    ! No additional data to the original type
end type TypeFiniteSolution

! XMM, YMM and YMM02 variants replace "real" with
! "type(TypeXMM)", "type(TypeYMM)" and "type(TypeYMM02)"
! XMM variant
#define real type(TypeXMM)
type TypeFiniteSolutionXMM
    SEQUENCE
#include "FiniteSolution.inc"
    ! additional members for TypeXMM
    integer :: TetherNumber0
    integer :: TetherNumber1
end type TypeFiniteSolutionXMM
#undef real

! YMM variant
#define real type(TypeYMM)
type TypeFiniteSolutionYMM
    SEQUENCE
#include "FiniteSolution.inc"
    ! additional members for TypeYMM
    integer :: TetherNumber0
    integer :: TetherNumber1
    integer :: TetherNumber2
    integer :: TetherNumber3
end type TypeFiniteSolutionYMM
#undef real

! YMM02 variant
#define real type(TypeYMM02)
type TypeFiniteSolutionYMM02
    SEQUENCE
#include "FiniteSolution.inc"
    ! additional members for TypeYMM02
    integer :: TetherNumber0
    integer :: TetherNumber1
    integer :: TetherNumber2
end type TypeFiniteSolutionYMM02
#undef real

We can use the same technique for the subroutines and functions by using the FPP and redefining “real” to the appropriate type. The main difference is the non-SIMD would typically not have the .v suffix. So you may need two versions of your subroutines and functions that are sensitive to SIMD-typed variables.

Generic subroutine and function interfaces

Use of generic subroutine and function interfaces is a major requirement for conversion to PAOS. Consider the requirements of a CROSS product of two arrays of dimension 3.

    interface CROSS
        RECURSIVE SUBROUTINE CROSS_r_r_r(VA,VB,VC)
            use MOD_ALL
            real :: VA(3), VB(3), VC(3)
        END SUBROUTINE CROSS_r_r_r
        
        RECURSIVE SUBROUTINE CROSS_r_r_ymm(VA,VB, VC)
	        USE MOD_ALL
            real :: VA(3), VB(3)
            type(TypeYMM) :: VC(3)
        END SUBROUTINE CROSS_r_r_ymm
        
        RECURSIVE SUBROUTINE CROSS_r_r_ymm02(VA,VB, VC)
	        USE MOD_ALL
            real :: VA(3), VB(3)
            type(TypeYMM02) :: VC(3)
        END SUBROUTINE CROSS_r_r_ymm02
        
        RECURSIVE SUBROUTINE CROSS_r_r_xmm(VA,VB, VC)
	        USE MOD_ALL
            real :: VA(3), VB(3)
            type(TypeXMM) :: VC(3)
        END SUBROUTINE CROSS_r_r_xmm
        
        RECURSIVE SUBROUTINE CROSS_ymm_ymm_ymm(VA,VB,VC)
            use MOD_ALL
            type(TypeYMM) :: VA(3), VB(3), VC(3)
        END SUBROUTINE CROSS_ymm_ymm_ymm
        
        RECURSIVE SUBROUTINE CROSS_ymm02_ymm02_ymm02(VA,VB,VC)
            use MOD_ALL
            type(TypeYMM02) :: VA(3), VB(3), VC(3)
        END SUBROUTINE CROSS_ymm02_ymm02_ymm02
        
        RECURSIVE SUBROUTINE CROSS_xmm_xmm_xmm(VA,VB, VC)
            use MOD_ALL
            type(TypeXMM) :: VA(3), VB(3), VC(3)
        END SUBROUTINE CROSS_xmm_xmm_xmm
    end interface CROSS

Note that a typical application will have various (but not all) permutations of the types of arguments. A generic subroutine interface will sort this out for you.

! EVAL ANGULAR MOMENTUM DERIVATIVE TERM
    CALL CROSS (RPI,RPIDD,  pFiniteSolutionSIMD.rRVDI)

The above calls CROSS_r_r_ymm

! FORM VECTOR IN DIRECTION OF TETHER FRAME "K" UNIT VECTOR
    CALL CROSS (pFiniteSolutionSIMD.rUITI, pFiniteSolutionSIMD.rUJOI,  pFiniteSolutionSIMD.rUKTI)

The above calls CROSS_ymm_ymm_ymm

Subroutines were replicated from the original AOS layout for each of the additional PAOS widths (2-wide, 3-wide, 4-wide). In this respect, C++ is easier since templates can be used. In FORTRAN with Visual Studio* you select a subroutine, select all, and copy to clip board. Then in the solution Explorer, click on the project of interest, Add Item, add same name as original code plus suffix of your choice (I used ymm, ymm02, and xmm). Then paste the contents of the original subroutine into the new empty file. The remaining step is to change the types and add an appropriate USE module.

Some additional consideration for data changed will have to be addressed. While you can use operator functions in FORTRAN to eliminate suffixing variables with “.v,” which can be done with search and replace, I chose not to due to an issue with the compiler not inline-ing the code. This inline issue may be resolved with a product update by the time you read this article.

Some of your code sections may have to address the PAOS vector horizontally. I wanted to have the same code text for all the different PAOS widths (2, 3, 4 now, 5, 6, 7, 8, … later). In the FORTRAN subroutine, I added:

    integer :: s
    integer, parameter :: sLast = 3

And used it as:

! FORM UNIT VECTOR FROM X-END TO Y-END OF BEAD SEGMENT
     do s=0,sLast
         if(DUMEL.v(s) .ne. 0.0_8) then
            TUI.v(s) = SUI.v(s) / DUMEL.v(s)
         else
            call RandomUnitVector(TUI.v(s))
         endif
     end do

(to C++ programmers “parameter” is like “static const”)

The above is a divide-by-zero protection except that instead of returning a NULL vector, a random unit vector is returned. The above modification runs as AOS (non-PAOS).

In the case of the production of NULL vector, the code becomes:

 RadiusSquared.v = RelativePOSE_IF(1).v**2 &
     & + RelativePOSE_IF(2).v**2 &
     & + RelativePOSE_IF(3).v**2
 ! N.B we are comparing a bead position within a tether against
 ! a gravitational body position they should not coincide
 ! ignore if at center of object
 !DEC$ VECTOR ALWAYS
 do s = 0, sLast
   if(RadiusSquared.v(s) .eq. 0.0_8) RadiusSquared.v(s) = 1.0_8
 end do
 Radius.v = dsqrt(RadiusSquared.v)
 ACCE_IFlocal(1).v = RelativePOSE_IF(1).v &
    & * ((pObject.rGMft3Ps2 / RadiusSquared.v) / Radius.v)
 ACCE_IFlocal(2).v = RelativePOSE_IF(2).v &
    & * ((pObject.rGMft3Ps2 / RadiusSquared.v) / Radius.v)
 ACCE_IFlocal(3).v = RelativePOSE_IF(3).v &
    & * ((pObject.rGMft3Ps2 / RadiusSquared.v) / Radius.v)

Optimization notes on the above code

The preventative code for the potential for the divide by 0 is moved to test the square of the vector as opposed to after the square root. The result will be the same regardless of path taken. However, now the dsqrt will never be asked to perform a square root of 0.0. Depending on processor microcode, the square root of 0.0 may take a longer path.

The above do s loop, having a constant sLast, will be unrolled and vectorized and surprisingly for this loop will be fully vectored without branch instructions.

Advantages of PAOS over AOS

While AOS can vectorize the larger arrays allocated to NumberOfBeads related allocations and has limited vectorization opportunities for the small fixed arrays of dimensions (3) and (3,3), AOS is unable to vectorize the manipulation of the scalars. PAOS can fully vectorize the scalars, fully vectorize the small fixed arrays of dimensions (3) and (3,3), and has better opportunities to vectorize the larger arrays allocated to NumberOfBeads related allocations.

Advantage with computation of the scalars

Traditional non-SIMD statements to be repeated on each tether

! CALCULATE 1ST TERM
! AND ACCUMULATE THIS INCREMENT
DUMSCL = DUMCEE * (  2.0_8*(pFiniteSolution%rELD / pFiniteSolution%rEL)**2 &
& - pFiniteSolution%rELDD / pFiniteSolution%rEL  )

SIMD statement to be repeated on a group of four tethers

! CALCULATE 1ST TERM
! AND ACCUMULATE THIS INCREMENT
DUMSCL.v = DUMCEE.v * (  2.0_8*(pFiniteSolution%rELD.v / pFiniteSolution%rEL.v)**2 &
& - pFiniteSolution%rELDD.v / pFiniteSolution%rEL.v  )

Note that the code was changed only slightly. The same code is used with the addition of the “.v” suffix to the variables typed as TypeYMM, TypeXMM, and TypeYMM02. While this suffix can be eliminated by declaring operator functions (I may make this revision later), the suffix use, such as the Debug Build, was followed with full debugging diagnostics and performed the operations in-line and with vector instructions. These code modifications can essentially be performed using search and replace. If the programmer elects to use operator overload functions, then no code change is required (other than for type declarations).

Advantage in computation with small fixed arrays

Manipulating the small fixed-size arrays has an advantage in PAOS. Consider the positions of the front and back ends of a tether and you are to calculate the vector between the front and back ends as well as the unit vector between the front and back ends (pFiniteSolution% pointers removed for clarity):

    real(8) :: Front(3), Back(3), ELI(3), UELI(3), Length

    ELI(1)   = Back(1) – Front(1)
    ELI(2)   = Back(2) – Front(2)
    ELI(3)   = Back(3) – Front(3)
    Length = DSQRT(ELI(1)**2 + ELI(2)**2 + ELI(3)**2)
    UEIL(1) = ELI(1) / Length
    UEIL(2) = ELI(2) / Length
    UEIL(3) = ELI(3) / Length

The scalar AOS code can vectorize only portions of the above code. On SSE platforms, only two of the 3 components of Back and Front could be loaded and subtracted at once; the 3rd component is loaded and subtracted separately. The “ELI(1)**2 + ELI(2)**2” could be performed in one operation and the “ELI(3)*2” as a separate operation. The remaining problem is the SSE now has to perform a horizontal add of the results of “ELI(1)**2 + ELI(2)**2”, then an additional scalar add of the “ELI(3)*2”. The result can then be run through a scalar DSQRT operation to produce the Length. The division of ELI(1:3) by the scalar is performed in two steps: one 2-wide for the “ELI(1) + ELI(2)”, and one for the “ELI(3)” component. The store is performed in two steps as mentioned for the load.

On AVX systems, the code loads each of the Back and Front in two parts, one for the var(1:2) parts and one for the var(3) parts, plus an additional XOR to pre-clear the 4th element of the YMM register. The subtraction is performed in one operation, same with the squaring of “ELI(1)**2 + ELI(2)**2 + ELI(3)*2”. However, at this point, the AVX code has to horizontally add the three partial results and then perform a scalar DSQRT to produce the Length. The division of the ELI by Length is performed in one operation after broadcasting the Length across a YMM register; however, this is generally not done in one step due to the uncertainty of the undefined 4th element of the YMM vector. Therefore, the division is performed in two steps: 2-wide, 1-wide. Note that the load and store are performed each in two parts with an additional XOR to pre-clear the 4th element of the load YMM register.

The PAOS has a much easier job and it fully vectorizes all of the code:

    type(TypeYMM) :: Front(3), Back(3), ELI(3), UELI(3), Length

    ELI(1).v   = Back(1).v – Front(1).v
    ELI(2).v   = Back(2).v – Front(2).v
    ELI(3).v   = Back(3).v – Front(3).v
    Length.v = DSQRT(ELI(1).v**2 + ELI(2).v**2 + ELI(3).v**2)
    UEIL(1).v = ELI(1).v / Length.v
    UEIL(2).v = ELI(2).v / Length.v
    UEIL(3).v = ELI(3).v / Length.v

In the first statements, each element of the ELI vector is computed 4-wide (four tethers at once). The partial results are kept 4-wide in separate YMM registers. The squaring now is performed on three separate registers, each 4-wide. What used to be the horizontal add of the three elements now becomes a vertical add of three 4-wide elements. Next, the DSQRT is performed on the 4-wide sum of the squares (on Sandy Bridge DSQRT is internally performed on each half of the YMM register, Ivy Bridge is full 4-wide DSQRT). The resultant 4-wide lengths (separate values) is then used to produce four unit vectors. The above code is completely 4-wide SIMD and potentially executes faster with four tethers than it does with one tether (speed is dependent on where cache lines are split).

Advantage with large arrays

When looping through larger arrays, similar advantages can be attained when the arrays contain the three (X, Y, Z) components. Effectively the PAOS data flow computations become:

 	OBJECT(1)%X(1) | OBJECT(2)%X(1) | OBJECT(3)%X(1) | OBJECT(4)%X(1)
	OBJECT(1)%Y(1) | OBJECT(2)%Y(1) | OBJECT(3)%Y(1) | OBJECT(4)%Y(1)
	OBJECT(1)%Z(1) | OBJECT(2)%Z(1) | OBJECT(3)%Z(1) | OBJECT(4)%Z(1)
	OBJECT(1)%X(2) | OBJECT(2)%X(2) | OBJECT(3)%X(2) | OBJECT(4)%X(2)
	OBJECT(1)%Y(2) | OBJECT(2)%Y(2) | OBJECT(3)%Y(2) | OBJECT(4)%Y(2)
	OBJECT(1)%Z(2) | OBJECT(2)%Z(2) | OBJECT(3)%Z(2) | OBJECT(4)%Z(2)

Where OBJECT represents the structure containing the array of application vector(3)’s, but now remapped into SIMD 4-wide vectors stored in three units of SIND 4-wide vectors.

Scalar (original) representation

    ! Position with respect to IP IF
    real, pointer :: rBeadIPIF(:,:)   	! (3,NumberOfBeads+2)

AVX 4-wide SIMD representation

    ! Position with respect to IP IF
    type(TypeYMM), pointer :: rBeadIPIF(:,:)   	! (3,NumberOfBeads+2)

The code to produce the vector between adjacent beads and the unit vector of that vector.

Original code (abbreviated) that will optimize to include partial vectorization:

    DO I=1,nBeads
      ELI(1)   = pFiniteSolution%rBeadIPIF(1,I) – pFiniteSolution%rBeadIPIF(1,I+1)
      ELI(2)   = pFiniteSolution%rBeadIPIF(2,I) – pFiniteSolution%rBeadIPIF(2,I+1)
      ELI(3)   = pFiniteSolution%rBeadIPIF(3,I) – pFiniteSolution%rBeadIPIF(3,I+1)
      Length = DSQRT(ELI(1)**2 + ELI(2)**2 + ELI(3)*2)
      pFiniteSolution%rTUI(1,I) = ELI(1) / Length
      pFiniteSolution%rTUI(2,I) = ELI(2) / Length
      pFiniteSolution%rTUI(3,I) = ELI(3) / Length
   END DO

YMM SIMD code fully vectorized across 4 tethers:

    DO I=1,nBeads
      ELI(1).v   = pFiniteSolution%rBeadIPIF(1,I).v – pFiniteSolution%rBeadIPIF(1,I+1).v
      ELI(2).v   = pFiniteSolution%rBeadIPIF(2,I).v – pFiniteSolution%rBeadIPIF(2,I+1).v
      ELI(3).v   = pFiniteSolution%rBeadIPIF(3,I).v – pFiniteSolution%rBeadIPIF(3,I+1).v
      Length = DSQRT(ELI(1).v**2 + ELI(2).v**2 + ELI(3).v*2).v
      pFiniteSolution%rTUI(1,I).v = ELI(1).v / Length
      pFiniteSolution%rTUI(2,I).v = ELI(2).v / Length
      pFiniteSolution%rTUI(3,I).v = ELI(3).v / Length
   END DO

Advantages over SOA

The SOA layout requires more cache line loads and more TLB loads than PAOS. An example is the case of producing the vector and unit vector between the beads of the tether. SOA would require separate cache lines and TLBs for:

   rBeadIPIF(I,1,t) rBeadIPIF(I+1,1,t) (cache line split 1/8 of time)
   rBeadIPIF(I,2,t) rBeadIPIF(I+1,2,t) (cache line split 1/8 of time)
   rBeadIPIF(I,3,t) rBeadIPIF(I+1,3,t) (cache line split 1/8 of time)
   rTUI(I,1,t) (cache line split 1/8 of time)
   rTUI(I,2,t) (cache line split 1/8 of time)
   rTUI(I,3,t) (cache line split 1/8 of time)

Six cache lines per bead for each tether.

The PAOS requires 4 cache lines per bead 4 tethers at a time:

   rBeadIPIF(1,I).v(0:3) rBeadIPIF(1,I+1).v(0:3) (cache line shared with v)
   rBeadIPIF(2,I).v(0:3) rBeadIPIF(2,I+1).v(0:3) (1/2 time in above/below)
   rBeadIPIF(3,I).v(0:3) rBeadIPIF(3,I+1).v(0:3) (cache line shared with ^)
   rTUI(1,I).v(0:3) (cache line shared with v)
   rTUI(2,I).v(0:3) (1/2 time in above/below)
   rTUI(3,I).v(0:3) (cache line shared with ^)

The PAOS traverses through cache lines faster (meaning flushing faster); however, the traversal occurs 1/4 as many times (SIMD is packing 4 tethers into the data). While the total data storage is the same for both layouts, the demands on separate cache lines and TLB entries are reduced.

Benchmark program

To illustrate the advantage of PAOS, we will adapt a real application as opposed to using a synthetic benchmark. I chose to use a complex simulation program, one I am very familiar with, as a benchmark. Over a period of seven years, I have optimized and re-optimized the code many times. Prior to conversion to PAOS, I thought there wasn’t much left to squeeze out of the code. I knew that a large portion of the code was not amenable to vectorization as written. I also knew that what code that was vectorized and optimized by the compiler, was not all that could be vectorized. I tried to resolve this and developed the packed array of structures technique to use with my application. I considered the SOA format, but knew that this would have marginal effect for this particular application.

The program is a highly optimized OpenMP* FORTRAN program for simulation of tethers attached to spacecraft. I use the program for investigating Space Elevator design issues as well as tethered satellites. I chose a simulation study configuration for an EDDE spacecraft to use as a benchmark. This spacecraft is proposed for use in debris elimination in near earth orbit (dead satellites and pieces thereof).

The EDDE spacecraft under study is longer than the one depicted above and consists of:

2 Net Managers
2 Controller and Emitters
9 Solar Arrays
2 Relatively short tethers (~15 m) connecting the Payload Managers to the Controller/Emitter
10 Long tethers (1 km each) of aluminum tape

A total of 12 tethers and 13 connected satellites

The PAOS modifications made to this program were performed only on the Finite Element portion of the tether portion of the application. This represents an important but smaller percentage of the total code (~25%). The remaining tether portions and body portion (Payload Managers, Controller and Emitters, Solar Arrays, Earth, Moon, Planets, etc.) have yet to be converted to PAOS. Full advantage of PAOS for this application has not been addressed.

The Visual Studio solution for this program contains twelve FORTRAN projects and one C++ project. The C++ project is used to query the processor for supported SIMD functions. A total of 554 F90 source files are used, many of which contain multiple subroutines and functions. Approximately 100 additional files were added to the original solution for PAOS support. The contents of these files are ostensibly the same as the original source of these 100 files with the addition of the “.v” suffix and different data types for the different SIMD packing formats.

The test platform was an Intel Core i7 2600K, which is 4 core processors with HT, code-named Sandy Bridge. It had Windows* 7 x64 PRO and 16 GB RAM.

The system was configured at various times with HT on/off and Turbo Boost on/off. The tests were mainly run with HT off because this was a 4-core processor, and the application is double precision floating point heavy.

Due to a significant portion of the code dealing with the Bodies, representing (approximately 80% of non-PAOS code), I increased the number of elements (beads) for simulating the tethers. The number of beads per tether was increased from 1,000 to 10,000. The increase in the number of beads per tether by a factor of 10 will reduce the proportional load of the non-PAOS code to the PAOS code by a factor of 10. Though more beads could be used, I decided not to bias the test results too heavily towards use of PAOS.

To balance iteration loads, the end two short tethers (each 15-beads) were substituted with the 1,000 meter aluminum tethers and at 10,000 beads, thus making 12 tethers of the same design and integration options.

In this application, tethers of same number of beads and same integration options (Aerodynamic, Electrodynamic, etc.) are suitable for consolidating into SIMD PAOS layouts. Depending on compatibility of tethers for SIMD and the number of such tethers, 2-wide, 3-wide, or 4-wide packing is chosen at run time. Twelve tethers permit full packing combinations of one (none), 2-wide, 3-wide, and 4-wide.

Wanting to run various permutations of the numbers of threads (1:4 and 1:8) and various packing densities of the PAOS (1-wide (not packed), 2-wide, 3-wide, 4-wide, of doubles), I increased the number of tethers from 12 to 96, thus making any combination of packing (1,2,3,4) and thread count (1,2,4,8) present an equal number of passes per thread in the outer loop.

The additional tethers were attached to the existing satellites in order to not increase the load of the non-PAOS modified portion of code used by the body (satellite) simulation. Ideal tether counts for 3, 5, 6, 7 threads were not configured as this would greatly increase the number of tethers.

The test program now consists of:

13	satellites, connected by
96	1 km tethers
  	each tether having
  	~85 scalars
 	57 Arrays of dimension (3)
 	3 Arrays of dimension (3,3)
  	22 Arrays of dimension (3,~10000)
 	52 Arrays of dimension (~10000)
96 * (85 + (57*3) + (3*3*3) + (22*3*1000) + (52*x10000)) = 56,283,168 doubles (450 MB)

There are additional variables being manipulated per tether, and the satellites, planets, etc.

A single-threaded configuration test run, without PAOS SIMD, was made to determine a suitable integration interval to perform for a base-line reference simulation time. On each anniversary of that simulation time interval, the wall clock elapse time from the prior simulation time anniversary is displayed.

Tests were run for each combination of packing density: no pack, 2-wide, 3-wide, 4-wide, and for each number of threads: 1, 2, 3, 4. Each configuration was run for four report intervals and the smallest value was selected for use in producing the comparison tables. The choice of the smallest value was made under the assumption of having the least interference by whatever else may have been running on the system (Windows 7 x64, anti-virus, Visual Studio phone home, etc.).

96 tethers, each with 10,000 beads - Turbo On, HT Off

   no pack SIMD      2-wide SIMD         3-wide SIMD         4-wide SIMD
1  202.461830544090  98.8166282926832  110.127472870544  96.2114920825516
2  113.080892757974  64.3881699061922  69.6634485553468  62.4252337711068
3  88.0928441275801  56.0872621388371  60.3187281050650  57.5615369606003
4  78.3640600375229  56.5572142589117  58.9836796998115  58.8164001500936

96 tethers, each with 10,000 beads - Turbo On, HT Off

   no pack SIMD      2-wide SIMD         3-wide SIMD         4-wide SIMD
1  222.377931407129  108.414170956848  120.604551444653  105.015817485928
2  119.739932457786  67.4135627767355  73.9538575609758  65.4644253658544
3  90.9758964352723  56.9652196622883  61.5711771857414  58.8302754221386
4  80.8031147467163  57.1387076923083  59.7432093058160  59.3558339962474

The first test with one thread, Turbo On, HT Off showed an impressive speedup of 2.05x over the original code. The 3-wide pack was 1.88x speedup, and the 4-wide was 2.10x speedup over the original code. The difference between the 2-wide and 4-wide was only marginal at 1.027x.

The multi-threading scaling beyond two threads seemed to suffer; however, keep in mind that the simulation was memory intensive (~450 MB of working memory).

Regarding the 3-wide pack performance and why this may be of interest to you. Test runs were made with bead counts of 4000.

4000 beads - Turbo On, HT On

   no pack SIMD      2-wide SIMD         3-wide SIMD         4-wide SIMD
1  91.3673146779801  50.2108811987264  53.7199640737510  47.4793344518262
2  55.4531537381101  34.9485817566820  35.3566061461863  31.8637723342063
3  45.6128820578588  31.2057180011579  30.7373190216704  28.4547075003547
4  45.9029787147183  30.9318707075290  30.1450711425587  29.1348981424198
5  45.9152253990896  30.9467731366030  30.1314135844295  29.2071261732417
6  52.4540275792824  36.6395676842658  34.7508676123107  33.7327393855012
7  55.0174089813809  39.3389245960061  36.7703592835023  36.2333221964363
8  60.0940263305092  42.1638867950896  39.0601884747230  39.5467420973800

Note that the lesser bead count for single thread, no pack vs. 2-wide fell from a speedup of 2.05x to 1.82x. This was due to the change in the ratio of the non-PAOS code vs. PAOS code. Also note that for the 3 to 8 thread runs that the 3-wide outperformed the 2-wide, and in the 8-thread run the 3-wide outperformed the 4-wide. It is apparent from this chart and running under these circumstances that a choice of 3 threads would be optimal for 4-wide. Different processor design, bead counts, numbers of tethers would affect the results greatly. Some combinations of Last Level Cache size and working data set size may favor 3-wide packing.

The marginal speedup for scaling beyond 2 threads is likely due to three factors:

a) The full data set of 450 MB exceeded that of the L3 cache, and the application was principally memory bandwidth bound. It should be noted that not all of these data are manipulated during the simulation run (aerodynamic and electro-dynamic forces were disabled among others).

b) The Intel Core i7 (Sandy Bridge) AVX does not have all 256-bit data paths. This requires two clock ticks for some of the internal transfers. The newer Ivy Bridge processor resolves this issue.

c) With Sandy Bridge having Dual Bank memory, it is not as memory-bus efficient as Ivy Bridge with Quad Bank memory. Additionally, this system has one node of memory. A 2-processor Ivy Bridge with two NUMA nodes in NUMA configuration, each with Quad Bank memory can improve the 4-wide SIMD instruction performance.

Conclusions

PAOS is exceptionally effective in some applications. I anticipate getting additional performance benefits when I replace my 1P Sandy Bridge with a 1P or 2P Ivy Bridge system. It is also expected that non-memory bandwidth limited applications may experience better performance gains using the 4-wide PAOS SIMD (AVX). PAOS will increase size requirements and exhibit better cache line efficiency (higher hit ratio). 2-wide SIMD will double the working storage cache requirements over non-packed and 4-wide SIMD will quadruple the working storage cache requirements. Decreasing or partitioning loop runs may be in order when using PAOS. Your results will differ depending on working set size and processor generation.

Copyright © 2012 Jim Dempsey. All rights reserved.

Mr. Dempsey is available for program- or code-related consultations.
jim.00.dempsey@gmail.com http://www.linkedin.com/in/jim00dempsey

*Other names and brands may be claimed as the property of others.

Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.