By jimdempseyatthecove, Added

by Jim Dempsey

Programmers are familiar with the small vector capability of today’s processors, referred to as Single Instruction – Multiple Data (SIMD). SIMD, in Intel Architecture (IA32, IA64, Intel64), was introduced with MMX, then followed up with SSE, SSSE, SSE2, SSE3, AVX, AVX2 (and a few others). The major benefit with SIMD is that a single instruction can perform the same instruction on multiple pieces of data at the same time. Though the SIMD instructions support different sizes and types of data (byte, word, dword, qword, (dqword), float, double, …), this article will address the use of SIMD for manipulating doubles known as REAL*8 in FORTRAN (a.k.a. DOUBLE PRECISION).

Most processors support SSE2 or SSE3 SIMD instruction formats, and the recent Intel* Core* i7 Sandy Bridge supports AVX with AVX2 and newer SIMD formats in various stages of development. From the high-level programmer’s view point, the main difference is in the width of the data (number of multiple data) manipulated by the SIMD instruction. While the high-level language programmer will rely on the compiler to generate optimal code paths for instruction differences, this same programmer can take advantage of the knowledge of the width of the MD portion of SIMD when laying out the data.

The SIMD widths supported are:

SSEn 2-wide doubles (2 x 64-bit)

AVX 4-wide doubles (4 x 64-bit)

AVX2 4-wide doubles (4 x 64-bit)

MIC 8-wide doubles (8 x 64-bit) (a.k.a. “Xeon Phi”)

To take advantage of the SIMD capability of the processor, the programmer needs to design the data layout used by the application. The ideal layout is such, that data undergoing the same manipulations, are adjacent in memory. When using arrays of data, programmers don't need to worry about programming considerations other than the nesting level of array indexes for multi-dimensional arrays. This article will address programming issues relating to data other than arrays of single elements.

To date, most papers in the technical literature discuss two programming techniques for data layout:

AOS Array of structures

SOA Structure of arrays

This article presents a third alternative, which is a hybrid of AOS and SOA:

PAOS Packed array of structures

Or, alternately the programming technique can be called:

PSOA Partitioned structure of arrays

Programmers tend to prefer array of structures (AOS) during development. This is mostly due to how the program debugger works. With AOS, by providing a single pointer or reference to a structure in the debugger, the programmer can then explore the entire contents of the structure. This makes it very convenient to follow the flow of the program during a debug session.

When debugging sessions are mostly complete, the programmer can assess if structure of arrays (SOA) might yield better performance. We have a dilemma whether to perform the conversion from AOS to SOA, which is a non-trivial process that may break working code. AOS has better cache locality for intra-structure manipulations. SOA has better cache locality for inter-structure manipulations. Considering that major portions of any application manipulating data are neither AOS nor SOA, projecting the performance impact with conversion is difficult at best.

*Note, this article uses sample code in FORTRAN. It is easier to adapt C++ programs to PAOS due to C++’s templates. *

*Adaptation in FORTRAN (without templates) requires more copy and paste of existing code.*

The sample code presented in this article came from a bead model synthesis program used principally in the study of tethers attached to objects. The program can be used to model a “tether ball” system. This would look like a volleyball attached by a cord to a pole.

The program was developed to simulate tethered satellites, initially the SED experiment with the Space Shuttle and a smaller satellite connected by long (20 km) conducting tether. The conducting tether, moving through the magnetosphere, was intended to generate power. Later revisions of the simulation program were made for the study of space elevator designs. A space elevator is a tether attached to the earth near the equator and approximately 100,000 km long going radially outwards to a ballast mass. A space elevator is a self-supporting structure with an electrically powered elevator car/truck that can climb the tether to deliver orbital payloads.

A tether, as used by the simulation program, is a series of one-dimensional segments interconnected and terminated to an object (space shuttle, satellite, earth, etc.). You can view this as a series of springs. The connection point of adjacent segments is called a bead. A bead has position, velocity, mass, and other variables for properties (mostly force vectors: gravity, external, spring, etc.). The segments have: length (un-deformed), length (deformed), length rate, length acceleration, elastic diameter, aerodynamic diameter, resistivity, spring constant, spring damping constant, temperature, and several other properties. The bodies have many properties and may have things attached to them (tether, boom, thrusters, etc.). And the simulation can incorporate the JPL (Jet Propulsion Mission Data) ephemeris database to obtain initial planetary body information as well as satellites (moons, asteroids, comets) and space craft. The simulator is a comprehensive modeling system written in FORTRAN and has a development history of over 25 years.

## AOS format in FORTRAN (C++ programmers, think of this as a struct)

type TypeFiniteSolution SEQUENCE ! REAL(8) variables ! INTEGRATION INTERVAL & LAPSED TIME OF FINITE TETHER SOLN real :: rDELTIS ! Integration step size real :: rTISTIM ! Integration time … ! NOMINAL INSTANTANEOUS DEPLOYED (UNDEFORMED) TETHER LENGTH real :: rEL ! rate of deployment +out, -in through the X end real :: rELD ! rate of rate of deployment +out, -in through the X end real :: rELDD … ! POSITION VECTOR BETWEEN ATT PTS (INER FRAME) real :: rELI(3) ! DERIVATIVE OF POS VECTOR BETWEEN ATT PTS (INER FRAME) real :: rELDI(3) ! 2ND DERIV OF POS VECTOR BETWEEN ATT PTS (INER FRAME) real :: rELDDI(3) … ! Position with respect to IP IF real, pointer :: rBeadIPIF(:,:) ! (3,NumberOfBeads+2) ! Velocity with respect to IP IF real, pointer :: rBeadDIPIF(:,:) ! (3,NumberOfBeads+2) ! Acceleration with respect to IP IF real, pointer :: rBeadDDIPIF(:,:) ! (3,NumberOfBeads+2) … ! SEGMENT LENGTH AND LENGTH RATE real, pointer :: rELBSG(:) ! (NumberOfBeads+1) real, pointer :: rELBSGD(:) ! (NumberOfBeads+1) … end type TypeFiniteSolution

With a memory layout of:

Where the green bar represents one cache line of 64 bytes (8 doubles), and the blue bar represents the adjacent cache line. Many of the current processors will fetch two adjacent cache lines at the same time. If we recompose this for SSE opportunities, we find:

Majority of operations are inter-bead

Majority of operations are inter-bead

Where the cells are outlined for SSE optimization opportunities. Only when two adjacent like operational cells are grouped in a box do you have an SSE vectorization opportunity. The large array of bead positions (rBeadIPIF) are presented above in two organizations. The first box grouping represents the preponderance of code for Bead and Inter-Bead calculations. The second box grouping represents only the last loop optimizations where the positions can be advanced (by dV * dT).

The above figures illustrate that AOS has:

No vectorization opportunities for scalars

Half vectorization opportunities for small dimensioned vectors (3) (3,3)

Half vectorization opportunities for preponderance of large vectors (3,nBeads)

Full vectorization opportunity for final integration of large vectors (3,nBeads)

The above, though, is a postulation; so let us examine some code samples.

## Simple loop – Euler Integration using AOS

Let’s take a look at the final phase of the integration process where we advance the bead position based on the bead velocity and integration step interval. This is known as Euler integration.

Note, the following code should be one of the least optimized opportunities for PAOS over AOS.

call EULINTxyz(DELTIS, NBEAD, rBeadDIPIF(1:3,1:NBEAD), rBeadIPIF(1:3,1:NBEAD)) ... subroutine EULINTxyz_r_n_r_r(dT, n, dX, X) implicit none real :: dT integer :: n real :: dX(1:3,1:n) real :: X(1:3,1:n) integer :: i,j do i=1,n do j=1,3 X(j,i) = X(j,i) + dX(j,i) * dT end do end do end subroutine EULINTxyz_r_n_r_r

*(The above code includes generic interface signature “_r_n_r_r”)*

**The AOS assembly code for Euler integration**

; parameter 1(DT): rcx ; parameter 2(N): rdx ; parameter 3(DX): r8 ; parameter 4(X): r9 ;;; recursive subroutine EULINTxyz_r_n_r_r(dT, n, dX, X) push rbp ; save stack frame pointer sub rsp, 48 ; allocate stack space lea rbp, QWORD PTR [32+rsp] ; set new stack frame pointer movsxd rax, DWORD PTR [rdx] ; rax = n mov QWORD PTR [rbp], rax ; temp0 = n (unnecessary) mov QWORD PTR [8+rbp], rax ; temp1 = n (unnecessary) ;;; implicit none ;;; real :: dT ;;; integer :: n ;;; real :: dX(1:3,1:n) ;;; real :: X(1:3,1:n) ;;; integer :: i,j ;;; ;;; do i=1,n movsxd r10, DWORD PTR [rdx] ; r10 = n (could have obtained from rax) xor edx, edx ; zero iteration count xor eax, eax ; zero linear index into both arrays test r10, r10 ; if(n <= 0) jle .B5.5 ; Prob 3% ; goto .B5.5 .B5.2:: ;;; do j=1,3 ;;; X(j,i) = X(j,i) + dX(j,i) * dT vmovsd xmm0, QWORD PTR [rcx] ; xmm0= dT (scalar) ALIGN 16 ; align following code to 16 byte address .B5.3:: : loop: vmulsd xmm1, xmm0, QWORD PTR [rax+r8] ; xmm1 = dT * dX(index) (scalar) inc rdx ; ++iteration count vmulsd xmm3, xmm0, QWORD PTR [8+rax+r8] ; xmm3 = dT * dX(index+1) (scalar) vmulsd xmm5, xmm0, QWORD PTR [16+rax+r8] ; xmm5 = dT * dX(index+2) (scalar) vaddsd xmm2, xmm1, QWORD PTR [rax+r9] ; xmm2 = dT * dX(index) +

X(index) (scalar) vaddsd xmm4, xmm3, QWORD PTR [8+rax+r9] ; xmm4 = dT * dX(index+1) + X(index+1)

(scalar) vaddsd xmm1, xmm5, QWORD PTR [16+rax+r9] ; xmm1 = dT * dX(index+2) + X(index+2)

(scalar) vmovsd QWORD PTR [rax+r9], xmm2 ; X(index) = dT * dX(index) + X(index)

(scalar) vmovsd QWORD PTR [8+rax+r9], xmm4 ; X(index+1) = dT * dX(index+1) +

X(index+1) (s) vmovsd QWORD PTR [16+rax+r9], xmm1 ; X(index+2) = dT * dX(index+2) +

X(index+2) (s) add rax, 24 ; index += 3 cmp rdx, r10 ; if(iteration count < n) jb .B5.3 ; Prob 82% ; goto loop .B5.5:: ;;; end do ;;; end do ;;; end subroutine EULINTxyz_r_n_r_r lea rsp, QWORD PTR [16+rbp] ; restore stack pointer pop rbp ; restore stack frame ret ; return

Comments on compiler generated code:

Loop is aligned (good)

The inner and outer loops were fused (good)

There were two unnecessary savings of n onto stack (not good)

The calculations are all scalar (non-optimal)

While the programmer can address these issues by rewriting the Euler integration function such as casting the (3,NBEAD) two dimensional array into a (3*NBEAD) one dimensional array (using TRANSFER), the programmer would have to rewrite all loops that traverse the bead (and segment) information for the tethers. This would require edits to hundreds of loops within the program. The object of this article is to recode such that the compiler does most of the work for you.

## SOA format

! REAL variables (set compiler option for default real == real(8) ! INTEGRATION INTERVAL & LAPSED TIME OF FINITE TETHER SOLN real, pointer :: rDELTIS(:) ! (NumberOfTethers) real, pointer :: rTISTIM(:) ! (NumberOfTethers) … ! NOMINAL INSTANTANEOUS DEPLOYED (UNDEFORMED) TETHER LENGTH real, pointer :: rEL(:) ! (NumberOfTethers) ! rate of deployment +out, -in through the X end real, pointer :: rELD(:) ! (NumberOfTethers) ! rate of rate of deployment +out, -in through the X end real, pointer :: rELDD(:) ! (NumberOfTethers) … ! POSITION VECTOR BETWEEN ATT PTS (INER FRAME) real, pointer :: rELI(:,:) ! (3, NumberOfTethers) ! DERIVATIVE OF POS VECTOR BETWEEN ATT PTS (INER FRAME) real, pointer :: rELDI(:,:) ! (3, NumberOfTethers) ! 2ND DERIV OF POS VECTOR BETWEEN ATT PTS (INER FRAME) real, pointer :: rELDDI(:,:) ! (3, NumberOfTethers) … ! Position with respect to IP IF ! *** note change in index order and number of indexes real, pointer :: rBeadIPIF(:,:,:) ! (NumberOfBeads+2, 3, NumberOfTethers) ! Velocity with respect to IP IF real, pointer :: rBeadDIPIF(:,:,:) ! (NumberOfBeads+2, 3, NumberOfTethers) ! Acceleration with respect to IP IF real, pointer :: rBeadDDIPIF(:,:,:) ! (NumberOfBeads+2, 3, NumberOfTethers) … ! SEGMENT LENGTH AND LENGTH RATE real, pointer :: rELBSG(:,:) ! (NumberOfBeads+1, NumberOfTethers) real, pointer :: rELBSGD(:,:) ! (NumberOfBeads+1, NumberOfTethers)

Or:

real, pointer :: rBeadIPIF(:,:,:) ! (NumberOfTethers, NumberOfBeads+2, 3)

Or:

real, pointer :: rBeadIPIF(:,:,:) ! (NumberOfTethers, 3, NumberOfBeads+2)

Or:

real, pointer :: rBeadIPIF(:,:,:) ! (3, NumberOfTethers, NumberOfBeads+2)

The order of the indexing of the SOA is an important consideration and will vary depending on programming requirements. Of particular concern in this application is the performance impact on inter-bead calculations verses bead/intra-bead calculations. The SOA will favor one type of calculation but not both.

Debugging SOA will require inspection of individual array elements and synchronizing the array indexes. This is relatively difficult (cumbersome) compared to the AOS format. Note, the choice of “pointer” verses “allocatable” is my personal programming preference.

For an application such as the one presented in this article, the programmer would arrange the scalars and small vectors into AOS format. And only consider the SOA conversion of the large vectors of NumberOfBead sized allocations. The programmer still has the difficult decision as to the impact of index order, and would likely have to resort to making multiple conversions for a different index order, and run performance tests on each. SOA conversion of the application would be partially implemented and optimal for only a portion of the calculations.

## PAOS format

Packed structure of arrays (PAOS) has a different representation. I use three representations. One representation for SSE for 2-wide doubles and two for AVX/AVX2: a) four-wide doubles, and interestingly, b) 3-wide doubles.

! AVX four-wide double vector type TypeYMM SEQUENCE real(8) :: v(0:3) end type TypeYMM ! AVX three-wide double vector type TypeYMM02 SEQUENCE real(8) :: v(0:2) end type TypeYMM02 ! SSE two-wide double vector type TypeXMM SEQUENCE real(8) :: v(0:1) end type TypeXMM type TypeFiniteSolution SEQUENCE ! REAL variables (4-wide AVX) ! INTEGRATION INTERVAL & LAPSED TIME OF FINITE TETHER SOLN type(TypeYMM) :: rDELTIS ! Integration step size type(TypeYMM) :: rTISTIM ! Integration time … ! NOMINAL INSTANTANEOUS DEPLOYED (UNDEFORMED) TETHER LENGTH type(TypeYMM) :: rEL ! rate of deployment +out, -in through the X end type(TypeYMM) :: rELD ! rate of rate of deployment +out, -in through the X end type(TypeYMM) :: rELDD … ! POSITION VECTOR BETWEEN ATT PTS (INER FRAME) type(TypeYMM) :: rELI(3) ! DERIVATIVE OF POS VECTOR BETWEEN ATT PTS (INER FRAME) type(TypeYMM) :: rELDI(3) ! 2ND DERIV OF POS VECTOR BETWEEN ATT PTS (INER FRAME) type(TypeYMM) :: rELDDI(3) … ! Position with respect to IP IF type(TypeYMM), pointer :: rBeadIPIF(:,:) ! (3,NumberOfBeads+2) ! Velocity with respect to IP IF type(TypeYMM), pointer :: rBeadDIPIF(:,:) ! (3,NumberOfBeads+2) ! Acceleration with respect to IP IF type(TypeYMM), pointer :: rBeadDDIPIF(:,:) ! (3,NumberOfBeads+2) … ! SEGMENT LENGTH AND LENGTH RATE type(TypeYMM), pointer :: rELBSG(:) ! (NumberOfBeads+1) type(TypeYMM), pointer :: rELBSGD(:) ! (NumberOfBeads+1) … end type TypeFiniteSolution

The data organization for PAOS is identical to the AOS with the exception of changing the variable type “real” to “type(TypeYMM)”. The array indexing scheme remains the same. Debugging PAOS is almost as convenient as AOS. You do have to explode the SIMD vector to observe the 4-wide variables.

The in-memory layout for the 4-wide YMM use is:

The in-memory 2-wide XMM layout:

Note that all references are available for full SIMD vectorization when you perform aggregate work on all tethers. Portions of the code that perform individual access may suffer additional cache line loads, but the preponderance of the calculations are performed on the aggregate of tethers.

## Simple loop – Euler Integration using PAOS

call EULINTxyz(DELTIS, NBEAD, rBeadDDIPIF(1:3,1:NBEAD), rBeadDIPIF(1:3,1:NBEAD)) ... subroutine EULINTxyz_ymm_n_ymm_ymm(dT, n, dX, X) USE MOD_ALL implicit none type(TypeYMM) :: dT integer :: n type(TypeYMM) :: dX(1:3,1:n) type(TypeYMM) :: X(1:3,1:n) integer :: i,j do i=1,n do j=1,3 !DEC$ VECTOR ALWAYS X(j,i).v = X(j,i).v + dX(j,i).v * dT.v end do end do end subroutine EULINTxyz_ymm_n_ymm_ymm (Note the generic interface selected signature suffix “_ymm_n_ymm_ymm”)

**The PAOS assembly code**

; parameter 1(DT): rcx ; parameter 2(N): rdx ; parameter 3(DX): r8 ; parameter 4(X): r9 ;;; recursive subroutine EULINTxyz_ymm_n_ymm_ymm(dT, n, dX, X) push rbp ; save stack frame pointer sub rsp, 48 ; allocate stack space lea rbp, QWORD PTR [32+rsp] ; set new stack frame pointer movsxd rax, DWORD PTR [rdx] ; rax = n mov QWORD PTR [rbp], rax ; temp0 = n (unnecessary) mov QWORD PTR [8+rbp], rax ; temp1 = n (unnecessary) ;;; type(TypeYMM) :: dT ;;; integer :: n ;;; type(TypeYMM) :: dX(1:3,1:n) ;;; type(TypeYMM) :: X(1:3,1:n) ;;; integer :: i,j ;;; ;;; do i=1,n movsxd r10, DWORD PTR [rdx] ; r10 = n (could have obtained from rax) xor edx, edx ; zero loop iteration count xor eax, eax ; zero linear index into dX and X test r10, r10 ; if(n <= 0) jle .B6.5 ; Prob 3% ; goto .B6.6 .B6.2:: ;;; do j=1,3 ;;; !DEC$ VECTOR ALWAYS ;;; X(j,i).v = X(j,i).v + dX(j,i).v * dT.v vmovupd xmm1, XMMWORD PTR [rcx] ; xmm1 = dT.v(0:1) vmovupd xmm0, XMMWORD PTR [16+rcx] ; xmm0 = dT.v(2:3) .B6.3:: ; loop: vmulpd xmm2, xmm1, XMMWORD PTR [rax+r8] ; xmm2=dT(index).v(0:1) *

dX(index).v(0:1) vmulpd xmm4, xmm0, XMMWORD PTR [16+rax+r8] ; xmm4=dT(index).v(2:3) *

dX(index).v(2:3) vaddpd xmm3, xmm2, XMMWORD PTR [rax+r9] ; xmm3=dT(index).v(0:1)*dX(index).v(0:1)+

X(index).v(0:1) vaddpd xmm5, xmm4, XMMWORD PTR [16+rax+r9] ; xmm5=dT(index).v(2:3)*dX(index).v(2:3)+

X(index).v(2:3) vmulpd xmm2, xmm1, XMMWORD PTR [32+rax+r8] ; xmm2 = dT.v(0:1) * dX(index+1).v(0:1) ; X(index).v(0:1)= dT(index).v(0:1) * dX(index).v(0:1) + X(index).v(0:1) vmovupd XMMWORD PTR [rax+r9], xmm3 ; X(index).v(2:3)= dT(index).v(2:3) * dX(index).v(2:3) + X(index).v(2:3) vmovupd XMMWORD PTR [16+rax+r9], xmm5 vmulpd xmm4, xmm0, XMMWORD PTR [48+rax+r8] ; xmm4 = dT.v(2:3) * dX(index+1).v(2:3) ; xmm3 = dT.v(0:1) * dX(index+1).v(0:1) + X(index+1).v(0:1) vaddpd xmm3, xmm2, XMMWORD PTR [32+rax+r9] ; xmm5 = dT.v(2:3) * dX(index+1).v(2:3) + X(index+1).v(2:3) vaddpd xmm5, xmm4, XMMWORD PTR [48+rax+r9] vmulpd xmm2, xmm1, XMMWORD PTR [64+rax+r8] ; xmm2 = dT(index+2).v(0:1) * dX(index+2).

v(0:1) ; X(index+1).v(0:1)= dT(index+1).v(0:1) * dX(index+1).v(0:1) + X(index+1).v(0:1) vmovupd XMMWORD PTR [32+rax+r9], xmm3 ; X(index+1).v(2:3)= dT(index+1).v(2:3) * dX(index+1).v(2:3) + X(index+1).v(2:3) vmovupd XMMWORD PTR [48+rax+r9], xmm5 vmulpd xmm4, xmm0, XMMWORD PTR [80+rax+r8] ; xmm4 = dT(index+2).v(2:3) * dX(index+2).

v(2:3) ;xmm3=dT(index+2).v(0:1)*dX(index+2).v(0:1)+X(index+2).v(0:1) vaddpd xmm3, xmm2, XMMWORD PTR [64+rax+r9] ; xmm5 = dT(index+2).v(2:3) * dX(index+2).v(2:3) + X(index+2).v(2:3) vaddpd xmm5, xmm4, XMMWORD PTR [80+rax+r9] ; X(index+2).v(0:1)= dT(index+2).v(0:1) * dX(index+2).v(0:1) + X(index+2).v(0:1) vmovupd XMMWORD PTR [64+rax+r9], xmm3 ; X(index+2).v(2:3)= dT(index+2).v(2:3) * dX(index+2).v(2:3) + X(index+2).v(2:3) vmovupd XMMWORD PTR [80+rax+r9], xmm5 inc rdx ; ++loop count add rax, 96 ; index += 3 cmp rdx, r10 ; if(loop count < n) jb .B6.3 ; Prob 82% ; goto loop .B6.5:: ;;; end do ;;; end do ;;; end subroutine EULINTxyz_ymm_n_ymm_ymm

In examining the assembly code you find the use of v….pd with xmm register notation. These instructions represent 2-wide packed doubles. While AVX supports 4-wide packed doubles, the Sandy Bridge architecture performs fetches and stores as 2-wide. A 4-wide fetch or store can be issued; however, this makes the adjacent fetches or stores occur together. Should the data of these two 2-wide fetches or stores span a cache line, the pipeline will stall for the data. By using 2-wide fetches and stores, code optimizations can interleave the instructions such that computations can occur concurrently with fetch or store.

vmulpd xmm2, xmm1, XMMWORD PTR [rax+r8] ; xmm2=dT(index).v(0:1) * dX(index).v(0:1) vmulpd xmm4, xmm0, XMMWORD PTR [16+rax+r8] ; xmm4=dT(index).v(2:3) * dX(index).v(2:3) vaddpd xmm3, xmm2, XMMWORD PTR [rax+r9] ; xmm3=dT(index).v(0:1)*dX(index).v(0:1)+

X(index).v(0:1) vaddpd xmm5, xmm4, XMMWORD PTR [16+rax+r9] ; xmm5=dT(index).v(2:3)*dX(index).v(2:3)+

X(index).v(2:3) ------------ or ------------- vmulpd ymm2, ymm1, XMMWORD PTR [rax+r8] ; ymm2=dT(index).v(0:3) * dX(index).v(0:3) vaddpd ymm3, ymm2, XMMWORD PTR [rax+r9] ; ymm3=dT(index).v(0:3)*dX(index).v(0:3)+X(index)

.v(0:3)

On the Sandy Bridge architecture, the two execute at approximately the same time, except when the fetch is split by a cache line, meaning the .v(0:1) lies in one cache line and the .v(2:3) lies in a different cache line. In cache line split circumstances, the 2-wide instructions perform faster.

Migration to Ivy Bridge can eliminate this penalty. Code targeted for Ivy Bridge realizes an advantage because it can perform 4-wide fetch and store operations, together with 4-wide computation.

In many places in the code, the compiler will perform 4-wide fetch and store (see later example).

## Typical problem

The Euler advancement code examined earlier is the least likely example of PAOS over AOS. It also is atypical of loop traversal. Let us examine a typical loop. The following loop calculates the spring tension.

**The AOS version of the loop**

DO JSEG = 1,JSEGlast !*********************************************************** ! FORM SEGMENT VECTOR BETWEEN ADJACENT BEADS (IN INER FRAME) !*********************************************************** do i=1,3 ! Determine TUI at 1/2 integration step from now ! using current velocity and acceleration ! Position of X-End of segment TOSVX1(i) = rBeadIPIF(i,JSEG-1) & & + (rBeadDIPIF(i,JSEG-1) * (rDELTIS/2.0_8)) & & + (rBeadDDIPIF(i,JSEG-1) * (rDELTIS**2/4.0_8)) ! Position of Y-End of segment TOSVX2(i) = rBeadIPIF(i,JSEG).v & & + (rBeadDIPIF(i,JSEG) * (rDELTIS/2.0_8)) & & + (rBeadDDIPIF(i,JSEG) * (rDELTIS**2/4.0_8)) ! SUI X->Y SUI(i) = TOSVX2(i) - TOSVX1(i) end do ! CALC DIST BETWEEN BEADS ADJACENT TO THIS SEGMENT !------------------------------------------------- DUMEL = dsqrt(SUI(1)**2 + SUI(2)**2 + SUI(3)**2) rELBSG(JSEG) = DUMEL ! TALLY TOTAL SPATIAL ARC LENGTH ARCTOTlocal = ARCTOTlocal + DUMEL !--------------------------------------------- ! POPULATE THE JSEG ELEMENT OF THE BBTUI ARRAY !--------------------------------------------- ! FORM UNIT VECTOR FROM X-END TO Y-END OF BEAD SEGMENT if(DUMEL .ne. 0.0_8) then TUI = SUI / DUMEL else call RandomUnitVector(TUI) endif rBBTUI(1:3,JSEG) = TUI(1:3) ! recompute velocity of Bead 0 if(JSEG .eq. 1) then rBeadDIPIF(:,0) = rAVI + (TUI * rELD) endif ! ARC LENGTH WR/T UNDEFORMED TETHER !-------------------------------------------- ! (NOTE, NEGATIVE SIGN REFLECTS MU DEFINITION FROM FAR END) rDSDU(JSEG) = - DUMEL/ELBis !************************************************************** ! FORM DERIVATIVE OF SEGMENT VECTOR (INER FRAME) !************************************************************** ! compute equivelent to ! SUDI = BUDI(JB1:JB1+2) - BUDI(JB1-3:JB1-1) + ELSGX_x_UITDI_ELSGXD_x_UITI SUDI = rBeadDIPIF(:,JSEG) - rBeadDIPIF(:,JSEG-1) ! CALC DERIV OF DIST BETWEEN BEADS ADJACENT TO SEGMENT !----------------------------------------------------- DUMELD = DOT(TUI,SUDI) rELBSGD(JSEG) = DUMELD !-------------------------------------------------------- ! POPULATE THE SEGMENT DATA ARRAY'S FOR THIS JSEG ELEMENT ! (FOR AVERAGING AND NUMERICAL DIFF IN DEPLOY FLOW CALCS) !-------------------------------------------------------- ! BBTUDI(1:3,JSEG) = (SUDI / DUMEL) + (SUI * -DUMELD/DUMEL**2) if(DUMEL .ne. 0.0_8) then rBBTUDI(1:3,JSEG) =((SUDI) + (SUI * -DUMELD/DUMEL)) / DUMEL else rBBTUDI(1:3,JSEG) = 0.0_8 endif ! FINALLY, ARC LENGTH DERIV WR/T UNDEFORMED TETHER, AND ITS TIME DERIV !--------------------------------------------------------------------- ! (NOTE, NEGATIVE SIGN REFLECTS MU DEFINITION FROM FAR END) rDSDUD(JSEG) = - ( DUMELD - DUMEL*ELDoverEL )/ELBis end do

*(Note, the pointers were removed to clarify reading the code)*

*The PAOS version of the loop*

DO JSEG = 1,JSEGlast !*********************************************************** ! FORM SEGMENT VECTOR BETWEEN ADJACENT BEADS (IN INER FRAME) !*********************************************************** do i=1,3 ! Determine TUI at 1/2 integration step from now ! using current velocity and acceleration ! Position of X-End of segment TOSVX1(i).v = rBeadIPIF(i,JSEG-1).v & & + (rBeadDIPIF(i,JSEG-1).v * (rDELTIS.v/2.0_8)) & & + (rBeadDDIPIF(i,JSEG-1).v * (rDELTIS.v**2/4.0_8)) ! Position of Y-End of segment TOSVX2(i).v = rBeadIPIF(i,JSEG).v & & + (rBeadDIPIF(i,JSEG).v * (rDELTIS.v/2.0_8)) & & + (rBeadDDIPIF(i,JSEG).v * (rDELTIS.v**2/4.0_8)) ! SUI X->Y SUI(i).v = TOSVX2(i).v - TOSVX1(i).v end do ! CALC DIST BETWEEN BEADS ADJACENT TO THIS SEGMENT !------------------------------------------------- DUMEL.v = dsqrt(SUI(1).v**2 + SUI(2).v**2 + SUI(3).v**2) rELBSG(JSEG) = DUMEL ! TALLY TOTAL SPATIAL ARC LENGTH ARCTOTlocal.v = ARCTOTlocal.v + DUMEL.v !--------------------------------------------- ! POPULATE THE JSEG ELEMENT OF THE BBTUI ARRAY !--------------------------------------------- ! FORM UNIT VECTOR FROM X-END TO Y-END OF BEAD SEGMENT ! *** code change for SIMD horizontal operation *** do s=0,sLast if(DUMEL.v(s) .ne. 0.0_8) then TUI.v(s) = SUI.v(s) / DUMEL.v(s) else call RandomUnitVector(TUI.v(s)) endif end do rBBTUI(1:3,JSEG) = TUI(1:3) ! recompute velocity of Bead 0 if(JSEG .eq. 1) then do i=1,3 rBeadDIPIF(i,0).v = rAVI(i).v + (TUI(i).v * rELD.v) end do endif ! ARC LENGTH WR/T UNDEFORMED TETHER !-------------------------------------------- ! (NOTE, NEGATIVE SIGN REFLECTS MU DEFINITION FROM FAR END) rDSDU(JSEG).v = - DUMEL.v/ELBis.v !************************************************************** ! FORM DERIVATIVE OF SEGMENT VECTOR (INER FRAME) !************************************************************** do i=1,3 SUDI(i).v = rBeadDIPIF(i,JSEG).v - rBeadDIPIF(i,JSEG-1).v end do ! CALC DERIV OF DIST BETWEEN BEADS ADJACENT TO SEGMENT !----------------------------------------------------- DUMELD = DOT(TUI,SUDI) rELBSGD(JSEG) = DUMELD !-------------------------------------------------------- ! POPULATE THE SEGMENT DATA ARRAY'S FOR THIS JSEG ELEMENT ! (FOR AVERAGING AND NUMERICAL DIFF IN DEPLOY FLOW CALCS) !-------------------------------------------------------- if((DUMEL.v(0) .ne. 0.0_8) .and. (DUMEL.v(1) .ne. 0.0_8) & & .and. (DUMEL.v(2) .ne. 0.0_8) .and. (DUMEL.v(3) .ne. 0.0_8)) then do i=1,3 rBBTUDI(i,JSEG).v =((SUDI(i).v) + (SUI(i).v * -DUMELD.v / DUMEL.v)) / DUMEL.v end do else do s=0,sLast ! *** code change for SIMD horizontal operation *** if(DUMEL.v(s) .ne. 0.0_8) then do i=1,3 rBBTUDI(i,JSEG).v(s) = ((SUDI(i).v(s)) & & + (SUI(i).v(s) * -DUMELD.v(s) / DUMEL.v(s))) / DUMEL.v(s) end do else do i=1,3 rBBTUDI(i,JSEG).v(s) = 0.0_8 end do endif end do ! *** code change for SIMD horizontal operation *** endif ! FINALLY, ARC LENGTH DERIV WR/T UNDEFORMED TETHER, AND ITS TIME DERIV !--------------------------------------------------------------------- ! (NOTE, NEGATIVE SIGN REFLECTS MU DEFINITION FROM FAR END) rDSDUD(JSEG).v = - ( DUMELD.v - DUMEL.v * ELDoverEL.v ) / ELBis.v end do

Now let’s decompose the first small loop within the main loop. This computes a vector between two points (beads) as it may reside halfway through the integration interval. This vector will be used later to determine the direction of two spring force vectors (one for each bead on the two segment ends).

**AOS**

do i=1,3 ! Determine TUI at 1/2 integration step from now ! using current velocity and acceleration ! Position of X-End of segment TOSVX1(i) = rBeadIPIF(i,JSEG-1) & & + (rBeadDIPIF(i,JSEG-1)*(rDELTIS/2.0_8)) & & + (rBeadDDIPIF(i,JSEG-1)*(rDELTIS**2/4.0_8)) ! Position of Y-End of segment TOSVX2(i) = rBeadIPIF(i,JSEG) & & + (rBeadDIPIF(i,JSEG)*(rDELTIS/2.0_8)) & & + (rBeadDDIPIF(i,JSEG)*(rDELTIS**2/4.0_8)) ! SUI X->Y SUI(i) = TOSVX2(i) - TOSVX1(i) end do

**PAOS**

do i=1,3 ! Determine TUI at 1/2 integration step from now ! using current velocity and acceleration ! Position of X-End of segment TOSVX1(i).v = rBeadIPIF(i,JSEG-1).v & & + (rBeadDIPIF(i,JSEG-1).v * (rDELTIS.v/2.0_8)) & & + (rBeadDDIPIF(i,JSEG-1).v * (rDELTIS.v**2/4.0_8)) ! Position of Y-End of segment TOSVX2(i).v = rBeadIPIF(i,JSEG).v & & + (rBeadDIPIF(i,JSEG).v * (rDELTIS.v/2.0_8)) & & + (rBeadDDIPIF(i,JSEG).v * (rDELTIS.v**2/4.0_8)) ! SUI X->Y SUI(i).v = TOSVX2(i).v - TOSVX1(i).v end do

Notice the addition of the suffix “.v” to the statements containing arithmetic operators. This inconvenience can be eliminated if you elect to write overloaded operator functions for these data types. I chose not to for debugging reasons (wanting the debug build to produce the code inline).

**AOS assembler code for the small loop of three statements**

mov r12, QWORD PTR [80+rbp] ;178.25 mov r11, QWORD PTR [176+rbp] ;179.18 mov rdi, QWORD PTR [152+rbp] ;179.18 mov r8, QWORD PTR [272+rbp] ;180.18 mov rax, QWORD PTR [56+rbp] ;178.25 mov rdx, QWORD PTR [248+rbp] ;180.18 mov QWORD PTR [896+rbp], r12 ;178.25 mov QWORD PTR [928+rbp], r11 ;179.18 mov QWORD PTR [912+rbp], rdi ;179.18 mov QWORD PTR [936+rbp], r8 ;180.18 mov r9, QWORD PTR [rbp] ;178.13 mov r15, QWORD PTR [64+rbp] ;178.25 mov rcx, QWORD PTR [88+rbp] ;178.25 mov r12, QWORD PTR [184+rbp] ;179.18 mov r11, QWORD PTR [160+rbp] ;179.18 mov r14, QWORD PTR [96+rbp] ;178.25 mov rdi, QWORD PTR [192+rbp] ;179.15 mov r8, QWORD PTR [256+rbp] ;180.18 mov r10, QWORD PTR [280+rbp] ;180.18 mov QWORD PTR [920+rbp], rax ;178.25 mov QWORD PTR [904+rbp], rdx ;180.18 cmp rax, 8 ;180.18 jne .B1.12 ; Prob 50% ;180.18 .B1.9:: ; Preds .B1.8 cmp QWORD PTR [912+rbp], 8 ;180.18 jne .B1.12 ; Prob 50% ;180.18 .B1.10:: ; Preds .B1.9 cmp QWORD PTR [904+rbp], 8 ;180.18 jne .B1.12 ; Prob 50% ;180.18 .B1.11:: ; Preds .B1.10 mov QWORD PTR [960+rbp], rbx ; mov rbx, QWORD PTR [936+rbp] ;182.13 imul r10, rbx ;182.13 mov QWORD PTR [968+rbp], r10 ;182.13 mov r10, rsi ;182.25 mov rax, QWORD PTR [896+rbp] ;182.13 imul r10, rax ;182.25 imul rcx, rax ;182.13 add r10, r9 ;182.13 shl r15, 3 ;182.13 sub r10, rcx ;182.13 mov QWORD PTR [976+rbp], r15 ;182.13 sub r10, r15 ;182.13 mov r15, rsi ;183.18 mov rdx, QWORD PTR [928+rbp] ;182.13 imul r15, rdx ;183.18 imul r12, rdx ;182.13 add r15, r14 ;182.13 shl r11, 3 ;182.13 sub r15, r12 ;182.13 mov QWORD PTR [984+rbp], r11 ;182.13 sub r15, r11 ;182.13 mov r11, rsi ;184.18 imul r11, rbx ;184.18 vmulsd xmm0, xmm10, QWORD PTR [8+r15] ;183.45 mov QWORD PTR [944+rbp], rsi ; lea rsi, QWORD PTR [-1+rsi] ;178.25 imul rdx, rsi ;179.18 imul rax, rsi ;178.25 imul rbx, rsi ;180.18 vaddsd xmm0, xmm0, QWORD PTR [8+r10] ;183.15 add r11, rdi ;182.13 add r14, rdx ;178.13 mov QWORD PTR [952+rbp], rdi ; sub r14, r12 ;178.13 mov rdi, QWORD PTR [968+rbp] ;182.13 sub r11, rdi ;182.13 shl r8, 3 ;182.13 add r9, rax ;178.13 sub r11, r8 ;182.13 sub r9, rcx ;178.13 sub r14, QWORD PTR [984+rbp] ;178.13 mov rcx, QWORD PTR [952+rbp] ;178.13 add rcx, rbx ;178.13 vmulsd xmm5, xmm8, QWORD PTR [8+r11] ;184.49 sub rcx, rdi ;178.13 sub rcx, r8 ;178.13 vaddsd xmm14, xmm0, xmm5 ;182.13 vmulsd xmm0, xmm10, QWORD PTR [8+r14] ;179.47 vmulsd xmm5, xmm8, QWORD PTR [8+rcx] ;180.51 vmulsd xmm1, xmm8, QWORD PTR [24+rcx] ;180.51 sub r9, QWORD PTR [976+rbp] ;178.13 vmovsd QWORD PTR [704+rbp], xmm14 ;182.13 mov QWORD PTR [888+rbp], rsi ;178.25 mov rbx, QWORD PTR [960+rbp] ;186.11 mov rsi, QWORD PTR [944+rbp] ;186.11 vaddsd xmm0, xmm0, QWORD PTR [8+r9] ;179.15 vaddsd xmm13, xmm0, xmm5 ;178.13 vmulsd xmm0, xmm10, QWORD PTR [16+r15] ;183.45 vmulsd xmm5, xmm8, QWORD PTR [16+r11] ;184.49 vaddsd xmm0, xmm0, QWORD PTR [16+r10] ;183.15 vmovsd QWORD PTR [736+rbp], xmm13 ;178.13 vaddsd xmm12, xmm0, xmm5 ;182.13 vmulsd xmm0, xmm10, QWORD PTR [16+r14] ;179.47 vmulsd xmm5, xmm8, QWORD PTR [16+rcx] ;180.51 vaddsd xmm0, xmm0, QWORD PTR [16+r9] ;179.15 vmovsd QWORD PTR [712+rbp], xmm12 ;182.13 vaddsd xmm11, xmm0, xmm5 ;178.13 vmulsd xmm0, xmm10, QWORD PTR [24+r15] ;183.45 vmulsd xmm5, xmm8, QWORD PTR [24+r11] ;184.49 vaddsd xmm0, xmm0, QWORD PTR [24+r10] ;183.15 vmovsd QWORD PTR [744+rbp], xmm11 ;178.13 vaddsd xmm5, xmm0, xmm5 ;182.13 vmulsd xmm0, xmm10, QWORD PTR [24+r14] ;179.47 vunpcklpd xmm12, xmm14, xmm12 ;186.20 vaddsd xmm0, xmm0, QWORD PTR [24+r9] ;179.15 vunpcklpd xmm11, xmm13, xmm11 ;186.32 vaddsd xmm0, xmm0, xmm1 ;178.13 vsubpd xmm11, xmm12, xmm11 ;186.11 vmovupd XMMWORD PTR [1024+rbp], xmm11 ;186.11 vsubsd xmm11, xmm5, xmm0 ;186.11 vmovsd QWORD PTR [720+rbp], xmm5 ;182.13 vmovsd QWORD PTR [752+rbp], xmm0 ;178.13 vmovsd QWORD PTR [1040+rbp], xmm11 ;186.11 jmp .B1.13 ; Prob 100% ;186.11 .B1.12:: ; Preds .B1.8 .B1.9 .B1.10 mov rax, rsi ;182.25 mov rdx, QWORD PTR [896+rbp] ;182.13 imul rax, rdx ;182.25 imul rcx, rdx ;182.13 add rax, r9 ;182.13 mov QWORD PTR [1000+rbp], rcx ;182.13 sub rax, rcx ;182.13 mov rcx, QWORD PTR [920+rbp] ;182.13 imul r15, rcx ;182.13 mov QWORD PTR [960+rbp], rbx ; mov rbx, rsi ;183.18 mov QWORD PTR [976+rbp], r15 ;182.13 sub rcx, r15 ;182.13 mov r15, QWORD PTR [928+rbp] ;182.13 imul rbx, r15 ;183.18 imul r12, r15 ;182.13 add rbx, r14 ;182.13 mov QWORD PTR [1008+rbp], r12 ;182.13 sub rbx, r12 ;182.13 mov r12, QWORD PTR [912+rbp] ;182.13 imul r11, r12 ;182.13 mov QWORD PTR [984+rbp], r11 ;182.13 sub r12, r11 ;182.13 mov r11, rsi ;184.18 mov QWORD PTR [992+rbp], r14 ; mov r14, QWORD PTR [936+rbp] ;182.13 imul r11, r14 ;184.18 imul r10, r14 ;182.13 vmulsd xmm0, xmm10, QWORD PTR [r12+rbx] ;183.45 add r11, rdi ;182.13 vaddsd xmm0, xmm0, QWORD PTR [rcx+rax] ;183.15 mov QWORD PTR [968+rbp], r10 ;182.13 sub r11, r10 ;182.13 mov r10, QWORD PTR [904+rbp] ;182.13 imul r8, r10 ;182.13 mov QWORD PTR [944+rbp], rsi ; lea rsi, QWORD PTR [-1+rsi] ;178.25 imul rdx, rsi ;178.25 imul r15, rsi ;179.18 imul r14, rsi ;180.18 sub r10, r8 ;182.13 add r9, rdx ;178.13 mov rdx, QWORD PTR [992+rbp] ;178.13 add rdi, r14 ;178.13 add rdx, r15 ;178.13 sub rdx, QWORD PTR [1008+rbp] ;178.13 sub r9, QWORD PTR [1000+rbp] ;178.13 sub rdi, QWORD PTR [968+rbp] ;178.13 mov QWORD PTR [888+rbp], rsi ;178.25 mov rsi, QWORD PTR [912+rbp] ;183.18 vmulsd xmm5, xmm8, QWORD PTR [r10+r11] ;184.49 mov r15, QWORD PTR [976+rbp] ;182.13 lea r14, QWORD PTR [rsi+rsi] ;183.18 sub r14, QWORD PTR [984+rbp] ;182.13 vaddsd xmm14, xmm0, xmm5 ;182.13 vmulsd xmm0, xmm10, QWORD PTR [rdx+r12] ;179.47 vmulsd xmm5, xmm8, QWORD PTR [rdi+r10] ;180.51 vaddsd xmm0, xmm0, QWORD PTR [r9+rcx] ;179.15 mov rcx, QWORD PTR [920+rbp] ;182.25 vaddsd xmm13, xmm0, xmm5 ;178.13 vmulsd xmm0, xmm10, QWORD PTR [r14+rbx] ;183.45 mov r10, QWORD PTR [904+rbp] ;184.18 lea r12, QWORD PTR [rcx+rcx] ;182.25 sub r12, r15 ;182.13 lea rcx, QWORD PTR [rcx+rcx*2] ;182.25 sub rcx, r15 ;182.13 lea r15, QWORD PTR [rsi+rsi*2] ;183.18 mov rsi, QWORD PTR [904+rbp] ;184.18 add r10, r10 ;184.18 sub r10, r8 ;182.13 sub r15, QWORD PTR [984+rbp] ;182.13 vmovsd QWORD PTR [704+rbp], xmm14 ;182.13 vmovsd QWORD PTR [736+rbp], xmm13 ;178.13 vmulsd xmm5, xmm8, QWORD PTR [r10+r11] ;184.49 vaddsd xmm0, xmm0, QWORD PTR [r12+rax] ;183.15 vaddsd xmm12, xmm0, xmm5 ;182.13 vmulsd xmm0, xmm10, QWORD PTR [r14+rdx] ;179.47 vmulsd xmm5, xmm8, QWORD PTR [r10+rdi] ;180.51 vaddsd xmm0, xmm0, QWORD PTR [r12+r9] ;179.15 vmovsd QWORD PTR [712+rbp], xmm12 ;182.13 lea r12, QWORD PTR [rsi+rsi*2] ;184.18 sub r12, r8 ;182.13 vaddsd xmm11, xmm0, xmm5 ;178.13 vmulsd xmm0, xmm10, QWORD PTR [r15+rbx] ;183.45 vmulsd xmm5, xmm8, QWORD PTR [r12+r11] ;184.49 vmulsd xmm1, xmm8, QWORD PTR [r12+rdi] ;180.51 vaddsd xmm0, xmm0, QWORD PTR [rcx+rax] ;183.15 vmovsd QWORD PTR [744+rbp], xmm11 ;178.13 vaddsd xmm5, xmm0, xmm5 ;182.13 vmulsd xmm0, xmm10, QWORD PTR [r15+rdx] ;179.47 vunpcklpd xmm12, xmm14, xmm12 ;186.20 vaddsd xmm0, xmm0, QWORD PTR [rcx+r9] ;179.15 vunpcklpd xmm11, xmm13, xmm11 ;186.32 vaddsd xmm0, xmm0, xmm1 ;178.13 vsubpd xmm11, xmm12, xmm11 ;186.11 vmovupd XMMWORD PTR [1024+rbp], xmm11 ;186.11 vsubsd xmm11, xmm5, xmm0 ;186.11 vmovsd QWORD PTR [720+rbp], xmm5 ;182.13 vmovsd QWORD PTR [752+rbp], xmm0 ;178.13 vmovsd QWORD PTR [1040+rbp], xmm11 ;186.11 mov rbx, QWORD PTR [960+rbp] ;186.11 mov rsi, QWORD PTR [944+rbp] ;186.11 .B1.13:: ; Preds .B1.12 .B1.11

The assembler code generated for three statements is 214 instructions. This represents typical statements found within loops traversing tethers. A significant portion of the generated code computes the indexes to the components. This is principally due to the programmer’s use of pointers, which can have non-stride 1 references. Of the following:

(known stride-1 to the compiler) TOSVX1(3), TOSVX2(3), SUI(3) (unknown stride to the compiler) rBeadIPIF(3,0:NBEAD+1), rBeadDIPIF(3,0:NBEAD+1), rBeadDDIPIF(3,0:NBEAD+1)

These indexes are computed for each bead in all tethers.

Further, all the double precision floating point computations performed in AOS are with scalars.

Also note, the code generated contains data stride tests:

cmp rax, 8 ;180.18 jne .B1.12 ; Prob 50% ;180.18 .B1.9:: ; Preds .B1.8 cmp QWORD PTR [912+rbp], 8 ;180.18 jne .B1.12 ; Prob 50% ;180.18 .B1.10:: ; Preds .B1.9 cmp QWORD PTR [904+rbp], 8 ;180.18 jne .B1.12 ; Prob 50% ;180.18 .B1.11:: ; Preds .B1.10

The stride test will take one of two different code paths depending on the result of the test.

**PAOS assembler code for the small loop of three statements**

mov r14, QWORD PTR [11184+rdi] ;172.15 mov r15, QWORD PTR [11176+rdi] ;172.15 imul r14, r15 ; vmovupd ymm1, YMMWORD PTR [160+r13] ;204.10 vmovupd xmm2, XMMWORD PTR [32+rdi] ;178.94 vmovupd ymm0, YMMWORD PTR [_2il0floatpacket.40] ; vmovupd YMMWORD PTR [576+r13], ymm1 ;204.10 vmovupd ymm1, YMMWORD PTR [_2il0floatpacket.39] ; mov QWORD PTR [1432+rbp], r15 ;172.15 sub r14, r15 ; mov r15, rsi ;176.15 mov r8, QWORD PTR [11200+rdi] ;172.15 imul r15, r8 ;176.15 mov r12, QWORD PTR [11208+rdi] ;172.15 imul r12, r8 ; mov rax, QWORD PTR [11240+rdi] ;173.19 mov QWORD PTR [1528+rbp], rax ;173.19 mov rax, QWORD PTR [11328+rdi] ;173.19 mov QWORD PTR [1360+rbp], rax ;173.19 mov rax, QWORD PTR [11320+rdi] ;173.19 mov QWORD PTR [1408+rbp], rax ;173.19 mov rax, QWORD PTR [11304+rdi] ;173.19 mov r9, QWORD PTR [11120+rdi] ;172.15 add r15, r9 ; mov QWORD PTR [1352+rbp], rax ;173.19 sub r15, r12 ; mov rax, QWORD PTR [11296+rdi] ;173.19 sub r15, r14 ; mov QWORD PTR [1400+rbp], r12 ; mov r12, rsi ;176.15 mov r10, QWORD PTR [11080+rdi] ;172.15 mov QWORD PTR [1440+rbp], rax ;173.19 xor eax, eax ;173.19 imul r12, r10 ;176.15 mov rdx, QWORD PTR [11088+rdi] ;172.15 mov QWORD PTR [1416+rbp], r10 ;172.15 mov QWORD PTR [1392+rbp], r14 ; imul rdx, r10 ; mov r14, QWORD PTR [1352+rbp] ; mov r10, QWORD PTR [1440+rbp] ; imul r14, r10 ; mov rbx, QWORD PTR [11064+rdi] ;172.15 sub r14, r10 ; mov r11, QWORD PTR [11056+rdi] ;172.15 mov r10, rsi ;177.19 imul rbx, r11 ; mov rcx, QWORD PTR [11000+rdi] ;172.15 add r12, rcx ; sub rbx, r11 ; sub r12, rdx ; mov QWORD PTR [1448+rbp], r11 ;172.15 sub r12, rbx ; mov r11, QWORD PTR [1408+rbp] ; imul r10, r11 ;177.19 mov QWORD PTR [1496+rbp], r12 ; mov r12, QWORD PTR [1360+rbp] ; imul r12, r11 ; mov QWORD PTR [1504+rbp], r15 ; mov r15, QWORD PTR [1528+rbp] ; add r10, r15 ; sub r10, r12 ; sub r10, r14 ; mov QWORD PTR [1520+rbp], rax ;173.19 mov QWORD PTR [1488+rbp], r10 ; mov QWORD PTR [1480+rbp], rax ; mov QWORD PTR [1472+rbp], rax ; mov QWORD PTR [1464+rbp], rax ; mov QWORD PTR [1456+rbp], rax ; mov QWORD PTR [1344+rbp], rcx ; mov QWORD PTR [1352+rbp], r14 ; mov QWORD PTR [1360+rbp], r12 ; mov QWORD PTR [1368+rbp], rbx ; mov QWORD PTR [1376+rbp], rdx ; mov QWORD PTR [1384+rbp], r8 ; mov QWORD PTR [1336+rbp], rsi ; mov QWORD PTR [1320+rbp], rdi ; mov rax, QWORD PTR [1456+rbp] ; mov rdx, QWORD PTR [1464+rbp] ; mov rcx, QWORD PTR [1472+rbp] ; mov rbx, QWORD PTR [1480+rbp] ; mov rsi, QWORD PTR [1488+rbp] ; mov r8, QWORD PTR [1504+rbp] ; mov r10, QWORD PTR [1520+rbp] ; mov r11, QWORD PTR [1440+rbp] ; mov r12, QWORD PTR [1432+rbp] ; mov r14, QWORD PTR [1448+rbp] ; vinsertf128 ymm2, ymm2, XMMWORD PTR [48+rdi], 1 ;178.94 mov rdi, QWORD PTR [1496+rbp] ; ALIGN 16 .B1.6:: ; Preds .B1.6 .B1.5 vmovupd xmm10, XMMWORD PTR [rdx+rsi] ;178.63 vmovupd xmm3, XMMWORD PTR [rbx+r8] ;177.62 vmovupd xmm7, XMMWORD PTR [rcx+rdi] ;177.19 inc r10 ;173.19 vinsertf128 ymm11, ymm10, XMMWORD PTR [16+rdx+rsi], 1 ;178.63 add rdx, r11 ;173.19 vmulpd ymm12, ymm0, ymm11 ;178.63 vmulpd ymm13, ymm12, ymm2 ;178.94 vmulpd ymm15, ymm13, ymm2 ;178.94 vinsertf128 ymm4, ymm3, XMMWORD PTR [16+rbx+r8], 1 ;177.62 add rbx, r12 ;173.19 vmulpd ymm5, ymm1, ymm4 ;177.62 vmulpd ymm9, ymm5, ymm2 ;177.64 vinsertf128 ymm8, ymm7, XMMWORD PTR [16+rcx+rdi], 1 ;177.19 add rcx, r14 ;173.19 vaddpd ymm14, ymm8, ymm9 ;177.19 vaddpd ymm3, ymm14, ymm15 ;176.15 vmovupd YMMWORD PTR [256+r13+rax], ymm3 ;176.15 add rax, 32 ;173.19 cmp r10, 3 ;173.19 jb .B1.6 ; Prob 66% ;173.19 .B1.7:: ; Preds .B1.6 mov rsi, QWORD PTR [1336+rbp] ; lea r11, QWORD PTR [-1+rsi] ;172.15 mov rax, QWORD PTR [1416+rbp] ;172.15 xor r10d, r10d ;173.19 imul rax, r11 ;172.15 vmovupd ymm0, YMMWORD PTR [_2il0floatpacket.40] ; vmovupd ymm1, YMMWORD PTR [_2il0floatpacket.39] ; mov rcx, QWORD PTR [1344+rbp] ; add rcx, rax ; mov rdx, QWORD PTR [1376+rbp] ; sub rcx, rdx ; mov r8, QWORD PTR [1384+rbp] ; mov rdx, QWORD PTR [1408+rbp] ;173.19 imul rdx, r11 ;173.19 imul r8, r11 ;172.15 add r15, rdx ; add r9, r8 ; xor r8d, r8d ; mov r12, QWORD PTR [1360+rbp] ; sub r15, r12 ; mov rbx, QWORD PTR [1368+rbp] ; sub rcx, rbx ; sub r9, QWORD PTR [1400+rbp] ; xor ebx, ebx ; mov QWORD PTR [1512+rbp], r10 ;173.19 xor edx, edx ; mov r14, QWORD PTR [1352+rbp] ; sub r15, r14 ; mov QWORD PTR [1424+rbp], r11 ;172.15 mov rdi, QWORD PTR [1320+rbp] ; sub r9, QWORD PTR [1392+rbp] ; mov r11, QWORD PTR [1512+rbp] ; mov r12, QWORD PTR [1440+rbp] ; mov r14, QWORD PTR [1432+rbp] ; mov rax, QWORD PTR [1448+rbp] ; ALIGN 16 .B1.8:: ; Preds .B1.8 .B1.7 vmovupd xmm10, XMMWORD PTR [rbx+r15] ;174.65 vmovupd xmm3, XMMWORD PTR [r10+r9] ;173.64 vmovupd xmm7, XMMWORD PTR [r8+rcx] ;173.19 inc r11 ;173.19 vinsertf128 ymm11, ymm10, XMMWORD PTR [16+rbx+r15], 1 ;174.65 add rbx, r12 ;173.19 vmulpd ymm12, ymm0, ymm11 ;174.65 vmulpd ymm13, ymm12, ymm2 ;174.96 vmulpd ymm15, ymm13, ymm2 ;174.96 vinsertf128 ymm4, ymm3, XMMWORD PTR [16+r10+r9], 1 ;173.64 add r10, r14 ;173.19 vmulpd ymm5, ymm1, ymm4 ;173.64 vmulpd ymm9, ymm5, ymm2 ;173.66 vinsertf128 ymm8, ymm7, XMMWORD PTR [16+r8+rcx], 1 ;173.19 add r8, rax ;173.19 vaddpd ymm14, ymm8, ymm9 ;173.19 vaddpd ymm3, ymm14, ymm15 ;172.15 vmovupd YMMWORD PTR [352+r13+rdx], ymm3 ;172.15 add rdx, 32 ;173.19 cmp r11, 3 ;173.19 jb .B1.8 ; Prob 66% ;173.19 .B1.9:: ; Preds .B1.8 xor ecx, ecx ;173.19 xor edx, edx ; .B1.10:: ; Preds .B1.10 .B1.9 vmovupd ymm1, YMMWORD PTR [256+r13+rdx] ;192.11 inc rcx ;173.19 vsubpd ymm2, ymm1, YMMWORD PTR [352+r13+rdx] ;192.11 vmovupd YMMWORD PTR [608+r13+rdx], ymm2 ;192.11 add rdx, 32 ;173.19 cmp rcx, 3 ;173.19 jb .B1.10 ; Prob 66% ;173.19

**PAOS generates 164 instructions (plus an undetermined number of NOPs for the ALIGN 16 pads). **

The PAOS code, formerly one loop of 3 iterations in source, now has three loops each of 3 iterations-one each to compute TOSVX1, TOSVX2, and the SUI result. These loops are relatively short, and the major index calculations are performed only once prior to loop entry. The execution path extends by 98 instructions or 262 instructions total for four SUI results or 65.5 instructions per SUI result. Comparing this with the AOS method that generated one SUI result in 214 instructions, we have an approximate 3.26x speedup for this section of code. Memory and cache latencies will affect this rough estimate.

Vectorization of double precision floating point is found, most are now 4-wide:

vmovupd ymm1, YMMWORD PTR [160+r13] ;204.10 vmovupd xmm2, XMMWORD PTR [32+rdi] ;178.94 ... vmulpd ymm12, ymm0, ymm11 ;178.63 vmulpd ymm13, ymm12, ymm2 ;178.94 vmulpd ymm15, ymm13, ymm2 ;178.94

The reason the compiler chose 4-wide loads and use in the above cases is that the potential penalty for split cache line loads was less than the performance gain through multiple re-use of the variable fetched.

Some of the fetches are still performed in two steps,

vmovupd xmm10, XMMWORD PTR [rbx+r15] ;174.65 ... vinsertf128 ymm11, ymm10, XMMWORD PTR [16+rbx+r15], 1 ;174.65

which removes the potential split cache line load performance hit.

Also note that the compiler can (does) force alignment of the local temps: TOSVX1(3), TOSVX2(3), and SUI(3). Therefore, it can know there will be no cache line split issue with 4-wide access to these variables.

**PAOS has the following advantages:**

Index calculations per bead are similar but for only a quarter of the number of tethers

Many fetches and stores are 4-wide, a few are 2-wide

Double precision adds, subtracts, and multiplies are 4-wide

Shorter code means better utilization of L1 instruction cache

**Notes for C++ programmers (not familiar with FORTRAN)**

In the code shown, the arrays were using pointers to array format:

AOS

real, pointer :: rBeadIPIF(:,:) real, pointer :: rBeadDIPIF(:,:) real, pointer :: rBeadDDIPIF(:,:) ... rBeadIPIF => pFiniteSolution.rBeadIPIF rBeadDIPIF => pFiniteSolution.rBeadDIPIF rBeadDDIPIF => pFiniteSolution.rBeadDDIPIF

POS

type(TypeYMM), pointer :: rBeadIPIF(:,:) type(TypeYMM), pointer :: rBeadDIPIF(:,:) type(TypeYMM), pointer :: rBeadDDIPIF(:,:) ... rBeadIPIF => pFiniteSolutionSIMD.rBeadIPIF rBeadDIPIF => pFiniteSolutionSIMD.rBeadDIPIF rBeadDDIPIF => pFiniteSolutionSIMD.rBeadDDIPIF

In both variants of the code, these pointers point to the entire array in the pFiniteSolution and pFiniteSolutionSIMD structures respectively. These happen to be pointers to arrays as well. In FORTRAN, pointers can point to portions of an array **that need not be contiguous**.

real, pointer :: f(:,:) type(TypeYMM), pointer :: t(:,:) real, pointer:: slice(:,:) ... s = 2 ! 3rd cell of 4-wide v(0:3) slice(LBOUND(f, DIM=1):UBOUND(f, DIM=1), LBOUND(f, DIM=2):UBOUND(f, DIM=2)) => & & t(LBOUND(f, DIM=1), LBOUND(f, DIM=2))%v(s::4) f => slice

In the preceding statements, the pointer f would have a stride of 4 (doubles) assuming the array pointer in the pFiniteSolutionSIMD structure had a stride of 1 (it need not have). And start at v(s), s=2. This presents in f, a two dimensional array that looks like (3,0:NBEAD+1), and is indexed so. However, indexing walks down the column s of the v array in the two dimensional array of 4-wide vectors. C++ programmers may legitimately ask “Why would anyone need to do this?”

To answer this, recall that only 25% of the code was converted to use PAOS. What about the other 75% of the code? As it turns out, using pointers and non-stride 1 assignments, the other 75% of the code can be left unchanged yet access and manipulate all SIMD allocated arrays as they were formerly allocated non-PAOS. With this capability, conversion of an application need not be an all–or-nothing situation. Without this capability of FORTRAN, multiple sets of data would have to be continuously synchronized.

Because of a potential for strides not equal to 1, the FORTRAN compiler will generate multiple paths of code. One for stride=1 and the other for stride != 1. I should correct this statement. The preceding is correct when the array is “natural” numbers for the language (char, word, integer, float, double) where the processor instruction set has the capability of scaling the array cell by *1, *2, *4 or *8. In the case of the Type(TypeYMM) array, each cell in the array consists of four doubles (32 bytes). Therefore, due to the instruction set being incapable of a scale of *32, only one execution path is generated (and taken). The overhead is trivial:

add rdx, 32 ;173.19

## C++ equivalent code

For the benefit of the C++ programmers lets piece together a mockup of the equivalent example:

// AOS struct xyz_t { double xyz[3]; }; struct FiniteSoluton_t { double rDELTIS; int nBeads; xyz_t* rBeadIPIF; xyz_t* rBeadDIPIF; xyz_t* rBeadDDIPIF; FiniteSoluton_t() { nBeads = 0; rBeadIPIF = NULL; rBeadDIPIF = NULL; rBeadDDIPIF = NULL; } void Allocate(int n) { if(rBeadIPIF) delete [] rBeadIPIF; if(rBeadDIPIF) delete [] rBeadDIPIF; if(rBeadDDIPIF) delete [] rBeadDDIPIF; nBeads = n; rBeadIPIF = new xyz_t[n]; rBeadDIPIF = new xyz_t[n]; rBeadDDIPIF = new xyz_t[n]; } }; FiniteSoluton_t fs; // dummy external function to force optimizer NOT to remove

unused code // *** this will generate a link error, ignore error and look at

.ASM file void foo(xyz_t& x); void TENSEG(FiniteSoluton_t* pFiniteSolution) { int JSEG; // pull in frequently used items from fs struct int JSEGlast = pFiniteSolution->nBeads+1; double rDELTIS = pFiniteSolution->rDELTIS; xyz_t* rBeadIPIF = pFiniteSolution->rBeadIPIF; xyz_t* rBeadDIPIF = pFiniteSolution->rBeadDIPIF; xyz_t* rBeadDDIPIF = pFiniteSolution->rBeadDDIPIF; xyz_t TOSVX1, TOSVX2, SUI; for(JSEG=0; JSEG < JSEGlast; ++JSEG) { for(int i=0; i<3; ++i) { // Determine TUI at 1/2 integration step from now // using current velocity and acceleration // Position of X-End of segment TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i] + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0)) + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0)); // Position of Y-End of segment TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i] + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0)) + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0)); // SUI X->Y SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i]; } // for(int i=0; i<3; ++i) // insert code to assure optimizer does not remove loop foo(SUI); } // for(JSEG=0; JSEG < JSEGlast; ++JSEG) }

**PAOS code**

#include < dvec.h > struct xyzF64vec4 { F64vec4 xyz[3]; }; struct FiniteSolutonF64vec4 { F64vec4 rDELTIS; int nBeads; xyzF64vec4* rBeadIPIF; xyzF64vec4* rBeadDIPIF; xyzF64vec4* rBeadDDIPIF; FiniteSolutonF64vec4() { nBeads = 0; rBeadIPIF = NULL; rBeadDIPIF = NULL; rBeadDDIPIF = NULL; } void Allocate(int n) { if(rBeadIPIF) delete [] rBeadIPIF; if(rBeadDIPIF) delete [] rBeadDIPIF; if(rBeadDDIPIF) delete [] rBeadDDIPIF; nBeads = n; rBeadIPIF = new xyzF64vec4[n]; rBeadDIPIF = new xyzF64vec4[n]; rBeadDDIPIF = new xyzF64vec4[n]; } }; FiniteSolutonF64vec4 fsF64vec4; void foo(xyzF64vec4& x); // dummy external functon to force optimizer NOT to remove unused code // *** this will generate a link error, ignore error and look at .ASM file void TENSEG(FiniteSolutonF64vec4* pFiniteSolution) { int JSEG; // pull in frequrently used items from fs struct int JSEGlast = pFiniteSolution->nBeads+1; F64vec4 rDELTIS = pFiniteSolution->rDELTIS; xyzF64vec4* rBeadIPIF = pFiniteSolution->rBeadIPIF; xyzF64vec4* rBeadDIPIF = pFiniteSolution->rBeadDIPIF; xyzF64vec4* rBeadDDIPIF = pFiniteSolution->rBeadDDIPIF; xyzF64vec4 TOSVX1, TOSVX2, SUI; for(JSEG=0; JSEG < JSEGlast; ++JSEG) { for(int i=0; i<3; ++i) { // Determine TUI at 1/2 integration step from now // using current velocity and acceleration // Position of X-End of segment TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i] + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0)) + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0)); // Position of Y-End of segment TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i] + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0)) + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0)); // SUI X->Y SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i]; } // for(int i=0; i<3; ++i) // insert code to assure optimizer does not remove loop foo(SUI); } // for(JSEG=0; JSEG < JSEGlast; ++JSEG) }

Other than for the type declaration, the code section remains the same.

## The sections of code to be adapted for PAOS

Some additional coding considerations will have to be made when programming in PAOS. One example is defensive code for divide by zero protection. An example where you may have had:

xyz_t& unitVector(xyz_t& v) { xyz_t ret; double Length = sqrt( v.xyz[0]*v.xyz[0] + v.xyz[1]*v.xyz[1] + v.xyz[2]*v.xyz[2]); // defensive code to avoid divide by 0 // When (if) Length = 0.0 then substitute 1.0 // X = 0.0 / 1.0 = uX 0.0 (not divide by 0) // Y = 0.0 / 1.0 = uY 0.0 (not divide by 0) // Z = 0.0 / 1.0 = uZ 0.0 (not divide by 0) if(Length == 0.0) Length = 1.0; ret.xyz[0] = v.xyz[0] / Length; ret.xyz[1] = v.xyz[1] / Length; ret.xyz[2] = v.xyz[2] / Length; return ret; } Becomes: xyzF64vec4& unitVector(xyzF64vec4& v) { xyzF64vec4 ret; static const F64vec4 Zero(0.0); static const F64vec4 One(1.0); F64vec4 Length = sqrt( v.xyz[0]*v.xyz[0] + v.xyz[1]*v.xyz[1] + v.xyz[2]*v.xyz[2]); // defensive code to avoid divide by 0 // When (if) Length = 0.0 then substitute 1.0 // X = 0.0 / 1.0 = uX 0.0 (not divide by 0) // Y = 0.0 / 1.0 = uY 0.0 (not divide by 0) // Z = 0.0 / 1.0 = uZ 0.0 (not divide by 0) Length = select_eq(Length, Zero, Length, One); ret.xyz[0] = v.xyz[0] / Length; ret.xyz[1] = v.xyz[1] / Length; ret.xyz[2] = v.xyz[2] / Length; return ret; }

Other than for the type declaration, only one statement was changed.

## PAOS advantage in the assembly code

The real advantage of PAOS can be found in the assembly code, with the understanding that the PAOS code is operating on four entities at the same time. Let’s compare the assembly code of both the original AOS code and the modified PAOS code. The source code is the same; the only difference is in the data type. The source code:

for(int i=0; i<3; ++i) { // Determine TUI at 1/2 integration step from now // using current velocity and acceleration // Position of X-End of segment TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i] + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0)) + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0)); // Position of Y-End of segment TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i] + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0)) + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0)); // SUI X->Y SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i]; } // for(int i=0; i<3; ++i)

**AOS assembly code**

;;; for(int i=0; i<3; ++i) ;;; { ;;; ;;; // Determine TUI at 1/2 integration step from now ;;; // using current velocity and acceleration ;;; // Position of X-End of segment ;;; TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i] ;;; + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0)) ;;; + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0)); ;;; // Position of Y-End of segment ;;; TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i] vmulsd xmm1, xmm2, QWORD PTR [_2il0floatpacket.42] ;66.13 vmulsd xmm2, xmm2, xmm2 ;66.13 vmulsd xmm3, xmm2, QWORD PTR [_2il0floatpacket.43] ;66.13 vmovddup xmm0, xmm1 ;66.13 vmovddup xmm2, xmm3 ;66.13 vmovups XMMWORD PTR [80+rsp], xmm6 ;66.13 vmovapd xmm6, xmm2 ;66.13 vmovups XMMWORD PTR [64+rsp], xmm7 ;66.13 vmovapd xmm7, xmm1 ;66.13 vmovups XMMWORD PTR [48+rsp], xmm8 ;66.13 vmovapd xmm8, xmm3 ;66.13 vmovups XMMWORD PTR [32+rsp], xmm9 ;66.13 vmovapd xmm9, xmm0 ;66.13 mov QWORD PTR [136+rsp], rbx ;66.13 mov rbx, rax ;66.13 mov QWORD PTR [128+rsp], rsi ;66.13 mov rsi, rdx ;66.13 mov QWORD PTR [120+rsp], rdi ;66.13 mov rdi, rcx ;66.13 mov QWORD PTR [112+rsp], r12 ;66.13 mov r12, r8 ;66.13 mov QWORD PTR [104+rsp], r13 ;66.13 mov r13, r9 ;66.13 mov QWORD PTR [96+rsp], r14 ;66.13 mov r14, r10 ;66.13 .B5.3:: ; Preds .B5.4 .B5.2 vmulpd xmm4, xmm9, XMMWORD PTR [24+rbx+r12] ;66.13 vaddpd xmm5, xmm4, XMMWORD PTR [24+rbx+r13] ;66.13 vmulpd xmm4, xmm6, XMMWORD PTR [24+rbx+rdi] ;66.13 vaddpd xmm0, xmm5, xmm4 ;66.13 vmulpd xmm4, xmm9, XMMWORD PTR [rbx+r12] ;62.13 vmulpd xmm5, xmm6, XMMWORD PTR [rbx+rdi] ;62.13 vaddpd xmm4, xmm4, XMMWORD PTR [rbx+r13] ;62.13 vaddpd xmm4, xmm4, xmm5 ;62.13 ;;; + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0)) ;;; + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0)); ;;; // SUI X->Y ;;; SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i]; vsubpd xmm4, xmm0, xmm4 ;70.11 vmovupd XMMWORD PTR [144+rsp], xmm4 ;70.11 ;;; } // for(int i=0; i<3; ++i) lea rcx, QWORD PTR [144+rsp] ;73.3 vmulsd xmm4, xmm7, QWORD PTR [40+rbx+r12] ;66.13 vmulsd xmm5, xmm8, QWORD PTR [40+rbx+rdi] ;66.13 vaddsd xmm4, xmm4, QWORD PTR [40+rbx+r13] ;66.13 vaddsd xmm0, xmm4, xmm5 ;66.13 vmulsd xmm4, xmm7, QWORD PTR [16+rbx+r12] ;62.13 vmulsd xmm5, xmm8, QWORD PTR [16+rbx+rdi] ;62.13 vaddsd xmm4, xmm4, QWORD PTR [16+rbx+r13] ;62.13 vaddsd xmm4, xmm4, xmm5 ;62.13 vsubsd xmm4, xmm0, xmm4 ;70.11 vmovsd QWORD PTR [160+rsp], xmm4 ;70.11

The AOS computed one SUI vector in 46 instructions using a mixture of 2-wide vector and scalar

instructions. This loop was unrolled.

**PAOS assembly code**

;;; for(int i=0; i<3; ++i) ;;; { ;;; ;;; // Determine TUI at 1/2 integration step from now ;;; // using current velocity and acceleration ;;; // Position of X-End of segment ;;; TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i] vdivpd ymm1, ymm0, YMMWORD PTR [_2il0floatpacket.437] ;213.13 vmulpd ymm0, ymm0, ymm0 ;213.13 vmovupd YMMWORD PTR [288+r13], ymm1 ;213.13 vdivpd ymm0, ymm0, YMMWORD PTR [_2il0floatpacket.438] ;213.13 vmovupd YMMWORD PTR [320+r13], ymm0 ;213.13 mov QWORD PTR [456+rsp], rbx ;213.13 mov rbx, rax ;213.13 mov QWORD PTR [448+rsp], rsi ;213.13 mov rsi, rdx ;213.13 mov QWORD PTR [440+rsp], rdi ;213.13 mov rdi, rcx ;213.13 mov QWORD PTR [432+rsp], r12 ;213.13 mov r12, r8 ;213.13 mov QWORD PTR [424+rsp], r14 ;213.13 mov r14, r9 ;213.13 mov QWORD PTR [416+rsp], r15 ;213.13 mov r15, r10 ;213.13 .B5.3:: ; Preds .B5.4 .B5.2 vmovupd ymm4, YMMWORD PTR [288+r13] ;213.13 ;;; + (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0)) ;;; + (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0)); ;;; // Position of Y-End of segment ;;; TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i] ;;; + (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0)) ;;; + (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0)); ;;; // SUI X->Y ;;; SUI.xyz[i] = TOSVX2.xyz[i] - TOSVX1.xyz[i]; ;;; } // for(int i=0; i<3; ++i) ;;; // insert code to assure optimizer does not remove loop ;;; foo(SUI); lea rcx, QWORD PTR [192+r13] ;224.3 vmovupd ymm3, YMMWORD PTR [320+r13] ;213.13 vmulpd ymm5, ymm4, YMMWORD PTR [rbx+r12] ;213.13 vmulpd ymm1, ymm3, YMMWORD PTR [rbx+rdi] ;213.13 vaddpd ymm0, ymm5, YMMWORD PTR [rbx+r14] ;213.13 vmulpd ymm5, ymm4, YMMWORD PTR [96+rbx+r12] ;217.13 vaddpd ymm2, ymm0, ymm1 ;213.13 vmulpd ymm1, ymm3, YMMWORD PTR [96+rbx+rdi] ;217.13 vmovupd YMMWORD PTR [r13], ymm2 ;213.13 vaddpd ymm0, ymm5, YMMWORD PTR [96+rbx+r14] ;217.13 vaddpd ymm5, ymm0, ymm1 ;217.13 vsubpd ymm2, ymm5, ymm2 ;221.11 vmovupd YMMWORD PTR [96+r13], ymm5 ;217.13 vmovupd YMMWORD PTR [192+r13], ymm2 ;221.11 vmulpd ymm0, ymm4, YMMWORD PTR [32+rbx+r12] ;213.13 vmulpd ymm2, ymm3, YMMWORD PTR [32+rbx+rdi] ;213.13 vmulpd ymm5, ymm4, YMMWORD PTR [128+rbx+r12] ;217.13 vaddpd ymm1, ymm0, YMMWORD PTR [32+rbx+r14] ;213.13 vaddpd ymm0, ymm5, YMMWORD PTR [128+rbx+r14] ;217.13 vaddpd ymm1, ymm1, ymm2 ;213.13 vmulpd ymm2, ymm3, YMMWORD PTR [128+rbx+rdi] ;217.13 vmovupd YMMWORD PTR [32+r13], ymm1 ;213.13 vaddpd ymm5, ymm0, ymm2 ;217.13 vsubpd ymm1, ymm5, ymm1 ;221.11 vmovupd YMMWORD PTR [128+r13], ymm5 ;217.13 vmovupd YMMWORD PTR [224+r13], ymm1 ;221.11 vmulpd ymm0, ymm4, YMMWORD PTR [64+rbx+r12] ;213.13 vmulpd ymm2, ymm3, YMMWORD PTR [64+rbx+rdi] ;213.13 vmulpd ymm4, ymm4, YMMWORD PTR [160+rbx+r12] ;217.13 vmulpd ymm3, ymm3, YMMWORD PTR [160+rbx+rdi] ;217.13 vaddpd ymm1, ymm0, YMMWORD PTR [64+rbx+r14] ;213.13 vaddpd ymm0, ymm1, ymm2 ;213.13 vaddpd ymm1, ymm4, YMMWORD PTR [160+rbx+r14] ;217.13 vmovupd YMMWORD PTR [64+r13], ymm0 ;213.13 vaddpd ymm1, ymm1, ymm3 ;217.13 vsubpd ymm0, ymm1, ymm0 ;221.11 vmovupd YMMWORD PTR [160+r13], ymm1 ;217.13 vmovupd YMMWORD PTR [256+r13], ymm0 ;221.11

The PAOS used all 4-wide vectors computing four segments at the same time (from four different tethers). The for loop was unrolled, and the resulting code was a very Spartan 56 instructions per four SUI results. Giving 56/4 = 14 instructions per SUI.

AOS took 46 instructions per SUI calculation; PAOS took 114 instructions per SUI. Assuming equal time per instruction PAOS has a performance advantage of *3.28x* over AOS (for this code sample).

**Caution:** do not assume that you will experience an actual 3.28x improvement in an application as a whole.

While not all applications are suitable for conversion from AOS to PAOS, portions of your application may benefit from conversion. Later in this article, we will show the results of using PAOS modifications in a FORTRAN application for the study of tether systems. In this application only 25% of the code was modified for PAOS (mostly copy and paste, search, and replace). A collection of spacecraft were interconnected with a collection of 96 tethers. This configuration yielded a 2x performance boost. Though not the 3.28x optimization estimate for the code snippet above, it shows a rather impressive performance gain for the programming effort involved.

## Program conversion techniques

In the tether simulation program, I took the body of the “type TypeFiniteSolution” and placed it into a separate file (FiniteSolution.inc). Compilation of the module containing the type definition uses the FORTRAN preprocessor (FPP):

! Note, "real" is not defined at this point ! i.e. "real" in the FPP #include file passes through as "real" type TypeFiniteSolution SEQUENCE #include "FiniteSolution.inc" ! No additional data to the original type end type TypeFiniteSolution ! XMM, YMM and YMM02 variants replace "real" with ! "type(TypeXMM)", "type(TypeYMM)" and "type(TypeYMM02)" ! XMM variant #define real type(TypeXMM) type TypeFiniteSolutionXMM SEQUENCE #include "FiniteSolution.inc" ! additional members for TypeXMM integer :: TetherNumber0 integer :: TetherNumber1 end type TypeFiniteSolutionXMM #undef real ! YMM variant #define real type(TypeYMM) type TypeFiniteSolutionYMM SEQUENCE #include "FiniteSolution.inc" ! additional members for TypeYMM integer :: TetherNumber0 integer :: TetherNumber1 integer :: TetherNumber2 integer :: TetherNumber3 end type TypeFiniteSolutionYMM #undef real ! YMM02 variant #define real type(TypeYMM02) type TypeFiniteSolutionYMM02 SEQUENCE #include "FiniteSolution.inc" ! additional members for TypeYMM02 integer :: TetherNumber0 integer :: TetherNumber1 integer :: TetherNumber2 end type TypeFiniteSolutionYMM02 #undef real

We can use the same technique for the subroutines and functions by using the FPP and redefining “real” to the appropriate type. The main difference is the non-SIMD would typically not have the .v suffix. So you may need two versions of your subroutines and functions that are sensitive to SIMD-typed variables.

**Generic subroutine and function interfaces**

Use of generic subroutine and function interfaces is a major requirement for conversion to PAOS. Consider the requirements of a CROSS product of two arrays of dimension 3.

interface CROSS RECURSIVE SUBROUTINE CROSS_r_r_r(VA,VB,VC) use MOD_ALL real :: VA(3), VB(3), VC(3) END SUBROUTINE CROSS_r_r_r RECURSIVE SUBROUTINE CROSS_r_r_ymm(VA,VB, VC) USE MOD_ALL real :: VA(3), VB(3) type(TypeYMM) :: VC(3) END SUBROUTINE CROSS_r_r_ymm RECURSIVE SUBROUTINE CROSS_r_r_ymm02(VA,VB, VC) USE MOD_ALL real :: VA(3), VB(3) type(TypeYMM02) :: VC(3) END SUBROUTINE CROSS_r_r_ymm02 RECURSIVE SUBROUTINE CROSS_r_r_xmm(VA,VB, VC) USE MOD_ALL real :: VA(3), VB(3) type(TypeXMM) :: VC(3) END SUBROUTINE CROSS_r_r_xmm RECURSIVE SUBROUTINE CROSS_ymm_ymm_ymm(VA,VB,VC) use MOD_ALL type(TypeYMM) :: VA(3), VB(3), VC(3) END SUBROUTINE CROSS_ymm_ymm_ymm RECURSIVE SUBROUTINE CROSS_ymm02_ymm02_ymm02(VA,VB,VC) use MOD_ALL type(TypeYMM02) :: VA(3), VB(3), VC(3) END SUBROUTINE CROSS_ymm02_ymm02_ymm02 RECURSIVE SUBROUTINE CROSS_xmm_xmm_xmm(VA,VB, VC) use MOD_ALL type(TypeXMM) :: VA(3), VB(3), VC(3) END SUBROUTINE CROSS_xmm_xmm_xmm end interface CROSS

Note that a typical application will have various (but not all) permutations of the types of arguments. A generic subroutine interface will sort this out for you.

! EVAL ANGULAR MOMENTUM DERIVATIVE TERM CALL CROSS (RPI,RPIDD, pFiniteSolutionSIMD.rRVDI)

The above calls CROSS_r_r_ymm

! FORM VECTOR IN DIRECTION OF TETHER FRAME "K" UNIT VECTOR CALL CROSS (pFiniteSolutionSIMD.rUITI, pFiniteSolutionSIMD.rUJOI, pFiniteSolutionSIMD.rUKTI)

The above calls CROSS_ymm_ymm_ymm

Subroutines were replicated from the original AOS layout for each of the additional PAOS widths (2-wide, 3-wide, 4-wide). In this respect, C++ is easier since templates can be used. In FORTRAN with Visual Studio* you select a subroutine, select all, and copy to clip board. Then in the solution Explorer, click on the project of interest, Add Item, add same name as original code plus suffix of your choice (I used ymm, ymm02, and xmm). Then paste the contents of the original subroutine into the new empty file. The remaining step is to change the types and add an appropriate USE module.

Some additional consideration for data changed will have to be addressed. While you can use operator functions in FORTRAN to eliminate suffixing variables with “.v,” which can be done with search and replace, I chose not to due to an issue with the compiler not inline-ing the code. This inline issue may be resolved with a product update by the time you read this article.

Some of your code sections may have to address the PAOS vector horizontally. I wanted to have the same code text for all the different PAOS widths (2, 3, 4 now, 5, 6, 7, 8, … later). In the FORTRAN subroutine, I added:

integer :: s integer, parameter :: sLast = 3

And used it as:

! FORM UNIT VECTOR FROM X-END TO Y-END OF BEAD SEGMENT do s=0,sLast if(DUMEL.v(s) .ne. 0.0_8) then TUI.v(s) = SUI.v(s) / DUMEL.v(s) else call RandomUnitVector(TUI.v(s)) endif end do

*(to C++ programmers “parameter” is like “static const”)*

The above is a divide-by-zero protection except that instead of returning a NULL vector, a random unit vector is returned. The above modification runs as AOS (non-PAOS).

In the case of the production of NULL vector, the code becomes:

RadiusSquared.v = RelativePOSE_IF(1).v**2 & & + RelativePOSE_IF(2).v**2 & & + RelativePOSE_IF(3).v**2 ! N.B we are comparing a bead position within a tether against ! a gravitational body position they should not coincide ! ignore if at center of object !DEC$ VECTOR ALWAYS do s = 0, sLast if(RadiusSquared.v(s) .eq. 0.0_8) RadiusSquared.v(s) = 1.0_8 end do Radius.v = dsqrt(RadiusSquared.v) ACCE_IFlocal(1).v = RelativePOSE_IF(1).v & & * ((pObject.rGMft3Ps2 / RadiusSquared.v) / Radius.v) ACCE_IFlocal(2).v = RelativePOSE_IF(2).v & & * ((pObject.rGMft3Ps2 / RadiusSquared.v) / Radius.v) ACCE_IFlocal(3).v = RelativePOSE_IF(3).v & & * ((pObject.rGMft3Ps2 / RadiusSquared.v) / Radius.v)

**Optimization notes on the above code**

The preventative code for the potential for the divide by 0 is moved to test the square of the vector as opposed to after the square root. The result will be the same regardless of path taken. However, now the dsqrt will never be asked to perform a square root of 0.0. Depending on processor microcode, the square root of 0.0 may take a longer path.

The above do s loop, having a constant sLast, will be unrolled and vectorized and surprisingly for this loop will be fully vectored **without** branch instructions.

## Advantages of PAOS over AOS

While AOS can vectorize the larger arrays allocated to NumberOfBeads related allocations and has limited vectorization opportunities for the small fixed arrays of dimensions (3) and (3,3), AOS is unable to vectorize the manipulation of the scalars. PAOS can fully vectorize the scalars, fully vectorize the small fixed arrays of dimensions (3) and (3,3), and has better opportunities to vectorize the larger arrays allocated to NumberOfBeads related allocations.

**Advantage with computation of the scalars**

Traditional non-SIMD statements to be repeated on each tether

! CALCULATE 1ST TERM ! AND ACCUMULATE THIS INCREMENT DUMSCL = DUMCEE * ( 2.0_8*(pFiniteSolution%rELD / pFiniteSolution%rEL)**2 & & - pFiniteSolution%rELDD / pFiniteSolution%rEL )

SIMD statement to be repeated on a group of four tethers

! CALCULATE 1ST TERM ! AND ACCUMULATE THIS INCREMENT DUMSCL.v = DUMCEE.v * ( 2.0_8*(pFiniteSolution%rELD.v / pFiniteSolution%rEL.v)**2 & & - pFiniteSolution%rELDD.v / pFiniteSolution%rEL.v )

Note that the code was changed only slightly. The same code is used with the addition of the “.v” suffix to the variables typed as TypeYMM, TypeXMM, and TypeYMM02. While this suffix can be eliminated by declaring operator functions (I may make this revision later), the suffix use, such as the Debug Build, was followed with full debugging diagnostics and performed the operations in-line and with vector instructions. These code modifications can essentially be performed using search and replace. If the programmer elects to use operator overload functions, then no code change is required (other than for type declarations).

**Advantage in computation with small fixed arrays**

Manipulating the small fixed-size arrays has an advantage in PAOS. Consider the positions of the front and back ends of a tether and you are to calculate the vector between the front and back ends as well as the unit vector between the front and back ends (pFiniteSolution% pointers removed for clarity):

real(8) :: Front(3), Back(3), ELI(3), UELI(3), Length ELI(1) = Back(1) – Front(1) ELI(2) = Back(2) – Front(2) ELI(3) = Back(3) – Front(3) Length = DSQRT(ELI(1)**2 + ELI(2)**2 + ELI(3)**2) UEIL(1) = ELI(1) / Length UEIL(2) = ELI(2) / Length UEIL(3) = ELI(3) / Length

The scalar AOS code can vectorize only portions of the above code. On SSE platforms, only two of the 3 components of Back and Front could be loaded and subtracted at once; the 3rd component is loaded and subtracted separately. The “ELI(1)**2 + ELI(2)**2” could be performed in one operation and the “ELI(3)*2” as a separate operation. The remaining problem is the SSE now has to perform a horizontal add of the results of “ELI(1)**2 + ELI(2)**2”, then an additional scalar add of the “ELI(3)*2”. The result can then be run through a scalar DSQRT operation to produce the Length. The division of ELI(1:3) by the scalar is performed in two steps: one 2-wide for the “ELI(1) + ELI(2)”, and one for the “ELI(3)” component. The store is performed in two steps as mentioned for the load.

On AVX systems, the code loads each of the Back and Front in two parts, one for the var(1:2) parts and one for the var(3) parts, plus an additional XOR to pre-clear the 4th element of the YMM register. The subtraction is performed in one operation, same with the squaring of “ELI(1)**2 + ELI(2)**2 + ELI(3)*2”. However, at this point, the AVX code has to horizontally add the three partial results and then perform a scalar DSQRT to produce the Length. The division of the ELI by Length is performed in one operation after broadcasting the Length across a YMM register; however, this is generally not done in one step due to the uncertainty of the undefined 4th element of the YMM vector. Therefore, the division is performed in two steps: 2-wide, 1-wide. Note that the load and store are performed each in two parts with an additional XOR to pre-clear the 4th element of the load YMM register.

**The PAOS has a much easier job and it fully vectorizes all of the code:**

type(TypeYMM) :: Front(3), Back(3), ELI(3), UELI(3), Length ELI(1).v = Back(1).v – Front(1).v ELI(2).v = Back(2).v – Front(2).v ELI(3).v = Back(3).v – Front(3).v Length.v = DSQRT(ELI(1).v**2 + ELI(2).v**2 + ELI(3).v**2) UEIL(1).v = ELI(1).v / Length.v UEIL(2).v = ELI(2).v / Length.v UEIL(3).v = ELI(3).v / Length.v

In the first statements, each element of the ELI vector is computed 4-wide (four tethers at once). The partial results are kept 4-wide in separate YMM registers. The squaring now is performed on three separate registers, each 4-wide. What used to be the horizontal add of the three elements now becomes a vertical add of three 4-wide elements. Next, the DSQRT is performed on the 4-wide sum of the squares (on Sandy Bridge DSQRT is internally performed on each half of the YMM register, Ivy Bridge is full 4-wide DSQRT). The resultant 4-wide lengths (separate values) is then used to produce four unit vectors. The above code is completely 4-wide SIMD and potentially executes faster with four tethers than it does with one tether (speed is dependent on where cache lines are split).

**Advantage with large arrays**

When looping through larger arrays, similar advantages can be attained when the arrays contain the three (X, Y, Z) components. Effectively the PAOS data flow computations become:

OBJECT(1)%X(1) | OBJECT(2)%X(1) | OBJECT(3)%X(1) | OBJECT(4)%X(1) OBJECT(1)%Y(1) | OBJECT(2)%Y(1) | OBJECT(3)%Y(1) | OBJECT(4)%Y(1) OBJECT(1)%Z(1) | OBJECT(2)%Z(1) | OBJECT(3)%Z(1) | OBJECT(4)%Z(1) OBJECT(1)%X(2) | OBJECT(2)%X(2) | OBJECT(3)%X(2) | OBJECT(4)%X(2) OBJECT(1)%Y(2) | OBJECT(2)%Y(2) | OBJECT(3)%Y(2) | OBJECT(4)%Y(2) OBJECT(1)%Z(2) | OBJECT(2)%Z(2) | OBJECT(3)%Z(2) | OBJECT(4)%Z(2)

Where OBJECT represents the structure containing the array of application vector(3)’s, but now remapped into SIMD 4-wide vectors stored in three units of SIND 4-wide vectors.

Scalar (original) representation

! Position with respect to IP IF real, pointer :: rBeadIPIF(:,:) ! (3,NumberOfBeads+2)

AVX 4-wide SIMD representation

! Position with respect to IP IF type(TypeYMM), pointer :: rBeadIPIF(:,:) ! (3,NumberOfBeads+2)

The code to produce the vector between adjacent beads and the unit vector of that vector.

Original code (abbreviated) that will optimize to include partial vectorization:

DO I=1,nBeads ELI(1) = pFiniteSolution%rBeadIPIF(1,I) – pFiniteSolution%rBeadIPIF(1,I+1) ELI(2) = pFiniteSolution%rBeadIPIF(2,I) – pFiniteSolution%rBeadIPIF(2,I+1) ELI(3) = pFiniteSolution%rBeadIPIF(3,I) – pFiniteSolution%rBeadIPIF(3,I+1) Length = DSQRT(ELI(1)**2 + ELI(2)**2 + ELI(3)*2) pFiniteSolution%rTUI(1,I) = ELI(1) / Length pFiniteSolution%rTUI(2,I) = ELI(2) / Length pFiniteSolution%rTUI(3,I) = ELI(3) / Length END DO

YMM SIMD code fully vectorized across 4 tethers:

DO I=1,nBeads ELI(1).v = pFiniteSolution%rBeadIPIF(1,I).v – pFiniteSolution%rBeadIPIF(1,I+1).v ELI(2).v = pFiniteSolution%rBeadIPIF(2,I).v – pFiniteSolution%rBeadIPIF(2,I+1).v ELI(3).v = pFiniteSolution%rBeadIPIF(3,I).v – pFiniteSolution%rBeadIPIF(3,I+1).v Length = DSQRT(ELI(1).v**2 + ELI(2).v**2 + ELI(3).v*2).v pFiniteSolution%rTUI(1,I).v = ELI(1).v / Length pFiniteSolution%rTUI(2,I).v = ELI(2).v / Length pFiniteSolution%rTUI(3,I).v = ELI(3).v / Length END DO

**Advantages over SOA**

The SOA layout requires more cache line loads and more TLB loads than PAOS. An example is the case of producing the vector and unit vector between the beads of the tether. SOA would require separate cache lines **and TLBs** for:

rBeadIPIF(I,1,t) rBeadIPIF(I+1,1,t) (cache line split 1/8 of time) rBeadIPIF(I,2,t) rBeadIPIF(I+1,2,t) (cache line split 1/8 of time) rBeadIPIF(I,3,t) rBeadIPIF(I+1,3,t) (cache line split 1/8 of time) rTUI(I,1,t) (cache line split 1/8 of time) rTUI(I,2,t) (cache line split 1/8 of time) rTUI(I,3,t) (cache line split 1/8 of time)

Six cache lines per bead for each tether.

The PAOS requires 4 cache lines per bead 4 tethers at a time:

rBeadIPIF(1,I).v(0:3) rBeadIPIF(1,I+1).v(0:3) (cache line shared with v) rBeadIPIF(2,I).v(0:3) rBeadIPIF(2,I+1).v(0:3) (1/2 time in above/below) rBeadIPIF(3,I).v(0:3) rBeadIPIF(3,I+1).v(0:3) (cache line shared with ^) rTUI(1,I).v(0:3) (cache line shared with v) rTUI(2,I).v(0:3) (1/2 time in above/below) rTUI(3,I).v(0:3) (cache line shared with ^)

The PAOS traverses through cache lines faster (meaning flushing faster); however, the traversal occurs 1/4 as many times (SIMD is packing 4 tethers into the data). While the total data storage is the same for both layouts, the demands on separate cache lines and TLB entries are reduced.

## Benchmark program

To illustrate the advantage of PAOS, we will adapt a real application as opposed to using a synthetic benchmark. I chose to use a complex simulation program, one I am very familiar with, as a benchmark. Over a period of seven years, I have optimized and re-optimized the code many times. Prior to conversion to PAOS, I thought there wasn’t much left to squeeze out of the code. I knew that a large portion of the code was not amenable to vectorization as written. I also knew that what code that was vectorized and optimized by the compiler, was not all that could be vectorized. I tried to resolve this and developed the packed array of structures technique to use with my application. I considered the SOA format, but knew that this would have marginal effect for this particular application.

The program is a highly optimized OpenMP* FORTRAN program for simulation of tethers attached to spacecraft. I use the program for investigating Space Elevator design issues as well as tethered satellites. I chose a simulation study configuration for an EDDE spacecraft to use as a benchmark. This spacecraft is proposed for use in debris elimination in near earth orbit (dead satellites and pieces thereof).

The EDDE spacecraft under study is longer than the one depicted above and consists of:

2 Net Managers

2 Controller and Emitters

9 Solar Arrays

2 Relatively short tethers (~15 m) connecting the Payload Managers to the Controller/Emitter

10 Long tethers (1 km each) of aluminum tape

A total of 12 tethers and 13 connected satellites

The PAOS modifications made to this program were performed only on the Finite Element portion of the tether portion of the application. This represents an important but smaller percentage of the total code (~25%). The remaining tether portions and body portion (Payload Managers, Controller and Emitters, Solar Arrays, Earth, Moon, Planets, etc.) have yet to be converted to PAOS. Full advantage of PAOS for this application has not been addressed.

The Visual Studio solution for this program contains twelve FORTRAN projects and one C++ project. The C++ project is used to query the processor for supported SIMD functions. A total of 554 F90 source files are used, many of which contain multiple subroutines and functions. Approximately 100 additional files were added to the original solution for PAOS support. The contents of these files are ostensibly the same as the original source of these 100 files with the addition of the “.v” suffix and different data types for the different SIMD packing formats.

The test platform was an Intel Core i7 2600K, which is 4 core processors with HT, code-named Sandy Bridge. It had Windows* 7 x64 PRO and 16 GB RAM.

The system was configured at various times with HT on/off and Turbo Boost on/off. The tests were mainly run with HT off because this was a 4-core processor, and the application is double precision floating point heavy.

Due to a significant portion of the code dealing with the Bodies, representing (approximately 80% of non-PAOS code), I increased the number of elements (beads) for simulating the tethers. The number of beads per tether was increased from 1,000 to 10,000. The increase in the number of beads per tether by a factor of 10 will reduce the proportional load of the non-PAOS code to the PAOS code by a factor of 10. Though more beads could be used, I decided not to bias the test results too heavily towards use of PAOS.

To balance iteration loads, the end two short tethers (each 15-beads) were substituted with the 1,000 meter aluminum tethers and at 10,000 beads, thus making 12 tethers of the same design and integration options.

In this application, tethers of same number of beads and same integration options (Aerodynamic, Electrodynamic, etc.) are suitable for consolidating into SIMD PAOS layouts. Depending on compatibility of tethers for SIMD and the number of such tethers, 2-wide, 3-wide, or 4-wide packing is chosen at run time. Twelve tethers permit full packing combinations of one (none), 2-wide, 3-wide, and 4-wide.

Wanting to run various permutations of the numbers of threads (1:4 and 1:8) and various packing densities of the PAOS (1-wide (not packed), 2-wide, 3-wide, 4-wide, of doubles), I increased the number of tethers from 12 to 96, thus making any combination of packing (1,2,3,4) and thread count (1,2,4,8) present an equal number of passes per thread in the outer loop.

The additional tethers were attached to the existing satellites in order to not increase the load of the non-PAOS modified portion of code used by the body (satellite) simulation. Ideal tether counts for 3, 5, 6, 7 threads were not configured as this would greatly increase the number of tethers.

The test program now consists of:

13 satellites, connected by 96 1 km tethers each tether having ~85 scalars 57 Arrays of dimension (3) 3 Arrays of dimension (3,3) 22 Arrays of dimension (3,~10000) 52 Arrays of dimension (~10000) 96 * (85 + (57*3) + (3*3*3) + (22*3*1000) + (52*x10000)) = 56,283,168 doubles (450 MB)

There are additional variables being manipulated per tether, and the satellites, planets, etc.

A single-threaded configuration test run, without PAOS SIMD, was made to determine a suitable integration interval to perform for a base-line reference simulation time. On each anniversary of that simulation time interval, the wall clock elapse time from the prior simulation time anniversary is displayed.

Tests were run for each combination of packing density: no pack, 2-wide, 3-wide, 4-wide, and for each number of threads: 1, 2, 3, 4. Each configuration was run for four report intervals and the smallest value was selected for use in producing the comparison tables. The choice of the smallest value was made under the assumption of having the least interference by whatever else may have been running on the system (Windows 7 x64, anti-virus, Visual Studio phone home, etc.).

**96 tethers, each with 10,000 beads - Turbo On, HT Off**

no pack SIMD 2-wide SIMD 3-wide SIMD 4-wide SIMD 1 202.461830544090 98.8166282926832 110.127472870544 96.2114920825516 2 113.080892757974 64.3881699061922 69.6634485553468 62.4252337711068 3 88.0928441275801 56.0872621388371 60.3187281050650 57.5615369606003 4 78.3640600375229 56.5572142589117 58.9836796998115 58.8164001500936

**96 tethers, each with 10,000 beads - Turbo On, HT Off**

no pack SIMD 2-wide SIMD 3-wide SIMD 4-wide SIMD 1 222.377931407129 108.414170956848 120.604551444653 105.015817485928 2 119.739932457786 67.4135627767355 73.9538575609758 65.4644253658544 3 90.9758964352723 56.9652196622883 61.5711771857414 58.8302754221386 4 80.8031147467163 57.1387076923083 59.7432093058160 59.3558339962474

The first test with one thread, Turbo On, HT Off showed an impressive speedup of 2.05x over the original code. The 3-wide pack was 1.88x speedup, and the 4-wide was 2.10x speedup over the original code. The difference between the 2-wide and 4-wide was only marginal at 1.027x.

The multi-threading scaling beyond two threads seemed to suffer; however, keep in mind that the simulation was memory intensive (~450 MB of working memory).

Regarding the 3-wide pack performance and why this may be of interest to you. Test runs were made with bead counts of 4000.

**4000 beads - Turbo On, HT On**

no pack SIMD 2-wide SIMD 3-wide SIMD 4-wide SIMD 1 91.3673146779801 50.2108811987264 53.7199640737510 47.4793344518262 2 55.4531537381101 34.9485817566820 35.3566061461863 31.8637723342063 3 45.6128820578588 31.2057180011579 30.7373190216704 28.4547075003547 4 45.9029787147183 30.9318707075290 30.1450711425587 29.1348981424198 5 45.9152253990896 30.9467731366030 30.1314135844295 29.2071261732417 6 52.4540275792824 36.6395676842658 34.7508676123107 33.7327393855012 7 55.0174089813809 39.3389245960061 36.7703592835023 36.2333221964363 8 60.0940263305092 42.1638867950896 39.0601884747230 39.5467420973800

Note that the lesser bead count for single thread, no pack vs. 2-wide fell from a speedup of 2.05x to 1.82x. This was due to the change in the ratio of the non-PAOS code vs. PAOS code. Also note that for the 3 to 8 thread runs that the 3-wide outperformed the 2-wide, and in the 8-thread run the 3-wide outperformed the 4-wide. It is apparent from this chart and running under these circumstances that a choice of 3 threads would be optimal for 4-wide. Different processor design, bead counts, numbers of tethers would affect the results greatly. Some combinations of Last Level Cache size and working data set size may favor 3-wide packing.

**The marginal speedup for scaling beyond 2 threads is likely due to three factors: **

a) The full data set of 450 MB exceeded that of the L3 cache, and the application was principally memory bandwidth bound. It should be noted that not all of these data are manipulated during the simulation run (aerodynamic and electro-dynamic forces were disabled among others).

b) The Intel Core i7 (Sandy Bridge) AVX does not have all 256-bit data paths. This requires two clock ticks for some of the internal transfers. The newer Ivy Bridge processor resolves this issue.

c) With Sandy Bridge having Dual Bank memory, it is not as memory-bus efficient as Ivy Bridge with Quad Bank memory. Additionally, this system has one node of memory. A 2-processor Ivy Bridge with two NUMA nodes in NUMA configuration, each with Quad Bank memory can improve the 4-wide SIMD instruction performance.

## Conclusions

PAOS is exceptionally effective in some applications. I anticipate getting additional performance benefits when I replace my 1P Sandy Bridge with a 1P or 2P Ivy Bridge system. It is also expected that non-memory bandwidth limited applications may experience better performance gains using the 4-wide PAOS SIMD (AVX). PAOS will increase size requirements and exhibit better cache line efficiency (higher hit ratio). 2-wide SIMD will double the working storage cache requirements over non-packed and 4-wide SIMD will quadruple the working storage cache requirements. Decreasing or partitioning loop runs may be in order when using PAOS. Your results will differ depending on working set size and processor generation.

Copyright © 2012 Jim Dempsey. All rights reserved.

Mr. Dempsey is available for program- or code-related consultations.

jim.00.dempsey@gmail.com http://www.linkedin.com/in/jim00dempsey

*Other names and brands may be claimed as the property of others.

## Comments (1)

Topjimdempseyatthecove said on

This is a follow-up to email responses to this article:

Some of the readers of this article have tried the PAOS approach on some very simple test programs where the standard approach in these test programs shows a high degree of vectorization.

PAOS is not recommended where you have high degree of vectorization.

PAOS is recommended where you have a low degree of vectorization, and where conversion to PAOS yields a better degree of vectorization.

Check the Home » Forums » Intel® AVX and CPU Instructions » Missed optimization opportunities (August 8, 2010) for an example of when not to consider PAOS.

Jim

## Add a Comment

Top(For technical discussions visit our developer forums. For site or software product issues contact support.)

Please sign in to add a comment. Not a member? Join today