### Challenge

**Improve memory utilization by manipulating data-structure layout.** For certain algorithms, like 3D transformations and lighting, there are two basic ways of arranging the vertex data. The traditional method is the array of structures (*AoS*) arrangement, with a structure for each vertex, as shown below:

typedef struct{ float x,y,z; int a,b,c; . . . } Vertex; Vertex Vertices[NumOfVertices]; |

This method does not take full advantage of the SIMD technology capabilities.

### Solution

**Arrange the data in an array for each coordinate, taking advantage the structure of arrays (***SoA***) processing method.** SoA data structure is shown here:

typedef struct{ float x[NumOfVertices]; float y[NumOfVertices]; float z[NumOfVertices]; int a[NumOfVertices]; int b[NumOfVertices]; int c[NumOfVertices]; . . . } VerticesList; VerticesList Vertices; |

There are two options for computing data in AoS format: perform operation on the data as it stands in AoS format, or re-arrange it (*swizzle it*) into SoA format dynamically. The following code samples show each option, based on a dot-product computation:

; The dot product of an array of vectors (Array) and a ; fixed vector (Fixed) is a common operation in 3D ; lighting operations, ; where Array = (x0,y0,z0),(x1,y1,z1),... ; and Fixed = (xF,yF,zF) ; A dot product is defined as the scalar quantity ; d0 = x0*xF + y0*yF + z0*zF. ; AoS code ; All values marked DC are “don’t-care.” ; In the AOS model, the vertices are stored in the ; xyz format movaps xmm0, Array ; xmm0 = DC, x0, y0, z0 movaps xmm1, Fixed ; xmm1 = DC, xF, yF, zF mulps xmm0, xmm1 ; xmm0 = DC, x0*xF, y0*yF, z0*zF movhlps xmm1, xmm0 ; xmm1 = DC, DC, DC, x0*xF addps xmm1, xmm0 ; xmm0 = DC, DC, DC, ; x0*xF+z0*zF movaps xmm2, xmm1 shufps xmm2, xmm2,55h ; xmm2 = DC, DC, DC, y0*yF addps mm2, xmm1 ; xmm1 = DC, DC, DC, ; x0*xF+y0*yF+z0*zF ; SoA code ; ; X = x0,x1,x2,x3 ; Y = y0,y1,y2,y3 ; Z = z0,z1,z2,z3 ; A = xF,xF,xF,xF ; B = yF,yF,yF,yF ; C = zF,zF,zF,zF movaps xmm0, X ; xmm0 = x0,x1,x2,x3 movaps xmm1, Y ; xmm0 = y0,y1,y2,y3 movaps xmm2, Z ; xmm0 = z0,z1,z2,z3 mulps xmm0, A ; xmm0 = x0*xF, x1*xF, x2*xF, x3*xF mulps xmm1, B ; xmm1 = y0*yF, y1*yF, y2*yF, y3*xF mulps xmm2, C ; xmm2 = z0*zF, z1*zF, z2*zF, z3*zF addps xmm0, xmm1 addps xmm0, xmm2 ; xmm0 = (x0*xF+y0*yF+z0*zF), ... |

Performing SIMD operations on the original AoS format can require more calculations and some of the operations do not take advantage of all of the SIMD elements available. Therefore, this option is generally less efficient.

The recommended way for computing data in AoS format is to swizzle each set of elements to SoA format before processing it using SIMD technologies. This swizzling can either be done dynamically during program execution or statically when the data structures are generated. See Chapters 4 and 5 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual for specific examples of swizzling code. Performing the swizzle dynamically is usually better than using AoS, but is somewhat inefficient as there is the overhead of extra instructions during computation. Performing the swizzle statically, when the data structures are being laid out, is best, as there is no runtime overhead.

As mentioned earlier, the SoA arrangement allows more efficient use of the parallelism of the SIMD technologies because the data is ready for computation in a more optimal vertical manner: multiplying components **x0,x1,x2,x3** by **xF,xF,xF,xF** using four SIMD execution slots to produce four unique results. In contrast, computing directly on AoS data can lead to horizontal operations that consume SIMD execution slots but produce only a single scalar result, as shown by the many “don’t-care” (*DC*) slots in the previous code sample.

Use of the SoA format for data structures can also lead to more efficient use of caches and bandwidth. When the elements of the structure are not accessed with equal frequency – such as when elements **x, y, z** are accessed ten times more often than the other entries – then SoA not only saves memory, but it also prevents fetching unnecessary data items **a, b, c**.

Note that SoA can have the disadvantage of requiring more independent memory-stream references. A computation that uses arrays **x, y,** and **z** in the first code sample on the Solution section would require three separate data streams. This can require the use of more prefetches and additional-address generation calculations, as well as having a greater impact on page-access efficiency. An alternative, hybrid SoA approach blends the two alternatives:

NumOfGroups = NumOfVertices/SIMDwidth typedef struct{ float x[SIMDwidth]; float y[SIMDwidth]; float z[SIMDwidth]; } VerticesCoordList; typedef struct{ int a[SIMDwidth]; int b[SIMDwidth]; int c[SIMDwidth]; . . . } VerticesColorList; VerticesCoordList VerticesCoord[NumOfGroups]; VerticesColorList VerticesColor[NumOfGroups]; |

In this case, only two separate address streams are generated and referenced: one which contains **xxxx,yyyy,zzzz,zzzz**,... and the other which contains **aaaa,bbbb,cccc,aaaa,dddd**,... This also prevents fetching unnecessary data, assuming the variables **x, y, z** are always used together, whereas the variables **a, b, c** would also used together, but not at the same time as **x, y, z**. This hybrid SoA approach ensures the following:

- Data is organized to enable more efficient vertical SIMD computation
- Simpler/less address generation than AoS
- Fewer streams, which reduces page misses
- Use of fewer prefetches, due to fewer streams
- Efficient cache-line packing of data elements that are used concurrently

With the advent of the SIMD technologies, the choice of data organization becomes more important and should be carefully based on the operations to be performed on the data. In some applications, traditional data arrangements may not lead to the maximum performance. Application developers are encouraged to explore different data arrangements and data segmentation policies for efficient computation. This may mean using a combination of AoS, SoA, and Hybrid SoA in a given application.

The following items are related to this one:

- How to Use Strip Mining to Optimize Memory Use on 32-Bit Intel® Architecture
- How to Use Loop Blocking to Optimize Memory Use on 32-Bit Intel® Architecture

### Source

Intel® 64 and IA-32 Architectures Optimization Reference Manual (PDF)