Best Known Method: Coaxing the Compiler to Vectorize Structured Data via Gathers

Published:02/22/2013   Last Updated:02/22/2013

Best Known Method

To help the compiler generate better vector code, sometimes it helps to decompose complex data structures to allow the compiler to understand the available parallelism and vectorize the code.  Vectorizing critical kernels is recommended for improving performance generally but is particularly important when targeting the Intel® Xeon Phi Coprocessor.  Decomposing data accesses may allow the compiler to use more advanced features like vector gather and scatter. Though adjacent data elements are preferred in order to maximize the performance advantage of vector loads and stores, sometimes the requirement to access data through indices is unavoidable.  Though use of vector gathers and scatters is generally slower than vector loads and stores, they can be beneficial if the amount of vector computation following the creation of the vectors is enough to offset the vector gather time. The compiler will generate vector gathers and scatters as needed, but sometimes it can be challenged by complex data structures. Showing the compiler how to access these complex data structures can help.

Original Code

This BKM originated with analysis of a Molecular Dynamics (MD) code. In MD codes, atom forces are typically maintained by three values: x, y, and z.  These force values are arranged in a position array that contains all atoms.

// atom forces
typedef struct float3 {
   float x;
   float y;
   float z;
} float3;
// position is an array of float3
float3* position; 

 Force calculations for a given atom are determined by its neighbor atoms.  For each atom, there is a list of neighbor atoms. To compute the forces for a given atom, this list is accessed to find all of its neighbors.

In the code segment below the position array, containing all of the atoms, is accessed by using an index in the neighList array.  The neighbors’ values are placed into the struct jpos. This is followed by calculations using the atom forces of this neighbor atom.

181    for (int k=0; k<dis; k++){

183        jpos = position[ neighList[j*dis + k + maxNeighbors * i] ];

192        float delx = ipos.x - jpos.x;
193        float dely = ipos.y - jpos.y;
194        float delz = ipos.z - jpos.z;
195        float r2inv = delx*delx + dely*dely + delz*delz;

The following is a portion of the assembly code generated for the above code segment. It shows that the position array elements are accessed one at a time even though vector instructions are being used. The neighList is accessed is get the index for the position array (line #183).  This index is loaded into register %rax. Each of the three adjacent elements of position indexed by %rax, x, y, and z, are loaded from memory separately at the vsubps subtraction instructions (line #192, #193, #194). Notice the {1to16} instruction modifier on the vsubps instruction indicating a broadcast of a single data value into all 16 elements of the source vector for the subtraction. All subsequent computations are completed as scalar operations even though vector instructions are used.

    movl      8(%r14,%r15,4), %eax                          #183.30 c1
    movslq    %eax, %rax                                    #183.21 c3
    shlq      $4, %rax                                      #183.21 c5
    vsubps    (%rax,%rbx){1to16}, %zmm19, %zmm28{%k4}       #192.28 c8
    vsubps    4(%rax,%rbx){1to16}, %zmm18, %zmm27{%k4}      #193.28 c10
    vsubps    8(%rax,%rbx){1to16}, %zmm22, %zmm26{%k4}      #194.32 c12
    vmulps    %zmm27, %zmm27, %zmm0{%k4}                    #195.41 c14
    vmovaps   %zmm28, %zmm1                                 #195.53 c16
    vmovaps   %zmm26, %zmm2                                 #195.53 c18
    vfmadd213ps %zmm0, %zmm28, %zmm1{%k4}                   #195.53 c20
    nop                                                     #195.53 c22
    vfmadd213ps %zmm1, %zmm26, %zmm2{%k4}                   #195.53 c24

Improved Code

To improve performance, we would like the compiler to exploit vector memory accesses and vector instructions using full vectors rather than scalar values. After loading the neighbor values from the position array, there are many force calculations to be completed (not shown in the code segment). Ideally these calculations will be completed using the performance advantage of full vectors rather than scalar values.

We note that the neighList is a sequential list of indexes that are used to access the position array. This is ideal for vector gathers.  The position array could be accessed using vector gather instructions, where the index is the vector of neighList that was just loaded.

Creating vectors for complicated data types like structs is a challenge for today’s compilers that will likely be addressed in the future.  For now, to help the compiler on this code, we recompose indexing the complex data type to explicitly access the individual data elements.  After this code change, the compiler is able to use vector gathers and fully vectorize the code. In the code below, the accesses to the elements in position array are made explicitly.

181     for (int k=0;k<dis;k++){ 

184         jposx = position[ neighList[j*dis + k + maxNeighbors * i] ].x;
185         jposy = position[ neighList[j*dis + k + maxNeighbors * i] ].y;
186         jposz = position[ neighList[j*dis + k + maxNeighbors * i] ].z;

192         float delx = iposx - jposx;
193         float dely = iposy - jposy;
194         float delz = iposz - jposz;
195         float r2inv = delx*delx + dely*dely + delz*delz;

The following is a portion of the assembly code generated for the above code segment. It shows that the position array is accessed by vector gather instructions (line #184 #185 #186).  The vector-gather instructions place the x, y, and z data values into three vector registers %zmm21, %zmm22, and %zmm23. The vector-subtract (vsubps) instructions (line #192 #193 #194) now use full vectors rather than scalar values:

        vgatherdps 4(%r13,%zmm20,4), %zmm22{%k4}                #185.22
        jkzd      ..L318, %k4   # Prob 50%                      #185.22
 ..L319 vgatherdps 4(%r13,%zmm20,4), %zmm22{%k4}                #185.22
        jknzd     ..L319, %k4   # Prob 50%                      #185.22
 ..L318 vgatherdps (%r13,%zmm20,4), %zmm21{%k3}                 #184.22
        jkzd      ..L320, %k3   # Prob 50%                      #184.22
 ..L321 vgatherdps (%r13,%zmm20,4), %zmm21{%k3}                 #184.22
        jknzd     ..L321, %k3   # Prob 50%                      #184.22
 ..L320 vsubps    %zmm22, %zmm11, %zmm2                         #193.27 c27
        vgatherdps 8(%r13,%zmm20,4), %zmm23{%k5}                #186.22
        jkzd      ..L322, %k5   # Prob 50%                      #186.22
 ..L323 vgatherdps 8(%r13,%zmm20,4), %zmm23{%k5}                #186.22
        jknzd     ..L323, %k5   # Prob 50%                      #186.22
 ..L322 vsubps    %zmm21, %zmm12, %zmm3                         #192.27 c33
        cmpl      %r15d, %r10d                                  #158.9 c33
        vmulps    %zmm2, %zmm2, %zmm31                          #195.41 c35
        vsubps    %zmm23, %zmm9, %zmm1                          #194.31 c37
        vfmadd231ps %zmm3, %zmm3, %zmm31                        #195.41 c39
        vfmadd231ps %zmm1, %zmm1, %zmm31                        #195.41

 To understand this listing, note that the vector gather instructions work in pairs, the first one initiating an operation and the second one continuing it until it is completed.  The remaining vector instructions in this example operate on vectors of distinct values, leaving a full vector of results in %zmm1.

With this code change, there was a 2.5X performance improvement in the overall MD application as the force calculations were completed using full vectors rather than scalar values.


Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804