Algorithms to vectorize load groups in x86

The article describes algorithms used to achieve vectorization described in the blog.

Basically the algorithms convert an array of structures to the structure of arrays, where array length is vector length. For example:

We have an array of 32bit float coordinates X, Y, Z. Vector length is 128 bits. Each vector consists of 4 32bit float elements. As we have 3 elements in structure we need 4 XYZ structures (gray, blue, green and orange):

to get 3 vectors of X, Y and Z:

 

It’s obvious that this works only when number of elements in a vector is greater than number of elements in a structure.

There are at least 2 algorithms to complete this. Let’s call them: Direct permutations and Shift permutations.

1. Direct permutations. Complexity N2+N*(N-1). Complexity to get 1 element 2*N-1. Where N is number of elements in structure.

To get a vector of X elements we put X into its position (using “shuffle” instructions) in each vector and then unite all elements (using “or” or “blend” instructions):

 

The complexity to get 1 structure element is N “shuffles” + (N-1) “ors”. If we need vectors for all structure elements the complexity will be N*N “shuffles” + N*(N-1) “ors”.

This algorithm has high parallelism. Potentially we can count all N vectors in parallel.

2. Shift permutations. Complexity N+N2-1. Complexity to get 1 element (N2+N+2)/2. Where N is number of elements in structure.

First we group all structure elements in a vector (using shuffles). Should take N shuffles:

 

Now we are able to operate with groups. What we need is to put each group one-by-one. We will do this in N-1 steps using “palignr” instructions:

Here we have Z vector ready and therefore need just N-1 “palignr” more:

 

Finally X and Y vectors are ready.

The complexity of the algorithm is N “shuffles” + (N-1) steps * N “palignr” instructions + (N-1) “pailgnr” instructions at final step: N + (N-1)*N + N-1 = N2+N-1.

The complexity for 1 structure element is N “shuffles” + (N-1 + N-2 + … +1) + 1 (if on final step a vector is not ready we need 1 more “palignr”. Finally 1 element complexity is (N2+N+2)/2.

This algorithm has less complexity than previous but lower parallelism.

3. Power of M case: N = MK

When number of elements in structure N is power of some natural number N = MK we can lower complexity of the algorithms to ~N*log(N).

Let A be one of described above algorithms, A(N) its complexity for N element structures. When its complexity could be lowered for the case N = MK if we K times sequentially apply algorithm for M element structures to each M vectors group. The complexity will be N/M – number of groups to which we should apply the algorithm multiplied by K*A(M). So finally N * K * A(M) / M = N * logM(N) * A(M) / M.

Below is example for N = 4, M = 2

Generally when we decompose N on multipliers = M * L, we can get final result applying algorithm for M elements to N/M groups and then algorithm for L elements to N/L groups. The complexity of such algorithm will be N * (A(M)/M + A(L)/L).

 

The algorithms are implemented in GCC compiler (functions vect_permute_load_chain and vect_shift_permute_load_chain): https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/tree-vect-data-refs.c?revision=218160&view=markup&pathrev=218160

 

有关编译器优化的更完整信息,请参阅优化通知