How to further optimize?

How to further optimize?

Hello everyone,

I have started overcoming the performance issues that I described in a previous thread (https://software.intel.com/en-us/forums/topic/516335). As suggested there, I enabled reports for vectorization and of course nothing was being vectorized. I started changing the code and yesterday I scored a small victory. I reduced execution time from 430 sec to 308 sec on the Phi, which is also faster than the 330 sec required on the 8 threads of the CPU (1x i7-3770 @3.40GHz). Now I am trying to further optimize the code and I have identified two points where I believe a lot of time can be saved.

The main computation in my application is to calculate small (3x3) rotation matrices and multiply them with another 3x3 matrix. For the current experiment I use, about 15 billion such multiplications are performed. For the CPU, I found that simply unrolling all loops and writing manually all calculations for this multiplication has the best performance. I was wondering, however, whether such a small calculation can be efficiently vectorized on the Phi. Each matrix does not even fill a single vector register (I have float values, so 16 values can fit into a vector register). I have searched a lot, but couldn't find anything relevant to this. I would like to hear your opinion whether it is worth or not. And if so, how can this actually be vectorized? Can pragmas provide better performance for such small calculations or would it be better to use directly intrinsics?

The second point is a calculation of the type:

for (i = 0; i < M; i++) {
    C[i] = sqrtf(A[i] * A[i] + B[i] * B[i]);
}

However, the vectorization report says that the sqrtf() function cannot be vectorized. Reading more on this I found that sqrtf() can be vectorized. Therefore, I started simplifying my application and indeed at some point the above loop gets vectorized. However, I have still not identified what exactly may hinder vectorization in the original code.

If I enforce vectorization in the original code through a #pragma simd I get messages of the type "remark: *MIC* vectorization support: gather was generated for the variable (unknown):  indirect access, 64bit indexed" and "remark: *MIC* vectorization support: scatter was generated for the variable C:  indirect access, 64bit indexed". The performance gets worse in this case.

I would appreciate any help on the above issues.

Ioannis E. Venetis

9 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

sqrtf() can be vectorized for MIC only via svml library.   By default, it should use imf-precision medium (roughly 23-bit precision), which is much faster than full 24-bit.   Setting the domain-exclusion option to handle only finite operands should produce a measurable speedup.  If you want only 13-bit precision, imf-precision low will be faster yet.  A chance to study the docs.

In a stride 1 loop such as you show, gather and scatter instructions should be used only for initial peeling for alignment and also for partial cache line at the end.   Aligning the data and setting the safe-padding assertion should eliminate these.   safe-padding tells the compiler it's OK to access the cache line remainders and to store the original contents back.   I don't know whether this might incur a sort of race condition if the loop is parallelized, e.g. by #pragma omp simd parallel for simd.  parallel simd is probably useful only for loop counts upwards of 10000.

If the comment about 64-bit indexed refers to use of a 64-bit data type for loop induction variable, that would be extra slow for gather-scatter.  It may be better to assure local int type:

for (int i = 0; i < M; i++)

(also assuring that M isn't a variable or a 64-bit type).

https://software.intel.com/en-us/articles/utilizing-full-vectors

Likely reason for not voluntarily vectorizing would be that you didn't assert no aliasing, e.g. by

float * __restrict A, * __restrict B, ....

I noticed a report that gcc requires all the pointers to be defined as  __restrict for it to have an effect; I haven't checked icc thoroughly in that respect.   If M involves re-calculation for each iteration of the loop or int64 promotion, that could make the compiler report "seems inefficient."

Hello,

Just in case anyone else looks for something similar. I realized that the problem for sqrtf() not being vectorized was not in my code, but in the compilation flags that MATLAB automatically adds. More specifically, MATLAB adds the -ansi parameter, which hindered vectorization of sqrtf(). Removing this parameter allowed vectorization and furthermore almost halved execution time of my application.

I also tried to make the 3x3 matrix multiplication using intrinsics, but it took actually more time in the. The reason was that the size of the matrices did not fit well into the 512 bits of the vector registers and multiple load, sets and swizzles were required. I embedded the 3x3 matrices into 4x4 matrices and tried again. I now get 20% better performance. Hopefully I can optimize this further.

Best regards,

Ioannis E. Venetis

That option -ansi was replaced by -ansi-alias long before the first MIC compiler came out, so the compiler may have misinterpreted it. It's not an appropriate option for MATLAB to issue for currently supported compilers.   In the 15.0 compilers, at least for MIC, ansi-alias would become a default.  If -ansi were interpreted as disabling in-line expansion of sqrtf or requiring ERRNO setting, that could prevent vectorization.

I don't know whether -opt-assume-safe-padding would make a difference in your case.  It asserts that it's OK to grab whole cache lines and store back the original contents of the padded bytes, which your 4x4 expansion has assured are safely padded.

Dear Tim,

Thank you for your explanation about -ansi. With respect to -opt-assume-safe-padding, it actually makes things worse. When using that option, time goes from 119 sec to 146 sec.

Best regards,

Ioannis E. Venetis

Hello,

This is the code I have written for the 4x4 matrix multiplication. The matrices are stored in row-major order and the operation performed is a3 = a1 * a2. My question is whether the code could be reorganized (perform operations in a different order) so as to get better performance.

static __inline__ void Matrix_Mult(float a1[16], float a2[16], float a3[16])
{
        __m512  b1, b2, b3, b4, c1, c2, c3, c4, r1, r2, r3, r4;
        __m512  a = _mm512_load_ps(a1);

        c1 = _mm512_swizzle_ps(a, _MM_SWIZ_REG_AAAA);
        c2 = _mm512_swizzle_ps(a, _MM_SWIZ_REG_BBBB);
        c3 = _mm512_swizzle_ps(a, _MM_SWIZ_REG_CCCC);
        c4 = _mm512_swizzle_ps(a, _MM_SWIZ_REG_DDDD);

        b1 = _mm512_extload_ps(a2, _MM_UPCONV_PS_NONE, _MM_BROADCAST_4X16, _MM_HINT_NT);
        r1 = _mm512_mul_ps(c1, b1);
        b2 = _mm512_extload_ps(a2 +  4, _MM_UPCONV_PS_NONE, _MM_BROADCAST_4X16, _MM_HINT_NT);
        r2 = _mm512_fmadd_ps(c2, b2, r1);
        b3 = _mm512_extload_ps(a2 +  8, _MM_UPCONV_PS_NONE, _MM_BROADCAST_4X16, _MM_HINT_NT);
        r3 = _mm512_fmadd_ps(c3, b3, r2);
        b4 = _mm512_extload_ps(a2 + 12, _MM_UPCONV_PS_NONE, _MM_BROADCAST_4X16, _MM_HINT_NT);
        r4 = _mm512_fmadd_ps(c4, b4, r3);
        _mm512_store_ps(a3, r4);
}

Ioannis E. Venetis

As the code shown here might be expected to show a latency of more than 30 cycles due to the sequential dependency of the simd float operations, plus memory operations, it seems to rely on pipeliining of multiple invocations within each thread for performance.  It might as well use one destination to combine all of r1...r4.

Different decisions between icc and gcc about riffling (multiple independent accumulators) seem to account for differences up to 50% in performance, with neither getting it right all the time.  The way Intel compilers do it, multiple accumulators rely on a significant loop count before entering the tree of adds combining accumulators.

If you have checked and found the compiler actually using fma rather than splitting into separate simd multiply and add instructions (to reduce the latency), a simple change in that direction would be to start with 2 independent mul_ps + fmadd_ps groups, adding the 2 accumulators at the end.  This might reduce  function latency by 10 cycles, if cache misses aren't significant.  Assuming you have pipelining between invocations of the function, this reduction in latency may show up only once per loop.

I can't guess whether using a memory operand for each mul and fma intrinsic (so as to shorten the code) could show any advantage.

Dear Tim,

Thank you for your suggestions. I have tried them (and many more!) but it seems that I was lucky enough and the code I sent here is actually the fastest one. Splitting the multiplications and additions to reduce latency seemed a good idea, but the result was worse. Furthermore, using one destination (r1) instead of 4 (r1...r4) in some tries had no effect in execution time, but in others it made things worse. I assume that the reason is that there are dependencies, i.e., the result calculated and put into r1 is required in the next instruction and it has to be calculated before it can be used. Using more destinations seems to remove some of these dependencies.

The only other idea I have to improve performance of this operation is to completely reorganize the order of calculations. As it is now, the calculations are performed as in a k-i-j loop for matrix multiplication. Do you think that changing this would benefit performance?

Ioannis E. Venetis

>>The main computation in my application is to calculate small (3x3) rotation matrices and multiply them with another 3x3 matrix. For the current experiment I use, about 15 billion such multiplications are performed.

I suggest running an experiment by converting your AOS to SOA. Usually a single rotation matrix is applied to multitudes of vectors. Such as a single 3x3 matrix to be multiplied against a list of 3x3 matrices. Let me express the following in Fortran, it is easier to describe than using pseudo code. You should be able to transcribe to C. Be mindful that in Fortran, the left most index of an array walks adjacent data and in C/C++ it is the right most.

subroutine ApplyRotation(A, R, C)
    real, intent(in)  :: A(:,:,:) ! A(nA, 3, 3)
    real, intent(in)  :: R(3,3) ! rotation matrix
    real, intent(out) :: C(:,:,:) ! C(nA, 3, 3)
    do j=1,3
        do i=1,3
            do n=1, size(A, DIM=1)
                C(n,i,j) = A(n,i,1)*R(1,j)+ A(n,i,2)*R(2,j) + A(n,i,3)*R(3,j)
            end do
        end do
    end do
end subroutine ApplyRotation

The above will fully vectorize

As a proof of concept test, insert a conversion routine and something like the above code (you won't have to restructure your data), and a convert back.

Run and time the conversion routine and time the converted routine and the convert back routine

With the three timing results in hand, you can determine the efficacy of:

a) using your current best method

b) using your current layout with the conversion, rotate, conversion back

c) reworking your data layout to avoid conversion and conversion back

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today