-Mr Compiler, may I help you with the loop vectorization? -Not a disservice, please.

Any parent knows the simple rule: "Never help a child with a task he can succeed at himself. Otherwise you don't make any good for the kid,  for you and for the whole planet".
While a compiler is not a child (actually it is - because Intel  C/C++ Compiler is less than 16 years  old yet), the rule is fully applicable to it as well.



To prove it let's look at the following simple code from the open source OpenCV library.

template<typename T, class Op> static void

cvtScale_( const Mat& srcmat, Mat& dstmat, double _scale, double _shift )

{

    Op op;

    typedef typename Op::type1 WT;

    typedef typename Op::rtype DT;

    Size size = getContinuousSize( srcmat, dstmat, srcmat.channels() );

    WT scale = saturate_cast<WT>(_scale), shift = saturate_cast<WT>(_shift);


    for( int y = 0; y < size.height; y++ )

    {

        const T* src = (const T*)(srcmat.data + srcmat.step*y);

        DT* dst = (DT*)(dstmat.data + dstmat.step*y);

        int x = 0;

        for(; x <= size.width - 4; x += 4 )

        {

            DT t0, t1;

            t0 = op(src[x]*scale + shift);

            t1 = op(src[x+1]*scale + shift);

            dst[x] = t0; dst[x+1] = t1;

            t0 = op(src[x+2]*scale + shift);

            t1 = op(src[x+3]*scale + shift);

            dst[x+2] = t0; dst[x+3] = t1;

        }

        for( ; x < size.width; x++ )

            dst[x] = op(src[x]*scale + shift);


      }

}


It is a template function working with chars, ints, shorts, floats and doubles.
And as you could see its authors decided to help to the compiler with the loop vectorization by unrolling the internal "x" loop by 4 (and processing the remaining data tail separately).

So, do you think the loop will be vectorized by the modern optimizing compilers properly?

Let's check it using  the Intel Compiler 12.0 with /QxSSE2 optimization option (using other SSEx or AVX option gives the same result as below).

And  the compiler generated assembly output is very surprising:  The compiler produces some SSE instructions however only scalar not the vector ones.  The unrolled loop is NOT vectorized, but  the remaining data tail, containing 1-3 elements in not unrolled loop, is vectorized!

If we remove the unrolling making our code simple :

for( int y = 0; y < size.height; y++ )

    {

        const T* src = (const T*)(srcmat.data + srcmat.step*y);

        DT* dst = (DT*)(dstmat.data + dstmat.step*y);

        int x = 0;

        for( ; x < size.width; x++ )

            dst[x] = op(src[x]*scale + shift);

    }


 ... and check the asm output again we find the compiler does fully vectorize the code resulting in performance increase up to 2-4 times depending on the input data type!
Conclusion: More work for unrolling - lower performance. Don't do it.

Please notice that Microsoft Compiler, Visual Studio 2010 and 2008 with /arch:SSE2 option does NOT vectorize the code above neither unrolled no the compact one.The code produced in both cases is very similar in appearence and performance. It just confirms the conclusion above.

And what if you want to keep unrolling for some reason but still need get the vectorization=performance desired?
Then use the Intel compiler pragmas as shown below:
----------------------------------------------------------
#pragma simd
   for(x=0; x <= size.width - 4; x += 4 )
        {
            DT t0, t1;
            t0 = op(src[x]*scale + shift);
            t1 = op(src[x+1]*scale + shift);
            dst[x] = t0; dst[x+1] = t1;
            t0 = op(src[x+2]*scale + shift);
            t1 = op(src[x+3]*scale + shift);
            dst[x+2] = t0; dst[x+3] = t1;
        }
#pragma novector
        for( ; x < size.width; x++ )
            dst[x] = op(src[x]*scale + shift);
      }

-----------------------------------------

It is self explaining, isn't it?

For more complete information about compiler optimizations, see our Optimization Notice.