Compiler bug report: Incorrect auto-vectorized SSE2 code

Compiler bug report: Incorrect auto-vectorized SSE2 code

*****************************
Bug Description
*****************************
The program below should always print "0". That is indeed the case when compiling without optimizations or only with size optimizations (/O1). However, with speed optimizations enabled (/O2 or /O3) the program prints "1". The compiler attempts to auto-vectorize the inner loop using SSE2 instructions, but the generated code incorrectly sets the variable "flag" to 1.

This does not appear to be related to precision since none of the conditions that can set the variable "flag" to a nonzero value are even remotely close to being true.

Additional observations:
- the bug does not appear when setting the architecture to SSE or older, only with SSE2 and up.
- the bug does not appear if the variable "flag" is an int instead of a short. This may be related to the fact that in the former case the compiler does not generate the "packssdw" instruction.

*****************************
Configuration
*****************************
Compiler Version
"Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.3.300 Build 20120130"

Operating System:
Windows 7 32-bit

CPU:
This was found on an Intel Core i7-870 CPU at 2.93 GHz. It was also reproduced on a Intel Core i7-2700 at 3.40 GHz.

How To Reproduce:
To produce the bug:
>> icl /arch:SSE2 /O2 auto_vectorizer_bug.c
>> auto_vectorizer_bug.exe
Result: 1

To get the correct result:
>> icl /Od auto_vectorizer_bug.c
>> auto_vectorizer_bug.exe
Result: 0

*****************************
Sample Program (also attached)
*****************************

// auto_vectorizer_bug.c : Demonstrates what appears to be a bug in the Intel compiler's auto-vectorization for SSE

// When compiled with -Od, the program prints "0", which is the correct result.

// When compiled with -O3 or -O2 the program prints "1".
#include <stdio.h>

#include <assert.h>
#define DIM   8
float g_buffer[DIM][DIM];
void init_buffer(float p_buf[DIM][DIM])

{

    int i, j;
    /* initialize all elements to 0.5 */

    for (i = 0; i < DIM; i++)

    {

        for (j = 0; j < DIM; j++)

        {

            p_buf[i][j] = 0.5;

        }

    }

}
int main(int argc, char** argv[])

{

    int i ,j;

    short flag;

    float x1, x2, x3, x4, x5, x6;

    int dim;
    flag = 0;
    /* initialize all array entries to 0.5 */

    init_buffer(g_buffer);
    /* make it appear as if the array dimensions are not known

     * at compile time */

    assert(argc==1);

    dim = argc*DIM;
    for (i = 1; i < dim; i++)

    {

        for (j = 0; j < dim; j++)

        {

            x1 = g_buffer[i][j];

            x2 = g_buffer[i-1][j];
            /* this condition should never be true */

            if ((x1 == 0) || (x2 == 0))

            {

                flag = 1;

            }

            else

            {

                x3 = x1 * x1;

                x4 = x1 * x3;

                x5 = x2 * x2;
                /* this condition should never be true */

                if ((x4 * 0.1) > x5)

                {

                    flag = 1;

                }

                else

                {

                    x6 = x2 * x5;

                    /* this condition should never be true */

                    if (x3 < (x6 * 0.1))

                    {

                        flag = 1;

                    }

                }

            }

        }

    }
    /* The result should always be 0, but with optimizations enabled we get the result 1. */

    printf("Result: %dn", flag);
    return 0;

}

AnexoTamanho
Download auto-vectorizer-bug.c2 KB
8 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Hi Roy,

>>...This was found on an Intel Core i7-870 CPU at 2.93 GHz. It was also reproduced on a Intel Core i7-2700 at 3.40 GHz.

That looks odd and I'll do a verification on a computer with Intel Core i7-3840QM and Windows 7 64-bit OS. Thanks for the test-case.

Best regards,
Sergey

PS: So far my main concern that you use Windows 7 32-bit OS and I use Windows 7 64-bit OS.

I don't reproduce such a problem with either the 64-bit 12.1.7.371 nor 13.0.1.119 compilers. I don't have the 32-bit one installed.
The only vectorization is init_buffer(). There's not much room for variations there.
There is no /arch:SSE distinct from SSE2 in recent compilers. Is it treated as /arch:IA32 ?

The inner loop in the main() function is getting vectorized. The compiler's optimization report explicitly states that (let me know if you'd like me to attach the report). And it's in that loop that the problem occurs.

Hi,
I was able to reproduce this error in all the 12.1 icc compilers, But was unable to reproduce it in 13.0 compiler. So this issue has been fixed in 13.0 compiler. You can download it from registrationcenter.intel.com.

Regards,
Sukruth h V

Hi,

i have compiled your sample with parallel studio XE 2013 Compiler 13.0.1 20121010 under openSUSE 12.2 linux,
no problems i use -O1, -O2 -03 and allways the result was '0'

greatings

Franz

>>...I was able to reproduce this error in all the 12.1 icc compilers, But was unable to reproduce it in 13.0 compiler...

That is a good news and thanks for the update.

Good news indeed. My company doesn't yet offer the 13.0 compiler, but I'll ask for it.

For others who may be using the 12.1 compiler and experiencing this issue, a simple workaround you can use is to add a "#pragma novector" right before the problematic loop.

Thank you all for your help!

Roy

Faça login para deixar um comentário.