*****************************
Bug Description
*****************************
The program below should always print "0". That is indeed the case when compiling without optimizations or only with size optimizations (/O1). However, with speed optimizations enabled (/O2 or /O3) the program prints "1". The compiler attempts to auto-vectorize the inner loop using SSE2 instructions, but the generated code incorrectly sets the variable "flag" to 1.
This does not appear to be related to precision since none of the conditions that can set the variable "flag" to a nonzero value are even remotely close to being true.
Additional observations:
- the bug does not appear when setting the architecture to SSE or older, only with SSE2 and up.
- the bug does not appear if the variable "flag" is an int instead of a short. This may be related to the fact that in the former case the compiler does not generate the "packssdw" instruction.
*****************************
Configuration
*****************************
Compiler Version
"Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1.3.300 Build 20120130"
Operating System:
Windows 7 32-bit
CPU:
This was found on an Intel Core i7-870 CPU at 2.93 GHz. It was also reproduced on a Intel Core i7-2700 at 3.40 GHz.
How To Reproduce:
To produce the bug:
>> icl /arch:SSE2 /O2 auto_vectorizer_bug.c
>> auto_vectorizer_bug.exe
Result: 1
To get the correct result:
>> icl /Od auto_vectorizer_bug.c
>> auto_vectorizer_bug.exe
Result: 0
*****************************
Sample Program (also attached)
*****************************
// auto_vectorizer_bug.c : Demonstrates what appears to be a bug in the Intel compiler's auto-vectorization for SSE
// When compiled with -Od, the program prints "0", which is the correct result.
// When compiled with -O3 or -O2 the program prints "1".
#include <stdio.h>
#include <assert.h>
#define DIM 8
float g_buffer[DIM][DIM];
void init_buffer(float p_buf[DIM][DIM])
{
int i, j;
/* initialize all elements to 0.5 */
for (i = 0; i < DIM; i++)
{
for (j = 0; j < DIM; j++)
{
p_buf[i][j] = 0.5;
}
}
}
int main(int argc, char** argv[])
{
int i ,j;
short flag;
float x1, x2, x3, x4, x5, x6;
int dim;
flag = 0;
/* initialize all array entries to 0.5 */
init_buffer(g_buffer);
/* make it appear as if the array dimensions are not known
* at compile time */
assert(argc==1);
dim = argc*DIM;
for (i = 1; i < dim; i++)
{
for (j = 0; j < dim; j++)
{
x1 = g_buffer[i][j];
x2 = g_buffer[i-1][j];
/* this condition should never be true */
if ((x1 == 0) || (x2 == 0))
{
flag = 1;
}
else
{
x3 = x1 * x1;
x4 = x1 * x3;
x5 = x2 * x2;
/* this condition should never be true */
if ((x4 * 0.1) > x5)
{
flag = 1;
}
else
{
x6 = x2 * x5;
/* this condition should never be true */
if (x3 < (x6 * 0.1))
{
flag = 1;
}
}
}
}
}
/* The result should always be 0, but with optimizations enabled we get the result 1. */
printf("Result: %dn", flag);
return 0;
}



