Vectorization - unsigned / signed condition

Vectorization - unsigned / signed condition

I am using the Intel C++ Compiler v7.1 on the Windows platform. Why is that this loop will vectorize (the code doesn't do anything, it's just an example)....

short Array1 [100];
short Array2 [100];
short Val = 34;
for (int i = 0;i < 100; ++i)
{
if (Array1[i] > Val)
{
Array2[i] = 3;
}
}
But this will not? Using the -Qvec_report3 parameter, the compiler says the if condition is too complex. But the only difference here is unsigned versus signed shorts...

unsigned short Array1 [100];
unsigned short Array2 [100];
unsigned short Val = 34;
for (int i = 0;i < 100; ++i)
{
if (Array1[i] > Val)
{
Array2[i] = 3;
}
}

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi,

I tried this with an 8.1 compiler and both of the loops did not vectorize. My code was a little different than yours as I had to make assumptions about how your code is situated in a function. In any event, the report stated that my loop did not vectorize because it says the "condition may protect exception". I'm not sure what this means. I'll see if I can get a vectorizer engineer to respond to this post.

Best regards,

Max

"condition may protect exception" usually prevents Intel compiler vectorization in the presence of a conditional. You may override this and promote vectorization by using
#pragma vector always
on the line preceding the for(), or, if you wish to assert in addition that the operands are aligned,
#pragma vector aligned

In an example such as you quote, it looks impossible that an exception could be raised by speculative execution, but the compiler appears not to be able to make the distinction.

I'm not familiar enough with the parallel 16-bit operations to be certain, but I assume "condition is too complex" would refer to lack of a 16-bit parallel unsigned compare instruction.

Message Edited by tim18 on 08-18-2004 07:45 PM

Dear user,

The reason the first loop is vectorized with 8-way SIMD parallelism on packed words but not the second is due to the lack of support for a packedunsigned > comparison (viz.instruction pcmpgtw supports > comparisons on packed signed words, but not for packed unsigned words). In contrast, a == or != comparison will vectorize for both cases, because the instruction pcmpeqw(possibly negated) works for both packed signed and unsigned words (provided that both operands are consistently either sign extended or zero extended, as in your example, so that the comparison can be done in the narrower 16-bit precision to allow for maximum parallelism).

The condition may protect exception appears for Max because the 8.x compilers are more careful about vectorizing a condition with bit-masking (where conditionally executed code is moved into the always-taken path) than the 7.x compilers. Suppose the trip count would not be known for this loop and Array1 is much longer than Array2. Then,conditions Array1[i] > Val could be used in the prefix to indicate that Array2 may still be accessed, while all remaining elements of Array1 are set to a value <= Val to signal an out-of-bounds of Array2. This may seem contrived for this example, but, in general, compilers must make conservative assumptions. The 8.x compilers only uses some simple symbolic manipulations to disprove that conditional exceptions can be moved into the always-taken path (and not yet the actual range of subscripts), which explains the rather conservative rejection ofthe example where the trip count and array sizes are statically known. Like Tim said, any vectorization enabling pragma will tell the compiler that it is okay to skip the analysis of whether conditions guard exceptions. I will work on further improving the analysis in the 9.x products.

For such and many other details on vectorization in the Intel compiler, please refer to the recently published book:

Aart J.C. Bik. The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance.Intel Press, June, 2004, http://www.intel.com/intelpress/sum_vmmx.htm

Hope this was useful.

Aart Bik

Thanks a lot for the detailed explanation, it was extremely helpful. I will look into the recommended reading.

As a minor follow-up, I slightly improved the potential out-of-boundsanalysis to capture at leastsome straightforward cases (because obtaining speedup with bit-masking can be tricky, the analysis never had been made very advanced). Given the program:

short Array1 [100];
short Array2 [100];
short Val = 34;
doit () {
int i;
for (i = 0; i < 100; ++i)
if (Array1[i] > Val)
{
Array2[i] = 3;
}
}

With context insensitive analysis, the message used to be:

[C:/cmplr/temp] icl -QxN -Qvec_report2 s.c
s.c(9) : (col. 4) remark: loop was not vectorized: condition may protect exception.

With context sensitive analysis (exploiting the constant trip-count), the analysis now disproves out-of-bounds, and vectorization can proceed:

[C:/cmplr/temp] icl -QxN -Qvec_report2 s.c
s.c(8) : (col. 1) remark: LOOP WAS VECTORIZED.

Quoting - userx03

I am using the Intel C++ Compiler v7.1 on the Windows platform. Why is that this loop will vectorize (the code doesn't do anything, it's just an example)....

short Array1 [100];
short Array2 [100];
short Val = 34;
for (int i = 0; i < 100; ++i)
{
if (Array1[i] > Val)
{
Array2[i] = 3;
}
}
But this will not? Using the -Qvec_report3 parameter, the compiler says the if condition is too complex. But the only difference here is unsigned versus signed shorts...

unsigned short Array1 [100];
unsigned short Array2 [100];
unsigned short Val = 34;
for (int i = 0; i < 100; ++i)
{
if (Array1[i] > Val)
{
Array2[i] = 3;
}
}

Probably, I did check with ICC-v11.0 on x86_64 lInux m/c, same thing still persists.

Top one (with signed integer) is vectorized as -
--
]$ icpc test1.cc
test1.cc(9): (col. 1) remark: LOOP WAS VECTORIZED.
--

but with unsigned integers it doesn't vectorizes.

The Intel Reference document - Intel-64 & IA-32 Architectures Software Develope's Manual, Vol.2B 4-91, Order# 253667-025US only speaks about signed bytes/words/double intergers.

~BR

BR,

Although SSE3 does not support unsigned integer math I think you can roll your own code to use SSE3

unsigned short SignBit = 0x8000;
unsigned short Array1[100];
unsigned short Array2[100];
...
if(
( ((short)Array1[i] - (short)Val) & (short)SignBit)
!=
((short)Array1[i] & (short)SignBit)
) (short)Array2[i] = (short)3;
...

The following may or may not work (too lazy to try) as I do not know how the SSE3 treats the xor of an SSE3 register loaded with short integers with a register short integers but treated as XORPD (xor of shorts)

if(
(
(
((short)Array1[i] - (short)Val)
^
((short)Array1[i])
)
&
(
(short)SignBit
)
)
(short)Array2[i] = (short)3;

I hope the above reads ok.

Basicly you are perfroming signed math on the unsigned values (subtraction) then testing to see if the sign bit flipped.

*** Val will have to be set such that the unvectored loop uses Array[i] >= Val and not >.

Jim Dempsey

www.quickthreadprogramming.com

I ment to say "XORPD floats"

SSE should have an XORPI (xor packed integers) for registers loaded as integers.

Jim Dempsey

www.quickthreadprogramming.com

I am not quite sure why srimks has felt the urge to raise this dead thread from 2004.

What is being done here is basically an array clipping operation, I wrote an article for the ISN some time ago covering the subject:

http://software.intel.com/en-us/articles/array-clipping/

You might want to take a look at it. Too bad that even ICC 11.1 beta five years of compiler development later still doesn't use that trick to vectorize such code.

Regards,
Igor Levicki

Quoting - Igor Levicki
I am not quite sure why srimks has felt the urge to raise this dead thread from 2004.

What is being done here is basically an array clipping operation, I wrote an article for the ISN some time ago covering the subject:

http://software.intel.com/en-us/articles/array-clipping/

You might want to take a look at it. Too bad that even ICC 11.1 beta five years of compiler development later still doesn't use that trick to vectorize such code.

I think I did raise the issue coz it was pointed by Aart Bik to get it supported or resolved in ICC-v9.0 but somehow even ICC-v11.0 doesn't have unsigned support till date.

I think Forums are meant for Intel users to learn & share, if some issues arises than report those issues for the betterment of Intel tools to Intel developers to investigate and fix if needed.

~BR

Quoting - srimks

I think I did raise the issue coz it was pointed by Aart Bik to get it supported or resolved in ICC-v9.0 but somehow even ICC-v11.0 doesn't have unsigned support till date.

I think Forums are meant for Intel users to learn & share, if some issues arises than report those issues for the betterment of Intel tools to Intel developers to investigate and fix if needed.

~BR

No, Aart Bik said that he (as in personally) will work to better support such cases in 9.x -- not fix it or resolve them simply because such issues aren't easily fixable or resolvable which he did explain in his rather long and high quality post.

That was however in August 2004, and Aart Bik doesn't work for Intel anymore since May 2007 so to me it seems a bit unrealistic to expect something from bringing this subject up.

As for the purpose of the forums -- if I understand it correctly, forums are intended mostly for self-help, although Intel engineers do watch it carefully and sometimes even jump in.

With that said, if you have a specific issue with code which doesn't vectorize when it should I suggest you to file a bug report on Premier Support. Also, if you believe that some code should be vectorizable feel free to file a feature request on Premier Support instead of bringing up old threads.

Finally, feel free to adapt my sample code, you can also ask me for help if you are having problems understanding it. After all I was just trying to be helpfull.

Regards,
Igor Levicki

Quoting - Igor Levicki
I am not quite sure why srimks has felt the urge to raise this dead thread from 2004.

What is being done here is basically an array clipping operation, I wrote an article for the ISN some time ago covering the subject:

http://software.intel.com/en-us/articles/array-clipping/

You might want to take a look at it. Too bad that even ICC 11.1 beta five years of compiler development later still doesn't use that trick to vectorize such code.

What about using 64-bit airthemetic for integer multiplication that produces 128-bit or larger products which are normally utilizd in cryptographic algorithms. Normally, processor cannot perform large number multiplication natively, they should be broken into chunks which are then perrmitted by the architecture (32 bit or 64-bit addtions or multiplications).

If I consider - Multiplication of two unsigned 64-bit numbers.

This multiplication can't be vectorize thereby with any versions of ICC or optimize to give better performance than 32-bit using vectorization?

~BR

Quoting - srimks

What about using 64-bit airthemetic for integer multiplication that produces 128-bit or larger products which are normally utilizd in cryptographic algorithms. Normally, processor cannot perform large number multiplication natively, they should be broken into chunks which are then perrmitted by the architecture (32 bit or 64-bit addtions or multiplications).

If I consider - Multiplication of two unsigned 64-bit numbers.

This multiplication can't be vectorize thereby with any versions of ICC or optimize to give better performance than 32-bit using vectorization?

~BR

The problem with 64-bit multiplication is that there is no hardware support for it be it signed or unsigned.

Therefore, the compiler cannot utilize something that doesn't exist. The code I wrote is a workaround for smaller datatypes because for those datatypes instructions do exist (except that they deal only with signed numbers).

So the answer is NO.

If you are looking for optimized code for large number multiplication you should consider specialized libraries.

Regards,
Igor Levicki

Quoting - Igor Levicki

The problem with 64-bit multiplication is that there is no hardware support for it be it signed or unsigned.

Therefore, the compiler cannot utilize something that doesn't exist. The code I wrote is a workaround for smaller datatypes because for those datatypes instructions do exist (except that they deal only with signed numbers).

So the answer is NO.

If you are looking for optimized code for large number multiplication you should consider specialized libraries.

Probably, Intel processor seems to have one of it's hardware limitations. But nowdays, "Grand-challenge HPC Problems" are looking to work around above 64-bit multiplication.

Could be some other processor might be supporting.

~BR

Quoting - srimks

Probably, Intel processor seems to have one of it's hardware limitations. But nowdays, "Grand-challenge HPC Problems" are looking to work around above 64-bit multiplication.

Could be some other processor might be supporting.

~BR

Not that I am aware of. The only sort of advice I can give you for big number multiplication is to use FFT.

Regards,
Igor Levicki

Leave a Comment

Please sign in to add a comment. Not a member? Join today