SSE4 ?

SSE4 ?

There was post on one German site, about SSE4 mnemimonics in VS beta:
http://translate.google.com/translate?hl=en&sl=de&u=http://www.heise.de/...

Can anyone from Intel comment where is it true/upcoming, and when specs will be published?

Also, is there a Research Group in Intel which investigates what SIMD instructions are missing or could be benifitial for applications, where one could send/submit the suggestions?

at.

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Dear Alex,
Manyengineers and architects at Intel read this forum (or at least know somebody that does :-), so if you would like to share suggestions for new SIMD extensions, you can simply post these here.
Aart

Ive been working with different SIMD sets: Intel, AltiVec, Equator, TI for video encoding applications. Below is just few intrinsic that are clearly missing, and could benefit multiple video encoding/decoding applications as well as other signal processing tasks.

--Absolute value intrinsic. It seems that SSE4 will have it. Right ?

--One critical missing SIMD command is 16-bit multiplication with rounding, probably with saturation and cut-off variants, like:
(A*B+32767)>>16

in common case, command which can be applied for wider variety of applications:
(A*B+C)>>R

where A,B,C are short, intermediate result is 32-bit is either cut-off or saturated to 16 bits. So 16-bit multiply-add/round is possible

--- Sign application, useful for quantization in multiple algorithms:
SignApply (A,B) => B<0 ? (-A): A

Example:
A 10 2 0 7
B 1 -10 10 0
S 10 -2 0 7

Sign application is quite critical. One example is
shift with sign :
A/2^n
If a is negative, arithmetic shift could cause result to be -1 instead of zero:
-1>>4=-1
SignApply(ABS(A)>>N,A)

As you probably know many new compression algorithms are using arithmetic coding, such as H264, JPEG2000, as well as proprietary ones. It is manly sequential algorithms.
One thing that could benefit execution of such algorithm is conditional commands, like found in TI architecture. This will avoid pipeline flash:
e.g.
if (RAX) RBX++ as signle instruction.
Instruction set for efficient arithmetic coding requires further research, since it can be one of the main bottlenecks in modern codecs. Maybe some sort of recipical division can be useful.

alex@streambox.com

Login to leave a comment.