SSE/SSE2 INTRINSICS CODES

SSE/SSE2 INTRINSICS CODES

given A 4X4 MATRIX X:
X[4][4]=X00 X01 X02 X04
X10 X11 X12 X13
X20 X21 X22 X23
X30 X31 X32 X33

IF ROW : X00 X01 X02 X04 is denoted by x0:
X10 X11 X12 X13 is denoted by x1
X20 X21 X22 X23 is denoted by x2
X30 X31 X32 X33 is denoted by x3
AND:

STAGE 1 STAGE 2
A0=X0+X3 Y0=A0+A1
A1=X1+X2 Y1=A2+A3<<1
A2=X1-X2 Y2=A0-A1
A3=X0-X3 Y3=A3-A2<<1

WHERE << DENOTES SHIFT LEFT.

HOW CAN I WRITE THE CODES using SSE/SSE2 intrinsics FOR THE ABOVE SENERIO.

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

According to your other posts, you generally seem to know how to use intrinsics. I am therefore unsure, what specific aspect you are struggling with.

You can implement the computation as follows: (You seem to work on 32bit integers.)

__int128i A0 = _mm_add_epi32(X0, X3);
__int128i A1 = _mm_add_epi32(X1, X2);
__int128i A2 = _mm_sub_epi32(X1, X2);
__int128i A3 = _mm_sub_epi32(X0, X3);

__int128i Y0 = _mm_add_epi32(A0, A1);
__int128i Y1 = _mm_add_epi32(A2, _mm_slli_epi32(A3, 1));
__int128i Y2 = _mm_sub_epi32(A0, A1);
__int128i Y3 = _mm_sub(A3, _mm_slli_epi32(A2, 1));

In case your results do not match your expectations, you can print the registers as discussed in this threador use a debugger that can display SSE registers.

For gaining an overview on what intrinsics are available, I strongly recommend the interactive "Intel Intrinsics Guide" which is available on this page.

is the __int128i A0ok or its suposed to be __m128i A0? See the error i get

error C2065: '__int128i' : undeclared identifier

I think it shoud be __m128i.

thanks a lot. just one more question. how do code using sse2 intrinsics and show that

x00 x01 x02 x03 matrix elements are denoted by x0?

Your question is not clear. Are you asking how x0 ( a _m128i variable) will be loaded with 4 consecutive elements x00 x01 x02 x03?
it depends on the data types of x00, x01 x02 x03 also.
You can use simple load instruction to load the data (SSE2).
_mm_load_si128(__m128 *data)or _mm_loadu_si128().

e.g. if they are char (8bit each). you can also use SSE4 instructions (PMOVZX), if data is packed:
_m128i x0 = _mm_cvtepu8_epi32(* (__m128i *) Input); where input is pointer to the integer (32bit containing 4 elements).

similarly if each element is short then you need to use: _mm_cvtepu16_epi32() and _mmcvtepu32_epi64 for 32 ints.

from

1 2 3 4 row x0
5 6 7 8
3 4 6 7
7 6 2 1 row x3
how do i load and add the elements of row x0 and row x3 using sse2 intrinsics?any example codes

The answer to your question depends on how the values are stored in memory.

Instead of trying to figure everything out directly, you might want to work through some tutorial and ready-made examples first. For example, there is this tutorial or this article.

Leave a Comment

Please sign in to add a comment. Not a member? Join today