Matrix Transpose for char/short array

Matrix Transpose for char/short array

We can find the macro __MM_TRANSPOSE_PS for transpose of floats. But I am interested in doing transpose of an array of characters. I was able to write _MM_TRANSPOSE_PS myself using unpack and move intrinsics, but can't find similar intrinsics for chars.

Can anyone please help as to what approach should be taken in this situation.

Thanks
HG

16 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Depending on the array size PSHUFB can be a solution or at least be helpful doing it.
Please give an example.

Show us your declaration for the character array.
This will give us an idea of the size of the array and the number of dimensions.

Jim Dempsey

www.quickthreadprogramming.com

An '_MM_TRANSPOSE4_PS' macro declared in 'xmmintrin.h' is designed to use a 4x4 matrix of floats
( single-precision ) as input.

In case of a similar approach for a 'char' type the biggest dimension will be 16x16, and for 'short' type it will be 8x8.

I'd like to repeatthe same question:

How big are your 'char' / 'short' matricies?

Best regards,
Sergey

Quoting gautam.himanshuWe can find the macro __MM_TRANSPOSE_PS for transpose of floats...

Does it make sense to use a SSE based transpose for a 4x4 matrix instead of a Classic algorithm?

Please take a look atresults of a test:

DEBUG configuration

> Test1028 Start <
Sub-Test 5 - 200,000,000 calls to [ CLASSIC 4x4 Matrix Transpose ] - 19657 ticks
Sub-Test 6 - 200,000,000 calls to [ SSE 4x4 Matrix Transpose ] - 8640 ticks // 2.28x faster
> Test1028 End <

RELEASE configuration

> Test1028 Start <
Sub-Test 5 - 200,000,000 calls to [ CLASSIC 4x4 Matrix Transpose ] - 18563 ticks
Sub-Test 6 - 200,000,000 calls to [ SSE 4x4 Matrix Transpose ] - 5843 ticks // 3.18x faster
> Test1028 End <

I am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.

Quoting gautam.himanshuI am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.

I couldprovide you with the performance numbers for two Matrix Transpose algorithms, applied to a
1K x 1K matrix,that I've implemented for my current project. That is,

- a Classic ( Two-For-Loops /Non-Inplace)

and

- a Diagonal Based( Two-For-Loops / Inplace )

The Diagonal Based algorithm doesn't need a second outputmatrix andhas areduced number of
exchanges. It never "touches" values along the diagonal line from left-top corner to right-bottom corner of the matrix.

Please take a look at performance results.

Matrix size: 1,024 x 1,024

Classic Transpose - ( 128 transposes in 10.015 sec ) = 0.0782421875 sec
Diagonal Transpose - (128 transposes in 5.609 sec) = 0.0438203125 sec => ~1.79x faster

Please take a look at results of another test.

If four __m128 variables:

...
__m128 row1 = { 0x0 };
__m128 row2 = { 0x0 };
__m128 row3 = { 0x0 };
__m128 row4 = { 0x0 };
...

initialized with characters as follows:

...
row1.m128_u8[ 0] = '0'; r1.m128_u8[ 1] = '1'; r1.m128_u8[ 2] = '2'; r1.m128_u8[ 3] = '3';
row1.m128_u8[ 4] = '4'; r1.m128_u8[ 5] = '5'; r1.m128_u8[ 6] = '6'; r1.m128_u8[ 7] = '7';
row1.m128_u8[ 8] = '8'; r1.m128_u8[ 9] = '9'; r1.m128_u8[10] = 'A'; r1.m128_u8[11] = 'B';
row1.m128_u8[12] = 'C'; r1.m128_u8[13] = 'D'; r1.m128_u8[14] = 'E'; r1.m128_u8[15] = 'F';
...
< the same for rows row2, row3 and row4 >
...

a Source Matrix ( as characters ) will look like:

0123456789ABCDEF
0123456789ABCDEF
0123456789ABCDEF
0123456789ABCDEF

and after a call to:

...
_MM_TRANSPOSE4_PS( row1, row2, row3, row4 );
...

a Transposed Matrix will look like:

0123012301230123
4567456745674567
89AB89AB89AB89AB
CDEFCDEFCDEFCDEF

This is wrong and there is nothing unusual here. The _MM_TRANSPOSE4_PS macro cannot be used for
transposing a 4x16 matrix of characters because it was designed to transpose a 4x4 matrix of floats.

Quoting gautam.himanshuI am working on some benchmarks and generally taking sizes like 1k x 1k. shuffling the xmm registers seem the only posssible way which i dont think will give some good gains.

It would be interesting to see results of your R&D. Please provide some technical details and performance
numbersif you can.

Did you consideran Eklundh method of aMatrix Transpose?

Best regards,
Sergey

Thanks for your interest. i finally managed to do a good transpose using unpackepi8/16/32/64 instructions. its hard to give any numbers as transpose was a part of the actial problem. anyways i am intetested in eklundh method. what kind of numbets are reachable there?

Quoting gautam.himanshuThanks for your interest. i finally managed to do a good transpose using unpackepi8/16/32/64 instructions. its hard to give any numbers as transpose was a part of the actial problem...
It would nice to see a performance comparison of your SSE based algorithmwith a Classic algorithm.

The Eklundh method for a matrix transpose makes moreiterationsandmoreexchangescompared to a
Diagonal based algorithm.

Quoting gautam.himanshu...anyways i am intetested in eklundh method. what kind of numbets are reachable there?...

Here is a comparisonof number of exchangesfordifferent algorithms. In case of an 8x8 matrix:

Classic-64 exchanges
Diagonal -28 exchanges
Eklundh - 48 exchanges

Take into account that forDiagonal and Eklundh algorithms an input matrix must be Square and both
algorithms areInplace (don't need an output matrix ).

pshufb will only work for transposing 4x4 byte matrices.I've attatched assembly code for x64 that will transpose 16x16 byte data very rapidlyIt is based on using punpcklbw and punpckhbw to interleave data between 16x1 colomns stored in xmm registers.To link it to your c++ project you will need to assemble the file (I use JWasm assembler), link the obj file to your project, then insert these lines to define it as a function.#ifdef __cplusplusextern "C" {#endif void Transpose16x16A(char* a, char* b);at);#ifdef __cplusplus}#endifThe basic idea is to interleave data between colomns. I will use a 4x4 matrix as an examplexmm0 xmm1 xmm2 xmm3a0 b0 c0 d0a1 b1 c1 d1a2 b2 c2 d2a3 b3 c3 d3The MERGE macro in my code is defined like this:MERGE MACRO FIRST, SECOND, TEMP movdqa TEMP, FIRST punpcklbw FIRST, SECOND punpckhbw TEMP, SECOND movdqa SECOND, TEMPENDMIt interleaves data from 2 colomns of the matrix. In our exampleIf we applyMERGE xmm0, xmm2, xmm4MERGE xmm1, xmm3, xmm4 we get:xmm0 xmm1 xmm2 xmm3a0 b0 a2 b2c0 d0 c2 d2a1 b1 a3 b3c1 d1 c3 d3The data has been moved across colomns and we are just one step away from the transpose.ApplyingMERGE xmm0, xmm1, xmm4MERGE xmm2, xmm3, xmm4we get:xmm0 xmm1 xmm2 xmm3a0 a1 a2 a3b0 b1 b2 b3c0 c1 c2 c3d0 d1 d2 d3the 16x16 transpose operates in the same way but requires more MERGE operations.One thing to note is my code DOES NOT PERSERVE ANY XMM REGISTERS. The reason for this is that it is slow to save the registers and my code didn't need to. If you need to preserve registers, my suggestion would be to make a function that saves the registers, runs the 16x16 transpose in an assembly loop to generate a bunch of 16x16 transposes, and then restores the registers.When I timed this algorithm (Intel core i2 processor) on byte data, I achieved speds of on average, 1/4 clock cycle per byte. That means the full 16x16 transpose should take roughly 64 clock cycles. I found this to be about 6x faster than doing the regular scalar algorithm (doublely nested loop with loop unrolling)

Anlagen: 

AnhangGröße
Herunterladen Transpose16x16A.asm4.48 KB

That looks interesting and I'll try to test it (Visual Studio 2005\ ona 32-bit Windows platform ).

A couple of days ago I tested '_MM_TRANSPOSE4_PS' macro vs. 'No-For-Loops' codes ( just exchanges )
for a 4x4 matrix of floats and it outperforms the macro in a couple of times. I'll post results for comparison later.

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen