Fast method to exchange values of two double-precision variables

Fast method to exchange values of two double-precision variables

Let's say there are two double-precision variables declared as follows:

...
double dValueA = 55.55L;
double dValueB = 77.77L;
...

What is a fastest methodin assemblerto exchange values ofthese two double-precision variables?

Best regards,
Sergey

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Sergey,
Are you doing x87 math or SSE2 math?
Is this showing up as a bottleneck?
Pat

Hi Patrick,

Quoting Patrick Fay (Intel)...
Are you doing x87 math or SSE2 math?

[SergeyK]x87 - Yes ( this is because a solution has to be highlyportable )
SSE2 solution also could be considered.

Is this showing up as a bottleneck?

[SergeyK] Yes, and I need to make the exchange in as fastest as possible way.

...

I'd like to provide some technical details. I don't need this to do the math butI need to use it inseveral sorting algorithms, like MergeSort, QuickSort, etc,
in cases when 'double' data types are used.

In ageneric form it looks like:

			...

			dTemp = dValueA;

			dValueA = dValueB;

			dValueB = dTemp;

			...

Here is a solution I currently implemented:

#define HrtXchgDATATYPE_RTDOUBLE( dValueA, dValueB )

		{

			_asm	FLD		[dValueA]

			_asm	FLD		[dValueB]

			_asm	FSTP	[dValueA]

			_asm	FSTP	[dValueB]

		}

Thesolution with FLD-FSTP instructions is~1.6x fasterand it improves performance of sorting algorithms.

Is it possible to make the exchange faster?

Best regards,
Sergey

What is keeping the compiler from using portable source code to accomplish the same thing? Are you trying to include the cases of misaligned data? If not, wouldn't 128-bit parallel moves be preferable?

Quoting TimP (Intel)What is keeping the compiler from using portable source code to accomplish the same thing?

[SergeyK] Nothing, but my toppriority is optimization of source codes in the first place.
It means thatcodes must be highlyoptimized at a C/C++ level,sometimes with inline assembler, and
I can't rely all the time onoptimizations of aC/C++ compiler.

Are you trying to include the cases of misaligned data? If not, wouldn't 128-bit parallel moves be preferable?

[SergeyK] Could you provide more technical details with an example?

Thanks in advance.

Quoting Sergey Kostrov...
Thesolution with FLD-FSTP instructions is~1.6x fasterand it improves performance of sorting algorithms.

Is it possible to make the exchange faster?

I've done a set of tests with 'Load-Shuffle-Store' intrinsic functions, like

			...

			RTdouble ddA[2] = { 77.0L, 55.0L };

			__m128d ddV = { 0.0L, 0.0L };

			ddV = _mm_loadu_pd( &ddA[0] );

			ddV = _mm_shuffle_pd( ddV, ddV, 1 );

			_mm_store_pd( &ddA[0], ddV );

			...

but it is not as fast as 'Fld-Fstp' based exchange. Finalrelative resultsof my tests are as follows:

Generic basedExchange- ~1.5x slower than Fld-Fstp
Fld-Fstp basedExchange - 1.0x
Shuffle basedExchange - ~2.5x slower than Fld-Fstp

Leave a Comment

Please sign in to add a comment. Not a member? Join today