| Last Modified On : | October 20, 2008 9:47 AM PDT |
Rate |
|
Several instructions are available on the Intel® Pentium® 4 Processor for moving integer data between SIMD registers. However, it may be more beneficial to use other instructions as a replacement for the straightforward register-to-register moves to reduce the number of cycles it takes to execute. Together, the organization of the code and the execution units required by the instructions, will determine the benefit of these replacement instructions.
The instructions covered in the document are for SSE2 128-bit integer, SSE 64-bit integer and MMX Technology instructions. The SSE2 128-bit replacement instructions can only be used to replace the SSE2 128-bit standard move instructions and the SSE 64-bit replacement instructions can only be used to replace the MMX Technology standard move instructions.
Standard Move Instructions
The following is a list of the standard register-to-register move instructions available for MMX and SSE2 including their latency and throughput information as listed in the Intel® Pentium® 4 and Intel Xeon™ Processor Optimization Reference Manual:
Streaming SIMD Extension 2 128-bit Integer Instructions
|
Instruction
|
Latency
|
Throughput
|
Execution Unit
|
| MOVDQA xmm, xmm |
6
|
1
|
FP_MOVE
|
| MOVDQU xmm, xmm |
6
|
1
|
FP_MOVE
|
Table 1: SSE2 Integer Full Register Move Instructions
MMX Technology 64-bit Instructions
|
Instruction
|
Latency
|
Throughput
|
Execution Unit
|
| MOVQ mm, mm |
6
|
1
|
FP_MOVE
|
Table 2: MMX Full Register Move Instructions
Replacement Move Instructions
The following is a list of the instructions that can be used as replacements for the standard move instructions available for SSE and SSE2 including their latency and throughput information as listed in the Intel® Pentium® 4 and Intel Xeon™ Processor Optimization Reference Manual:
Streaming SIMD Extension 2 128-bit Integer Instructions
|
Instruction
|
Latency
|
Throughput
|
Execution Unit
|
| PSHUFD xmm, xmm, imm8 |
4
|
2
|
MMX_SHFT
|
Table 3: SSE2 Integer Replacement Move Instructions
Streaming SIMD Extension 64-bit Integer Instructions
|
Instruction
|
Latency
|
Throughput
|
Execution Unit
|
| PSHUFD mm, mm, imm8 |
2
|
1
|
MMX_SHFT
|
Table 4: MMX Replacement Move Instructions
Take note that the replaceme nt for the MMX Technology MOVQ instruction is an SSE 64-bit integer instruction.
Using the Replacement Move Instructions
Replacing the move instructions is simple to do. If you noticed from the previous sections all of the replacement instructions listed are different forms of the shuffle instruction for the various technologies (i.e. SSE-Integer and SSE2-Integer).
The shuffle instructions move data from the source register to the destination register using the immediate parameter to determine where the data comes from in the source. The algorithm for PSHUFD is as follows:
dest[31-0] <= (src >> (order[1-0]*32)[31-0])
dest[31-0] <= (src >> (order[1-0]*32)[31-0])
dest[31-0] <= (src >> (order[1-0]*32)[31-0])
dest[31-0] <= (src >> (order[1-0]*32)[31-0])
Equation 1: Shuffle Algorithm
In order to do a straightforward copy an immediate value of 0xE4 will have to be used. What 0xE4 tells the shuffle instruction to do is copy word/dword 0 from the source register to word/dword 0 of the destination register, word/dword 1 of the source to word/dword 1 of the destination, etc. Therefore the instruction should be written as follows:
pshufd xmm0, xmm1, 0xE4 ; REPLACING movdqa xmm0, xmm1
pshufw mm0, mm1, 0xE4 ; REPLACING movq mm0,mm1
Equation 2: How to replace movdqa with pshufd
For a more detailed description on how the shuffle instructions work refer to the IA-32 Intel® Architecture Software Developer's Manual Volume2: Instruction Set Reference.
Replacement Factors
The replacement factors listed in this section are to be used as guidelines for doing fast moves. There is no guarantee that the fast move will be faster, and in some case it may even be slower. The main reason for this is that the Pentium® 4 Processor executes instructions out-of-order so there is no guarantee when an instruction will be executed in relation to another instruction.
Latency is the time it takes for a certain instruction to return its result. So if an instruction has a latency of 4 clocks, the result of that instruction will not be available until 4 clocks after it starts.
The shuffle instructions can be faster because of their lower latency. So if an instruction immediately following a move has a dependency on it, it can be executed sooner if a shuffle is used (see Figure 1).
|
Standard
|
Shuffle
|
|
|
|
Figure 1: Fast Move Latency Benefit
However, if the next instruction dependent on the move is several instructions away you may not see a benefit from using the fast move (see Figure 2).
|
Standard
|
Shuffle
|
|
|
|
Figure 2: Fast Move No Latency Benefit
Throughput is the time it takes for a certain instruction to execute. So if an instruction has a throughput 2 clocks, it will need this many clocks to do its calculations but may not necessarily return the data in this amount of time.
The throughput of the shuffle instructions is longer than that of the move instructions with the exception of PSHUFW. Several back-to-back move instructions with no dependencies will complete sooner than the same amount of back-to-back shuffles as can be seen in Figure 3.
|
Standard
|
Shuffle
|
|
|
|
Figure 3: Back-to-back Moves
The move instructions are on a different port and execution unit than the other SIMD instructions, and therefore will not interfere with their scheduling. The shuffle instructions, on the other hand, may interfere with scheduling since they share the same port, and sometimes the same execution unit, with other SIMD instructions. Figure 2 shows that it takes the shuffle path an extra clock to complete because PXOR cannot enter the port at the same time as the shuffle. If, for example, PXOR used the same execution unit as the shuffle then it would have to wait two clocks before it could execute.
Another method of doing fast moves back-to-back is by combining a standard move with a shuffle instead of doing back-to-back shuffles (refer to section 2.4.2 for an explanation of why back-to-back shuffles are slow).
|
Standard + Shuffle
|
|
|
Figure 4: Move and Shuffle Parallelism
Figure 4 illustrates the standard move plus shuffle completing 2-3 clocks sooner than the pure moves and pure shuffles in Figure 3. The reason for this is that the pure moves and pure shuffles have to wait the for their execution units to be freed up before they can execute while the combined moves and shuffles can be issued at the same time because they use different ports and execution units.
In order to make code more readable when using the alternate move instructions the following macros can be utilized:
#ifdef _USE_FASTMOV
// SSE2 Integer
#define FASTMOVDQ( dst, src ) __asm pshufd dst, src, 0xE4
// SSE Integer
#define FASTMOVQ( dst, src ) __asm pshufw dst, src, 0xE4
#else
// SSE2 Integer
#define FASTMOVDQ( dst, src ) __asm movdqa dst, src
// SSE Integer
#define FASTMOVQ( dst, src ) __asm movq dst, src
#endif
Equation 3: FASTMOV Macros
While using fast moves can produce faster code, it needs to be done with care because not all instances of it will be beneficial. Also, keep in mind that when making any changes to the code it may be necessary to return to a regular move as other changes made may affect the factors listed in Section 2.4.
Finally, remember to use the FASTMOV macros. They exist to make your code more readable and give you the ability to quickly revert to the standard move for testing.
//***************************************
// INCLUDE FILES
//***************************************
#include <stdio.h>
//***************************************
// DEFINES
//***************************************
#define ReadTSC( x ) __asm cpuid
__asm rdtsc
__asm mov dword ptr x,eax
__asm mov dword ptr x+4,edx
#define TEST_XMM_LAT( x ) __asm x xmm0,xmm7
__asm x xmm1,xmm0
__asm x xmm2,xmm1
__asm x xmm3,xmm2
__asm x xmm4,xmm3
__asm x xmm5,xmm4
__asm x xmm6,xmm5
__asm x xmm7,xmm6
#define TEST_XMM_LAT_IMM( x,imm ) __asm x xmm0,xmm7,imm
__asm x xmm1,xmm0,imm
__asm x xmm2,xmm1,imm
__asm x xmm3,xmm2,imm
__asm x xmm4,xmm3,imm
__asm x xmm5,xmm4,imm
__asm x xmm6,xmm5,imm
__asm x xmm7,xmm6,imm
#define TEST_MMX_LAT( x ) __asm x mm0,mm7
__asm x mm1,mm0
__asm x mm2,mm1
__asm x mm3,mm2
__asm x mm4,mm3
__asm x mm5,mm4
__asm x mm6,mm5
__asm x mm7,mm6
#define TEST_MMX_LAT_IMM( x,imm ) __asm x mm0,mm7,imm
__asm x mm1,mm0,imm
__asm x mm2,mm1,imm
__asm x mm3,mm2,imm
__asm x mm4,mm3,imm
__asm x mm5,mm4,imm
__asm x mm6,mm5,imm
__asm x mm7,mm6,imm
#define MACROx5( x ) x x x x x
//
// 128-bit Integer
//
#define TEST_MOVDQU MACROx5( MACROx5( TEST_XMM_LAT( movdqu ) ) )
#define TEST_MOVDQA MACROx5( MACROx5( TEST_XMM_LAT( movdqa ) ) )
#define TEST_PSHUFD MACROx5( MACROx5( TEST_XMM_LAT_IMM( pshufd, 0xE4 ) ) )
//
// 64-bit Integer
//
#define TEST_MOVQ MACROx5( MACROx5( TEST_MMX_LAT( movq ) ) )
#define TEST_PSHUFW MACROx5( MACROx5( TEST_MMX_LAT_IMM( pshufw, 0xE4 ) ) )
__declspec( align( 16 ) ) static unsigned int mem[ 4 ];
main( int argc, char *argv[ ], char *envp[ ] )
{
__int64 StartTime, EndTime;
int aTime[ 6 ];
__asm
{
pxor xmm0,xmm0
pxor xmm1,xmm1
pxor xmm2,xmm2
pxor xmm3,xmm3
pxor xmm4,xmm4
pxor xmm5,xmm5
pxor xmm6,xmm6
pxor xmm7,xmm7
}
//
// Perform the tests.
//
int i, j;
printf( "Int movdqu movdqa pshufd movq pshufw" );
printf( "--- ------ ------ ------ | ---- ------" );
for ( j=0; j < 5; j++ )
{
//
// Test MOVDQU.
//
ReadTSC( StartTime );
for ( i=0; i < 2000; i++ )
{
TEST_MOVDQU
}
ReadTSC( EndTime );
aTime[ 0 ] = (int)(EndTime-StartTime) / 1000;
//
// Test MOVDQA.
//
ReadTSC( StartTime );
for ( i=0; i < 2000; i++ )
{
TEST_MOVDQA
}
ReadTSC( EndTime );
aTime[ 1 ] = (int)(EndTime-StartTime) / 1000;
//
// Test PSHUFD.
//
ReadTSC( StartTime );
for ( i=0; i < 2000; i++ )
{
TEST_PSHUFD
}
ReadTSC( EndTime );
aTime[ 2 ] = (int)(EndTime-StartTime) / 1000;
//
// Test MOVQ.
//
ReadTSC( StartTime );
for ( i=0; i < 2000; i++ )
{
TEST_MOVQ
}
ReadTSC( EndTime );
aTime[ 3 ] = (int)(EndTime-StartTime) / 1000;
//
// Test PSHUFW.
//
ReadTSC( StartTime );
for ( i=0; i < 2000; i++ )
{
TEST_PSHUFW
}
ReadTSC( EndTime );
aTime[ 4 ] = (int)(EndTime-StartTime) / 1000;
printf( "Pass %d: %6d %6d %6d %5d %6d KClocks",
j+1, aTime[ 0 ], aTime[ 1 ], aTime[ 2 ], aTime[ 3 ], aTime[ 4 ] );
}
printf( "" );
__asm emms
return 0;
}
The following is a sample output of the source listed in Appendix A on a 1.7 GHz Pentium® 4 Processor:
Int movdqu movdqa pshufd movq pshufw
--- ------ ------ ------ | ---- ------
Pass 1: 2400 2400 1600 2770 801
Pass 2: 2400 2400 1674 2400 800
Pass 3: 2400 2681 1600 2400 800
Pass 4: 2400 2400 1600 2460 800
Pass 5: 2652 2400 1600 2400 800
