Optimizing for the Intel® Pentium® 4 Processor Using Assembly Language

by Khang Nguyen
Intel Corporation


When talking about optimizing programs for the Intel® Pentium® 4 processor, people usually think about using Streaming SIMD Extensions (SSE) and Streaming SIMD Extensions 2 (SSE2) instructions to improve the performance. This concept holds true for most cases. Most of the time, by using SSE2 instructions on 128-bit XMM registers, performance increases dramatically. However, if a program is using 64-bit data and it is inefficient to group them together into 128-bit registers, there is a way to optimize it without using SSE or SSE2.

Using SIMD instructions on 128-bit wide registers can vastly improve the performance of a program. There is no question about it. However, if for some reason, you can only work with 64-bit data, what can you do about it? Just because you can only work with 64-bit data doesn't mean you can't use 128-bit XMM registers. You can use them by rearranging data and putting them into XMM registers. The problem is that sometimes the overhead involved in massaging and inserting data into XMM registers can be so large that you don't gain any performance at all. This is where Hyper-Threading Technology benefits the user. Systems with Hyper-Threading Technology enabled are different from real multi-processors in a sense that they share physical resources like cache memory and execution units. The following section covers some techniques on how to optimize a program for Pentium 4 processors with Hyper-Threading Technology enabled.

General Concepts

To optimize an application for the Pentium 4 processor, you can either use Intel MMX™ technology or SSE/SSE2 instructions. The question is when to use MMX technology versus SSE/SSE2. It is common practice to explore SSE/SSE2 first because it can be used on XMM registers which are 128-bit wide versus the 64-bit MMX technology registers. However, just because XMM registers are wider than registers using MMX technology, some instructions still take much longer to complete when running on XMM registers. For example, the instruction "paddq" takes 6 clock cycles when using it on XMM registers but needs only 2 clock cycles to complete on MMX registers. The rule is: If your applications involve intensive calculations in 128-bit data registers, using SSE/SSE2 makes sense. If it takes too much overhead to massage and load data into XMM registers using MMX technology, although not as wide as XMM registers, is the better alternative. Sometimes a combination of both MMX and SSE/SSE2 is advantageous. You can use the XMM registers as extra storage for data values when the MMX operations require temporary storage of large quantities of data. By using the XMM registers as temporary storages you don't have to swap data or wait until the MMX registers are free before loading new data. This way you can save clock cycles involved in loading and reloading data.

Look at the overall program or functions for ways to simplify it. Sometimes replacing the branching conditions or relocating them can make a big difference. For example, taking a condition out and putting it before the loop eliminates one branching statement out of a loop that gets called 640 times (as in the number of pixels per scan line of a 640x480 resolution picture). The improvement is even greater if that function is called many times in a program. We all know that moving data from memory to registers or between memory locations can be costly if this is not handled carefully. The common practice is to load them far in advance of their usage. Another point you might want to pay attention to is equivalent instructions with shorter latency. For example, use the instruction "pshufw" (2 clock cycles) in place of "movd" (6 clock cycles) when moving data among MMX registers (with the order value of 0xE4). Note that you have to take into account, not only the latency of an instruction, but also its throughput. The throughput is defined as the time it takes for an execution unit to serve an instruction before it is ready to receive the next one. The importance of the throughput can be seen in section 4.2. Finally, it is very important to remember that Hyper-Threading Technology enabled machines share execution units; therefore, for statements next or close to each other, use appropriate instructions to distribute tasks among execution units. This way you can hide latencies, thus improving performance. This technique is not just beneficial for systems with Hyper-Threading Technology, but for all systems. The following section describes some of the techniques.

Tips and Tricks

Tip 1: Initializing Data

Set a Register to Zero:

movd eax, 0



xor eax, eax 

pxor mm0, mm0 

pxor xmm0, xmm0 


Set All Bits of MM0 to 1s:

C declaration: unsigned temp[4] = {0xFFFFFFFF, 0xFFFFFFFF,


asm { movq mm0, temp

movdq xmm1, temp}



pcmpeqd mm0, mm0

pcmpeqd xmm1, xmm1


Tip 2: Creating a Constant

Set mm7 to 0x FF00FF00FF00FF00

pcmpeqd mm7, mm7 // 0xFF FF FF FF FF FF FF FF

psllq mm7, 8 // 0xFF FF FF FF FF FF FF 00

pshufw mm7, mm7, 0x0 // 0xFF 00 FF 00 FF 00 FF 00


Each instruction takes 2 clock cycles to complete. The whole operation takes 6 clock cycles.


pxor mm7, mm7 // 0x 0

pcmpeqd mm0, mm0 // 0x FFFFFFFFFFFFFFFF

punpcklbw mm7, mm0 // 0x FF00FF00FF00FF00


pxor and pcmpeqd are handled by the MMX-ALU execution unit and takes 2 clock cycles to complete, but the MMX-ALU only waits 1 cycle instead of waiting for the completion of the instruction pxor before serving the instruction pcmpeqd. Therefore,the whole operation only takes 5 clock cycles to complete instead of 6.

Set mm7 to 0x 00FF00FF00FF00FF

pxor mm0, mm0 // 0x 0

pcmpeqd mm7, mm7 // 0x FFFFFFFFFFFFFFFF

punpcklbw mm7, mm0 // 0x 00FF00FF00FF00FF


Note: You can use the same technique with XMM registers with some minor modifications because you can only work on half of the XMM register at a time.

Tip 3: Loading Data

movq mm1, mm2



pshufw mm1, mm2, 0xE4


Note: The trick lies in the magic number 0xE4; do not change the order. This is a useful way to copy the content of one register to another. The instruction movq takes 6 clock cycles to complete compared to only 2 for the pshufw instruction. However, don't always substitute movq with pshufw automatically. Care should be taken to make sure that the appropriate execution unit is not busy at that time. The movq and pshufw instructions use the FP_MOV and MMX_SHFT execution units, respectively.

Tip 4: Swapping Data

Swap the Hi and Lo Portions of a Registers:

pshufw mm0, mm0, 0x4E

pshufd xmm0, xmm0, 0x4E


Note: If you reverse the order number from 0x4E to 0xE4, the operation becomes a copy instead of swap.

Create Patterns:

Load register mm0 with 0xAADDAADDAADDAADD

movd eax, 0xAADD

movd mm0, eax

pshufw mm0, mm0, 0x0


Note: The number 0x0 copies the first word "AADD" to all subsequent words of mm0.You can use the same technique with XMM registers by doing it in the lower half; shift left to move it to the upper half and issue the command again to take care of the lower half.

Tip 5: Using lea instructions

mov edx,ecx

sal edx,3



lea edx, [ecx + ecx]

add edx, edx

add edx, edx


Note: lea instructions with 2 more add instructions will be fast, but don't go beyond 3 adds; any larger throughput defeats the purpose of the lea instruction.


We don't always have to use SSE/SSE2 instructions to gain any performance in speed when optimizing programs for the Intel® Pentium® 4 processor. If operations involve only integers on 64-bit data, use MMX technology instead of SSE/SSE2. First, use some common sense looking at the overall picture to see what you can do to simplify the functions. If you have many identical operations that need to stay close to each other, try to spread them across different execution units to hide latencies. Things like unused code, placement of branching conditions, looping, faster instructions are worth considering during the optimization process. Finally, MMX instructions are running faster on Pentium 4 processors. Whenever possible, use MMX instructions when working with 64-bit data.

Related Resources


About the Author

Khang Nguyen is Senior Applications Engineer working with Intel's Software and Solutions Group. He can be reached at khang.t.nguyen@intel.com

For more complete information about compiler optimizations, see our Optimization Notice.


$udh@k@r's picture

HI Khang Nguyen

I am trying to implement "_mm_set1_epi16( imm8_val)" in assembly. While refering manual i came to know that it's a composit instruction. I would like to know the actual instructions which used to implement this set1 instructions.


Nguyen, Khang T (Intel)'s picture

Hi BR,

You can find information about instruction cycles and latency numbers at:


Appendix C, Opt manual.


srimks's picture

How can I get information about instuctions cycles and latency number, could you share the reference(s) which diclose that.


anonymous's picture

Cool articles.

May be you can take me some advise. I want swap bytes in the xmm0. Do something like this:

__declspec( align(16) ) BYTE g_carReverseC1ShufleMask[16] =
0x0F, 0x0E, 0x0D, 0x0C, 0x0B, 0x0A, 0x09, 0x08, 0x07, 0x06, 0x05, 0x04, 0x03, 0x02, 0x01, 0x00,


movdqa xmm1, oword ptr [g_carReverseC1ShufleMask]


pshufb xmm0, xmm1

It's perfect, but "pshufb" is SSE3, and question is can i do something like this in "SSE1"?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.