Differences in Optimizing for the Pentium® 4 Processor vs. the Pentium® III Processor

by Bryan Hayes
Hayes Technologies


Introduction

Abstract

The Pentium® 4 processor introduces a completely new micro-architecture for the IA32 instruction set, the NetBurst™ architecture. It is specifically designed to enable industry-leading clock speeds and achieve highest performance for advanced applications such as graphics, video, speech or multimedia processing. This paper points out the most important performance-related differences of the NetBurst architecture compared to the earlier Pentium family processors, as implemented by the Pentium® III processor, and gives advice on how to optimize software for the NetBurst architecture.

Definitions

Earlier Pentium Family Processor - The processors and micro-architectural features introduced with the Pentium® Pro processor and enhanced by the Pentium® II processors and Pentium III processors [1, 2].

NetBurst Architecture - The micro-architecture of the Pentium® 4 processor [3, 4].

Scope

The differences between the NetBurst architecture in the Pentium 4 processor and the architecture of the earlier Pentium family processors are numerous. This paper focuses on those that are most relevant to performance. The paper does not cover optimizations common to both architectures, such as data alignment issues, nor does it provide a complete description of the architectures. Please see the references section [8, 9, 10, 11] to find additional information on this. Also, the Intel® Xeon™ processors [5, 6] are not explicitly covered, but it is important to note that they employ the NetBurst architecture (please see [7] for a description of the Hyper-Threading Technology introduced by the 0.13µm Xeon processor).


Key Differences and the Consequences for Optimization

Data Bus

The NetBurst architecture features a 400 MHz, 64 bits wide system bus with a peak bandwidth of 3.2 Gbytes/s offering an increase of more than a factor of three over earlier Pentium family processor architecture, which maxed at a 133 MHz, 64 bits wide bus and a peak bandwidth of 1.064 Gbytes/s. When the memory implementation utilizes this bandwidth, this greatly speeds up memory-intensive software, specifically array-oriented applications such as numerical / technical computing, multi-media, image, video or audio processing.

As the memory bandwidth is a fundamental barrier to speed improvements, this means that the Pentium® 4 can be as much as 3x faster than a Pentium® III on those applications which are limited by memory performance. A consequence of the high memory performance is that a lot of software that previously was memory-bound (speed was limited by the memory bandwidth) is now compute-bound (speed is limited by the speed of the computations rather than by the memory bandwidth) and therefore subject to optimization. Especially the single -instruction multiple-data (SIMD) instructions should be considered for optimization, since they have the highest potential for speed-ups and are particularly suited for array-oriented computations.

Cache Sizes and Behavior

The following table lists the key parameters of L1 and L2 caches:

Cache L1 Data L1 Code L2
Architecture NetBurst™ Architecture Earlier Pentium Family Processors NetBurst™ Architecture Earlier Pentium Family Processors NetBurst™ Architecture Earlier Pentium Family Processors
Size [Bytes] 8 K 16 K 12K µops 16 K 256-512 K† 256-512 K†
Line size [Bytes] 64 32 6 µops 32 128* 32
Associativity 4 4 8 8 8 8
Policy Write-Through Write-Back NA Write-Back Write-Back Write-Back
Read number bits / Clock cycle 128 / 1 64 / 1 3 µops / 1 128 / 1 512 / 2 256 / 2
Write number bits / Clock cycle 128 / 1** 64 / 1 NA NA 512 / 2 256 / 2
Integer data read latency [clock cycles] 2 3 NA NA 7 7
FP/SIMD data read latency [clock cycles] 6 3 NA NA 7 7

 

† 512K for 0.13µm models, * 2x64, only the modified 64-byte segment(s) are written to main memory, ** as the L1 cache is write-through and the L2 cache can only handle writes every 2 clock cycles, the throughput for L1 writes is 2.

The key differences are:

  • The number of bits, that can be read or written per clock cycle, is doubled from 64 to 128 bits
  • The latencies for integer reads have been reduced significantly
  • The code cache contains decoded µops instead of raw instructions, removing the decode step from the execution pipeline in most cases

 

In addition, because the NetBurst architecture schedules µops that are dependent on loads as if the load will hit the L1 cache, it needs to replay (discard the results and reschedule) them if an L1 cache miss actually occurs. This leads to penalties of at least 12 clock cycles.

Recommendations

  • Use the smallest possible data types for arrays to conserve memory bandwidth and cache space (i.e. if possible use "unsigned char x[ 5000 ]" instead of "unsigned int x[ 5000 ]")
  • Use Streaming SIMD Extension/Streaming SIMD Extension 2 (SSE/SSE2) instructions to utilize the double data read/write bandwidth

 

Data Prefetching

The NetBurst architecture introduced a hardware prefetch mechanism, which automatically prefetches data into the L2 cache in case of sequential data accesses with constant stride. This prefetch mechanism was subsequently included in the 0.13 µm Pentium® III processors. The basic properties of the prefetch mechanism are:

  • It invokes prefetch mechanism if the cache is missed 3-4 times in a constant stride pattern
  • It prefetches 256 bytes ahead
  • It only handles one read or write stream per 4 Kbytes page
  • Up to 8 streams can be prefetched in total
  • It does not prefetch past the 4 Kbytes page boundaries
  • It fetches past the last items to be accessed

 

Also the behavior of the software prefetch instructions have changed: The NetBurst architecture always fetches 128 bytes into the L2 cache only, whereas the earlier Pentium family processor architecture fetches 32 bytes into either L1 only, L2 only or both. This implies that prefetching/preloading data into the L1 cache can only be accomplished by a dummy load.

Recommendation

Although the hardware prefetch mechanism does a good job on large arrays, it is generally preferable to use software prefetch instructions because they:

  • can prefetch all of the data
  • do not prefetch unneeded data (if programmed accordingly)
  • can handle irregular and non-array accesses

 


Pipeline and Execution

Pipeline Front End

Earlier Pentium family processor architect ure featured a rather classic front end, consisting of instruction decoders able to decode up to 3 instructions (one complex, or one non-simple and two simple) per clock cycle. NetBurst architecture employs a totally new concept: A trace cache which holds sequences (traces) of µops, effectively removing the decoding stages (~5) from the pipeline in most cases and thereby reducing branch misprediction penalties.

The trace cache has a capacity of 12K µops and can deliver up to 3 µops per clock cycle (vs. up to 6, typical 2-4, for the earlier Pentium family processor architecture; however, the renaming pipeline stage can only handle 3 µops per clock cycle). In cases where instructions are not in the trace cache, a decoder has to fetch them from the L2 cache, decode them, then store the result in the trace cache. Although this decoder is limited to decoding one instruction per clock cycle, this is typically not a problem since most performance relevant instructions are fetched from the trace cache.

Recommendations

  • Instruction scheduling to match decoder capabilities, as with the earlier Pentium family processor, ("4-1-1 scheduling") is not an issue
  • Minimize the number of µops (generally 1 for "op reg, reg" or "mov reg,mem", 2 for "mov mem,reg", 2 or 3 for "op reg,mem" and 4 for "op mem,reg" instructions)
  • Employ SIMD instructions to further reduce the number of µops necessary to accomplish a task

 

Execution Units

The following table details the execution unit structure of both architectures:

(Throughput refers to the number of clock cycles by which instructions of the same kind are separated and latency to the number of clock cycles that pass until the result becomes available for use.)

Architecture NetBurst Earlier Pentium Family Processors
Port µops Throughput / Latency µops Throughput / Latency
0 Integer ALU† 0.5 / 0.5 Integer ALU (add, sub etc.) 1 / 1
Integer store data 1 / NA Integer shift 1 / 1
FPU / MMX / SSE / SSE2 move 1 / 6 Integer multiply 1 / 4
FPU exch 1 / 0 LEA instruction 1 / 1
FPU / MMX / SSE / SSE2 store data 16-128 bits: 1 / NA FPU add/sub 1 / 3
FPU mul 2 / 5
FPU div 37 / 38
MMX ALU 1 / 1
MMX mul 1 / 3
SSE mul 2 / 5
SSE div, sqrt 36-58 / 36-58
SSE move 1 / 1
1 Integer ALU† 0.5 / 0.5 Integer ALU (add, sub etc.) 1 / 1
Branches jcc, call, ret 0.5 / NA, 1 / 5, 1 / 8 MMX ALU 1 / 1
Integer shift, rotate by 1 1 / 4 MMX shift, rotate, shuffle, pack, unpack 1 / 2
Adc, sbb instructions Reg, reg: 3 / 8, reg, imm: 2 / 6 SSE add / sub 2 / 4
Inc, dec instructions 1 / 1 SSE shuffle 1-4 / 1-2
Complex integer NA SSE reciprocal, reciprocal sqrt 2 / 2
FPU / SSE / SSE2 add, sub FPU: 1 / 5, SSE / SSE2: 2 / 4 SSE move 1 / 1
Integer / MMX / FPU / SSE / SSE2 mul Int: 3-5 / 14-18, MMX: 1 / 8, FPU: 2 / 7, SSE / SSE2: 2 / 6
Integer / FPU / SSE / SSE2 div, sqrt Int: 23 / 56-70, FPU: 23-43 / 23-43, SSE: 32 / 32, SSE2: 62 / 62
FPU misc NA
MMX / SSE / SSE2 ALU MMX: 1 / 2, SSE / SSE 2: 2 / 2
MMX / SSE / SSE2 shift, rotate, shuffle, pack, unpack MMX: 1 / 2, SSE / SSE 2: 2 / 2
MMX misc, SSE / SSE2 reciprocal NA, 4 / 6
2 Load data / Prefetch 8-128 bits: 1 / 2 Load data 8-64 bits: 1 / 3, 128 bits: 2 / 4
3 Store address generation 1 / 1 Store address generation 1 / 1
4 N/A &nbs p; Store data 8-64 bits: 1 / NA, 128 bits: 2 / NA

 

† Only mov, movzx, movsx, add, sub, and, or, xor, cmp, test, neg, not and nop instructions; the integer ALUs can each handle 2 µops per clock cycle.


Pipeline and Execution...continued

The main structural differences are:

  • The store data operation has been moved from port 4 to port 1
  • All FPU and SIMD operations (except load, store and move) have been concentrated in port 1
  • There are 2 integer ALUs which can handle the most common integer instructions at a rate of 2 per clock cycle (even if they are fully dependent, i.e., "add eax, ebx" followed by "sub edx, eax"). This increases the maximum number of dispatched µops from 5 (earlier Pentium family processors) to 6 per clock cycle.

 

Other key differences are:

  • The load and store data operations can now handle up to 128 bits per clock cycle. This doubles the data bandwidth into and out of the execution units per clock cycle.
  • The throughput and latencies have changed for many operations

 

The maximum retirement rate of 3 µops per clock cycle has remained unchanged. Other related resources have been increased as detailed by the following table:

Architecture / Resource NetBurst Earlier Pentium Family Architecture
Physical registers (used by renaming) 128 40
ROB entries 126 40
Reorder entries Included in ROB entries 20
Store buffers 24 12
Pending loads 48 12

 

Recommendations

Instruction(s) Recommendation(s)
adc / sbb Avoid if possible / replace by other code sequences
inc / dec x Replace by add / sub x,1 if possible
imul / mul Avoid if possible, schedule dependent instructions as far away as possible, consider switching to SIMD or FPU code, replace multiplication with small constants with other code sequences
Shifts, rotations Avoid if possible, schedule dependent instructions as far away as possible; replace shl with additions
lea The decoder internally generates mov, add and shl µops for this instruction, i.e. "lea eax,[ebx+edx]" leads to an add µop, "lea eax,[eax+ecx*2+4]" leads to a shl µop and 2 add µops; the shl µop has a latency of 4, so try to avoid the scaled index if this is a problem
movsx Uses one more µop than movzx, consider using unsigned data
fxch Try to avoid this instruction, as it consumes a µop (the earlier Pentium family processor architecture handles this instruction by register renaming and not by the execution units)
FPU Consider using SSE / SSE2 FP instructions instead, this can result in a significant speed-up due to the doubling of the data bandwidth (128 bit loads and stores), the parallel utilization of the instruction units, the lower latency and the lower number of µops. The SSE / SSE2 FP instructions are the only way to fully keep the FP execution units busy.
FPU / MMX / SSE / SSE2 move reg, reg Try to avoid such moves as they have long latencies.
MMX Replace with SSE2 instructions, this can result in a significant speed-up due to the doubling of the data bandwidth (128 bit loads and stores), the parallel utilization of the instruction units and the lower number of µops
SSE / SSE2 Try to generate a balanced mix of add/sub, mul and shift etc. instructions, so that none of the execution units is oversubscribed

 

Replaying µops in Case of L1 Cache Misses

The NetBurst architecture schedules µops that are dependent on loads as if the load will hit the L1 cache. If actually a L1 cache miss occurs, it needs to replay them, leading to penalties of at least 12 clock cycles.

Dealing with Latencies
Due to its large internal ROB buffer and other resources the NetBurst architecture is uniquely positioned to exploit the parallelism present in the instruction stream. However, if there is little or no parallelism in the code due to instruction dependencies (instructions need results produced by previous instructions), high instruction latencies can worsen the problem.

Example:

double sumarray( double a[], size_t n )

{

double sum = 0;


for( size_t z = 0; z < n; z++ )

sum += a[ z ];


return( sum );

}

 

The problem with this code is that the key instruction generated by the compiler (the floating-point add instruction) is fully dependent on its previous result. The result of this is that the processor mainly sits idle, waiting for the result of the previous add.

The solution to this problem is to increase the available parallelism by breaking dependency chains (sequences of dependent instructions), in this case by introducing a second accumulator:

double sumarray( double a[], size_t n )

{

double sum0 = 0, sum1 = 0;


if( n >= 1 )

{

for( size_t z = 0; z < n-1; z += 2 )

sum0 += a[ z ], sum1 += a[ z+1 ];

if( z < n ) // handle last iteration

sum0 += a[ z ];

}


return( sum0 + sum1 );

}

 

In other cases it may be possible to include more work in the inner loop, either the same type of work (i.e. summing up n arrays within the loop) or other work (i.e. loop fusion). In addition, latencies due to data loading from main memory can be dealt with by employing prefetch instructions if the load address is known far enough in advance.


Register Access and Branch Prediction

Partial Register Accesses / 16-bit Operands

The NetBurst architecture treats every access to a sub-register (i.e. bx, bh, bl) as an access to the underlying 32 bit register, so that accessing ax, al or ah is treated as an access to eax. The consequence of this is that accesses to ah and al etc. are not independent, as with the earlier Pentium family processor architecture. On the other hand there are no penalties for reading a register after writing to a sub-part, i.e. accessing ecx after modifying ch.

The NetBurst architecture has no penalties for 16-bit operands although these have an operand-size prefix.

Recommendation

Avoid code treating ah etc. as independent from al (although the handling of even this case is very good due to the double-pumped ALUs).

Branch Prediction

The NetBurst architecture has improved the branch prediction mechanism considerably. The following table lists the differences:

Architecture NetBurst Earlier Pentium Family Architecture
Size of BTB [entries] 4096 512
Return address stack size 16 16
Static prediction Backward conditional jumps: Taken, forward conditional jumps: Not taken Backward conditional jumps: Taken, forward conditional jumps: Not taken
Static branch hint prefixes Yes (see [8, 9]) No
Dynamic prediction
  • Fully predicted loops
1-16 loop iterations 1-4 loop iterations
  • Number of mispredictions for loops if above loop count exceeded
1 (last iteration) 2 (first and last iteration)
  • Taken / Not-Taken pattern length with no misprediction
1-4 1-4
Misprediction penalty [clock cycles] Min. 20 10-15, max. 26
Penalty for correctly predicted taken branches [clock cycles] 0 1 (for instruction fetch)
Average misprediction rate ~6% ~10%

 

The NetBurst architecture significantly improves the handling of branches, the only exception being the higher misprediction penalty.

Recommendations

Branches should be avoided or their number reduced, because even if they are correctly predicted, they consume resources, potentially limiting the throughput. They also inhibit compiler optimizations in many cases.

One way to achieve this is to eliminate loop invariant branches from loops:

Example:

for( z = 0; z < n; z++ )

if( op == MULT )

r[ z ] = a[ z ] * b[ z ];

else

r[ z ] = a[ z ] + b[ z ];

 

Replace this with:

if( op == MULT )

for( z = 0; z < n; z++ )

r[ z ] = a[ z ] * b[ z ];

else

for( z = 0; z < n; z++ )

r[ z ] = a[ z ] + b[ z ];

 

Another technique is loop unrolling (reducing the loop count by repeating the loop body). Please be aware that the compiler may already be unrolling a loop, so a manual unroll can in fact cause a performance degration in such a case. Loop unrolling also makes additional optimizations possible in many cases.

Example (count spaces assuming no codes below the space char, limited to 24MB strings):

size_t countspaces( unsigned char s[], size_t n )

{

size_t nrspaces = 0;


for( size_t z = 0; z < n; z++ )

nrspaces -= s[ z ] - (' ' +
1)& ~UCHAR_MAX;


return( nrspaces >> CHAR_BIT );

}

 

Replace this with (loop unrolling by 2):

size_t countspacesopt( unsigned char s[], size_t n )

{

size_t nrspaces = 0;


if( n >= 1 )

{

for( size_t z = 0; z < n - 1; z += 2
)

nrspaces -= s[ z ] - (' ' +
1) & ~UCHAR_MAX,

nrspaces -= s[ z + 1 ] - ('
' + 1) & ~UCHAR_MAX;

if( z < n ) // handle last
iteration

nrspaces -= s[ z ] - (' ' +
1) & ~UCHAR_MAX;

}


return( nrspaces >> CHAR_BIT );

}

 

Another powerful optimization is in-lining. This not only eliminates the branches, but also the parameter passing overhead and in many cases enables further optimizations.

Finally some conditional branches can be eliminated by using the cmovcc/fcmovcc and the setcc instructions, by employing some clever arithmetic (see example above) or table lookup; also adc, sbb, lahf and the shift and rotate instructions may be of use, but beware of the latencies.

Example:

;if( eax >= 8 ) eax = 0;

cmp eax,8

jnc donotsettozero ;jump if unsigned eax >= 8

xor eax,eax

donotsettozero:

 

Replace this with:

cmp eax,8

cmovc eax,ebx ;ebx must be 0

 

To take advantage of the static branch prediction, arrange conditions/code accordingly (i.e. with "if( x ) ... else ..." x should be true with >50% probability).

In addition, pair all calls with returns to utilize the return address stack mechanism.


Instruction Scheduling

Because the trace cache holds decoded instructions in form of µops the scheduling of the instructions is not critical, only the instruction latencies should be considered (see also the sub-section "Dealing with Latencies" in the section "Execution Units"). Specifically no 4-1-1 decoder template, as with the earlier Pentium family processor, needs to be matched.

SIMD Instructions

With the arrival of streaming SIMD extension 2 (SSE2) [8, 9, 13] the Pentium® 4 processor provides parallel instructions for all basic integer and floating-point data types, thereby enabling significant speed-ups for all array-oriented processing.

Compared to the, Pentium family processors the main differences are:

  • The new SSE2 instructions now also handle double precision floating-point values and make all of the MMX™ Technology integer instructions available for the 128 bit wide XMM registers while adding additional support for 32 bit data types
  • Loads and stores can handle 128 bits per clock cycle as opposed to 64 bits
  • All operations except load, store and move have been concentrated in one port
  • There is one integer ALU as opposed to two
  • The latencies have generally increased

 

Recommendations

  • Port MMX code to SSE2
  • Avoid denormals, NANs and exceptions as they incur high penalties (enable the DAZ and FTZ modes for floating-point operations [9])

 

Floating-Point Instructions

The main changes are slightly higher latencies, the concentration of all arithmetical operations in one port and that fxch is now executed rather than being handled by register renaming.

Recommendations

  • Consider switching to SIMD SSE/SSE2 instructions, this can result in significant speed-ups (up to a factor of 4) due to the lower number of µops, lower latencies, the full utilization of the floating-point execution units and the 128 bit data buses from and to the caches
  • Reduce the number of mode switches (also see [12]) as they incur high penalties
  • Avoid the fxch instruction
  • Avoid denormals, NANs and exceptions as they incur high penalties

 


Conclusion

The NetBurst architecture is designed for very high clock speeds, legacy-i nstruction-reduced software and parallel processing of arrays via its powerful SSE/SSE2 instructions. Developing software with this and the details given in this paper in mind will ensure that the full potential of the Pentium® 4 materializes, resulting in industry-leading performance.


References

 


Further Reading

 


About the Author

Bryan Hayes is the founder and Managing Director of Hayes Technologies (http://www.hayestechnologies.com*), a company focusing on providing development services for Software Speed Optimization. He has been involved in software development since 1983 and particularly with performance optimization since 1984, even writing complete compilers in assembly while still at school. Bryan Hayes has studied Business Administration in St. Gallen, Switzerland, and Hamburg, Germany. Prior to founding Hayes Technologies in 2001 he has been with Basler Vision Technologies, a leading company in the field of Machine Vision, for more than 11 years, holding various key positions: Manager Development, Manager Corporate Development, Manager Research & Technology and Member of the Board of Management. His e-mail address is bryan.hayes@hayestechnologies.com.


Étiquettes:
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.