Introducing Intel® NetBurst® MicroArchitecture Optimization


Introduction

A Deeper Pipeline and New Cache Structure

The Intel NetBurst® microarchitecture is a new feature from Intel that was introduced in the Pentium® 4 and Intel® Xeon® processors. Although these processors maintain backwards compatibility with all previous Intel processors, the Intel NetBurst micro-architecture has new features that require attention when optimizing an application.

Several new techniques are utilized in the Intel NetBurst micro-architecture feature of the Pentium 4 and Xeon processors that can dramatically improve the performance of a given application. One of the biggest differences in the Intel NetBurst micro-architecture from the P6 micro-architecture used on the Intel® Pentium® Pro, Intel® Pentium® II, and Pentium® III microprocessors is a much deeper pipeline coupled with a new cache structure that allows for significantly faster clock speeds.

To take full advantage of these new features, your applications should be targeted for the Intel NetBurst micro-architecture by compiling with the latest versions of Intel NetBurst micro-architecture-aware tools. Version 5.0 and above of Intel's compilers will take advantage of the Intel NetBurst micro-architecture features of the Pentium 4 and Xeon processors.


Available Resources and Documentation

Five Core Documents
There are five core documents to help you optimize for the Intel NetBurst micro-architecture that are currently available online- at no cost - from Intel. The most up-to-date document versions are available in electronic format. If you prefer, hard copies may be ordered as well.

Document Order Number
Intel® 64 and IA-32 Architectures Optimization Reference Manual 248966-001
Desktop Performance and Optimization for Intel® Pentium® 4 Processor 24943801
Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture 245470
Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2: Instruction Set Reference 245471
Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 3: System Programming Guide 245472

 

Table 1 - Core Resource Documents


Intelr Pentiumr 4 and Intelr XeonT Processor Optimization
Figure 1 - Intel® Pentium® 4 Processor Optimization Reference Manual Cover Page

This 333 page manual (Order Number 248966) provides detailed information on Intel NetBurst micro-architecture and code optimization techniques and can be downloaded from Intel's developer support web site.

Much of the information in this document is derived from the Intel® Pentium® 4 Processor Optimization Reference Manual (see Figure 1). Please refer to that document for additional information and details.


Additional Resources

In addition to the five core documents, the following resources will assist you in gaining maximum performance with Intel products.

  • Intel® Developer Zone is a web site that focuses on tools, solutions and resources for software developers.
  • Intel® Software Development Products
    Using Intel's latest compilers will ensure that performance sensitive features of the target processor are utilized. Re-compiling with third party compilers that take advantage of Intel NetBurst micro-architecture features will also yield significant performance benefits.
  • Intel® VTune™ Performance Analyzer support is available for Intel NetBurst micro-architecture in the latest release, version 5.0.
  • Intel® Software Partner Home allows developers to gain access to detailed, technical information, pre-release hardware and software, training and co-marketing opportunities.

 


Identified Optimization Issues and Best Practices

Coding Pitfalls

This section identifies a few optimization issues with the Intel NetBurst micro-architecture and provides a summary of best practices. Examples are provided that set the stage for further study in the referenced manuals and documentation.

To obtain optimum performance, it is important to know how to avoid coding pitfalls that limit the performance of the target processor. While most useful to assembly-level programmers, a general knowledge of t he following will benefit any developer working with the Intel NetBurst micro-architecture.

Table 11 lists several, known coding pitfalls that affect performance in the Intel NetBurst micro-architecture.

Factors Affecting Performance Symptom Example (if applicable) Section Reference
Small, unaligned load after large store Store-forwarding blocked resulting in increased memory latency Example 2-10 Store Forwarding,

Store-Forwarding Restriction on Size and Alignment
Large load after small store;

load dword after store dword, store byte;

load dword, AND with 0xff after store byte
Store-forwarding blocked resulting in increased memory latency Example 2-11,

Example 2-12
Store Forwarding,

Store-Forwarding

Restriction on Size and Alignment
Cache line splits Access across cache line boundary (2 accesses instead of 1) Example 2-9 Align data on natural operand size address boundaries
Integer shift and multiply latency Longer latency than Pentium® III processor Use of the shift and Rotate Instructions, Integer and Floating-Point Multiply
Denormal inputs and outputs Slows x87, SSE, SSE2 floating point operations Floating-point Exceptions
Cycling more than 2 values of Floating-point Control Word fldcw now optimized to improve on performance seen in P6 core Floating-point Modes

 

Table 3- Factors Affecting Performance in the Pentium® 4 Processor

The store-forwarding issues mentioned in Table 1 are depicted well in Figure 22.

Figure 2 - Size and Alignment Restrictions in Store Forwarding

Figure 2 - Size and Alignment Restrictions in Store Forwarding

Example Scenarios

The following four examples-referenced in Table 1 above- are republished from the Intel® Pentium® 4 Processor Optimization Reference Manual. They portray scenarios where the mentioned pitfalls may be encountered.

Example 2-9: Code That Causes Cache Line Split
This example moves a block of data, two double words at a time, from one base address to another. The source base address is not 4-byte aligned. This causes 4-byte loads to occasionally cross cache line boundaries and hence require two lines to be loaded.

mov esi, 029e70feh (Not a 4-byte aligned address)

mov edi, 05be5260h

Blockmove:

mov eax, DWORD PTR [esi] (4-byte accesses load from 2 cache lines)

mov ebx, DWORD PTR [esi+4]

mov DWORD PTR [edi], eax

mov DWORD PTR [edi+4], ebx

add esi, 8

add edi, 8

sub edx, 1

jnz Blockmove

 

Example 2-10: Several Situations of Small Loads After Large Store
The following example stores 32 bits in a general-purpose register and then loads 8 bit operands into 4 other registers. Only the loads aligned with the store will be forwarded. The rest are blocked. See the rules depicted in Figure 2 that illustrate the alignment issues.

mov [EBP],'abcd'

mov AL, [EBP]       ; not blocked - same alignment

mov BL, [EBP + 1]    ; blocked

mov CL, [EBP + 2]    ; blocked

mov DL, [EBP + 3]    ; blocked

mov AL, [EBP]      ; not blocked - same alignment

; n.b. passes older blocked loads

 

Example 2-11: A Non-forwarding Example of Large Load After Small Store
The following example shows where a large load will be blocked because it is larger than the previous store. This simple case could be easily avoided as noted.

mov [EBP], 'a'

mov [EBP + 1], 'b'

mov [EBP + 2], 'c'

mov [EBP + 3], 'd'

mov EAX, [EBP]       ; blocked

; The first 4 small store can be consolidated into

; a single DWORD store to prevent this non-forwarding situation

 

Example 2-16: Rearranging a Data Structure
This example shows how to make a specific data structure more efficient. The integer elements consume 32 bits in memory while the character elements are padded to 32 bits from 8 to align on natural operand boundaries. The compiler may not have the flexibility to arrange these elements for optimal storage utilization. By rearranging the elements, padding is minimized, reducing the overall size of the structure. Additionally, the new size of the structure allows it to evenly fit within the 128 byte cache lines reducing the number of loads across multiple cache lines.

struct unpacked { /* fits in 20 bytes due to padding */

int a;

char b;

int c;

char d;

int e;

}


struct packed { /* fits in 16 bytes */

int a, c, e;

char b, d;

}

 

The processor can handle spin loops more efficiently if the PAUSE instruction is used. This instruction is compatible with all previous micro-architectures. See page 2-15 of the Intel® Pentium® 4 Optimization Reference Manual for more information.


Coding Best Practices

Guidelines in the optimization manual can make application optimization a lot easier. This section presents a few suggestions and highlights some key coding best practices.

First, consider the following five coding practices that are even more critical for increasing performance on the new Intel NetBurst micro-architecture feature of the Pentium 4 processor:

  • Use good branch prediction.
  • Avoid memory access stalls.
  • Choose the appropriate instructions, including use of SIMD instructions.
  • Examine instruction scheduling to maximize trace cache³ bandwidth
  • Use Vectorization.

 

Second, understanding the new Intel NetBurst micro-architecture feature of the Pentium 4 processor and how traditional coding techniques affect it will also yield improved performance. Consider the following two insights:

  • Understand that excessive loop unrolling can trash the trace cache. Page 2-20 of the Intel® Pentium® 4 Optimization Reference Manual begins a discussion on factors to weigh when unrolling loops. For example, unrolling loops can be harmful if the loop no longer fits in the trace cache.
  • Longer cache lines have several performance implications. For example, sparse data structures should be avoided. Data organization should always be considered to optimize cache usage. Page 2-31 of the Intel® Pentium® 4 Optimization Reference Manual discusses data layout optimizations.

 

Finally, use the best tools and methodologies available to receive maximum benefit from all that the Intel NetBurst micro-architecture feature of the Pentium 4 and Xeon processors has to offer. Applications should be re-compiled with tools that take full advantage of this technology and it is important to understand and follow the coding rules outlined in the reference manuals; presented here.

1 Republished from the Intel® Pentium® 4 Processor Optimization Reference Manual: Table 2-1

2 Republished from the Intel® Pentium® 4 Processor Optimization Reference Manual: Figure 2-2

3 See Chapter 2 of the Intel® Pentium® 4 Optimization Reference Manual or Chapter 1 of IA-32 Intel Architecture Software Developer's Manual: Volume 1 for an explanation of the Trace Cache.


For more complete information about compiler optimizations, see our Optimization Notice.
Tags: