| Last Modified On : | October 17, 2008 11:25 AM PDT |
Rate |
|
A Deeper Pipeline and New Cache Structure
The Intel NetBurst® microarchitecture is a new feature from Intel that was introduced in the Pentium® 4 and Intel® Xeon™ processors. Although these processors maintain backwards compatibility with all previous Intel processors, the Intel NetBurst micro-architecture has new features that require attention when optimizing an application.
Several new techniques are utilized in the Intel NetBurst micro-architecture feature of the Pentium 4 and Xeon processors that can dramatically improve the performance of a given application. One of the biggest differences in the Intel NetBurst micro-architecture from the P6 micro-architecture used on the Intel® Pentium® Pro, Intel® Pentium® II, and Pentium® III microprocessors is a much deeper pipeline coupled with a new cache structure that allows for significantly faster clock speeds.
To take full advantage of these new features, your applications should be targeted for the Intel NetBurst micro-architecture by compiling with the latest versions of Intel NetBurst micro-architecture-aware tools. Version 5.0 and above of Intel's compilers will take advantage of the Intel NetBurst micro-architecture features of the Pentium 4 and Xeon processors.
Five Core Documents
There are five core documents to help you optimize for the Intel NetBurst micro-architecture that are currently available online- at no cost - from Intel. The most up-to-date document versions are available in electronic format. If you prefer, hard copies may be ordered as well.
Table 1 - Core Resource Documents
Figure 1 - Intel® Pentium® 4 Processor Optimization Reference Manual Cover Page
This 333 page manual (Order Number 248966) provides detailed information on Intel NetBurst micro-architecture and code optimization techniques and can be downloaded from Intel's developer support web site: http://developer.intel.com/design/index.htm
Much of the information in this document is derived from the Intel® Pentium® 4 Processor Optimization Reference Manual (see Figure 1). Please refer to that document for additional information and details.
In addition to the five core documents, the following resources will assist you in gaining maximum performance with Intel products.
Coding Pitfalls
This section identifies a few optimization issues with the Intel NetBurst micro-architecture and provides a summary of best practices. Examples are provided that set the stage for further study in the referenced manuals and documentation.
To obtain optimum performance, it is important to know how to avoid coding pitfalls that limit the performance of the target processor. While most useful to assembly-level programmers, a general knowledge of t he following will benefit any developer working with the Intel NetBurst micro-architecture.
Table 11 lists several, known coding pitfalls that affect performance in the Intel NetBurst micro-architecture.
| Factors Affecting Performance | Symptom | Example (if applicable) | Section Reference |
| Small, unaligned load after large store | Store-forwarding blocked resulting in increased memory latency | Example 2-10 | Store Forwarding, Store-Forwarding Restriction on Size and Alignment |
| Large load after small store; load dword after store dword, store byte; load dword, AND with 0xff after store byte |
Store-forwarding blocked resulting in increased memory latency | Example 2-11, Example 2-12 |
Store Forwarding, Store-Forwarding Restriction on Size and Alignment |
| Cache line splits | Access across cache line boundary (2 accesses instead of 1) | Example 2-9 | Align data on natural operand size address boundaries |
| Integer shift and multiply latency | Longer latency than Pentium® III processor | Use of the shift and Rotate Instructions, Integer and Floating-Point Multiply | |
| Denormal inputs and outputs | Slows x87, SSE, SSE2 floating point operations | Floating-point Exceptions | |
| Cycling more than 2 values of Floating-point Control Word | fldcw now optimized to improve on performance seen in P6 core | Floating-point Modes |
Table 3- Factors Affecting Performance in the Pentium® 4 Processor
The store-forwarding issues mentioned in Table 1 are depicted well in Figure 22.
Figure 2 - Size and Alignment Restrictions in Store Forwarding
The following four examples-referenced in Table 1 above- are republished from the Intel® Pentium® 4 Processor Optimization Reference Manual. They portray scenarios where the mentioned pitfalls may be encountered.
Example 2-9: Code That Causes Cache Line Split
This example moves a block of data, two double words at a time, from one base address to another. The source base address is not 4-byte aligned. This causes 4-byte loads to occasionally cross cache line boundaries and hence require two lines to be loaded.
mov esi, 029e70feh (Not a 4-byte aligned address) |
Example 2-10: Several Situations of Small Loads After Large Store
The following example stores 32 bits in a general-purpose register and then loads 8 bit operands into 4 other registers. Only the loads aligned with the store will be forwarded. The rest are blocked. See the rules depicted in Figure 2 that illustrate the alignment issues.
mov [EBP],'abcd' |
Example 2-11: A Non-forwarding Example of Large Load After Small Store
The following example shows where a large load will be blocked because it is larger than the previous store. This simple case could be easily avoided as noted.
mov [EBP], 'a' |
Example 2-16: Rearranging a Data Structure
This example shows how to make a specific data structure more efficient. The integer elements consume 32 bits in memory while the character elements are padded to 32 bits from 8 to align on natural operand boundaries. The compiler may not have the flexibility to arrange these elements for optimal storage utilization. By rearranging the elements, padding is minimized, reducing the overall size of the structure. Additionally, the new size of the structure allows it to evenly fit within the 128 byte cache lines reducing the number of loads across multiple cache lines.
struct unpacked { /* fits in 20 bytes due to padding */
|
The processor can handle spin loops more efficiently if the PAUSE instruction is used. This instruction is compatible with all previous micro-architectures. See page 2-15 of the Intel® Pentium® 4 Optimization Reference Manual for more information.
Guidelines in the optimization manual can make application optimization a lot easier. This section presents a few suggestions and highlights some key coding best practices.
First, consider the following five coding practices that are even more critical for increasing performance on the new Intel NetBurst micro-architecture feature of the Pentium 4 processor:
Second, understanding the new Intel NetBurst micro-architecture feature of the Pentium 4 processor and how traditional coding techniques affect it will also yield improved performance. Consider the following two insights:
Finally, use the best tools and methodologies available to receive maximum benefit from all that the Intel NetBurst micro-architecture feature of the Pentium 4 and Xeon processors has to offer. Applications should be re-compiled with tools that take full advantage of this technology and it is important to understand and follow the coding rules outlined in the reference manuals; presented here.
1 Republished from the Intel® Pentium® 4 Processor Optimization Reference Manual: Table 2-1
2 Republished from the Intel® Pentium® 4 Processor Optimization Reference Manual: Figure 2-2
3 See Chapter 2 of the Intel® Pentium® 4 Optimization Reference Manual or Chapter 1 of IA-32 Intel Architecture Software Developer's Manual: Volume 1 for an explanation of the Trace Cache.
