Optimizing Software for Intel® Centrino® Mobile Technology and Intel® NetBurst™ Microarchitecture

by James Rose

Introduction

So you're a software developer and you want the best performance for your software on Intel's latest generation desktop and mobile processors? This may seem like a daunting task, but by following a few guidelines you can significantly improve the performance of applications targeted to run on Intel® NetBurst™ microarchitecture and Intel® Pentium® M processors. This paper focuses on performance optimization for desktop NetBurst microarchitecture and mobile Pentium M processors. Other non-performance related mobility optimizations such as reducing power consumption to extend battery life and methods for occasionally connected applications have been covered in detail in other documents that can be found referenced at the end of this paper. The following sections describe important application and source level practices, tools, coding rules and recommendations that will help you to optimize your applications performance on the latest IA-32 processors. The majority of the coding guidelines benefit both processors based on the Intel NetBurst microarchitecture and the Pentium M processor microarchitecture. Some of the coding guidelines benefit one microarchitecture and have negligible or small performance impact on the other microarchitecture. As a whole, these coding rules enable software to be optimized for the common performance features of the Intel NetBurst and the Pentium M processor microarchitectures.


Optimize Performance Using the Best Features of Each Processor

For maximum performance, take advantage of the best available features on each processor such as Hyper-Threading Technology, streaming SIMD extensions, and increased cache size. Where optimum performance on all processor generations is desired, applications can use the CPUID instruction to identify those features and allow you to integrate processor-specific instructions (such as SSE2) or additional threads in the case of Hyper-Threading Technology enabled processors into the source code where appropriate.

There is also the capability for the compiler to help you transparently use available features on the latest IA-32 processors. For SSE and SSE2 instructions, the Intel® C++ Compiler supports the integration of different versions of the code for each target processor within the same binary library or executable. The selection of which code to execute at runtime is made based on the CPU identifier that is read with the CPUID instruction. Binary code targeted for different processor generations can either be generated under the control of the programmer or automatically by the compiler. This capability will be discussed in more detail in later sections.

For applications that must run on both the Pentium® 4 and Pentium M processors with minimum binary code size and a single code path, a compatible code strategy is best. Optimizing your application for the Intel NetBurst microarchitecture is likely to deliver high performance, efficiency and scalability when running on processors based on current and future generations of IA-32 processors including the Pentium M processor.


Optimizations and Power Consumption

What about power consumption on mobile computers? If your software is optimized will it consume more power on a mobile system? Typically, code optimized for performance is also good for power since you may complete a task faster and enter an idle or low power state earlier or be able to run the same application at a lower voltage and frequency for the same performance level. In general, performance optimizations tend to reduce power consumption as long as the overall runtime or CPU utilization is reduced allowing a mobile processor to run in a lower power state. Keep in mind however that the processor is only one part of overall power consumption in a mobile PC and there are other ways to reduce overall power consumption by involving other system components such the hard drive, CD-DVD drives and LCD display. Many of these techniques are discussed in the whitepaper Application Power Management for Mobility.


Using the Intel C++ Compiler for Optimum Performance

The fastest and most efficient way to help achieve the best performance for your application is by using a current generation compiler. Compilers such as the Intel C++ compiler version 7.1 and the Microsoft .Net 2003* C++ compiler have been designed to maximize performance on the latest Intel microprocessors. The current versions of these compilers produce optimized code and ensure that code targeted for the NetBurst microarchitecture also performs very well on Pentium M processors and future IA-32 processors. The next two sections of this paper discusses features of the Intel C++ and Microsoft .Net 2003 compilers.

To take best advantage of the capabilities of the Intel compiler, you need to understand which switches will likely produce the most benefit for your application. The following paragraphs discuss some of the most important switches to set to achieve maximum benefit from the compiler.

Produce Code Optimized for the Pentium 4 and Pentium M Processors

By default, the Intel C++ compiler is set to the -O2 option for release builds, optimizing for best performance. If code size is critical for your application, the -O1 option may be used to help reduce code size expansion due to optimization. Experiment with each of these options to make sure which one gives you the best combination of performance improvement and code size.

The compiler includes switches to target specific processor implementations including the G7 switch which tells the compiler to optimize for NetBurst microarchitecture, the Pentium M processor, and later processors. By default, the Intel C++ compiler uses the G7 switch since it is more likely to yield significant performance benefits on the latest processors. The compiler does not introduce any Pentium 4 or Pentium M processor specific instructions into the binary when using this switch, so code generated with the G7 switch is backward compatible for use on previous processors. The use of this switch ensures the best performance on Pentium 4 and Pentium M processors with improved instruction scheduling heuristics, instruction selection, store forward penalty avoidanc e, and by generating code which reduces the number of partial stalls and other penalties.

Use The Compiler Switches for Interprocedural Optimization

Interprocedural optimizations have the potential to increase performance outside of the scope of a function or procedure, and can produce significant benefits. Some of these include inline function expansion across different modules, optimized argument passing, interprocedural constant propagation and others.

There are two switches that control interprocedural optimization, -Qip for single file and -Qipo for multifile interprocedural optimization. With -Qip, the compiler performs optimizations to functions defined within the current source file. However, when you use -Qipo to specify multifile IPO, the compiler may perform optimizations to functions defined in separate files. For this reason, when you specify ipo, it is important to compile the entire application or multiple, related source files together. The -Qip and -Qipo options can, in some cases, significantly increase compile time and code size.

Use Switches for SIMD Optimizations and Vectorization

Some of the most important performance gains can be realized by utilizing the MMX, SSE, and SSE2 SIMD instructions in applicable code. Using these instructions can significantly improve performance in specific applications areas and the Intel compiler's automatic vectorization capabilities can help your application take advantage of these instructions.

There are two types of switches that produce SIMD instructions, -Qax{M|K|W} option produces optimized code with a generic code path for non-compatible processors and the -Qx{M|K|W} option switch which produces optimized code only for the selected target instruction set as indicated above. These M, K, and W switches respectively support MMX, SSE, and SSE2 instructions. For example, an application targeting the use of automatic vectorization for both SSE and SSE2 instructions with support for earlier processors would include the options -QaxK and -QaxW. Applications targeting the use of automatic vectorization for SSE and SSE2 without additional code paths for earlier processors would include the -QxK and -QxW options. If you decide to use the -QxK and/or -QxW options, remember that that generated code is incompatible with earlier processors and will cause exceptions on processors that don't support these instructions.

The best performance for NetBurst microarchitecture and Pentium M processors are likely to be achieved by targeting the SSE2 instruction set using the -QaxK and -QxW options. If your code includes single precision floating point include the -QaxK or -QxK options. There are several other important things that you can do to help make automatic vectorization more effective. Some performance hints to help the compiler more effectively vectorize your code are included in the section "Optimize Floating Point & Vectorization" later in this paper.

In addition to support for automatic vectorization, the -W option also enables generation of other latest generation instructions such as automatic prefetch and conditional move instructions. Each of these are discussed in following sections in this paper.


Using the Microsoft .Net 2003 C++ Compiler for Optimum Performance

The Microsoft .Net 2003 C++ Compiler also includes important capabilities to improve performance on the latest generation Intel processors. The most important features include Pentium 4 and Pentium M tuned optimizations with the /G7 flag, the /GL whole program optimization switch (similar to the Intel compiler -Qipo switch) and the /arch:SSE2 flag which can emit scalar SSE/SSE2 code.

Produce Code Optimized for the Pentium 4 and Pentium M Processors

By default, the Microsoft C++ compiler is set to the -O2 option for release code, which optimizes for the best performance. If code size is critical, the -O1 option may be used to help reduce any code size expansion due to optimization. Again, it is best to experiment with these two options to make sure that you get the best performance and in some cases the -O1 option may produce faster code.

The .Net 2003 C++ compiler includes switches to target specific processor implementations, the /G6 switch for Pentium Pro, Pentium II, and Pentium III processors and the /G7 switch for Pentium 4 and later processors. By default, the .Net compiler uses the /G6 switch, but the /G7 switch is recommended since it is more likely to yield significant performance benefits on the latest processors.

Like the analogous switch in the Intel compiler, the -G7 switch tells the compiler to optimize for the NetBurst microarchitecture and Pentium M processor. With this switch, the compiler does not introduce any Pentium 4 or Pentium M processor specific instructions into the binary, so this switch is backward compatible for use across processor generations. The use of this switch helps ensure the best performance on Pentium 4 and Pentium M processors through optimizations similar to those in the Intel C++ compiler.

Use The Compiler Switches for Interprocedural Optimizations

The -GL whole program optimization switch enables the compiler to perform optimizations with information on all modules in the program. Whole program optimization is off by default and must be explicitly enabled. With information on all modules, the compiler can optimize the use of registers across function boundaries and inline a function in a module even when the function is defined in another module. If you compile your program with /GL, you should also use the /LTCG linker option to create the output file.

Use Switches for SIMD Optimizations

The .Net 2003 compiler supports generation of code using Streaming SIMD Extensions (SSE) and Streaming SIMD Extensions 2 (SSE2) instructions For example, /arch:SSE allows the compiler to use SSE instructions, and /arch:SSE2 allows the compiler to use SSE2 instructions.

The optimizer chooses when and how to make use of the SSE and SSE2 instructions when /arch is specified. Currently SSE and SSE2 instructions are used for some scalar floating-point computations, when it is determined that it is faster to use the SSE/SSE2 instructions and registers rather than the x87 floating-point register stack. As a result, your code will actually use a mixture of both x87 and SSE/SSE2 for floating-point computations. Additionally, with /arch:SSE2, SSE2 instructions may be used for some 64-bit integer operations.

In addition to making use of the SSE and SSE2 instructions, the compiler also makes use of other instructions that are present on the processor revisions that support SSE and SSE2 such as the CMOV instruction.

Note that the .Net 2003 compiler does not vectorize for SSE or SSE2 instructions, which limits the overall performance benefit of the .Net 2003 compiler for floating point intensive applications compared to the Intel C++ compiler.


Use Current Generation Performance Monitoring Tools

After you've compiled using a current generation compiler, how can you be sure that your application isn't suffering from other performance problems that limit overall performance? Current-generation performance monitoring tools, such as Intel® VTune™ Performance Analyzer can help you identify performance issues using event-based sampling, estimate the performance impact of various performance issues, and suggest ways to resolve those issues through code changes. An important principle with software optimization is to only optimize parts of the code that have a significant amount of CPU utilization to affect overall performance. VTune can help ensure that you don't spend valuable development resources trying to fix what isn't broken. Before focusing on specific performance issues, make sure that you have properly identified them.


Application and Source Level Optimization Tips

The remaining sections of this document outline basic source and application level tips that help promote excellent performance on NetBurst microarchitecture and Pentium M processors. For some of these tips, such as branch and cache optimizations, it is important to focus optimization efforts on application hotspots instead of trying to optimize parts of a program which don't significantly impact performance.

Optimize Branch Predictability and Performance

If your program has a significant amount of mispredicted branches, branch optimizations may offer a great deal of potential performance improvement. Understanding the flow of branches and improving the predictability of branches can improve performance significantly. Using VTune, try to identify areas of your code that may have a larger percentage of mispredicted branches so that you can focus on the real problem areas in your code.

Eliminate Unnecessary Branches

In high performance processors such as the Pentium 4 and Pentium M processors, eliminating unnecessary branches improves performance by reducing mispredictions and reduces impac t on hardware branch prediction resources. Note that every branch affects performance since even correctly predicted branches reduce the amount of useful code delivered to the processor.

Possible ways to eliminate branches include optimizing code flow, making basic blocks contiguous, enabling compiler switches to make use of the cmov and setcc instructions and with other code optimizations.

Arrange Code to Improve Branch Predictability

Sometimes the most common code path of conditional expressions as they are initially written in code doesn't match up very well with processor branch prediction algorithms. When a new branch instruction is encountered for the first time in the processor, it is assumed that backward branches will be taken and forward branches will not be taken. You can improve branch predictability and optimize instruction prefetching by arranging code to be consistent with these static branch prediction assumptions. In the case of a if–else statement for example, this would involve rewriting the statement so that the if and else portion are switched so that the if condition is the most common case.

// forward conditional branches not taken (fall through)

// Branch statically predicted to run if condition and not else

if <condition> {

...

}

else

{

}


// For loop fall through predicted as taken

for <condition> {

...

}


// Backward Conditional Branches are predicted taken.

// In each of these cases, the backward branches at the bottom of

// the loop are predicted to be taken


// For loop bottom predicted as taken to the top

for <condition> {

...

}

loop {

...

} <condition> 

 

Inline Functions According to Coding Recommendations

Generally, small functions that would suffer from an excess amount of call overhead if they weren't inlined or C++ Get/Set methods should continue to be declared with the __inline qualifier. Current generation compilers are best able to determine when performance is improved due to function inlining. By compiling with the recommended settings, automatic function inlining is likely to improve rather than degrade performance as excessive manual inlining can sometimes do. Please see pg 2-23 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual for details. Where possible, also avoid indirect calls and virtual functions in C++ since these calls incur greater overhead at function call time.

Optimize Spin-Wait Loops

If you write your own synchronization primitives, make sure that your code follows the best advice for optimal performance. This technique is described in detail in Chapter 7 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

Optimize Memory Accesses

Memory optimizations can also improve performace significantly. This section discusses some of the most important application and source-level practices that help to improve performance. These techniques are described in detail in chapter 6 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual.


Optimize for Increased Cache Size

In high performance applications, take advantage of increased cache sizes where possible by dynamically detecting cache size at runtime with the cupid instruction and adjusting performance critical code accordingly. Optimize data structures to either fit in one-half of the first-level cache or in the second-level cache. Optimizing for one-half of the first-level cache will bring the greatest performance benefit. If one-half of the first-level cache is too small to be practical, optimize for the second-level cache. Optimizing for a point in between (for example, for the entire first-level cache) will likely not bring a substantial improvement over optimizing for the second-level cache. Although current compilers often do a good job enhancing locality, also consider using manual techniques to enhance locality such as blocking, loop interchange and loop skewing as described in the article Cache Blocking Technique on Hyper-Threading Technology Enabled Processors.

Enable the prefetch generation in your compiler by using the /QxW or /QaxW switch in the Intel C++ compiler or the /arch:SSE2 switch in the .Net 2003 C++ compiler. As the compiler's prefetch implementation improves, automatic prefetch insertion by the compiler may outperform manual insertion except for code tuning experts. If you are using a compiler that does not support software prefetching, intrinsics or inline assembly may be used to manually insert prefetch instructions. Chapter 6 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual contains an example of using software prefetch t o implement memory copy algorithm.

If a load is found to miss frequently with significant negative performance impact, first try moving the load up to execute either. If that change doesn't reduce the amount of load misses, insert a prefetch before the load of the data. Be aware that manual prefetch is independent of the hardware prefetching capabilities in both the Netburst microarchitecture and Pentium M processors. These mechanisms are separate and hardware prefetch is not improved by manual prefetching and excessive manual prefetching can degrade performance.


Align and Organize Data for Better Performance

Unaligned data can be another potentially serious performance problem. The guidelines in this section helps you minimize performance losses due to unaligned data. It is important to remember to focus on data elements in the most CPU intensive parts of your program.

Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16 byte boundaries. For best performance, align data as follows:

  • Align 8-bit data at any address.
  • Align 16-bit data to be contained within an aligned four byte word.
  • Align 32-bit data so that its base address is a multiple of four.
  • Align 64-bit data so that its base address is a multiple of eight.
  • Align 80-bit data so that its base address is a multiple of sixteen.
  • Align 128-bit data so that its base address is a multiple of sixteen.

 

Also, pad data structures defined in the source code so that every data element is aligned to a natural operand size address boundary. If the operands are packed in a SIMD instruction, align to the packed element size (64- or 128-bit). Align data by providing padding inside structures and arrays. Programmers can reorganize structures and arrays to minimize the amount of memory wasted by padding.

The __declspec(align(sizeInBytes)) pragma is supported in both the Intel and Microsoft compilers, and causes the linker to align variables with the specified alignment. For example, the following declaration ensures that the variable signMask is aligned on a 16 byte boundary suitable for vector instruction processing.

__declspec(align(16)) static const int signMask[4] = {-1,-1,-1,-1};

 

Employ data structure layout optimization to ensure efficient use of 64 byte cache line size. Sometimes frequently used data in data structures is more than 64 bytes apart, with other less frequently used data in-between. If these data elements are rearranged in the data structure such that they are close together, they are more likely to be on the same cache line which can potentially reduce the number of cache misses and the memory footprint loaded into the data cache.

Ensure proper data alignment to prevent data split across cach e line boundary. In some cases data elements can span cache line boundaries causing two separate cache lines to be accessed to retrieve the data. This can be a very costly performance limiter. Using VTune, you can identify cases where this is occurring and realign or relay out data to help prevent this from occurring.


Special Memory Considerations for Hyper-Threading Technology

False sharing occurs when two separate threads repeatedly access data on a single cache line. Beware of false sharing within a cache line (64 bytes) for Pentium 4, Intel Xeon, and Pentium M processors; and within a sector of 128 bytes on Pentium 4 and Intel Xeon processors. If you detect this problem using VTune with data not associated with synchronization, consider padding data such that variables will be located on separate cache lines (typically 128 bytes or more apart).

Consider using a special memory allocation library to avoid aliasing. One way to implement a memory allocator to avoid aliasing is to allocate more than enough space and pad. For example, allocate structures that are 68 KB instead of 64 KB to avoid the 64 KB aliasing, or have the allocator pad and return random offsets that are a multiple of 128 Bytes (the size of a cache line).

When padding variable declarations to avoid aliasing, the greatest benefit comes from avoiding aliasing on second-level cache lines, suggesting an offset of 128 bytes or more.

These topics and many others concerning optimizing for Hyper-Threading Technology are located on the following Intel Web site:


Optimize Floating-point Performance and Vectorization

This section includes application and source level guidelines to help you get excellent floating point performance. How can you make sure that your application has great floating point performance?

As mentioned earlier in this paper, the quickest way to optimize floating point intensive programs is to enable the compiler's use of SIMD instructions with appropriate switches. These switches help your application take advantage of the SIMD capabilities of Streaming SIMD Extensions (SSE), and Streaming SIMD Extensions 2 (SSE2) instructions.

Use the smallest possible floating-point or SIMD data type, to enable more parallelism with the use of a (longer) SIMD vector. For example, use single precision instead of double precision where possible and short integers instead of long integers. The integer instructions of the SIMD extensions are primarily targeted for 16-bit operands. Not all of the operators are supported for 32 bits, meaning that some source code will not be able to be vectorized unless smaller operands are used.

Arrange the nesting of loops so that the innermost nesting level is free of inter-iteration dependencies. Especially avoid the case where the store of data in an earlier iteration happens lexically after the load of that data in a future iteration, something which is called a lexically-bac kward dependence.

Avoid the use of conditionals inside loops and try to keep induction (loop) variable expressions simple. Also try to replace pointer with arrays and indices.

Avoid denormalized input values, denormalized output values, and explicit constants that could cause denormal exceptions. Out-of-range numbers cause very high overhead.

Do not use double precision unless necessary. Set the precision control (PC) field in the x87 FPU control word to "Single Precision". This allows single precision (32-bit) computation to complete faster on some operations (for example, divides due to early out). However, be careful of introducing more than a total of two values for the floating point control word, or there will be a large performance penalty.

Dependence chains can sometimes impact performance by introducing artificial dependencies that are an artifact of how an expression is written and not true data dependencies. For best performance, break dependence chains where possible. The following example shows an example dependence chain and a simple rewrite that helps overall performance and parallelism.

To calculate z = a + b + c + d, instead of

x = a + b;

y = x + c;

z = y + d;


use


x = a + b;

y = c + d;

z = x + y;

 

Hand Coded SIMD Optimizations

In some cases, complete vectorization is not possible and you may want to include hand coded SIMD instructions for the best possible performance. There are several excellent resources on developer.intel.com to help you create optimized SIMD code that can help to significantly improve your applications performance on CPU intensive code.

To help reduce the impact of denormal input or output when using assembly or assembly language intrinsics, be sure to enable flush-to-zero mode and DAZ (or denormals are zero) mode as described on page 2-58 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

In addition, be sure to use the fast float-to-int instructions cvttss2si, cvttsd2si instructions if coding with Streaming SIMD Extensions 2.


Summary

In this paper, we have considered some of the most important application and source code level tips and practices that help your application achieve excellent performance on the Intel NetBurst microarchitecture and Pentium M processors. Following these guidelines can have a significant impact on your application's performance and can help ensure that your software runs best on the latest IA-32 processors of today and tomorrow. For more detailed information about many of these coding guidelines and suggestions, please refer to the Intel® 64 and IA-32 Architectures Optimization Reference Manual.


Related Resources

 


About the Author

James Rose joined Intel in 1994 and is a senior software engineer working in Intel's Solutions Enabling Group for the past two years. He focuses on software optimization and performance tuning for the Netburst Microarchitecture and for Intel® Centrino® Mobile Technology. James holds Bachelors and Masters degrees in Electrical Engineering from Brigham Young University.


For more complete information about compiler optimizations, see our Optimization Notice.