Preparing Applications for Intel® Core™ Microarchitecture

by Khang Nguyen
Contributors: Bob Valentine, Erik Niemeyer, Paul Lindberg


Introduction

Currently, optimizing applications for a desktop platform is not the same as doing it for the mobile platform due to differences in the usage models for each platform. Intel® Core™ microarchitecture combines the best of the desktop Intel NetBurst® microarchitecture and mobile Pentium® architecture. As Intel will be using a single architecture for both the desktop and mobile platforms, the challenge is how to prepare your applications so that they can run well on Intel Core microarchitecture. What can we do with existing and new desktop and mobile applications to make them ready when the new Intel processors hit the market? This paper is not intended to show users everything they can do to improve the performance of existing applications on the Intel Core microarchitecture. It only suggests some techniques to either improve or maintain the performance of an existing application when running on systems with these new Intel® processors.


Techniques

Cache

Data in cache is accessed much faster than that in the main memory. Therefore, always try to load data in cache as much as possible. One of the features of Intel® Core™ microarchitecture is that level 2 cache is shared among cores. The primary benefit of a shared L2 is L2 data-sharing between threads running on different cores on the same die. This necessitates reevaluation of the mapping of hot data sections to threads in an application to ensure maximum hits in the L2. The other advantage of shared L2 cache is that if one core is disabled, the remaining core can make use of the full L2 cache. In order to get the number of threads that share the level cache, you need to execute instruction cpuid with eax = 4 and ecx = 0, 1, 2...
(0, 1, 2 corresponding to the cache level 1, 2, and 3 if it exists, respectively.) The number of threads will be obtained by adding 1 to the return value in eax[25:14].

Note that in some systems, there is an option in the bios to toggle the “maximum input value”-you need to disable it. This option is used to limit the maximum value returned by executing cpuid with eax=0 to 3. This option is needed to boot Windows* NT 4.0. Without setting the limit of this value, Windows NT will hang up with a blue screen (screen of death). With this setting, executing cpuid with eax=4 will result in an error. Make sure that the "maximum input value" is the same in all processors.

More information about how to get the number of threads per cache level can be found in the IA-32 programmer manual. The link is in the Additional Resources section.


Instructions

Macro fusion

Macro fusion takes two instructions and merges the micro-ops, resulting in higher average decoder throughput. In order to do that, the instructions CMP or TEST need to be adjacent to the JCC. Below are the situations when they can and cannot be fused:

  • Valid:

    CMP/TEST register – register +
    JCC
    CMP/TEST register – memory + JCC
    CMP/TEST register – immediate value + JCC
    CMP/JE
    CMP/JNE
    CMP/JA
    CMP/JAE
    CMP/JB
    CMP/JBE
  • Invalid:

    CMP/TEST memory – immediate value

 

Partial Register Stall

A partial register stall happens when a large load is needed after a series of small stores to the same area. Partial stall occurs when you write part of a register and read the full register. For example, write a value to a 16-bit AX register and later on read in the full 32-bit EAX. This will cause partial stall since AX is the subset of EAX. Partial stall will slow the performance of an application down since it forces the execution engine to add an additional micro-op that assembles the different parts of the register. The safe way to avoid partial stall is to always write read the 32-bit register or operate on the full register before the partial update. For example, issue the following statement:

xor eax, eax
  before reading eax

 

Length Changing Prefixes

Length Changing Prefixes (LCP) are prefixes that alter the length of immediate data in an instruction. For example:

0x67 - Address Override Prefix

0x66 - Operand Override Prefix

 

When this results in a change in the length of a displacement or immediate field, the processor undergoes a 5-cycle stall. The one that occurs most frequently is when immediate data is used in instructions with word data. To overcome this problem:

  • Avoid using instructions with immediate value that require an LCP.
  • Promote word operations to doubleword. For example:

 

and ax,   0xf0f0 is similar to 

and eax, 0xfffff0f0

 

  • Move LCP instructions with immediate values out of loops.

 

Store Forwarding

The store-forwarding problem still exists although future processors might improve store-forwarding detection. Use Intel® VTune™ to detect store forwarding. In VTune, select the store-forwarding counter to detect this problem, since it is difficult to solve. Follow the store-forwarding rules to avoid it. Details of the rules can be found in the IA-32 programmer reference manual. In case the rules cannot be applied, make sure to store the data way before it can be read so that it has enough time to write to the memory. Below are the cases where the store can be and cannot be forwarded:



Note the above pictures assume that the stores are aligned. It is very important to align the data whenever possible.

Useful Streaming SIMD Extensions (SSE) Instructions

There is a very good chance that some of the SSE instructions will get optimized in future processors to run faster. Unless there are problems with performance, the following SSE instructions should be used whenever possible:

unpcklps, unpckhps, packsswb, packuswb,packssdw, pshufd, shuffps, shuffpd

 

SSE Versus X87

SSE implementation is getting better and better. Therefore, use SSE instructions instead of x87 instructions whenever possible.

Do Not Mix SSE FP and SSE Integer on the Same Register

SSE FP has an additional cycle of latency. For example, use PXOR with SSE integer only. Use XORPS or XORPD with FP when dealing with single and double precision, respectively.

Instruction LDDQU

This instruction is very beneficial in P4 architecture. It is used to replace the instruction >MOVDQU to load 16 bytes from an unaligned memory address. It reads in the full 32 bytes instead of just 16 bytes to prevent the cache line split. In mobile, lddqu is just an alias to movdqu. Therefore, it will not produce any performance improvement in mobile applications, nor will it suffer any penalty due to store after the unaligned load. Because Intel® Core™ microarchitecture combines the best of both Intel NetBurst microarchitecture and mobile Pentium architecture, use this instruction whenever dealing with unaligned loads.

Inline Assembly in 64-Bit Platforms

The ability to use assembly language in a high-level language like C/C++ plays an important role in today’s application performance. Of course, we can also use intrinsics to replace certain but not all assembly instructions. With the introduc tion of the 64-bit Windows operating system (OS), more and more 32-bit applications are ported to 64-bit. So what happens to existing 32-bit applications with inline assembly when they are ported to a 64-bit environment? In most cases, those applications can run fine in 64-bit OS. However, if you want to modify or recompile them in 64-bit environment, you will run into problems with Microsoft Visual Studio* 2005 since it doesn’t recognize inline assembly language. We can solve this problem in one of two ways. First, you can recompile the applications with the Intel® Compiler version 9.1 or above since it allows inline assembly. Second, if you want to recompile the applications with Visual Studio 2005, you must convert all functions that have inline assembly in them to assembly routines and store them in an asm file. Finally, you can recompile them as usual. In general, just convert the existing inline assembly functions to assembly routines if you already had Visual Studio 2005. This way it will guarantee to work on both the Intel compiler and Visual Studio 2005.

Alignment

Words, doublewords, and quadwords do not need to be aligned in memory on natural boundaries. The natural boundaries for words, doublewords, and quadwords are even-numbered addresses, addresses evenly divisible by four, and addresses evenly divisible by eight, respectively. However, to improve the performance of programs, data structures should be aligned on natural boundaries whenever possible. The reason for this is that the processor requires two memory accesses to make an unaligned memory access; aligned accesses require only one memory access. A word or doubleword operand that crosses a 4-byte boundary or a quadword operand that crosses an 8-byte boundary is considered unaligned and requires two separate memory bus cycles for access.

Some instructions that operate on double quadwords require memory operands to be aligned on a natural boundary. These instructions generate a general-protection exception (#GP) if an unaligned operand is specified. A natural boundary for a double quadword is any address evenly divisible by 16. Other instructions that operate on double quadwords permit unaligned access (without generating a general-protection exception). However, additional memory bus cycles are required to access unaligned data from memory.

Misalignment can incur significant performance penalties. One such case is cache-line splits. With cache-line splits, instead of getting all the necessary data in one cache line, the computer has to wait until it can access the next cache line to get the remaining data. This will slow things down a lot. The rule is to align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries. Aligning data will make it load faster. It also improves the chance of forwarding the stores. Please read the store forwarding section for more information. Below is an example showing how to align an array to 16 bytes.

static const __declsp

ec(align(16)) float cfVec_1[3] = {1.0F, 1.0F, 1.0F};

 


Tools

CPUCount

CPUCount has been used by the compatibility and validation lab at Intel to test on all new platforms. It has been adopted as the official Intel tool to detect multi-core and Hyper-Threading Technology (HT Technology). Use CPUCount to retrieve the number of physical, available cores and available logical processors that an application can use. This is very important when optimizing multithreaded applications. For example, we have a 2-thread application running on a Pentium® 4 processor Extreme Edition machine. Since the Pentium Extreme Edition is a dual-core with HT Technology enabled, it will provide 4 logical processors all together. The problem is how to make sure that the 2 threads will run on 2 cores instead of running on 2 logical processors of the same core. Windows* XP might be smart enough to run threads on logical on different physical processors, but there is no guarantee that it would do it with cores. Let’s take a look at the output result of CPUCount running on a Pentium Extreme Edition machine:

Capabilities:

Hyper-Threading Technology: Enabled
Multi-core: Yes
Multi-processor: No


Hardware capability and its availability to applications:

System-wide availability: 1 physical processor, 2 cores, 4 logical processors
Multi-core capability: 2 cores per package
HT capability: 2 logical processors per core

All cores in the system are enabled for this application.


Relationships between OS affinity mask, Initial APIC ID, and 3-level sub-IDs:

AffinityMask = 1; Initial APIC = 0; Physical ID = 0, Core ID = 0, SMT ID = 0
AffinityMask = 2; Initial APIC = 1; Physical ID = 0, Core ID = 0, SMT ID = 1
AffinityMask = 4; Initial APIC = 2; Physical ID = 0, Core ID = 2, SMT ID = 0
AffinityMask = 8; Initial APIC = 3; Physical ID = 0, Core ID = 2, SMT ID = 1

 

Windows might decide to choose the first 2 logical processors corresponding to affinity mask 1 and 2, respectively. This would be unfortunate, since those 2 logical processors are on the same core. Looking closely at the above result, we can see that the logical processors with affinity mask of 1 and 4 are on 2 separate cores. Setting the appropriate affinity mask will force the application to run only on the desired logical processors (separate cores). Affinitymask set to 1 and 4 here means that the bits 0 and 3 of the bit-vector processor mask are set to 1, respectively. Therefore, to force the application to run only on core 0 and 2, set the bit 0 and 3 of the bit-vector processor mask to 1. Also assign the value of 5 in decimal node (or 101 in binary mode) to the processor mask. Use the system function Se tProcessAffinityMask to set the processor mask.

More information about CPUCount can be found in the white paper titled “Detecting Multi-Core Processor Topology in an IA-32 Platform.” Note that in some systems, there is an option in the bios to toggle the “maximum input value,” and you need to disable it. This option is used to limit the maximum value returned by executing cpuid with eax=0 to 3. This option is needed to boot Windows* NT 4.0. Without setting the limit of this value, Windows NT will hang up with a blue screen. With this setting, executing cpuid with eax=4 will result in an error. Make sure that the “maximum input value” is the same in all processors.

High Precision Counter

Windows API functions QueryPerformanceCounter and QueryPerformanceFrequency are used as a high precision timer. The 2 functions use different kind of timers, such as the FSB clock or bios timer, to implement. These kinds of timers are independent of Enhanced Intel SpeedStep® Technology, which is extremely valuable for game programming. Future Intel processors might implement the read-time stamp counter (RDTSC) using the front side bus, clock which is independent of Enhanced Intel SpeedStep Technology. The paper “Measure Code Sections Using the Enhanced Timer” will describe the use of those two functions. (See the link below in the Additional Resources section.)

Intel® Compilers

For the application-level optimization, always try to use the latest edition of the Intel compiler. In addition, always vectorize the applications whenever possible since vectorization will automatically use new instructions from Intel Core microarchitecture when they become available. Also, the new instructions will be supported by intrinsics and by inline assembly. Vectorization detects patterns of sequential data accessed by the same instruction, and transforms the code for SIMD execution, including use of the SSE, SSE2, and SSE3 instruction sets. The vectorizer also does alignment strategies including loop peeling and loop unrolling. Loop peeling can generate align loads, enabling faster application performance. Loop unrolling matches the prefetch of full cache line and allows better scheduling. Below are some of the useful Intel compiler switches:

/O3 (High-level Optimizations)
This switch will create the fastest code in most cases with loop transformation and cache optimizations. To get the full benefit, this switch should be accompanied by the /QxP or /QaxP switches for Intel® Pentium® M and Pentium 4 processors, and processors supporting Intel® Extended Memory 64 Technology (Intel® EM64T).

/QxP or /QaxP
This switch will generate instructions and optimize for the processors with Streaming SIMD Extension (SSE, SSE2, and SSE3) and Intel EM64T. /QaxP will also create generic code to allow applications to run on processors not specified above.

/Qip or /Qipo[value]
/Qip allows inlining and other optimizations within one single file while /Qipo permits across multiple files. Value specifies the maximum number of object files to be produced. For example, /Qipo3 specifies a maximum of three object files the com piler may choose to create. If value is unspecified, it is defaulted to one file, just like using /Qip. Using those switches has the following benefits:

  • Decreasing the number of branches, jumps, and calls within code; this reduces overhead when executing conditional code.
  • Reducing call overhead further through function inlining.
  • Providing improved alias analysis, which leads to better code vectorization and loop transformations.
  • Enabling limited data layout optimization, resulting in better cache usage.
  • Performing interprocedural analysis of memory references, which allows registerization of more memory references and reduces application memory accesses.

 

/Qparallel
This switch will detect loops that may benefit from multithreaded execution, and it automatically generates the appropriate threading calls.

/Qopenmp
This switch will allow the use of OpenMP* directives in the applications.

Threading Tools and Libraries

Intel® Threading Tools includes Intel® Thread Checker and Intel® Thread Profiler, which detect threading bugs such as data race collision and optimize threading, respectively. Intel also has libraries like the Intel® Math Kernel Library (Intel® MKL) and Intel® Integrated Performance Primitives (Intel® IPP). These libraries are highly optimized for threading. Links to those tools are provided in the Additional Resources section.

OpenMP*

OpenMP provides a simple and fast way to implement threads. Unlike the Windows thread, OpenMP takes care of all the initialization and garbage collection. The Intel compiler supports it and the future Microsoft compiler will also support it. However, the objects obtained from those two compilers might not be compatible to each other. More information about OpenMP can be found at the OpenMP links in the Additional Resources section.


Additional Resources

 

 


For more complete information about compiler optimizations, see our Optimization Notice.