Recent years have yielded an amazing number of new operating systems, new processors, and new platform capabilities that provide exciting opportunities for application developers. Consider the changes the computing industry has witnessed over the past five years and the implications of those changes to an application developer.
- Microsoft* has released Windows* 95, 98, ME, 2000, & XP, a couple of embedded operating systems, and the soon to be released 64-bit Windows. Linux* has found a home on the desktop, server, and hand-held.
- Intel has released 6 new processor lines, increased processor frequency from ~100-MHz to 2.0-GHz, introduced 3 new instruction set extensions (MMX™ Technology*, Streaming SIMD Extensions* (SSE), and Streaming SIMD Extensions 2* (SSE-2)) representing 175 new instructions. Just this year Intel released their first 64-bit microprocessor, and this fall Intel introduced a new simultaneous multi-threading technology called Hyper-Threading* which treats a single physical processor as 2 logical processors.
- Hand-held computing devices like the Compaq's iPaq Pocket PC* and Palm's Palm* devices have established a viable application market space for developers.
With this new diverse computing environment, many developers find themselves targeting applications for multiple operating systems and processors. Realizing these platform advances and maintaining application performance and portability between platforms can significantly increase the complexity of the application development environment.
Application developers are already saddled with meeting the current release schedule, adding new capabilities, maintaining operating system & processor compatibility, fixing bugs, and, of course, further testing. Without the right tools, an ISV deploying applications to multiple platforms may experience the time consuming task of re-writing the code for each operating system and processor; an order of magnitude more work. Fortunately, there are tools that have kept pace with the changing times that can help alleviate the task of developing high-performance cross-platform applications.
Obviously, this article can not solve all the problems associated with cross-platform application development. Instead, it attempts to shed light on some of the software tools that ease the process of developing high-performance cross Intel® Architecture (cross-IA) applications. To be more specific and better define the scope of this paper, cross-IA refers to deploying an application that runs on any combination of Intel's embedded, 32-bit, & 64-bit microprocessors (the Pentium® processor to the Pentium® 4 processor, Intel® Xeon® processor, StrongARM* and Itanium® processor), running Microsoft's* Windows operating systems (Windows 98, NT* 4, 2000, XP) or RedHat* Linux* 6.2 and 7.1. Consider this article a first pass at identifying some of the potential issues encountered during the development of cross-IA applications and tools that can ease the development effort. The majority of tools presented in this article are provided by Intel. Intel tools are a good starting point for discussion because who better to provide optimized tools and libraries than the processor manufacturer. Future articles will discuss other 3rd party development tools.
"No Assembly Required"
When Intel launched MMX Technology in 1997, the only way to take advantage of the new instructions was through the use of assembly language, and the performance critical components of applications (3D engines, audio & video multimedia CODECs, scientific simulation, and CAD/CAM visualization) were hand-tuned by the developer. The goal of outperforming the compiler, through the use of MMX Technology, required a developer with a firm understanding of the underlying micro-architecture. By and large this was "okay," because most developers of the day were largely focused on targeting Pentium®-based platforms running Windows* 95. Outperforming the compiler today, across multiple platforms, at the assembly level, requires a savvy developer with a lot of knowledge and time. The developer would require intimate knowledge of multiple micro-architectures (CISC, RISC, OOO Superscalar, EPIC) and instruction set extensions (MMX, SSE, & SSE-2). Assembler level optimizations are important, but for the majority of cross-platform developers in the world they are better left to the compiler.
Identifying the Processor & OS Capabilities
Before jumping into which tools are available for performance tuning through high-level programming, let's discuss the mechanisms required to deliver high-performance applications with processor-specific optimizations. A three-step process to is required to execute processor specific optimized code. First, identify the OS, second, identify the processor and processor capabilities, and third, based on the processor information, choose the best suited functions for that platform (DLL or code path). Identifying the OS is usually as difficult as calling the correct system call and reading the return value. Determining which processor, and the processor capabilities, is not such an intuitive process.
The quick and dirty way to identify the processor and its capabilities is to try and issue an instruction specific to a known SIMD instruction set. If an invalid opcode exception is thrown, it is known that either the processor or OS does not support that technology. Since the quick and dirty method to identify the processor and OS requires the use of assembler, the function below illustrates how to detect a processor that supports SSE2. The SSE2 instruction could be replaced with a MMX or SSE instruction to determine whether those instruction sets are supported.
Intel's official method to identify the processor is to follow the guidelines provided in the application note AP-485 Intel processor identification and the CPUID instruction. Application note AP-485 not only describes the process to identify the supported SIMD instruction sets, but provides the information to identify all processors introduced since the 8086. The following chart shows which processors support which SIMD instruction set.
|Processor Name||MMX™ Technology||Streaming SIMD Extensions||Streaming SIMD Extensions 2|
|Itanium® Processor Family||X||X||-|
|Pentium® Pro Processor||-||-||-|
|Pentium® Processor w/ MMX™ Technology||X||-||-|
|Pentium® II Processor||X||-||-|
|Pentium® III Processor||X||X||-|
|Pentium® 4 Processor||X||X||X|
|Xeon® Pentium® II Processor||X||-||-|
|Xeon® Pentium® III Processor||X||X||-|
Identifying Support in the Processor and Operating System
Now that it is understood how to identify the platform (OS/CPU) capabilities, let's examine some high-level development tools to create a high-performance cross-IA application.
High-Performance Development Tools
Depending upon which high-level programming language the application is developed with, there are a variety of tools that can assist with improving the portability and performance.
Applications developed using C/C++ can take advantage of intrinsics. Intrinsics enable the developer to target optimizations for specific processor capabilities, but at the same time, place the onus of issuing & scheduling the optimal instruction sequences, for each platform, on the compiler. Instrinsics still require the developer to know and understand the various SIMD instruction sets, but they also provide the compiler some flexibility in generating the best code for each platform.
To a JAVA* or Visual Basic* developer, intrinsics by and large are unusable; they are simply too low level. For these developers the use of IA optimized just-in-time compiler or cross-platform libraries and SDKs may be the answer. In the case of Java*, IBM's JIT issues optimized code for the Pentium® III Processors & the NetBurst™ (Pentium® 4 & Xeon) micro-architecture. IBM also supplies a 64-bit version of the JDK for the Itanium® processor family.
The figure below depicts the ease of application development and maintenance using the different tools described above. It places low-level hand-tuned assembly programming at the top because it requires the most time and is the most difficult to implement, maintain, and port. Residing below assembly level optimizations are the intrinsics & C/C++ classes, which issues platform specific instructions, but leaves the scheduling of those instructions to the compiler. This results in performance close to that of the assembly level, but reduces the amount of work required by the developer. The next layer down is a vectorizing compiler, which can automatically identify optimization opportunities and generate optimized SIMD code. Finally, at the bottom are the performance libraries with functions that are optimized specifically for a given processor. A performance library relies upon a code dispatch mechanism that identifies and targets the optimal code path. Performance libraries are the easiest to use because the developer does not have to do the optimization work, maintain compatibility, or make sure they port.
Targeting applications optimized through the above strategies are fine, but issuing SIMD instructions alone is not enough to ensure performance gains. There is the need for performance analysis tools that can assure the developer that the optimization work is actually paying off and performing better on the target platform.
Before delving into the different approaches to developing high-speed code, let's quickly look at the typical methodology for analyzing and improving application performance
This process assumes that the application is already developed, which is normally the rule rather than the exception. Many developers do not worry about performance until the end of the design cycle, and by doing so, optimizing the application may take some major re-working of both the data types and algorithms. Understanding the alternative methods to improve performance early in the development process can simplify the actual optimization work at the end of the project. This can be as easy as organizing your data in a structure-of-arrays instead of an array-of-structures, and letting the compiler identify opportunities for parallelism.
The above process of gathering performance information, identifying the performance critical code segments, then choosing the method to optimize the application can be a daunting task without the right tools to do the job effectively. So, with that, let's look at the different strategies used in optimizing applications and the analysis tools that reveal the performance gains from the implementation work.
Today most developers are trained in using high-level languages like C/C++ and Java*. To accelerate development targeted to specific processor technologies, intrinsics can be utilized to provide low-level optimizations through a high-level language. More importantly, this level of abstraction enables the compiler to decide which intrinsic is best for a given platform. For example: A program that utilizes the __m128 data-type could be implemented with SSE or SSE2 instructions or translated to an equivalent IA-64 code sequence.
Compiler vendors provide intrinsics to enable developers to create cross-platform code optimized for a given micro-architecture. Intrinsics can be used to generate specific code for IA-32 platforms, based on processor capabilities, and the Itanium product family. Cross-IA intrinsics are supported by Microsoft's Visual C/C++* compiler and Intel's C/C++ compiler. Microsoft's intrinsics enable developers to move between 32-bit Windows operating systems and their soon to be release 64-bit version of Windows, but be aware that intrinsic support is added through the Microsoft* Visual C++ 6.0 Processor Pack*. As for Intel's intrinsics, they enable developers to deploy intrinsic based code both with 32/64-bit Windows and 32/64-bit Linux operating systems, as long as the Intel® C/C++ compiler is used.
Intrinsics are not the complete answer to cross-IA development. Intrinsics provide SIMD opportunities for micro-architectures that can take advantage of them. Intrinsics can also be translated, by the compiler, into an equivalent 64-bit instruction sequence. However, they do not support the embedded StrongARM* processor line, and for JAVA*, .Net*, or Visual Basic* developer's intrinsics are simply too low-level to be useful. For these developers, using performance libraries and optimized JIT/JVM's will probably yield the best solution for cross-IA application deployment.
Compilers, Profile Guided Optimizations, and Auto-vectorization
Compiler technolog y has kept pace through the years. Most compilers support the use of SIMD instructions, but most do not support the ability to automatically generate SIMD optimized code.
One such optimization strategy employed by modern day compilers is profile guided optimizations. The process of utilizing profile guided optimizations consists of running the code, measuring performance of the code, then applying optimization techniques specific to the detected performance bottlenecks. The level of support for these features will vary from compiler to compiler.
For instance, GCC has the ability to perform profile guided optimizations, but these optimizations are limited to the general x86 architecture, and create nothing specifically targeted at SIMD instruction sets or different x86 architectures. Microsoft's compiler supports the ability to target P6 Family processors, but not the Pentium® 4 processor. Intel's compiler on the other hand can generate optimizations specific to each micro-architecture. This functionality is useful when creating high-performance cross-IA applications.
Since the focus of this article is on using high-level development tools to improve cross-IA applications, let's examine some of the optimization techniques the Intel® C/C++ compiler uses to generate high-performance code. One method used to improve performance is called auto-vectorization.
Auto-vectorization is a fine grain optimization technique that unrolls sequential loops and applies SIMD instructions to improve the performance of the loop. This optimization method is only available if the SIMD instruction set matches the data types used in a loop. For example, consider the following loop that is unrolled 4-times by the autovectorizer.
Do not think that unrolling the loop alone is enough. Typically, doing this will cause the loop to miss the cache more often than the previous sequential version. Cache misses can be reduced through the use of the prefetch instruction. These instructions work just as the instruction implies. Memory elements are fetched from main memory and brought closer to the processor data cache prior to use. The prefetch instruction, if successful, reduces memory latency, especially in the case when data elements are being operated on in parallel. It is important to note there are a few variations to the prefetch instructions, and they are detailed in the processor instruction manual. Consider the following situation to understand why the prefetching of data is important and advantageous. A loop misses the L2 cache each iteration. Simply unrolling the loop will result in missing the cache more often per iteration. In turn, this could cause the performance to degrade so much that the benefits of using SIMD optimizations is lost, in many cases performance will be below that of the original loop. To alleviate this issue, the compiler directives can be used. These directives enable the compiler to identify opportunities to automatically insert prefetch instructions into loop structures within the code. In order to improve the performance of applications that do not take advantage of the cache control instructions, the Intel® Pentium® 4 processor has a HW prefetch mechanism which is invoked if cache misses occur in a recognizable patter n. For further information on cache control and the prefetch instructions, the Streaming SIMD Extensions and General Vector Operations presentation offers good examples to get a developer started. For a formal discussion of the cache control instructions, the Pentium® 4 Processor & Intel® Xeon Processor Optimization Manual is an excellent reference.
A course grain optimization process called parallelization is used by some modern day compilers to improve performance on multi-processor platforms. Parallelization improves performance by creating multiple threads to execute the code. Only a handful of compilers currently support this method of optimization, KAI Software & Intel's compiler implements parallelization through the use of the OpenMP. Regardless of which compiler is used, the important fact is that the compiler is performing the optimizations and not the developer.
Performance libraries are great if you are willing to let the performance programmers do the work and the library meets your requirements. Performance libraries enable high-level developers, especially Visual Basic* and JAVA* developers, to realize the performance gains from processor specific optimizations without having to do the low-level optimizations. Performance libraries also carry the benefit of placing the maintenance, compatibility, and optimization work on the library vendor. This relieves the developer from concerning themselves with processor specific optimization.
Performance libraries work by identifying the processor and the processor's capabilities. Based on that information, the code dispatch unit determines which code is best suited for that platform. The diagram below illustrates this process.
Many performance libraries are available to developers targeting high-performance IA-32 application. However, very few performance libraries extend their libraries to include embedded and 64-bit processors. Since our goal is cross-IA development, let's examine tools that have been extended to support all IA platforms.
The Intel® Performance Libraries are a good example of a library set optimized for IA-32. The Intel Performance Libraries have been available for numerous years. The various components of the Intel Performance Libraries include optimized math routines, basic linear algebra routines, image processing routines, signal processing routines, and speech/character recognition routines. These routines are optimized with processor technologies that have been introduced through the 32-bit line of processors. So, while they are optimized with MMX Technology, SSE, and SSE2 instructions, they do not include support for the Itanium product family or the StrongARM and personal client architectures. Recognizing this, Intel has developed the Intel Integrated Performance Primitives, which extend the Intel Performance Libraries support to the StrongARM and Itanium processors.
Cross IA Performance Libraries - The Intel® Integrated Performance Primitives
To ease the effort of developing optimized cross-IA applications, Intel created the Integrated Performance Primitives. IPP represents the next step for Intel® Performance Libraries. IPP contains the SIMD based optimizations of the Intel Performance Libraries and extends the platforms that a developer can target to include Itanium, Xscale, and StrongARM processors. A developer using IPP can create an application that performs best-of-class regardless of the IA platform, and has the added benefit of leaving the processor specific optimization work to Intel engineers. IPP has a broad range of applicability. It can be beneficial for modem device drivers that perform signal processing; image or video effects, audio decode algorithms (MP3 player), or even a 3D / physics engine.
IPP is a low-level API that provides functions highly-optimized for image, JPEG, computer vision, speech, and signal processing. It is optimized for Intel's embedded, 32-bit, and 64-bit processors, and runs on both the Windows OS and Linux. IPP is designed to easily integrate into and improve application performance.
IPP is packaged and redistributed through either a dynamically linked library (DLL) containing optimized code for each IA processor, or as a static library. For the embedded StrongARM and Xscale processors the IPP is packaged as static library. This flexibility enables the developer to directly integrate IPP functionality into the executable or to leave it separate as a DLL. Leaving the IPP code separate as a DLL allows the application to take advantage of future versions of IPP without having to recompile the application.
JAVA* and Visual Basic* developers can also take advantage of IPP. In the case of JAVA*, an interface between IPP and the JAVA* Native Interface (JNI) can be used to call IPP functions. In preliminary testing, some image processing effects (Blur) were sped up over 5-times that of the 2D JAVA* API. Of course, this will vary from function call to function call, but it gives a reasonable idea of what kind of performance gains to expect.
Going forward, the common functionality between the Intel performance libraries and IPP will be moved into IPP, and the Intel performance libraries will become open-source reference designs. Future versions of IPP will include component primitives for audio decode/encode (MP3), video decode/encode (H.263, MPEG-4), and vector/matrix math routines. Expect this functionality to be publicly released at the beginning of 2002.
An evaluation version of the Intel® Integrated Performance Primitives is available. Also, a good introduction to the Intel® Integrated Performance Primitives is available online.
As a developer, it is you who needs to make the decision of letting the library vendor worry about performance and portability, or to do it yourself. Regardless of whether you choose to optimize your application through the use of intrinsics or a performance library, the need still exists to validate performance gains realized from the optimization work. As with optimizations there are multiple methods and tools to ease the p erformance validation process.
Application Performance Analysis
Measuring the performance of your application in your favorite development environment is probably limited to profiling function time and code coverage. Profiling code can take the form of counting the number of milliseconds, or counting the number of clock cycles, that transpire during the execution of a function. We'll discuss the advantages and disadvantages of a few popular methods of performance analysis.
Profiling is the most common method for gauging performance, probably because the capability is almost universally available within every development environment, and it provides a quick and dirty way of measuring the amount of time spent in each function call. Timing functions is a reasonable approach to gauging performance of a routine, but timing alone does not provide insight to the micro-architectural performance issues. For example, timing your functions does not provide insight to the number of cache misses, or missed branch predictions. Before delving into event based sampling lets examine the process of timing your code.
Measuring the amount of time that a function call takes, is as simple as acquiring the current system clock, running the function, then acquiring the time the function ended, subtracting the two and viola. It is actually a very simple process. However, many developers can not measure performance in terms of milliseconds because the function of interest takes less than a millisecond to execute. To work around this, many developers calculate the number of clock cycles a function takes. Since, this is a little more difficult to implement than measuring time, let's look at what it takes to measure a function's performance in terms of clock cycles.
With the Pentium processor, Intel introduced the read timestamp (rdtsc). The timestamp represents the number of clocks cycles accrued from the time of boot, and the read timestamp instruction, as it implies, returns the (64-bit) current timestamp. There are few of issues to be aware of when measuring performance with the timestamp counter. First, the out-of-order execution nature of the P6 Family (Pentium Pro, Pentium II, and Pentium III processors) requires the use of a serializing instruction to ensure that all instructions in the pipeline have been retired before the timestamp is read. Think of it this way, an out-of-order execution engine could execute the read timestamp instruction before all the executed instructions have retired, and thus report an incorrect number of clock cycles.
The easiest way to work around this issue is to issue a serializing instruction. A serializing instruction (such as cpuid) will not execute until the entire execution pipeline is empty. This ensures that all instructions of interest are measured prior to executing the read timestamp instruction. Since reading the timestamp counter is common practice, and it requires assembler, code is provided below that exemplifies how to determine the number of clock cycles a function takes.
Performance Counters & Event Based Sampling
Implementing your own event based sampling system represents a good amount of work, and it is not recommended for a number of reasons. First, it will require the use of assembler. Second, it requires the development of a device driver that operates at the system level (ring 0). Finally, there are a large number of events to account for, and events vary between processor lines (Pentium® processor, Pentium® Pro processor, Pentium® II processor, Pentium® III processor, Pentium® 4 processor, and Itanium® processor all support different sets of event-based sampling). If you are interested in undertaking this endeavor, there is a good article that describes the process of implementing an event-based sampling system for the Pentium processor (written by Robert Wyatt in the May 1998 issue of Game Developer Magazine at: http://www.gdmag.com*). Developing your own system to monitor IA processor events represents a significant amount of work, and to be honest, this is better left to tools with event-based sampling capabilities. One such tool is Intel's VTune™ Performance Analyzer. It supports both time-based and event-based sampling for all of IA (except the StrongARM processor).
Intel's VTune™ Performance Analyzer
Intel's VTune Performance Analyzer* has been available for a number of years. It provides developers with the ability to monitor processor events that occur within your application and the amount of time a specific routine takes. By monitoring the processor events, you'll get better insight as to why your application is performing the way it does. To date, VTune Performance Analyzer is limited to running only on the 32-bit IA processors running the Windows operating system (Windows* 98, NT4, and Windows* 2000). However, a new feature was recently added to VTune Performance Analyzer and it now has the ability to remotely collect performance information from Linux* platforms, as well as, Itanium®-based platforms. VTune Performance Analyzer supports monitoring the performance of stand-alone executables and JAVA* based applications.
VTune Performance Analyzer also contains a "code coach" mechanism that offers recommendations to speed up your code, based on the analysis of the event & time based sampling. This can be helpful to developers who do not have the background necessary to understand the performance information gathered from the performance counters by VTune Performance Analyzer.
It is beyond the scope of this paper to discuss how to use Intel's VTune Performance Analyzer, but for those interested, there is a trial version of VTune Performance Analyzer on the web and a plethora of information on using the tool.
For more information on the VTune Performance Analyzer go to:
Download a free 30-day evaluation copy of the VTune Performance Analyzer:
Extensive information on how to use VTune Performance Analyzer at:
This article only scratches the surface of developing high-performance cross-IA applications, but in case the message was missed, let it be reiterated. Unless the application requires the utmost performance, avoid the use of assembler language for optimization. Assembler code is difficult to maintain and does not port easily between micro-architectures. Opting to use high-level optimization tools such as the compiler, intrinsics, and performance libraries makes the application easier to port and easier to develop and maintain. In the end, this approach enables the application to perform best-of-class across the wide variety of Intel® Architectures with a minimal amount of effort. Future articles will examine some of the overlooked topics such as, multi-threaded optimizations, hyperthreading, and other 3rd party libraries and tools used in developing high-performance cross-IA applications.