| February 22, 2009 11:00 PM PST | |
by Andrew Binstock
The Pentium® 4 processor's Streaming SIMD Extensions 2 (SSE-2) are new processor instructions designed to accelerate the performance of applications that use double-precision floating point and integer instructions. These instructions are particularly important in applications that rely on:
- 3D graphics and geometry, such as ray tracing with floating-point code,
- Signal processing, high-precision simulation, and modeling algorithms that use floating-point math
- Video encoding and decoding algorithms
This article explains what SIMD is, what this second generation consists of, and how to use the new instructions.
SIMD is an acronym for Single Instruction, Multiple Data, and refers to the ability to execute a single processor instruction in parallel across several data items in a single operation. (The acronym is not an Intel creation; rather it is one of a family of similar acronyms to describe multiprocessing architectures.)
SIMD's roots at Intel go back to the early days of the PC. Readers who can remember back to those early days know that the 8086 processor could work with an optional floating-point coprocessor called the 8087. The 8087 chip could be detected by an application. If found, the application could issue escape instructions to the 8086 telling it to load several data items into the 8087 and have the 8087 perform a mathematical operation on them. The application could then retrieve the result from the 8087. The 8086 could be made to emulate floating-point operations without the 8087, but these operations were very slow. So loading the 8087 and having native hardware support for floating-point math was a significant performance gain, especially for scientific applications. An additional benefit was the 8087's 80-bit data registers. These enabled the lowly 16-bit PC to perform integer math on very, very large numbers.
Successive generations of the x86 processor line saw parallel shipments of math coprocessors. This tradition of separate coprocessor chips saw its last instantiation in the 386 generation, which had a 387 coprocessing sibling. The 486DX chip had the 387 core built into it. This chip was the first to integrate the coprocessor as an on-board floating-point unit, or FPU.
As most readers know, after the 486 processor the new generation of Intel processors was called the Pentium® processor. During the heyday of the 486, some Intel engineers in Israel (where much of the x87 coprocessor technology had been developed) realized that the built-in 80-bit registers of the FPU were mostly unused in PC applications, since prior to the 486 few machines had floating-point hardware.
They began to explore the possibility of making these registers available as a pair of trick 32-bit registers. Specifically, they examined the possibility of loading two 32-bit values into an FPU register and performing the same operation on them. This was the first time Intel considered SIMD in its general-purpose processors. By 1992, the notion had been extended significantly, so that the FPU registers might also hold four 16-bit data items. And the operations needed not b e floating point, the registers could contain multiple integers. While this research was being pursued, PC applications were moving towards graphical interfaces and multimedia. Multimedia in particular requires numerous computations of identical type performed on a large set of data items. It was a perfect fit for Intel's SIMD extensions.
The Pentium® generation of processors was the first to have the FPU integrated into the core chip in all versions. As such, it made sense for Intel to deploy its SIMD research in those chips. It did so under the name of multimedia extensions, or as they're commonly known, Intel® MMX technology.
MMX technology appeared only in the second generation of the Pentium® processor. The first generation had the integrated floating-point, but not SIMD capability. The addition of 57 MMX instructions mapped eight 64-bit MMX technology registers to the Pentium® processor's eight 80-bit FPU registers. These MMX technology registers could be loaded with integers or with floating-point values of different precision. User software defined the precision.
To tell whether a system supported MMX technology, Intel provided a test routine. It recommended that application vendors provide dynamic link libraries (DLLs) for systems with and without MMX technology. This way they could capitalize on it whenever it was installed. The Pentium® II and all subsequent generations of the x86 processors have MMX technology built-in. As a result, most applications vendors today simply assume MMX technology is available. They enforce this assumption simply by requiring at minimum a Pentium® II processor.
The introduction of the Pentium® III processor saw the advent of "Internet Streaming SIMD" extensions, or what in hindsight might be dubbed SSE-1 (technically there's only SSE and SSE-2). (The use of the term "Internet" is likely a marketing device, since the Web was all the rage when the Pentium® III processor was introduced. However, even if you correct Internet to Web, it's difficult to see what aspect of the Web needed additional floating-point instructions. Intel quietly dropped the "Internet" qualifier when it released SSE-2).
SSE introduced 70 new instructions that accelerated floating-point math and 3D computations. With SSE, Intel moved to a new architecture. The SSE instructions are not extensions to the MMX instruction set; rather, they are a different instruction set that uses 128-bit registers. SSE registers are not mapped to 8087 registers the way MMX technology does; they use their own registers. (SSE and MMX instructions can co-exist with no difficulty.) The "streaming" part of SSE refers to the subset of instructions that perform very fast floating-point operations needed by video and audio encoding standards, especially MPEG-1 and 2. In fact, to provide floating-point capabilities for both scientific and entertainment purposes, SSE supports the IEEE* floating-point standard as well as the flush-to-zero implementation. The latter has fewer bits of precision but the operations can be performed considerably faster. In many imaging applica tions, the loss of the extra bits of precision are of no consequence, since this effect cannot be discerned.
The second generation of SSE, so-called SSE-2, appeared in the Pentium® 4 processor, which shipped in 2000. SSE-2 added a series of new instructions to SSE that extends the operations that could be performed on packed data (where two or more data items are operated on in a single register) on scalar data as well (where only one element in the register is used in the operations). In conjunction with these operations, SSE-2 also adds several instructions for moving operands between registers. This can be useful where operands and results of one computation serve as inputs to a second operation.
Of these, the most important change in SSE-2 is the ability to perform simultaneous math functions on all pairs of 64-bit arithmetic on integer and single- or double-precision floating-point data items. SSE saw the arrival of 128-bit registers. The largest operand such a register could hold was a pair of 64-bit items, which is still true. However, only a limited set of operations could be performed on a limited set of data types under SSE. The current generation of SSE gets rid of most of these limitations. Pretty much all operations, including complex operations like square roots and reciprocals, can be performed by SSE-2 on all pairs of 64-bit numbers, regardless of their precision.
The primary use of SSE-2 will undoubtedly be found in graphics rendering. The SSE-2 operations have been designed to improve performance of these operations in particular:
- Alpha saturation-this process determines the opacity of the colors on the screen. The SSE-2 registers can be loaded with the four double words, each packed with the four 8-bit values that represent a pixel (RGB plus alpha). Alpha saturation can then be computed on 4 pixels at a time with a series of simple SSE-2 math instructions.
- Blending-this process is used to add skin to a model or to clip images as they are moved off the edge of a viewport. Again, the ability to load multiple data points in memory and perform the necessary calculations in parallel is where SSE-2 delivers its most effective performance enhancement.
- 3D-light calculation
In addition, numerous calculations involved in MPEG-2 encoding can be facilitated by new instructions that allow greater use of packing operands into the SSE-2 registers and performing SIMD calculations. The last section of this article points to further information on this.
Currently, to use these instructions, developers need either Intel® C/C++ compilers and assemblers or Microsoft Visual C++* and MASM. If using the latter tools, make sure to get the updated packs for the Pentium® 4 processor. Borland's website makes no mention of support for SSE-2. Its Delphi 6.0* product supports SSE and C++ Builder 5.0 supports only MMX technology; as of early September no support for Pentium® 4 processor SSE-2 instructions had been announced.
As with all previous processor-specific extensions, Intel requires you to check for the presence of the needed processor with the CPUID routine. The necessary code for this is presented in th e Optimization Reference Manual discussed in the next section.
The definitive source for information on the Pentium® 4 processor SSE instructions is the Intel website. Here are some useful pages.
This link* contains an interactive tutorial that steps a user through the new SSE-2 instructions, showing the code in C and assembly language and, using animation, the effect of the instructions on operands in the 128-bit registers. For most developers, this short tutorial is the place to start.
In terms of manuals, all programmers interested in SSE-2 should download the Pentium® 4 Processor Optimization Reference Manual, which comes as a 333-page PDF file, downloadable here. Other manuals of interest on the Pentium 4 processor can be downloaded here.
This link (use keyword sse2) contains a variety of technical articles that show how to use SSE-2 to solve specific problems. Each article is accompanied by C and assembly-language code and timings, showing the performance boost provided by the new instructions. The articles cover big-number multiplication, tessellation, 3D transforms, and motion compensation in codecs such as MPEG-2.
Andrew Binstock is the principal analyst at Pacific Data Works LLC, a firm that specializes in market analysis and the composition of high-tech white papers. He can be reached at abinstock@pacificdatworks.com.
Copyright © 2001 DevX Inc.
For more complete information about compiler optimizations, see our Optimization Notice.

