How Special Silicon Facilitates Parallel Arithmetic

Published:06/18/2009   Last Updated:06/18/2009

One of the most effective forms of parallelization is found deep inside Intel® x86 processors: the ability to execute parallel calculations with a single instruction. You can do this manually or let the compiler and the hardware handle it automagically. Either way, the performance of your parallel code will improve dramatically.

By Andrew Binstock

In a series of articles published on this portal, various authors have been explaining how to use lightweight threading to derive the maximum benefit from parallelization. Because visualization applications and the graphics used in games frequently rely on large data sets that can be processed concurrently, lightweight threading is a nearly optimal solution for those applications. For example, rendering graphics, transforming images, computing object physics, and converting scientific data into visual representations-all these activities rely on parallel processing of large data arrays.

The previous discussions of threading for data decomposition [Boost Game Performance with Threads, Fine Grained Parallelism in Games, Data Decomposition: Sharing the Love and the Data] examined ways to decompose data problems and distribute them among threads. For example, a common, but heavyweight, approach is to break up an array or matrix of data points and assign portions of it to specific threads that run independently-generally one to a core or processing pipeline. As with many heavyweight approaches, this design works, but it is does not do load balancing well. If one thread finishes early, it does nothing to support the other threads that are still working.

Lightweight multithreading, in contrast, breaks up the calculations into small tasks and send the tasks to a thread pool, which runs each one to completion. When a thread completes its task, the thread pool manager gives it a new one. In this way, load is balanced across all threads in the pool. Generally, the pool has logic to determine the optimal number of threads it should create, so this approach results in very effective use of the parallel resources on the processor.



SIMD and Intel® SSE


But we can go deeper and lighter still! We can use the parallel resources within the execution pipeline of a single thread. Specifically, we can exploit a feature that Intel first put into its x86 processors years ago, and which the company has expanded and refined with each generation of chips. This feature called SIMD (for single instruction, multiple data) enables one arithmetic instruction to be performed across up to 16 data items simultaneously. SIMD was first implemented by Intel as a way to issue more instructions per clock cycle. It consisted of series of special, wide registers and extensions to the instruction set to enable parallel arithmetic. Intel calls these instructions Intel® Streaming SIMD Extensions, or Intel® SSE.

Over the years, various generations of Intel® SSE have been released by Intel, thereby greatly expanding the capabilities and number of values that can be processed in one operation. The first generation was called Intel® SSE, and successors were called Intel® SSE-2, Intel® SSE-3, and so on. In this article, I'll use the term Intel® SSE to refer to all generations.

A typical Intel® SSE sequence involves loading a series of data values into a single large register and then executing one of the Intel® SSE instruction across all the values at once. The resulting values, often placed in a second Intel® SSE register, can then serve as input to a new calculation or be "unpacked" into their constituent fields and then mapped back to their locations in RAM.

Essentially the cycle of Intel® SSE calculations is: load, calculate one or more times using Intel® SSE, unload. An Intel® SSE register can hold many different types of values. For example, today's 128-bit register can hold:



  • 1 ea. 128-bit value
  • 2 ea. 64-bit values (either integer or floating-point)
  • 4 ea. 32-bit values (either integer or floating-point)
  • 8 ea. 16-bit values
  • 16 ea. 8-bit values

An Intel® SSE register must contain all values of the same type, so you can't mix any of the above. To support operations the various sizes of data elements, Intel® SSE instructions frequently have numerous variants - one for each data type. As a result, looking at Intel® SSE code can often reveal complex instruction names that can be easily parsed into descriptions of what they do and what data types they work with.

Coding For SIMD

There are several ways to code Intel® SSE operations. Let's start with a simple loop in C and see how it would look in assembly language using Intel® SSE op codes.

This example adds two arrays of 128 32-bit values and places the results into a third array.



/* x, y, z are 16-byte aligned arrays */
for ( i = 0; i < 128; i++ ) {
    x[i] = y[i] + z[i];


In SIMD, we load 128-bit SIMD registers, which are named xmm0-7.





L: movdga	xmm0,	y[eax]	; load 4 doublewords from y
   padd	xmm0, z[eax]	; add 4 doublewords from x
   movdga	x[eax], xmm0	; store resulting 4 doublewords into x
   add	eax, 16		; increase the point we start the next cycle from
   cmp	eax, 512		; is it past the end of the array? 
   jb		L			; looping until end of initial array (32 times)


There are several salient points to examine. The C code performs 128 moves and adds. The SIMD code performs 32 moves and adds. So, as can be seen, the SIMD code performs the same operation as the C code, but it uses one-quarter as many logical steps. If you were to time this code, depending on the platform, you would find the SIMD code more than 4x faster due to the fewer instructions in the loop when compared with the C equivalent. However, given the speed of today's processors, the difference of those few instructions would probably not show up. However, the SIMD benefit of a 4-to-1 ratio would definitely be visible, especially in a tight frequently iterated loop. In almost all cases, SIMD provides a substantial lift to array and matrix arithmetic.





SIMD Coding Options

The previous example used assembly language to code the SIMD logic. This is a common way of doing SIMD programming, because it gives the developer the greatest amount of control over the code. Typically, the code is written as an inline function to a C or C++ program. (If this form of coding appeals to you, the Software Vectorization Handbook [] from Intel Press is the fastest way to get started.)

There are, however, other alternatives for gamers and scientists who don't mind losing a little control over the code in exchange for simplifying the programming work.

The first option is called intrinsics. Intrinsics are instructions that are higher-level than assembly language, but lower level than C/C++. They are generally called from C and C++ programs, so they adopt those syntactic conventions.

Here is some code in C that steps through an array of 16 integers and shifts each one's value right by 2 bits (effectively dividing it by 4).





void shift2( int array[], int len ) {
    for( int i = 0; i < len; i++ ) {
        array[i] = array[i] >> 2;


Using intrinsics, this loop would appear as show in the next code snippet, provided that the value of len is evenly divisible by 4 and the array is 16-byte aligned. The former condition might not always be the case, but arrays should generally be 16-byte aligned, even for non-parallel code, as many instructions work better and faster when data is 16-byte aligned. This is especially true for compound data items.






void shift2( int array[], int len) {
    __m128i *array4 = (__m128i * ) array;
    for ( int i = 0; i < len / 4; i++ ) {
        array4[i] = _mm_srai_epi32( array4[i], 2);


This code divides the loop into quarter size arrays and runs the shift operation across ¼ of each array. The _m128i data type refers to a 128-bit Intel® SSE register for storing the 32-bit integers.

Notice that the use of intrinsics does not require leaving C or C++. Intrinsics integrate right in and if it were not for the sometimes unusual names they use, they would be difficult to distinguish from standard functions. Several C/C++ compilers for the x86 system support SIMD intrinsics, including the Intel® C/C++ Compiler (Windows* and Linux) and the Microsoft* Visual C++ (Windows* only) compiler, among others.





Moving Upscale

The assembly language option and intrinsics are both fairly low-level propositions. For developers not so comfortable down in the bits and bytes, Intel offers two higher-level means of programming Intel® SSE that make it easier, at the slight cost of intimate control of specific operations.

The first of these is to use C++ class libraries that ship with the Intel® C/C++ Compiler. These libraries wrap intrinsics with a truly high-level interface. In fact, by use of overloading, the libraries enable developers to use standard math operations. The key is to use the defined Intel® SSE data types. Here is an example that reproduces the code in the previous two snippets:






void shift2( int array[], int len ) {
    Is32vec4 *array4 = (Is32vec4 *) array;
    for( int i = 0; i < len/4; i++ ) {
        array4[i] = array4[i] >> 2;


As can be seen, the familiar C syntax for right shifts has replaced the intrinsic call of the previous example. The Is32vec4 identifier is a data type that corresponds to one of numerous Intel® SSE register types. It contains four 32-bit signed integers.

For developers who have access to the Intel® C/C++ Compiler, this approach is the highest-level SIMD solution if they want direct control over the Intel® SSE operations.

But there is another option: automatic vectorization. Opportunities for vectorization (a term that refers to the kind of arithmetic we've been discussing) can be identified by the Intel® C/C++ Compiler and Fortran Compiler. Because the compilers have all the needed information: operands, operators, and operations, it is capable of substituting Intel® SSE operations behind the scenes, without changing the results of the calculations.

To derive this benefit, the Intel® compiler must be run with the appropriate switches (in the -Q… family) set. It then looks for loops and vectorizes them automatically. The output messages from the compiler identify which loops it was able to vectorize. Should there be other parts of the program that could be vectorized, but which the compiler could not detect, then the use of the C++ libraries described previously is an elegant solution.





Super Extra Lightweight

The auto-vectorization capabilities of the Intel® compiler enable a very fine level of threading without much difficulty. It actually puts within reach the possibility of double parallelization. This can be achieved by combining lightweight threading with auto-vectorization. In this scenario, data decomposition is used to create small tasks that are handled in parallel by a thread pool (as mentioned earlier and discussed in detail in other articles on this site). Some of those tasks consist mostly or even entirely of Intel® SSE computations and, hence, they can be parallelized automatically by the Intel compiler. At that point, you have parallel operations occurring within parallel tasks-which truly maximizes the performance lift that today's processors can offer parallel computing.





Wrapping Up

If you're not using Intel® SSE for your arithmetic, chances are good you're not availing yourself of the full capabilities of the Intel processors. Fortunately, Intel® SSE instructions are easy to use and highly portable. If you don't know where to start, consider tight loops that perform arithmetic and branch out from there.

It's worth noting that Intel continues to enhance the Intel® SSE with their Advanced Vector Extensions (AVX). For information on AVX, see

To find out more information on programming with Intel® SSE, consult "The Software Vectorization Handbook" and The Software Optimization Cookbook, 2nd Ed., [] both from Intel Press. The examples in this article are adapted from them. They are remarkably approachable volumes that can help you exploit Intel® SSE effectively. Additional note: An excerpt from the first book's chapter on using auto-vectorization via the Intel® compiler is available at no charge here:





About the Author

Andrew Binstock writes technical white papers at Pacific Data Works LLC. He is also a senior contributing editor for InfoWorld and a columnist for SD Times. He is the author or co-author of several books on programming, including "Programming With Hyper-Threading Technology" from Intel Press. During his free time, he contributes to the open-source typesetting and page-layout project, Platypus, ( He can be reached through his blog at



Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804