Multi-core Introduction

Why is Multi-core Important?

Some people say that extending Moore’s Law into the future isn’t necessary, and that today’s computer hardware and software is good enough. This is a dubious notion given the history of the information technology industry. All too often, statements about good-enough computing capabilities, or innovations that will never find a market in the first place, are proven wrong – no matter who’s making the claim.

In the 1970s, for example, an Intel engineer came to Gordon Moore with an idea for a computer that would be used in the home. Moore asked what it would be good for, and all the engineer could think of was that such a device might be used by housewives to store recipes. Moore, who couldn’t imagine his wife with her recipes on a computer in the kitchen, declared the idea not to have any practical application at all.

Today, there are more than half a billion PC users worldwide doing lots more than storing recipes.

Mobile users of notebook PCs and handheld devices want all this mobile gadgetry to work together better and be easier to use, especially when it comes to rich media applications involving music, photos and videos.

Business users of software for 3D modeling, scientific calculations or high-end digital content creation have their own growing list of performance-heavy requirements, as well. And in a world where cost counts and competition is global, the business crowd increasingly is seeking computing and communications infrastructure platforms for end-to-end solutions in businesses.

With PCs moving out of the den and into the living room, digital-home users are seeking computing and communications platforms that emphasize entertainment applications and easy interoperability with consumer electronics devices.

Finally, with health care costs rising in most industrialized countries and the post-war baby boomers aging into their retirement years, a huge digital-health market is emerging. Problems in health-care research, diagnostics and productivity, and personal health care are ripe for new solutions based on increasingly powerful computing platforms and applications.

For all these users, performance means more than wringing additional benefits from a single application.

First, there’s the issue of power consumption and energy costs, dominant themes for policymakers, business IT managers and even families looking to trim their monthly expenses. In a New York Times article published Feb. 10, 2005, reporter Ian Austen wrote that “[w]ith their increased processing capabilities, computers today have become power gluttons.” Austen went on to cite an estimate by the U.S. Environmental Protection Agency which states that operating a typical computer in power-saving mode would result in an annual savings of $100 per computer.

Then there’s the issue of the evolving, interrupt-driven computer usage pattern. Users increasingly multitask, actively toggling between two or more applications, or working in an environment where many background processes compete with each other and with open applications for scarce processor resources.

In a 2002 Intel-sponsored Harris Interactive survey, 76 percent of comput er users said they multitask at least occasionally or frequently on the PC, and nine out of 10 users reported that they’ve experienced problems while trying to perform multiple processor-intensive tasks on a PC. Problems include computer freezes, time lags, function shut-downs, choppy screens and audio distortion. Nearly 60 percent of survey respondents said they feel bored when a computer function makes them wait, so they do something else on the computer at the same time.

Aside from multitasking, users are faced with handling vastly growing quantities of application data. The amount of data in existence is doubling every 24 months – a sort of data equivalent to Moore’s Law. Worldwide, data growth is increasing so quickly it’s now measured by the exabyte. That’s 1018 bytes, or a billion billion bytes. Other technologies associated with this explosion of data – storage, networking, input/output (I/O) and database – are advancing rapidly. But without another leap forward in processing capabilities, the opportunity to harvest and take advantage of this wealth of data will be missed.

No one expects a leap forward in processors’ core execution engines, which are already at the edge of the manufacturing envelope. In a few years, these processor cores may have larger cache sizes and at least slightly faster clock speeds than those on today’s chips. But because of physics – and engineering-related challenges, these advances will be hard won and won’t keep pace with performance demands. Inevitably, meeting new demands will require a scale-out approach that divvies up computing tasks among multiple processing cores.

One way to do this divvying up is to build single computers that are powered by multiple processors. For years this has been the norm in high-performance computing and supercomputing, two industry segments in which single machines often contain hundreds or even thousands of processors.

But adding even one additional processor isn’t always practical for mainstream PC users. Because of the additional socket and associated chips and circuitry, motherboards for dual-processor systems are more expensive than their single-processor counterparts. Also, since the processor is often the most expensive single component in a given computer system, trying to achieve scale-out performance by swapping a single-processor machine for a multiprocessor machine can be expensive.

What is the solution?

Another scale-out option, and the one Intel has embraced in its move to multi-core architectures, is to add two or more brains to each processor. Explained most simply, multi-core processor architecture entails silicon design engineers placing two or more execution cores – or computational engines – within a single processor. This multi-core processor plugs directly into a single processor socket, but the operating system perceives each of its execution cores as a discrete logical processor with all the associated execution resources.

Multi-core chips do more work per clock cycle, and thus can be designed to operate at lower frequencies than their single-core counterparts. Since power consumption goes up proportionally with frequency, multi-core architecture gives engineers the means to address the problem of runaway power and cooling requirements.

Multi-core processing may enhance user experience in many ways, such as improving the performance of compute- and bandwidth-intensive activities, boosting the number of PC tasks that can be performed simultaneously and increasing the number of users that can utilize the same PC or s erver. And the architecture’s flexibility will scale out to meet new usage models that are bound to arise in the future as digitized data continues to proliferate.

Any application that will work with an Intel single-core processor will work with an Intel multi-core processor. However, to make the most of a multi-core processor today, the software running on the platform must be written such that it can spread its workload across multiple execution cores. This functionality is called thread-level parallelism, or “threading,” and applications and operating systems that are written to support it (such as Microsoft Windows* XP, Windows* Server and various Linux* vendor offerings) are referred to as “threaded” or “multithreaded.”

A multi-core processor can execute multiple and completely separate threads of code. This can mean that one thread runs from an application and a second thread runs from an operating system, or parallel threads run from within a single application. Multimedia applications are especially conducive to thread-level parallelism because many of their operations can run in parallel.

Despite the recent attention focused on multi-core architecture, Intel has been delivering threading-capable products for more than a decade. By 1994, the Intel Pentium processor already featured instruction-level parallelism, an architectural feature that extracted instructions in a single thread of code, executed the instructions in parallel, and then recombined them in the same order.

That year, Intel added “glue-less” dual-processing capability – two full processors that plugged into two board sockets – to provide a hardware-enhanced threaded environment for servers and workstations. The company expanded its efforts in 1995, providing glue-less multiprocessing capability with the introduction of the Intel® Pentium® Pro processor. The Intel Pentium Pro processor enabled the seamless connection of as many as four processors on a single board, providing servers and workstation-class products with the means to attain higher compute throughput in threaded software environments.

These efforts provided a springboard for delivering higher degrees of thread-level parallelism in a single processor on volume platforms. In the early part of the 2000s, Intel introduced Hyper-Threading Technology (HT Technology) into its Intel NetBurst® microarchitecture (for Intel® Pentium® 4 and Intel® Xeon® processors) as an innovative means to deliver higher thread-level parallelism on volume platforms. HT Technology enables processors to execute tasks in parallel by weaving together multiple threads in a single-core processor.

By fall 2004, HT Technology had shipped on well over 50 million Intel Pentium 4 products for desktops, servers and mobile PCs, offering new incentive for software developers to design applications capable of processing information in parallel for greater efficiency.

Intel has been offering multi-core processors since 2005. At the spring 2006 Intel Developer Forum event in San Francisco, the company disclosed details of the Intel® Core™ microarchitecture, the industry-leading foundation for Intel’s multi-core server, desktop and mobile processors. Intel Core microarchitecture products built with advanced 65 nanometer process technology deliver higher-performing, yet more energy-efficient processors that spur more stylish, quieter and smaller mobile and desktop computers and servers. Likewise, these new machines can reduce electricity- and real estate-associated costs, and provide critical capabilities such as enhanced security, virtualization and manageability for co nsumers and businesses.

With Intel Core microarchitecture, each core is equipped with a nearly complete set of hardware resources, including cache memory, floating point and integer units, etc. One programming thread can utilize all these resources while another thread can use all the hardware resources on another core. The same programming techniques that for years have been used to write threaded applications for multiprocessor systems, or more recently for platforms based on HT Technology, can be used to take advantage of multi-core processors.

Thanks to $28 billion in R&D spending through the tech downturn from 2001 to 2003, Intel has the manufacturing capacity in place to quickly increase its production of multi-core chips. This investment provides the foundation of Intel’s twin 2006 transitions – to a new 65-nanometer process technology in several of its fabrication facilities and to a powerful new microarchitecture. By the end of 2006, Intel Core microarchitecture will be at the heart of its PC and server platforms.

Further on the horizon, Intel plans to deliver additional processors with two or more cores for mobile, desktop and server platforms. At the spring 2005 Intel Developer Forum in San Francisco, Intel described goals to mass-produce chips with a hundred or more processing cores. Applications will have to be multithreaded to fully exploit these architectural innovations. The alternative – a single-threaded application using just 1/100th of a processor’s throughput – would likely be a tough sell, no matter what users are demanding in the future.

Original Article by Geoff Koch

On Threading and Intel tools



Product Brief

Intel® Core™ Duo processor breaks new ground. Its dual-core technology rewrites the rules of computing, delivering optimized power efficient computing and breakthrough dual-core performance with amazingly low power consumption.† Intel Core Duo processor is at the heart of Intel’s premium desktop and notebook platforms: Intel® Viiv™ Technology and Intel® Centrino® Duo mobile technology, respectively.

More Efficient Use of Power: Demand for greater power efficiency in compute platforms is on the rise across all client segments and form factors. The Intel Core Duo processor balances great dual-core computing capabilities with power savings that enable extended battery life in notebooks. Its enhanced voltage efficiency supports cooler and quieter desktop-type systems.


Traditional mobile and desktop processors limit system design options. Users find they must compromise in areas such as cooling fan noise, battery life, performance and capabilities. With Intel Core Duo processors, the world’s most innovative PC manufacturers can drive a new generation in computer product designs that meet end customer needs more effectively.

The Intel Core Duo processor includes two mobile-optimized execution cores in a single processor. This design enables execution of parallel threads or applications on separate cores with dedicated CPU resources. The results enable outstanding dual-core performance and greater system response w hen running multi-threaded or multiple demanding applications simultaneously.+

The Intel Core Duo processor features a high-performance core architecture that uses micro-op fusion and Advanced Stack Management techniques to maximize performance while optimizing power efficiencies. Micro-op fusion combines micro-ops derived from the same macro-op. Advanced Stack Management reduces the number of micro-ops in stack-related operations by tracking relative stack pointer changes locally. Reducing the number of micro-ops results in more efficient scheduling for better performance at lower power.

Features and Benefits:

Intel® Smart Cache – 2MB L2 cache with Advanced Transfer Cache Architecture - Delivers a smarter and more efficient cache and bus design to enable enhanced dual-core performance and power savings.

Intel® Digital Media Boost -

Micro-architectural enhancements that include instruction optimizations and performance enhancements accelerate a diverse variety of processing-intensive tasks, such as audio/video processing, image processing, 3D graphics, and scientific calculations.

Intel® Dynamic Power Coordination with Dynamic Bus Parking -

Dual-core on demand, coordinated performance with enhanced low power management features Dynamic Bus Parking. This enables platform power savings by allowing the chipset to power down with the processor in these low-frequency mode states.

Enhanced Intel® Deeper Sleep with Dynamic Cache Sizing -

Allows the processor to lower voltage below the Deeper Sleep minimum voltage to enable enhanced power savings. Dynamic Cache Sizing is a new power savings mechanism that enables the Intel® Smart Cache to dynamically flush system memory based on demand or during periods of inactivity.

Intel® Advanced Thermal Manager -

A new thermal management system delivers enhanced accuracy and more precise acoustic control to enable quieter, cooler, thinner system designs.

Power-Optimized 667 MHz System Bus -

Utilizes Source-Synchronous Transfer (SST) of address and data enables improved performance by transferring data at 4X bus clock. Advanced Gunning Transceiver Logic (AGTL+) signaling technology, a variant of GTL+ signaling technology, delivering low power enhancements. Enhanced Intel SpeedStep® Multiple performance modes enable optimum performance at the lowest power, using real-time dynamic Technology Support switching of the voltage and frequency between multiple performance modes based on CPU demand. <


New Intel 65nm Process Technology -

Smaller transistors enable more logic and more frequency headroom for increased performance.


+ System performance, battery life, and functionality will vary depending on your specific operating system, hardware and software configurations.  References to enhanced performance as measured by SySMark*2004, PCMark*2005 and 3DMark*2005 refer to comparisons with previous generation processors. References to improved battery life as measured by MobileMark*2005, if applicable, refer to previous generation processors. See Intel® Centrino® more_info for more information.


Intel® Core® Duo Processor Original Article :

What is Pthreads?

Pthreads is a set of threading interfaces developed by the IEEE (Institute of Electrical and Electronics Engineers) committees in charge of specifying a Portable Operating System Interface (POSIX). The P in Pthreads stands for POSIX and, in fact, Pthreads are occasionally referred to as POSIX threads. Essentially, the POSIX committee defined a basic set of functions and data structures that it hoped would be adopted by numerous vendors so that threaded code could be ported easily across operating systems. The committee’s dream was realized by UNIX vendors who by and large all implemented Pthreads. (The notable exception is Sun, which continues to favor Solaris* threads as its primary threads API.) The portability of Pthreads has been expanded further by the Linux adoption and by a port to the Windows platform.

Pthreads specifies the API to handle most of the actions required by threads. These actions include creating and terminating threads, waiting for threads to complete, and managing the interaction between threads. In the latter category exist various locking mechanisms that prevent two threads from trying to modify the same data values simultaneously: mutexes, condition variables, and semaphores. (Technically speaking, semaphores are not part of Pthreads, but they are conceptually closely aligned with threading and available on all systems on which Pthreads run.)

To make use of Pthreads, developers must write their code specifically for this API. This means they must include header files, declare Pthreads data structures, and call Pthreads-specific functions. In essence, the process is no different than using other libraries. And like other libraries on UNIX and Linux, the Pthreads library is simply linked with application code (via the -lpthread parameter).

While the Pthreads library is fairly comprehensive (although not quite as extensive as some other native API sets) and distinctly portable, it suffers from a serious limitation common to all native threading APIs: it requires considerable threading-specific code. In other words, coding for Pthreads irrevocably casts the codebase into a threaded model. Moreover, certain decisions, such as the number of threads to use can become hard-coded into the program. In exchange for these constraints, Pthreads provides extensive control over threading operations-it is an inherently low-l evel API that mostly requires multiple steps to perform simple threading tasks. For example, using a threaded loop to step through a large data block requires that threading structures be declared, that the threads be created individually, that the loop bounds for each thread be computed and assigned to the thread, and ultimately that the thread termination be handled-all this must be coded by the developer. If the loop does more than simply iterate, the amount of thread-specific code can increase substantially. To be fair, the need for this much code is true of all native threading APIs, not just Pthreads.

Because of the amount of threading code needed to perform straightforward operations, developers have been increasingly looking for a simpler alternative to Pthreads.


Threading Models for High-Performance Computing: Pthreads or OpenMP?


What is OpenMP?

In 1997, a group of vendors came together under the aegis of hardware manufacturer, Silicon Graphics, to formulate a new threading interface. Their common problem was that the primary operating systems of the time all imposed drastically different ways of programming for threads. UNIX employed Pthreads, Sun used Solaris threads, Windows used its own API, and Linux used Linux threads (until its subsequent adoption of Pthreads). The committee wanted to design an API that would enable a codebase to run without changes equally well on Windows and UNIX/Linux. In 1998, it delivered the first API specification of what was called OpenMP (In those days, the term ‘open’ was associated with the concept of support from multiple vendors-as in open systems-rather than with today’s implication of open source.)

The OpenMP specification consists of APIs, a set of pragmas, and several settings for OpenMP-specific environment variables. As further revisions have been made to the standard, it has become clear that one of OpenMP’s most useful feature is its set of pragmas. By judicious use of these pragmas, a single-threaded program can be made multithreaded without recourse to APIs or environment variables. With the recent release of OpenMP 2.0, the OpenMP Architecture Review Board (ARB), which is the official name for the committee that formulates the OpenMP specification, made clear its preference that developers use the pragmas, rather than the APIs. Let’s examine this approach in greater depth, starting with a recap of what pragmas are.

The following definition of pragmas, taken from Microsoft’s documentation, is one of the clearest explanations: “The #pragma directives offer a way for each compiler to offer machine- and operating system-specific features while retaining overall compatibility with the C and C++ languages. Pragmas are machine- or operating system-specific by definition, and are usually different for every compiler. In C and C++, where they are most commonly used, pragmas have the form: #pragma token-string

A key aspect of pragmas is that if a compiler does not recognize a given pragma, it must ignore it (according to the ANSI C and C++ standards). Hence, it is safe to place library-specific pragmas in code without worrying that the code will be broken if it compiled with a different toolset.

Figure 1 shows a simple OpenMP pragma in action

#pragma omp parallel for
for ( i = 0; i < x; i++ )
printf ( "Loop number is %d%d%d\n", 
i, i, i );

Figure 1. Threading a loop with a simple OpenMP pragma

This pragma tells the compiler that the succeeding for-loop should be made multithreaded, and that the threads should execute in parallel. Between work done by the compiler and the OpenMP libraries, the for-loop will be executed using a number of threads, as explained shortly. OpenMP will take care of creating the threads, threading the for-loop by dividing the interactions among the threads, and handling the threads once the for-loop completes.

While OpenMP does not guarantee a priori how many threads will be created, it usually chooses a number equivalent to the number of available execution pipelines. On standard multiprocessor environments, this number is the number of processors. On systems with processors endowed with Hyper-Threading Technology, the number of pipelines is twice the number of processors. An API function or environment variable can be used to override the default number of threads.

OpenMP offers numerous other pragmas that identify code blocks to thread, scope variables to be shared across threads or local to individual threads, where to sync threads, how to schedule tasks or loop iterations to threads, and so forth. So, ultimately, it provides a medium-grained control over threading functionality. At this level of granularity, which is sufficient for many HPC applications, OpenMP delivers better than most other options on the promise of portability, optimal execution, and, especially, minimized disruption to the codebase.

Threading Models for High-Performance Computing: Pthreads or OpenMP?



Which Threading Model is Right For You?

OpenMP is convenient because it does not lock the software into a preset number of threads. This kind of lock-in poses a big problem for threaded applications that use lower-level APIs such as Pthreads or Win32. How can the software written with those APIs scale the number of threads when running on a platform where more processors are available? One approach has been to use threading pools, in which a bunch of threads are created at program start up and the work is distributed among them. However, this approach requires considerable thread-specific code and there is no guarantee that it will scale optimally with the number of available processors. With OpenMP, the number need not be specified.

OpenMP’s pragmas have another key advantage: by disabling support for OpenMP, the code can be compiled as a single-threaded application. Compiling the code this way can be tremendously advantageous when debugging a program. Without this option, developers will frequently find it difficult to tell whether complex code is working incorrectly because of a threading problem or because of a design error unrelated to threading.

Should developers need finer-grained control, they can avail themselves of OpenMP’s threading API. It includes a small set of functions that fall into three areas: querying the execution environment’s threading resources and setting the current number of threads; setting, managing, and releasing locks to resolve resource access between threads; and a small timing interface. Use of this API is discouraged because it takes away the benefits provided by the pragma-only approach. At this level, the OpenMP API is a small subset of the functionality offered by Pthreads. Both APIs are portable, but Pthreads offers a much greater range of primitive functions that provide finer-grained control over threading operations. So, in applications in which threads have to be individually managed, Pthreads or the native threading API (such as Win32 on Windows) would be the more natural choice.

To run OpenMP, a developer must have a compiler that supports the standard. On Linux and Windows, Intel® Compilers for C/C++ and Fortran support OpenMP. On the UNIX platform, SGI, Sun, HP, and IBM all provide OpenMP-compliant compilers. An open-source version is available via the Omni OpenMP Compiler project at*

So, if you’re writing UNIX or Linux applications for HPC, look at both Pthreads and OpenMP. You might well find OpenMP to be an elegant solution.

Threading Models for High-Performance Computing: Pthreads or OpenMP?



OpenMP for Prototyping and Testing
OpenMP’s ease of use is the foremost reason for prototyping with it. To put threading in a serial application, the easiest way-assuming that it will compile with an OpenMP-aware compiler like those from Intel-is to insert some appropriate OpenMP pragmas, then recompile it to see if you gain any performance benefits by running it on dual-processor or Hyper-Threading Technology enabled systems. If you approve of the performance and can maintain it, you can leave it. If not, recode everything that OpenMP did, using Win32 threads or Pthreads, to see if results improve. However, OpenMP provides a “quick and dirty” test for potential benefits from multithreading.

OpenMP can also be used in prototyping to find the best load balance for a set of loop iterations. The API allows run-time choice of loop schedule, which allows you to test different schedules and “chunk sizes” to achieve the appropriate balance. In the following example, the program finds all the prime numbers between two integer values, using an algorithm that divides the number under examination by all the numbers from 2 to the square root of the number. When it finds a factor, the search stops; otherwise, when the set of potential factors is exhausted, the number is declared to be prime:


#include <stdio.h>
#include <omp.h>
#include <math.h&g

main(int argc, char *argv[])
int i, j, limit;
int start, end; /* range of numbers to search */
int prime; /* boolean to mark if number is prime */

start = atoi(argv[1]); /* start and end from command line */
end = atoi(argv[2]);
if (!(start % 2)) start++;

printf("Range to check for Primes: %d - %d\n\n",start, end);

#pragma omp parallel private(j, limit, prime)
#pragma omp for schedule(runtime)
for(i = start; i <= end; i += 2) {

limit = (int) sqrt((float)i) + 1;
prime = 1; /* assume number is prime */
j = 3;
while (prime && (j <= limit)) {
if (i%j == 0) prime = 0;
j += 2;

if (prime) printf("%5d is prime\n",i);
printf("\nProgram Done.\ ");


In the above example, the number of possible factors that need to be tried can vary from one test value to the next. Each iteration of the i-loop takes different amounts of time, and the larger the number being tested, the more factors will be tried for the large primes. If the iterations of that loop aren't assigned well to threads, it is very easy to create a load imbalance. This algorithm would be good for using the run-time system to pick a schedule and investigate the impact different chunk sizes have on the load balance.

OpenMP can also help test code to find out if it is thread safe. You can easily construct test cases that, for example, create threaded calls into a library using OpenMP. If your design allows threads that call function A at the same time other threads are calling function B, you can set up calls to function A and function B. A small thread-safety checking application can be written using OpenMP to call function A from one thread and function B from another thread. This approach can be useful when one team is working on a threaded application that makes calls into a library, and another team is working on the library being called. In this case, OpenMP will generate the threads needed to call the library functions you are testing. For example:


#pragma omp sections
#pragma omp section

#pragma omp section


The library doesn’t have to be called from OpenMP threads when you attach it: you may use another interface. However, OpenMP can test the thread safety, quickly generating threaded calls to multiple combinations of functions to see if they actually work together.
Related Links
Choosing Between OpenMP* and Explicit Threading Methods

Synchronization: The Key to Multitasking Benefits

Threading is nothing new to most programmers. Even those targeting single-processor, non-HTT environments may use threads as a matter of convenience in programming and debugging. Sometimes, separate threads are spawned off to handle specific tasks. However, on multiprocessor or HTT systems, programs can be designed to run optimally by performing a number of tasks simultaneously. To take advantage of this capability, the threads must be synchronized appropriately to do what they need to do at the right time.

In order to avoid race conditions during the execution of a threaded application, mutual exclusion to shared resources is required to allow a single thread to access and change the state of shared resources. The shared resource can be a data structure or variable in the address space. Minimizing synchronization overheads is critical to application performance. In multithreaded applications, while a thread is executing a code section that accesses a shared resource (critical region), other threads waiting to access that resource are either spinning or waiting in a queue.

Typically, synchronization duties are handled by making use of locks (semaphores being a generalized form) which are a means of limiting or controlling access to a resource when many threads are executing. Locks can be implemented in many ways and are the more traditional means of handling synchronization between threads. In addition to conventional multithreading methodologies, some developers also vehemently believe that there can be performance benefits gained by developing multithreaded applications that either don’t use locks or use them very sparingly.

Multiple Approaches to Multithreaded Applications



Locks: an Overview

A number of issues are associated with the use of locks and the Intel document Developing Multithreaded Applications: A Platform Consistent Approach is probably the most valuable resource available for those delving into the realm of multiprocessing-especially for Intel platforms. Locks are looked at in depth in Chapter 4 of the paper and it serves as a good guideline for the synchronization of multiprocessor and Hyper-Threading Technology applications. That said, the following should be considered simply an overview of factors to keep in mind when using locks.

Synchronization constructs serialize execution. However, very few multithreaded programs are entirely synchronization-free. Fortunately, it is possible to mitigate some of the system overhead associated with synchronization by choosing appropriate constructs. If the variable being updated is shared among threads, the load→update→store instructions must be atomic (that is, the sequence of instructions must not be preempted before completion). Controlling how long a waiting thread spins before relinquishing the CPU is especially important for small and high-contention critical regions. Spin-waits can significantly impact performance on SMP systems and CPUs with Hyper-Threading technology. Critical regions, whether built with the Win32* CRITICAL_SECTION variables or mutex HANDLEs, should have only one point of entry and exit.

Jumping into critical regions defeats synchronization. Jumping out of a critical region without calling LeaveCriticalSection or ReleaseMutex will deadlock waiting threads. It’s important to prevent situations where threads terminate while holding CRITICAL_SECTION variables because this will deadlock waiting threads. Single entry and exit points also make for clearer code.

Lock Contention: Critical regions ensure data integrity when multiple threads attempt to access shared resources. They also serialize the execution of code within the critical section. Threads should spend as little time inside a critical region as possible to reduce the amount of time other threads sit idle waiting to acquire the lock. In other words, it is best to keep critical regions small. On the other hand, using a multitude of small, separate critical regions can quickly introduce system overheads-from acquiring and releasing each separate lock-to such a degree that the performance advantage of multithreading is negated. In the latter case, one larger critical region could be best. Programmers using locks need to balance the size of critical regions against the overhead of acquiring and releasing locks.

Threading APIs: Application programmers sometimes write hand-coded synchronization routines rather than using constructs provided by a threading API to reduce synchronization overhead or provide different functionality than existing constructs offer. Unfortunately, using hand-coded synchronization routines may have a negative impact on performance, performance tuning, or debugging of multithreaded applications. Using the routines provided by a threading API, such as omp_set_lock/omp_unset_lock or critical/end critical directives for OpenMP, pthread_mutex_lock/pthread_mutex_unlock for Pthreads, and EnterCriticalSection/LeaveCriticalSection or WaitForSingleObject or WaitForMultipleObjects and ReleaseMutex for the Win32 API can be helpful. It’s important to study the threading API synchronization routines and constructs to find one that is appropriate for a given situation, as it is sometimes the case that more than one method could be used within the same application.

Non-Blocking Locks: Most threading implementations, including the Win32 and POSIX threads APIs, provide both blocking and non-blocking thread synchronization primitives. Often, the blocking primitives are used as a default. When the lock attempt is successful, the thread gains control of the lock and executes the code in the critical region. In the case of an unsuccessful attempt, however, a context-switch occurs and the thread is placed in a queue of waiting threads. Context-switch overheads can be considerable, especially if the thread implementation is based on kernel threads. If useful work is available that does not rely on execution of the critical region code, non-blocking system calls can sometimes alleviate the performance penalties by allowing the application to postpone the wait on the lock, execute other work, and attempt to gain access to the critical region at a later time. The Win32 API provides blocking and non-blocking critical sections as does Pthreads. It’s also possible to specify timeouts for Win32 synchronization primitives.

These guidelines are really just the tip of the iceberg in terms of implementing locks. Other concerns, such as the fact that certain issues need to be addressed slightly differently depending on whether you are developing for a true SMP sys tem or a system that supports Hyper-Threading technology also come into play. However, the real benefit of locks is that they are tried and true. An abundance of documentation-from Intel and others sources-is available to developers who want to use them as a means of synchronizing multithreaded applications. In addition, they are supported by tools like the Intel® Thread Checker.

Multiple Approaches to Multithreaded Applications



The Lock-Free Approach

As with any subject that has to do with programming, there are always those who choose to take their own path. Sometimes that path leads to a dead end, and sometimes it leads to an approach that revolutionizes or at least offers different options to developers. In the case of multithreading, there are a number of vocal proponents of using a “lock-free” method of programming multithreaded applications. The idea is that locks can cause situations in which a high-priority task has to wait for a low-priority task to release a lock. Lock-free applications are an attempt at managing synchronization and access to shared resources-typically using real-time process schedulers-without the downfall of tasks not receiving the appropriate priority, which could lead to higher performance in certain types of applications.

Lock-free algorithms also frequently require the use of atomic assembly language instructions such as “compare and swap” (CAS) operations. CAS operations compare the contents of a memory location to a given value and, if they are the same, modify the contents of that memory location to a given new value. They are intended to take the place of locks. For example1:


Boolean CAS (addr: pointer to memory; old: expected value; new: new value)

x = read (addr) if (x == old) write (addr, new) return TRUE else return FALSE endif end CAS ()


LIFO (last in first out) stacks are an example of a data structure that can be made thread-safe using lock-free algorithms for the two operations of push and pop. Consider the implementation of the stack as a linked list with a head pointer referencing the top node on the stack and each node being a structure with a data potion and a pointer to the next node in the stack. Figure 1 gives a pictorial representation with some nodes already pushed onto the stack.

Figure 1: An LIFO stack with nodes pushed onto the stack

To implement the push of a new node onto the stack, an “infinite” loop is executed. The body of the loop assigns the pointer within the new node the address of the node at the top of the stack (copied from the head pointer). A CAS operation compares the address held by the head pointer and the new node’s next pointer. If the comparison is successful, the head pointer is updated to hold the address of the new node being pushed onto the stack and the loop is exited. If the compare is not successful, some other thread has updated the head pointer between the assignment of the new node’s next pointer and the CAS operation. The push routine will loop back, reset the pointer within the node and compare with the head pointer until the CAS operation is successful.

The pop operation is similar to the push. After ensuring the stack is not empty, the address held in the head pointer is copied to a local variable, top, and the address of the second node in the stack (copied from the pointer of the head node), is copied to another local variable, second. The CAS operation performs a comparison to see if the address of the head node and the address held in top are the same. If not, the head pointer has been changed and the pop routine loops to get new addresses to compare. If the compare is successful, the head pointer is updated with the address in second and the pop operation returns the top address as the pointer to the node that was just removed from the stack.

One drawback to using CAS is that it can fall prey to the "ABA problem." The ABA problem is manifested when the value of the new parameter is the same as addr, during the CAS comparison, but things have changed after the parameter values were initially accessed. For example, consider a thread executing the pop operation described above. The current stack pointer values are stored in top and second, but the thread is blocked before the CAS operation. Other threads may then push and pop elements on the stack in such a way that the same memory address is used for the head node when the first thread is allowed to proceed. In this case, CAS will return TRUE (since the head pointer and top are equal) but the wrong value of second could be stored back into the stack's head pointer. Solutions to the ABA problem usually expand the CAS to compare and swap two values using the second as a "version number" to double check that no changes have been made to the values. See ABA Prevention using Single-Word Instructions* for other methods of solving the ABA problem.

A number of other examples of lock-free programming are also available online, including Chris Thomasson’s extensive list of algorithms supporting the idea at AppCore: A Portable High-Performance Thread Synchronization Library*.

So far, proponents of the lock-free approach seem to be in the area of computer multimedia applications, where concurrent access to separate threads could be beneficial, and in managing large databases, where large numbers of users need to access servers at the same time. Other potential applications for a lock-free approach that should be mentioned are automotive system control and video-conferencing systems.

The cited benefits of a lock-free approach sound promising: increased fault tolerance, better performance, and better scalability. The unfortunate part about the method is that most of the work being done in the area is experimental and/or theoretical. It’s in the realm of university studies and those who have the time and energy to spend trying to pioneer a better and faster method of making use of multiprocessor systems and perhaps even Hyper-Threading Technology.

1- Dominique Fober, Yann Orlarey, and Stephane Letz. "Lock-Free Techniques for Concurrent Access to Shared Objects."

Multiple Approaches to Multithreaded Applications



Hybrid Multithreading Methods
In addition to completely “lock-free” programming, another method is often used that still falls logically into this category because it uses locks sparingly-double-check locking (DCL). In other words, it is sort of a hybrid of locking and non-locking approaches. When initializing shared, read-only data, it can be tempting to let multiple threads perform the entire initialization asynchronously. The initialization will be correct provided the threads are all writing the same values to the global data. However, asynchronous initialization could incur a serious performance penalty as multiple threads invalidate each other’s cache lines. In addition, the overhead of using threads for small data sets could mean performance degradation and coordinating threads to perform the operation may increase the code complexity far beyond the simplicity of the task being done. For small tasks that are only executed once (initialization, file opening/closing, dynamic memory allocation), it is often possible to use a single thread. To ensure that only one thread performs the task, a DCL scheme can be employed.

In DCL, if-tests are used to avoid locking after the first thread has executed the task. Instead of using a lock object, it is possible to write a DCL that employs CAS for a lock-free implementation. DCL can also be useful when threads need to repeatedly check whether an operation has been completed. The following pseudo-code illustrates how if-tests are used in DCL:


Boolean initialized = FALSE
function InitOnce
if not initialized
acquire lock
if not initialized // Double-check!
perform initialization
initialized = TRUE
release lock


Here, multiple threads can evaluate the first if-test as TRUE. However, only the first thread to acquire the lock may perform the initialization and set the Boolean variable to TRUE. When the lock is released, subsequent threads re-check the Boolean variable. Failure to double-check the Boolean control variable can result in re-initialization, possibly with different data, which could lead to unexpected results. In addition, threads that call the function after initialization has occurred will not seek to acquire the lock since the first if-test will evaluate to FALSE. No thread can return unless the initialization is complete (an implicit barrier without the overhead of an explicit barrier).

The astute reader will notice a data race exists for the Boolean variable. Specifically, a thread can read its value while another thread is modifying its value. This data race is benign because only the thread holding the lock can modify the variable. Any thread reading the Boolean while another thread writes won't defeat the intent of the algorithm due to the second check after acquiring the lock. However, the Intel® Thread Checker will still report storage conflicts on the Boolean variable.

More critical than the potential for a data race on the Boolean variable is the possibility of instructions being reordered by the compiler. Consider the two innermost lines of the above DCL: the initialization and the setting of the Boolean control variable. If the setting of the Boolean variable is done before some or all of the initializations, the initialization thread may block without having completed the full set. Another thread calling the DCL would find the Boolean variable set to TRUE, skip around the lock acquisition, and begin using data objects that have not been properly initialized. Several solutions to this potential problem have been put forth, but discussion of these is beyond the scope of this paper. See the recent Dr. Dobb's Journal article2 for more information and solutions.

2- Michael Abrash, “C++ and the Perils of Double-Checked Locking” (Parts I and II), Dr. Dobb’s Journal, July 2004 and August 2004.

Multiple Approaches to Multithreaded Applications



To Lock or Not to Lock

While a lock-free approach may be beneficial in many situations, the problem with it is that most programmers are working in a production environment and, without the availability of tools, libraries, and documentation, those who use it are pretty much on their own. In the future, as more research is done and better support is available for the approach, it may become a viable and beneficial solution for certain applications, especially as more desktop machines have multithreading capabilities via Hyper-Threading technology.

For now, however, most professional programmers will probably stick with using locks for synchronization purposes, whether benefits could be reaped by a lock-free approach or not. In addition to a plethora of documentation on the subject, developers targeting the Intel platform also have access to tools like the Intel® VTune™ Performance Analyzer, Intel® C/C++ Compilers (which en le threading by supporting both OpenMP and auto-parallelization), the Intel® Thread Checker and Intel® Thread Profiler, and the parallel libraries that are a part of the threading tools suite in the Intel® Math Kernel Library and Intel® Integrated Performance Primitives. That’s quite an arsenal to shave production time compared to attempting to gather disparate knowledge from those who have been working with a lock-free approach. It is viable; it’s just not ready for prime time. When it is, and support and documentation are in place, it’s sure to prove beneficial to some developers. Otherwise, a lock-free approach wouldn’t be creating such a buzz in the programming community at large and on the Threading on Intel® Parallel Architectures specifically. When developers communicate with each other, anything’s possible.

Multiple Approaches to Multithreaded Applications



References, Related Links and Articles

  • Developing Multithreaded Applications: A Platform Consistent Approach
  • Helping Developers Succeed with the Pentium® 4 Processor

Per informazioni più dettagliate sulle ottimizzazioni basate su compilatore, vedere il nostro Avviso sull'ottimizzazione.