Multiple Approaches to Multithreaded Applications


Introduction

The different means of building multithreaded applications have been a source of some debate. Some developers think the traditional method-using locks-is best, while others are proponents of avoiding the use of locks entirely. Though the different camps each have their points, here are the basics of the different methods, along with an attempt to outline their respective pros and cons.

Multithreaded programming, until recently, provided the greatest benefits to those building applications in the world of servers, where execution on multiple processors with access to shared memory was the norm. While programmers of server applications are still building programs that distribute the processing burden across more than one processor, the advent of the Intel® Pentium® 4 with Hyper-Threading Technology (HTT) has brought multithreading to the desktop. HTT makes a single physical processor function as two logical processors, allowing developers of desktop software to take advantage of the virtues of multiprocessing. Whether you’re building server applications or discovering the advantage of desktop multithreading, you’ll find-as in all types of programming-different schools of thought on how to best reap the benefits of distributing and synchronizing threads while avoiding such pitfalls as deadlock. Here, we’ll try to make some sense of the design of multithreading systems.


Synchronization: The Key to Multitasking Benefits

Threading is nothing new to most programmers. Even those targeting single-processor, non-HTT environments may use threads as a matter of convenience in programming and debugging. Sometimes, separate threads are spawned off to handle specific tasks. However, on multiprocessor or HTT systems, programs can be designed to run optimally by performing a number of tasks simultaneously. To take advantage of this capability, the threads must be synchronized appropriately to do what they need to do at the right time.

In order to avoid race conditions during the execution of a threaded application, mutual exclusion to shared resources is required to allow a single thread to access and change the state of shared resources. The shared resource can be a data structure or variable in the address space. Minimizing synchronization overheads is critical to application performance. In multithreaded applications, while a thread is executing a code section that accesses a shared resource (critical region), other threads waiting to access that resource are either spinning or waiting in a queue.

Typically, synchronization duties are handled by making use of locks (semaphores being a generalized form) which are a means of limiting or controlling access to a resource when many threads are executing. Locks can be implemented in many ways and are the more traditional means of handling synchronization between threads. In addition to conventional multithreading methodologies, some developers also vehemently believe that there can be performance benefits gained by developing multithreaded applications that either don’t use locks or use them very sparingly.


Locks: an Overview

A number of issues are associated with the use of locks and the Intel document Developing Multithreaded Applications: A Platform Consistent Approach is probably the most valuable resource available for those delving into the realm of multiprocessing-especially for Intel platforms. Locks are looked at in depth in Chapter 4 of the paper and it serves as a good guideline for the synchronization of multiprocessor and Hyper-Threading Technology applications. That said, the following should be considered simply an overview of factors to keep in mind when using locks.

Synchronization constructs serialize execution. However, very few multithreaded programs are entirely synchronization-free. Fortunately, it is possible to mitigate some of the system overhead associated with synchronization by choosing appropriate constructs. If the variable being updated is shared among threads, the load?update?store instructions must be atomic (that is, the sequence of instructions must not be preempted before completion). Controlling how long a waiting thread spins before relinquishing the CPU is especially important for small and high-contention critical regions. Spin-waits can significantly impact performance on SMP systems and CPUs with Hyper-Threading technology. Critical regions, whether built with the Win32* CRITICAL_SECTION variables or mutex HANDLEs, should have only one point of entry and exit.

Jumping into critical regions defeats synchronization. Jumping out of a critical region without calling LeaveCriticalSection or ReleaseMutex will deadlock waiting threads. It’s important to prevent situations where threads terminate while holding CRITICAL_SECTION variables because this will deadlock waiting threads. Single entry and exit points also make for clearer code.

Lock Contention: Critical regions ensure data integrity when multiple threads attempt to access shared resources. They also serialize the execution of code within the critical section. Threads should spend as little time inside a critical region as possible to reduce the amount of time other threads sit idle waiting to acquire the lock. In other words, it is best to keep critical regions small. On the other hand, using a multitude of small, separate critical regions can quickly introduce system overheads-from acquiring and releasing each separate lock-to such a degree that the performance advantage of multithreading is negated. In the latter case, one larger critical region could be best. Programmers using locks need to balance the size of critical regions against the overhead of acquiring and releasing locks.

Threading APIs: Application programmers sometimes write hand-coded synchronization routines rather than using constructs provided by a threading API to reduce synchronization overhead or provide different functionality than existing constructs offer. Unfortunately, using hand-coded synchronization routines may have a negative impact on performance, performance tuning, or debugging of multithreaded applications. Using the routines provided by a threading API, such as omp_set_lock/omp_unset_lock or critical/end critical directives for OpenMP, pthread_mutex_lock/pthread_mutex_unlock for Pthreads, and EnterCriticalSection/LeaveCriticalSection or WaitForSingleObject or WaitF orMultipleObjects and ReleaseMutex for the Win32 API can be helpful. It’s important to study the threading API synchronization routines and constructs to find one that is appropriate for a given situation, as it is sometimes the case that more than one method could be used within the same application.

Non-Blocking Locks: Most threading implementations, including the Win32 and POSIX threads APIs, provide both blocking and non-blocking thread synchronization primitives. Often, the blocking primitives are used as a default. When the lock attempt is successful, the thread gains control of the lock and executes the code in the critical region. In the case of an unsuccessful attempt, however, a context-switch occurs and the thread is placed in a queue of waiting threads. Context-switch overheads can be considerable, especially if the thread implementation is based on kernel threads. If useful work is available that does not rely on execution of the critical region code, non-blocking system calls can sometimes alleviate the performance penalties by allowing the application to postpone the wait on the lock, execute other work, and attempt to gain access to the critical region at a later time. The Win32 API provides blocking and non-blocking critical sections as does Pthreads. It’s also possible to specify timeouts for Win32 synchronization primitives.

These guidelines are really just the tip of the iceberg in terms of implementing locks. Other concerns, such as the fact that certain issues need to be addressed slightly differently depending on whether you are developing for a true SMP system or a system that supports Hyper-Threading technology also come into play. However, the real benefit of locks is that they are tried and true. An abundance of documentation-from Intel and others sources-is available to developers who want to use them as a means of synchronizing multithreaded applications. In addition, they are supported by tools like the Intel® Thread Checker.


The Lock-Free Approach

As with any subject that has to do with programming, there are always those who choose to take their own path. Sometimes that path leads to a dead end, and sometimes it leads to an approach that revolutionizes or at least offers different options to developers. In the case of multithreading, there are a number of vocal proponents of using a “lock-free” method of programming multithreaded applications. The idea is that locks can cause situations in which a high-priority task has to wait for a low-priority task to release a lock. Lock-free applications are an attempt at managing synchronization and access to shared resources-typically using real-time process schedulers-without the downfall of tasks not receiving the appropriate priority, which could lead to higher performance in certain types of applications.

Lock-free algorithms also frequently require the use of atomic assembly language instructions such as “compare and swap” (CAS) operations. CAS operations compare the contents of a memory location to a given value and, if they are the same, modify the contents of that memory location to a given new value. They are intended to take the place of locks. For example1:

 

Boolean CAS (addr: pointer to memory; old: expected value; new: new value)

x = read (addr)

if (x == old)

write (addr, new)

return TRUE

else

return FALSE

endif

end CAS



()

 

LIFO (last in first out) stacks are an example of a data structure that can be made thread-safe using lock-free algorithms for the two operations of push and pop. Consider the implementation of the stack as a linked list with a head pointer referencing the top node on the stack and each node being a structure with a data potion and a pointer to the next node in the stack. Figure 1 gives a pictorial representation with some nodes already pushed onto the stack.



Figure 1: An LIFO stack with nodes pushed onto the stack

To implement the push of a new node onto the stack, an “infinite” loop is executed. The body of the loop assigns the pointer within the new node the address of the node at the top of the stack (copied from the head pointer). A CAS operation compares the address held by the head pointer and the new node’s next pointer. If the comparison is successful, the head pointer is updated to hold the address of the new node being pushed onto the stack and the loop is exited. If the compare is not successful, some other thread has updated the head pointer between the assignment of the new node’s next pointer and the CAS operation. The push routine will loop back, reset the pointer within the node and compare with the head pointer until the CAS operation is successful.

The pop operation is similar to the push. After ensuring the stack is not empty, the address held in the head pointer is copied to a local variable, top, and the address of the second node in the stack (copied from the pointer of the head node), is copied to another local variable, second. The CAS operation performs a comparison to see if the address of the head node and the address held in top are the same. If not, the head pointer has been changed and the pop routine loops to get new addresses to compare. If the compare is successful, the head pointer is updated with the address in second and the pop operation returns the top address as the pointer to the node that was just removed from the stack.

One drawback to using CAS is that it can fall prey to the "ABA problem." The ABA problem is manifested when the value of the new parameter is the same as addr, during the CAS comparison, but things have changed after the parameter values were initially accessed. For example, consider a thread executing the pop operation described above. The current stack pointer values are stored in top and second, but the thread is blocked before the CAS operation. Other threads may then push and pop elements on the stack in such a way that the same memory address is used for the head node when the first thread is allowed to proceed. In this case, CAS will return TRUE (since the head pointer and top are equal) but the wrong value of second could be stored back into the stack's head pointer. Solutions to the ABA problem usually expand the CAS to compare and swap two values using the second as a "version number" to double check that no changes have been made to the values. See http://www.research.ibm.com/people/m/michael/RC23089.pdf* for other methods of solving the ABA problem.

A number of other examples of lock-free programming are also available online, including Chris Thomasson’s extensive list of algorithms supporting the idea at http://www.nnseek.com/a/APK63395mfDp/Chris%20Thomasson/*.

So far, proponents of the lock-free approach seem to be in the area of computer multimedia applications, where concurrent access to separate threads could be beneficial, and in managing large databases, where large numbers of users need to access servers at the same time. Other potential applications for a lock-free approach that should be mentioned are automotive system control and video-conferencing systems.

The cited benefits of a lock-free approach sound promising: increased fault tolerance, better performance, and better scalability. The unfortunate part about the method is that most of the work being done in the area is experimental and/or theoretical. It’s in the realm of university studies and those who have the time and energy to spend trying to pioneer a better and faster method of making use of multiprocessor systems and perhaps even Hyper-Threading Technology.

1- Dominique Fober, Yann Orlarey, and Stephane Letz. "Lock-Free Techniques for Concurrent Access to Shared Objects." /sites/default/files/m/f/5/c/L17_Fober.pdf


Hybrid Multithreading Methods

In addition to completely “lock-free” programming, another method is often used that still falls logically into this category because it uses locks sparingly-double-check locking (DCL). In other words, it is sort of a hybrid of locking and non-locking approaches. When initializing shared, read-only data, it can be tempting to let multiple threads perform the entire initialization asynchronously. The initialization will be correct provided the threads are all writing the same values to the global data. However, asynchronous initialization could incur a serious performance penalty as multiple threads invalidate each other’s cache lines. In addition, the overhead of using threads for small data sets could mean performance degradation and coordinating threads to perform the operation may increase the code complexity far beyond the simplicity of the task being done. For small tasks that are only executed once (initialization, file opening/closin g, dynamic memory allocation), it is often possible to use a single thread. To ensure that only one thread performs the task, a DCL scheme can be employed.

In DCL, if-tests are used to avoid locking after the first thread has executed the task. Instead of using a lock object, it is possible to write a DCL that employs CAS for a lock-free implementation. DCL can also be useful when threads need to repeatedly check whether an operation has been completed. The following pseudo-code illustrates how if-tests are used in DCL:

Boolean initialized = FALSE

function InitOnce

{

if not initialized

{

acquire lock

if not initialized // Double-check!

{

perform initialization

initialized = TRUE

}

release lock

}

}

 

Here, multiple threads can evaluate the first if-test as TRUE. However, only the first thread to acquire the lock may perform the initialization and set the Boolean variable to TRUE. When the lock is released, subsequent threads re-check the Boolean variable. Failure to double-check the Boolean control variable can result in re-initialization, possibly with different data, which could lead to unexpected results. In addition, threads that call the function after initialization has occurred will not seek to acquire the lock since the first if-test will evaluate to FALSE. No thread can return unless the initialization is complete (an implicit barrier without the overhead of an explicit barrier).

The astute reader will notice a data race exists for the Boolean variable. Specifically, a thread can read its value while another thread is modifying its value. This data race is benign because only the thread holding the lock can modify the variable. Any thread reading the Boolean while another thread writes won't defeat the intent of the algorithm due to the second check after acquiring the lock. However, the Intel® Thread Checker will still report storage conflicts on the Boolean variable.

More critical than the potential for a data race on the Boolean variable is the possibility of instructions being reordered by the compiler. Consider the two innermost lines of the above DCL: the initialization and the setting of the Boolean control variable. If the setting of the Boolean variable is done before some or all of the initializations, the initialization thread may block without having completed the full set. Another thread calling the DCL would find the Boolean variable set to TRUE, skip around the lock acquisition, and begin using data objects that have not been properly initialized. Several solutions to this potential problem have been put forth, but discussion of these is beyond the scope of this paper. See the recent Dr. Dobb's Journal article2 for more information and solutions.

2- Michael Abrash, “C++ and the Perils of Do uble-Checked Locking” (Parts I and II), Dr. Dobb’s Journal, July 2004 and August 2004.


To Lock or Not to Lock

While a lock-free approach may be beneficial in many situations, the problem with it is that most programmers are working in a production environment and, without the availability of tools, libraries, and documentation, those who use it are pretty much on their own. In the future, as more research is done and better support is available for the approach, it may become a viable and beneficial solution for certain applications, especially as more desktop machines have multithreading capabilities via Hyper-Threading technology.

For now, however, most professional programmers will probably stick with using locks for synchronization purposes, whether benefits could be reaped by a lock-free approach or not. In addition to a plethora of documentation on the subject, developers targeting the Intel platform also have access to tools like the Intel® VTune™ Performance Analyzer, Intel® C/C++ Compilers (which enable threading by supporting both OpenMP and auto-parallelization), the Intel Thread Checker and Thread Profiler, and the parallel libraries that are a part of the threading tools suite in the Intel® Math Kernel Library and Intel® Integrated Performance Primitives. That’s quite an arsenal to shave production time compared to attempting to gather disparate knowledge from those who have been working with a lock-free approach. It is viable; it’s just not ready for prime time. When it is, and support and documentation are in place, it’s sure to prove beneficial to some developers. Otherwise, a lock-free approach wouldn’t be creating such a buzz in the programming community at large and on the Threading on Intel® Parallel Architectures Forum specifically. When developers communicate with each other, anything’s possible.


References and Related Links

Articles

Developer Centers

Community

 


About the Author

George Walsh is a veteran technical editor and writer with experience in fields ranging from embedded systems programming to CAD. As a freelance researcher and writer he has provided his expertise to clients in a wide variety of markets.

 


分类:
如需更全面地了解编译器优化,请参阅优化注意事项