Multi-Threading for Experts: Synchronization

Introduction

By Sergey N. Zheltov and Stanislav V. Bratanov

Anyone designing multithreaded applications will agree: the problem of efficient synchronization is among the most difficult tasks of parallel programming. Modern operating systems provide a multitude of synchronization objects, which determine a variety of synchronization methods and schemes.

This article addresses some of the issues of parallel program synchronization and tries to clarify at least a few of them.

Below, several synchronization schemes are discussed; synchronization object behavior is described; internal system implementation of user-visible synchronization objects or functions is also explained where applicable. Hidden timings and latencies are provided; execution thread layout is shown over time, and non-obvious thread time shifts are pointed out.

Readers will also find code examples illustrating important points of the synchronization problem.

All discussions of synchronization objects and algorithms appearing in this article pertain to Microsoft Windows* implementations of such objects and algorithms.

Definitions

Several abbreviations are used throughout the article to denote the following terms:

IPI stands for Inter-Processor Interrupt, an interruption signal sent by one processor to another.

APC, Asynchronous Procedure Call, which is a Microsoft Windows* notification scheme enabling the execution of a specified procedure within the context of a specified thread.


Internal Implementation

This article addresses wake-up and signaling schemes which are used to directly control the execution of threads. Schemes of mutual exclusion, which are mostly employed to synchronize resource usage (granting access to a resource to only one thread), are beyond the scope of the article.

Also, in this article we assume that all necessary resources have been already divided between threads; the task is to wake up the thread and load each of them with an appropriate amount of work.

As various mutual exclusion constructions are not within the scope of this article, only two sorts of synchronization objects typically used for signaling are considered: events and semaphores.


Thread Wake Algorithms


Useful Timings

Let’s prove or rebut the above considerations on the synchronization methods and their usability with the actual timing information. A sample OpenMP application built using different compilers (in order to reproduce at least two different synchronization models) and a manual thread control example were investigated. The results are shown below.



Figure 1: Parallel Program Layout

Now we’ll proceed with the timings.

APC Invocation Delay

Figure 2 pr ovides an example of an APC-based thread control. The main thread inserts three APC objects into the queues of each controlled thread. In the figure, the beginning of APC insertion and the start of the last thread are marked with dashed lines. The marked interval corresponds to 47 microseconds (as typed in the bottom right corner of the figure), which is the APC invocation delay.



Figure 2: APC Timings

Event Triggering Delay

Figure 3 presents similar information for the event-based synchronization method. The execution region ranging from triggering the first event to the activation of the last thread covers 33 microseconds, which is faster than the time the APC version takes to execute.



Figure 3: Event Timings

Semaphore Release Delay



Figure 4: Semaphore Timings

The above timing information proves our assumptions on the internal algorithms involved in synchronization. As it was predicted, the event- and semaphore-based methods yield similar performance, and the double-staged APC queuing introduces additional overhead compared with events and semaphores.


Parallel Executions of Threads

Consider Figure 5. We marked with yellow the regions of code where the actual computation is performed. You will notice that no part of the computational code is executed in parallel. These results are slightly disappointing, because there is no guarantee that a properly threaded application outperforms its sequential analog.





Figure 5: Regions of Parallel Execution

All things considered, to ensure that at least part of an application runs in parallel, the following condition should hold true: the Sequential Time of an application divided by the Number of Processors should be greater than 50 microseconds for the worst case (see Figure 2).

Now the question: is there a method to trigger multiple threads simultaneously?

A closer look at Figure 4 would reveal a discrepancy between the expected behavior of threads (as the type of a user-level synchronization object used would suggest) and the actual operation. Indeed, if three threads are waiting for the same synchronization object, one expects them to wake up synchronously as that object acquires a signaled state. But the figure looks exactly as if each thread was triggered independently.

This only confirms the above hypothetical algorithms of operating system kernel synchronization. As one can see, when releasing a semaphore, the operating system picks a new thread from a wait list (associated with the semaphore), selects an idle processor, and then signals to the processor (by means of inter-processor interrupt) to start executing the new thread. This procedure is repeated until the wait list is empty or the semaphore limit is reached. As the wait list is analyzed on one processor and other processors are invoked sequentially, the delay between threads becomes greater or equal to the time of wait list processing plus the time of sending and receiving an IPI.

Currently, although there is no means to enable synchronous thread start (at least in the operating systems considered in this article), it may be easily amended:


Conclusion

A variety of synchronization objects and schemes exist, rooted in the basic support of the operating system kernel. The actual efficiency of different user-level synchronization schemes may appear to be the same, as such schemes may serve as mere synonyms of one general kernel mode construction.

On the contrary, some synchronization schemes may support a very specific case, and their performance in other cases of synchronization may degrade.

That is why it may be useful, while writing a well-balanced multithreaded application, to keep in mind the above relationships between user- and kernel-based objects, and only apply synchronization schemes that have been specifically designed for a particular purpose for the user.

The article provided examples of both the non-standard usage of a synchronization scheme, and the usage of different schemes having similar performance due to the same underlying system kernel algorithm.

Non-evident timings of synchronization calls and time shifts between threads are also provided. Such timing information is extremely necessary for planning a well-threaded application, since not taking into account the thread switching overhead and thread scheduling order may adversely affect the scalability of the application, and even decrease its performance as compared with a sequential version.

A point that lacks operating system support was identified: relatively large time shifts between threads that were triggered by a common semaphore. The solution with broadcast IPIs was proposed.

In summary:

  • Use synchronization objects and schemes specifically designed by operating system manufacturers to suit your task (it is very likely that such schemes are already optimized for solving your type of synchronization problems),
  • Avoid redundant API calls (in a synchronization loop),
  • Pay attention to the amount of work you are going to load your threads with (it may not be worth it, considering the overhead),
  • Try to compensate for the time shifts by delegating less work to lagging threads (especially for short threads).

 


References

 


About the Authors

Sergey Zheltov

Sergey Zheltov is a project manager in Microprocessor Research, Intel Labs. His research interests include media compression and processing, software and platforms architecture, signal processing, high-order spectra. He received a Diploma in radio-physical engineering and MS degree in theoretical and mathematical physics from Nizhny Novgorod State University.

Stanislav Bratanov

Stanislav Bratanov is a software engineer in Microprocessor Research, Intel Labs. His research interests include multi-processor software platforms, operating system environments, and platform-dependent media data coding. He graduated from Nizhny Novgorod State University, Russia.


Download the PDF (810KB)


Download the code sample


For more complete information about compiler optimizations, see our Optimization Notice.