Is the Free Lunch Really Over? Scalability in Manycore Systems Part 2: Using Locks Efficiently


Download PDF

Download Is the Free Lunch Really Over? Scalability in Manycore Systems Part 2: Using Locks Efficiently [PDF 185KB]


Introduction

Review Is the free lunch really over? Scalability in Many-core Systems: Part 1 in a Series

For the multi-core, shared-memory computing platforms currently, applications are typically rendered parallel with one or another choice of threading model - extensions, such as Pthreads or OpenMP™, to sequential languages. While this situation is less than ideal (see the analysis, for example, by Lee, 2006), it is the current industry practice. The shared-state paradigm of typical threading models, where all threads share the same memory with equal access to its data, requires careful protocol to prevent unintended consequences. The key difficulty is the race condition, where threads access data in an unanticipated sequence, with unintended results. Race conditions are the worst kind of bug: insidious, very difficult to detect, reproduce, or diagnose. Tools are essential to mitigating these difficulties [shameless plug: the Intel® Thread Checker is quite good at this]. Another category of irritants is deadlocks/livelocks, though - due to the fact that the code execution halts - these are normally easier to diagnose and correct. These correctness challenges are widely discussed in threading books and articles, and will not be addressed here. We focus, instead, on a necessary component of threaded programming - locks - and their potential impact on performance.

Even when parallel programs have been scrubbed to the point of reasonable confidence in their correctness, there remain potential unintended consequences in performance degradation. Applying a lock does, after all "re-serialize" that portion of execution; the goal then is to use these only as much locking as is necessary, but no more.

This scalability series assumes the code is correct, and looks are factors which may inhibit the scalability of the correctly-running application. In this episode, we look at the efficient use of locks, and in particular at two guiding principles: lock the data, not the code, and use the right lock for the job.


Factors impacting scalability: using locks efficiently

Profile view: when locks intrude

Good lock management is essential to scalable parallel performance. Figure 2-1 illustrates the opposite, a profile trace of poor lock management. Each bar represents a timeline of execution for a particular thread: green areas indicating busy, executing presumably useful work; idle, as the term suggests, is time spent waiting for other threads, and in critical indicates time spent inside locked regions of execution. (This example is from an actual profile, illustrated using the Intel® Thread Profiler.) In the example profile shown, once a critical region is reached by one thread, all others are rendered idle; the overall execution in those sections has been serialized. This particular problem often occurs when work inside and outside the protected region (e.g. a critical section) is very small, so threads “pile up” on the lock. Such design limitations may be avoided by following the guidelines discussed below.

Figure 2-1. Thread performance degradation from inefficient locking mechanisms.

Lock data, not code

Lock only what needs to be accessed serially. What usually need this kind of protection are data elements; whenever possible, lock data, not code. Programming adepts have been repeating this for years, and it bears repeating again and again. (see, for example, Russell's somewhat salty and entertaining Unreliable Guide to Locking, published in 2000. More recently, Sutter nicely alludes to this technique (without actually repeating the phrase) in a 2008 article on concurrency-friendly data structures.)

In our own courses at the Intel Academic Community, we teach this concept right up front, using a link-list example. Consider the example of creating a hash table (a collection of linked lists), shown schematically in Figure 2-2. The lists may be operated upon in parallel, but updates to any particular list must be protected (locked) to prevent a race condition. One very simple way to accomplish this is with a Critical Section, shown (with OpenMP syntax) in Figure 2-3.

Figure 2-2. Schematic representation of hash table creation.

Figure 2-3. Code segment showing inefficient use of Critical Section (don't do this).

 

#pragma omp parallel for private (index)

for (i = 0; i < elements; i++) {

index = hash(element[i]);

#pragma omp critical

insert_element (element[i], index);}

 

In this example, we have locked the code: we prevent two threads from trying to update the same list simultaneously by putting function insert_element inside a critical section. Unfortunately, this effectively serializes the loop, because the work of inserting an element into the hash table is the principal activity of the loop. This approach will result in little, if any, speed improvement from parallelism.

A much better approach is to lock the data. OpenMP, as with most threading models, gives us the ability to lock individual chains of the hash table. We associate a lock variable with each index in the hash table. When we are ready to insert an element into a particular chain, we set the lock associated with that chain. The advantage of this approach is that a group of threads may all be inserting elements into the hash table in parallel, as long as these elements hash to different table indices.

To use this approach, a bit of housekeeping is in order: the lock variable must be declared as type omp_lock_t, and initialized with the function omp_lock_t. Figure 2-4a shows these steps for the hash table case, where we have declared a lock variable hash_lock.

Figure 2-4a. Code segments showing housekeeping steps to use omp_set_lock.

 

/* hash_lock declared as type omp_lock_t */

omp_lock_t hash_lock[HASH_TABLE_SIZE];

/* locks initialed in function ‘main’ */

for (i = 0; i < HASH_TABLE_SIZE; i++)

omp_init_lock(&hash_lock[i]);

 

Figure 2-4b. Code segment showing efficient locking with omp_set_lock (do this).

 

void insert_element (ELEMENT e, int i)

{

omp_set_lock (&hash_lock[i]);

/* Code to insert element e */

omp_unset_lock (&hash_lock[i]);

}

 

Use the right locking mechanism for the job

Using the right locking mechanism nearly always implies: use the lightest possible mechanism. Threading models supply a hierarchy of locks; in the Win32 API, for example, the choices of synchronization primitives are as follows, in increasing order of execution cost:

  • Atomic increments/decrements
    • InterlockedIncrement
  • Critical Section, Critical Section with spin count
    • EnterCriticalSection, LeaveCriticalSection, SetCriticalSectionSpinCount
    • Works within a single process
  • Events
    • Signal that a condition has been changed or satisfied
  • Mutex
    • Works both within and across processes (since it's a kernel object)
  • Semaphore
    • also works across processes

 

The "spin count" variant of critical section can work well in highly-contended situations: the critical section is initialized with a count; whenever a thread attempts to access that section, rather than sleeping, it remains active for a time period determined by the spin count, and then tests again. This avoids the overhead of blocking, then reawakening, the thread which is waiting for a lock to be released.

Measurements of the overhead (the actual lock times, normalized to the smallest time of InterlockedIncrement) of each locking mechanism are compared in Figure 2-5. The experiment was arranged to avoid contention (thus the critical section shows the same performance with and without spin count). Notice that the mutex and semaphore locks are dramatically more expensive, more than 50 times slower than the others, going right off the scale of the graph. To repeat: in terms of cost:

Mutex or semaphore >> critical section > InterlockedIncrement

Figure 2-5. Relative cost for different locking mechanisms, scaled to InterlockedIncrement.

For those who'd like to try this at home: the complete lab which generated the results of Figure 2-5, including course code, is available for college/university classroom through Intel's Academic Community.

Besides choosing the right level of lock - that is, the least expensive lock which will do the job - it is important to apply the lock in the least intrusive way possible. In generally, global (static) locks are to be avoided. A simple case study illustrates this point. A client code was found to show negative scaling at 4 threads or more - that is, the application ran slower with four threads, on four cores, than it did when running with a single thread. The root cause was discovered to be that a class defined a critical section as a static member variable. The solution: have each instance of class use separate lock by removing static declaration. Once this was done, the application performance did increase with increasing core/thread count. Figure 2-6 shows the concept in schematic form, before and after this change.

Figure 2-6. Schematic comparison of static (Before) to private locks (After).

To conclude: efficient lock usage has a dramatic impact on scalability. Lock data, not code. Use the least expensive locking mechanism possible. Beware of static locks.


References

 


Acknowledgments

This material is derived from a longer session by the Intel Academic Community.


About the Author

Michael Wrinn is a senior course architect in the Intel Academic Community, where he collaborates with universities to bring parallel computing to the mainstream of undergraduate education. Prior assignments include managing Intel's software engineering lab in Shanghai, and directing the human interface technology research lab. He was Intel's representative to the committee which produced the first OpenMP specification, and remains active in the parallel computing community. Before joining Intel, Michael worked at Accelrys (San Diego), implementing commercial and research simulation codes on a wide variety of parallel/HPC systems. He holds a Ph.D. (in quantum mechanics) and a B.Sc. (mathematics/chemistry/physics) from McGill University.

 


For more complete information about compiler optimizations, see our Optimization Notice.

12 comments

Top
Dny's picture

Hello All,

I'm new to this area of parallel programming .
Can you please answer my following query ?

As we know that openMP cant be used with loop count with STL iterators, we need to use explicit threading for scaling the application, so what are the different ways to enable free lunch using explicit threading and what are hurdles and solutions to them.

I apologize if Its a stupid question.

Thanking You,

Regards,
Digambar.

Dmitry Vyukov's picture

Anthony, I totally agree. Distribution of data is the best possible solution.
I just wanted to point that it&#39;s possible to create and centralized
data structures in scalable low-overhead way (if it&#39;s impossible to
distribute data).

Dmitry Vyukov's picture

2 John Engelhart: btw, I see there are warnings like
&#34;This library/framework is largely untested. It deals with multithreading issues that are notoriously hard to debug and get right. Not only that, it attempts to do so in a lock-free manner.&#34;

You can use Relacy Race Detector to verify implementation:
http://groups.google.ru/group/relacy
With RDD you can find nearly 100% of bugs, verification is extremely easy, and it can detect all kind of bugs typical for such things like multi-threaded hash maps (even bugs related to relaxed memory model).
If source code is under BSD, then usage of RDD is completely free.

Dmitry Vyukov's picture

2 John Engelhart: Is there any benchmarks on SMP/multicore machines? I found benchmarks only on singlecore G4... Hmmm... IMHO it makes little sense to benchmark synchronization primitives on singlecore, because the main thing is scalability, I was observing up to 100-1000x performance degradation when moving from single core to quad core.
And if the goal is &#34;single-threaded&#34; performance, then as I see single operation on NSInteger hash map takes ~630 cycles... hmmm... I think than plain spin-lock based hash map will be significantly faster...

anonymous's picture

Take a look at TransactionKit (http://transactionkit.sourceforge.net/), an open source (BSD license) project of mine.

It&#39;s a lockless, concurrent multi-reader, multi-writer transaction capable hash table. Since it uses transactions, it supports deterministic enumeration of the hash table contents while still allowing other threads to continue mutating the hash table. All table operations are lockless, atomic CAS - LL/SC operations are used to insert and remove items from the table.

anthony_williams's picture

Of course you can write lock-free algorithms that don&#39;t have independent data, but an independent set of data per thread is definitely the easiest to handle.

Dmitry Vyukov's picture

&#34;No locking&#34; doesn&#39;t necessarily mean &#34;independent set of data per thread&#34;.
Non-mutating operations can access data w/o locks in some conditions.
Even mutating operations on shared data can be made &#34;no locking&#34;. Think of single-producer/single-consumer queue, or per-thread statistics.

Dmitry Vyukov's picture

And why don&#39;t say it explicitly?

anthony_williams's picture

Of course the ultimate in &#34;minimal locking&#34; is &#34;no locking&#34; --- use a completely independent set of data per thread.

anthony_williams's picture

FYI: The last couple of figures in this article are wrong: fig 2.5 is a duplicate of fig 2.4, and fig 2.6 is what should be fig 2.5. The real fig 2.6 appears to be missing.

Pages

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.