Fun with Locks and Waits - Performance Tuning

At times threaded software requires some critical sections, mutexes or locks.   Do developers always know which of the objects in their code has the most impact?   If I want to examine my software to minimize the impact or restructure data to eliminate some of these synchronization objects and improve performance, how do I know where I should make changes to get the biggest performance improvement?    Intel Parallel Amplifier can help me determine this.    

Intel Parallel Amplifier provides 3 basic analysis types: hotspots (with call stack data), concurrency and locks  and waits.  The locks and waits analysis highlights which synchronization objects block threads the longest.    It is common for software to have too many or too few synchronization points.   Insufficient synchronization points lead to race conditions and indeterminant results (if you have this problem you need Intel Parallel Inspector, not Intel Parallel Amplifier.  See this MSDN web seminar for more on Parallel Inspector:  Got Memory Leaks? ).  If you have too many synchronization objects you want to know which ones if removed would improve performance the most.   Even if all the synchronization objects you have are necessary, you might want to know how much they impact performance. This may help you decide whether to spend time to refactor the software so the synchronization object can be removed.     The Locks and Waits analysis in Parallel Amplifier tracks how much time threads spend waiting on each synch object and reports the times in an ordered table.  The synch objects that cause the most waiting are listed at the top, the rest in declining order.   

Let’s look at a simple example.  One of the commonly used computational exercises to teach parallel programming is a simple program to calculate pi.    The main process thread creates NTHREADS threads and sets them off to execute a routine called calcpi, then waits for them to complete.   The basic algorithm in calcpi is shown below: 

47       int my_begin = myid * (ITER/NTHREADS) ;
48       int my_end = (myid +1) * (ITER/NTHREADS) ;
49       step = 1.0 / (double) ITER ;
50       for (int i = my_begin ; i < my_end ; ++i)
51       {
52           x = (i+0.5) * step ;
53           EnterCriticalSection(&protect);
54           pi = pi + 4.0/(1.0 + x*x) * step ;
55           LeaveCriticalSection(&protect) ;
56       }

 Each thread takes a section of the for loop to operate on.  Pi is a global variable so when each thread updates pi, a critical section protects its access from any data races.   After creating NTHREADS and assigning them to begin executing function calcpi, the main thread waits for all the threads to complete.    When I run this through the Locks and Waits analysis of Intel Parallel Amplifier I see three synch objects appear in the analysis section. This screenshot is shown below.     I can expand each of these three synch objects and drill down to the source code.   The top synch object where a thread waits the most time is the main thread waiting for all the threads it created to complete.    All this thread does is create threads and wait.    There are no intermediate steps to tune.  We certainly don’t want to exit the program before calculations are done, so let’s keep this synchronization point.    Let’s look down the list for the next opportunity.         

 The The second item in the table highlights a critical section.This second synch object that shows significant wait time is the critical section in calcpi.    I can double click and drill down to source and verify that the critical section that causes so much delay is the critical section shown in the code segment above.  In the screenshot, you can also see this was line 53 in my code (I rearranged the display so the creation line column is next to the wait time   You may rearrange column ordering as you like, by dragging each column to the desired position). So let’s look at how I can minimize the number of entrances into a critical section.  The contention is around the access to this global variable pi.   There is no reason that I must accumulate each contribution in the global variable.  I can create partial sums within each thread and combine these partial sums for the final result.   The first thing I do is create a local copy of pi for each thread.  I call this my_pi.  Then each thread calculates my_pi and when the for loop is complete, each thread adds their portion of the calculation into the global variable pi.   For the data collected here, I used a dual core system and 1,200,000 iterations with NTHREADS set to 2.   So for the first algorithm I entered a critical section 1,200,000 times.  For the algorithm below, I enter the critical section only twice.  

50         double my_pi = 0.0 ;
51         for (int i = my_begin ; i < my_end ; ++i)
52        {
53             x = (i+0.5) * step ;
54             my_pi = my_pi + 4.0/(1.0 + x*x) ;
55         }
56         my_pi *= step ;
57         EnterCriticalSection(&protect);
58         pi += my_pi ;
59         LeaveCriticalSection(&protect) ;

  As a further optimization, I pulled out the multiplication-by step out of the for-loop and did the multiplication-by step once per thread rather than once per iteration.   Now when I run this through Parallel Amplifier the screenshot is shown below.



Once again the top of the table is the thread in main waiting for all the worker threads to complete.      The code runs so much faster the scale of the display changed from seconds to milliseconds.  The amount of time spent waiting is far shorter.       The second entry in this screenshot  (Stream)   maps back to a print statement – remember printf is an implicit barrier and its behavior is tracked just like the calls to critical sections or WaitforMultipleObjects.   

So what to do next?  Get an evaluation copy of Intel Parallel Amplifier and analyze your code.  Check out which synch objects have threads waiting the most.  Then see if you can find some things to optimize.  The difference in runtime between the two algorithms I showed here is about 50x.  This was a simple example, where the worst case performed worse than the sequential algorithm, so changes in performance of this magnitude are not common.  Improved performance by minimizing time spent waiting at barriers is common, though.    So see what you can get.   Now, if I had I begun with Intel Threading  Building Blocks, I would have let the Threading Building Blocks parallel reduce algorithm handle the reduction intelligently and I would have had a very different experience, but that would be a topic for a different blog.

For more complete information about compiler optimizations, see our Optimization Notice.
Tags: