Fun with Locks and Waits - Performance Tuning

At times threaded software requires some critical sections, mutexes or locks.   Do developers always know which of the objects in their code has the most impact?   If I want to examine my software to minimize the impact or restructure data to eliminate some of these synchronization objects and improve performance, how do I know where I should make changes to get the biggest performance improvement?    Intel Parallel Amplifier can help me determine this.    

Intel Parallel Amplifier provides 3 basic analysis types: hotspots (with call stack data), concurrency and locks  and waits.  The locks and waits analysis highlights which synchronization objects block threads the longest.    It is common for software to have too many or too few synchronization points.   Insufficient synchronization points lead to race conditions and indeterminant results (if you have this problem you need Intel Parallel Inspector, not Intel Parallel Amplifier.  See this MSDN web seminar for more on Parallel Inspector:  Got Memory Leaks? ).  If you have too many synchronization objects you want to know which ones if removed would improve performance the most.   Even if all the synchronization objects you have are necessary, you might want to know how much they impact performance. This may help you decide whether to spend time to refactor the software so the synchronization object can be removed.     The Locks and Waits analysis in Parallel Amplifier tracks how much time threads spend waiting on each synch object and reports the times in an ordered table.  The synch objects that cause the most waiting are listed at the top, the rest in declining order.   

Let’s look at a simple example.  One of the commonly used computational exercises to teach parallel programming is a simple program to calculate pi.    The main process thread creates NTHREADS threads and sets them off to execute a routine called calcpi, then waits for them to complete.   The basic algorithm in calcpi is shown below: 

47       int my_begin = myid * (ITER/NTHREADS) ;
48       int my_end = (myid +1) * (ITER/NTHREADS) ;
49       step = 1.0 / (double) ITER ;
50       for (int i = my_begin ; i < my_end ; ++i)
51       {
52           x = (i+0.5) * step ;
53           EnterCriticalSection(&protect);
54           pi = pi + 4.0/(1.0 + x*x) * step ;
55           LeaveCriticalSection(&protect) ;
56       }

 Each thread takes a section of the for loop to operate on.  Pi is a global variable so when each thread updates pi, a critical section protects its access from any data races.   After creating NTHREADS and assigning them to begin executing function calcpi, the main thread waits for all the threads to complete.    When I run this through the Locks and Waits analysis of Intel Parallel Amplifier I see three synch objects appear in the analysis section. This screenshot is shown below.     I can expand each of these three synch objects and drill down to the source code.   The top synch object where a thread waits the most time is the main thread waiting for all the threads it created to complete.    All this thread does is create threads and wait.    There are no intermediate steps to tune.  We certainly don’t want to exit the program before calculations are done, so let’s keep this synchronization point.    Let’s look down the list for the next opportunity.         

 The The second item in the table highlights a critical section.This second synch object that shows significant wait time is the critical section in calcpi.    I can double click and drill down to source and verify that the critical section that causes so much delay is the critical section shown in the code segment above.  In the screenshot, you can also see this was line 53 in my code (I rearranged the display so the creation line column is next to the wait time   You may rearrange column ordering as you like, by dragging each column to the desired position). So let’s look at how I can minimize the number of entrances into a critical section.  The contention is around the access to this global variable pi.   There is no reason that I must accumulate each contribution in the global variable.  I can create partial sums within each thread and combine these partial sums for the final result.   The first thing I do is create a local copy of pi for each thread.  I call this my_pi.  Then each thread calculates my_pi and when the for loop is complete, each thread adds their portion of the calculation into the global variable pi.   For the data collected here, I used a dual core system and 1,200,000 iterations with NTHREADS set to 2.   So for the first algorithm I entered a critical section 1,200,000 times.  For the algorithm below, I enter the critical section only twice.  

50         double my_pi = 0.0 ;
51         for (int i = my_begin ; i < my_end ; ++i)
52        {
53             x = (i+0.5) * step ;
54             my_pi = my_pi + 4.0/(1.0 + x*x) ;
55         }
56         my_pi *= step ;
57         EnterCriticalSection(&protect);
58         pi += my_pi ;
59         LeaveCriticalSection(&protect) ;

  As a further optimization, I pulled out the multiplication-by step out of the for-loop and did the multiplication-by step once per thread rather than once per iteration.   Now when I run this through Parallel Amplifier the screenshot is shown below.



Once again the top of the table is the thread in main waiting for all the worker threads to complete.      The code runs so much faster the scale of the display changed from seconds to milliseconds.  The amount of time spent waiting is far shorter.       The second entry in this screenshot  (Stream)   maps back to a print statement – remember printf is an implicit barrier and its behavior is tracked just like the calls to critical sections or WaitforMultipleObjects.   

So what to do next?  Get an evaluation copy of Intel Parallel Amplifier and analyze your code.  Check out which synch objects have threads waiting the most.  Then see if you can find some things to optimize.  The difference in runtime between the two algorithms I showed here is about 50x.  This was a simple example, where the worst case performed worse than the sequential algorithm, so changes in performance of this magnitude are not common.  Improved performance by minimizing time spent waiting at barriers is common, though.    So see what you can get.   Now, if I had I begun with Intel Threading  Building Blocks, I would have let the Threading Building Blocks parallel reduce algorithm handle the reduction intelligently and I would have had a very different experience, but that would be a topic for a different blog.

For more complete information about compiler optimizations, see our Optimization Notice.

2 comments

Top
David Mackay (Intel)'s picture

Interesting discussion. The first thing that occurs to me though is why I would accept the limitation of only one producer and only one consumer thread? Why not think of a larger thread pool? Quad cores are common now, systems with 16 or 24 cores are not uncommon. Dual core systems might be the most common platform my customers currently use, but the number of cores will be increasing in the future and I might have more ideas for larger workloads and activities. Why not think of a more general implmentations that I expect to scale as the number of cores and threads is likely to increase?

-David

sumitagarwal.smit's picture

hi had a small question..

Creating a thread safe producer consumer queue in C++ without using locks

Yesterday, someone in the newsgroups asked a question about synchronization problems he was having. I explained that –if properly programmed– a consumer producer queue does not need synchronization even on a multiprocessor machine.

A Microsoft employee then asked me to explain exactly how this was done. Ie. To put my money where my mouth is.

The idea
A queue is a chunk of memory that can be divided into equal sized pieces. To simplify this you can think of a queue simply as a traditional 1 dimensional C style array.

You should also note that there is 1 reader, and 1 writer. Each will work in his own thread. This means that reads can happen concurrent with writes, but there will be no multiple concurrent reads or multiple concurent writes. It is possible to provide that level of concurrency without locks (I did, once) but that is significantly more complex, and not the purpose of this example.

To manage the producer and consumer part, we need to have 2 additional variables: a read index and a write index.

The read index is the index of the element right after the element that was last read by the consumer.

The write index is the index of the element right after the element that was last written by the producer.

A writer can always add elements in the queue as long as it can move the write pointer without overtaking the read pointer. An overtake would mean that the queue would overflow.

There are 3 possible cases:

The read and write pointers are equal. This is only possible if the queue is in the initialized state. Any read will result in a fail, any write will succeed.
The read pointer is pointing to the first element after the write pointer. This means that the queue is full. If one more element would be added, the read and write pointers would be the same, and the reader would assume that the queue was empty. Any read will succeed, and write will fail.
The read and write pointers are not equal, and the previous case was not true. Any read will succeed. Any write will succeed.
The implementation
Adding elements

bool PushElement(T &Element)
{
int nextElement = (m_Write + 1) % Size;
if(nextElement != m_Read)
{
m_Data[m_Write] = Element;
m_Write = nextElement;
return true;
}
else
return false;
}

As long as the first element after the write pointer is not the read pointer, it is not occupied. The element can be added and the write pointer can be updated.

Otherwise the queue is full and the write operation will fail.

Removing elements

bool PopElement(T &Element)
{
if(m_Read == m_Write)
return false;

int nextElement = (m_Read + 1) % Size;
Element = m_Data[m_Read];
m_Read = nextElement;
return true;
}

the above article talks about creating thread safe producer consumer queue without using locks..
how safe is this
?

Regards,
Sumit

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.