Using Locks Effectively in Parallel Programming

**********************************************


Title of the Research Paper:
Using Locks Effectively in Parallel Programming

Research Area:
Using Locks Effectively in Parallel Programming

Team ID: TC2009028

Authors:
Rishabh Rao
6th semester, Department of Information Science & Engineering
(rishabhsrao@hotmail.com)

Narendra G S
6th semester, Department of Information Science & Engineering
(naren_2110@yahoo.com)

Naga Anudeep Reddy
4th semester, Department of Computer Science & Engineering
(deepuneo@gmail.com)

Faculty mentor:
Mr. Manjunath
Department of Computer Science & Engineering
(majubt_rvce@yahoo.co.in)

Name of the Institution:
JSS Academy of Technical Education, Bangalore

**********************************************


1. Abstract:
Thread synchronization is required to maintain data consistency in multi-threaded/parallel programs. As locks are used to prevent race conditions, this article highlights a few optimization techniques/strategies for effective lock utilization.

In this article, we first look at why multi-threading is required and the problem of a race condition. We then proceed to highlight a few locking techniques/strategies, which include using the critical section to the minimum, spin locks, blocking the waiting threads, using more than one lock, using high-level APIs, using a lock manager and using read/write locks. A discussion about their advantages and disadvantages is also highlighted.


2. Background:

A single-threaded process (especially GUI-based) will sometimes appear unresponsive to the user if this single thread is performing a complex operation (such as printing a large text file, performing a processor intensive calculation, or attempting to connect to a remote server thousands of miles away.) But with the advent of Intel Hyper-Threading technology [ref. W1b] and multi-core processors, this drawback can be addressed with the help of the multi-threading/parallel programming paradigm.

The multi-threading/parallel programming paradigm makes it possible for the primary thread of execution to spawn additional secondary threads in the background. Each thread (primary or secondary) becomes a unique path of execution in the process and has concurrent access to all shared data.

Although multi-threading increases the performance of the application, a new problem - called the race condition - arises due to it.


3. Problem Statement:
A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time (either by using Hyper-Threading technology or on a multi-core processor), but because of the nature of the system, the operations must be done in a proper sequence so as to be done correctly.

For example, let us assume that two threads, T1 and T2, each want to increment the value of a shared integer variable 'i' (initial value = 0) by one. Also, let us assume that thread T1 uses a register, r_A, with an initial value of 0 and thread T2 uses a register, r_B, with an initial value of 0.

Ideally, we wish the following sequence of operations (pseudo-machine instructions) to take place:

Ideal%20Sequence%20of%20Operations

The final value of 'i' is 2. But due to the race condition problem, the following operations may take place instead:
Again, assuming initial value of 'i' = 0. Please note that instructions placed side-by-side are executed in parallel (either by using Hyper-Threading technology or on a multi-core processor.)

A%20typical%20Race%20Condition%20Sequence

The final value of 'i' is 1, instead of the expected value 2.

This occurred because both the threads read and modified the shared variable 'i' at the same time, leading to invalid data. In other words, the shared memory resource was not accessed in a controlled manner due to lack of synchronization between the two threads.


4. Methodology:
The key to synchronization is the concept of the monitor (also called a semaphore.) A monitor is an object that is used as a mutually exclusive lock, or a mutex. Only one thread can own a monitor at a given time. When a thread acquires a lock, it is said to have entered the monitor (or a critical section.) All other threads attempting to enter the locked monitor will be suspended (or any other action may take place depending on the operating system) until the first thread exits the monitor. These other threads are said to be waiting for the monitor.

4.1. A Few Locking Techniques/Strategies:
4.1.1. Use critical section to the minimum:
Ideally, a thread should execute in its critical section for a minimum required set of instructions.

For example, instead of the following sequence of steps:

Inefficient%20method

Preferably, consider re-structuring them to:

Preferred%20method

4.1.2. Spin locks:
When a thread is in its critical section, any other process that tries to enter the critical section must loop continuously in the entry code. This type of semaphore is called a spin lock because the threads 'spin' while waiting for the lock.

A sample implementation of spin locks using POSIX.1c PThreads is given below.

The echoThis() function may be called by any thread to print some text to the standard output. This function ensures that only one thread can print the text to the standard output at a time.

/* spinlock.cpp */
#include <pthread.h>
#include <iostream>

using namespace std;

/* create a mutex lock */
static pthread_spinlock_t echoLock;

void *echoThis(void *text)
{
        /* spin until lock is acquired */
        if(pthread_spin_lock(&echoLock))
        {
                /* error acquiring spin lock */
                /* NOTE: if the lock is already acquired by another thread, */
                /* then it is not an error, the thread will simply spin */
                /* until the lock is acquired */
                perror("pthread_spin_lock");
        }

        /* lock was successfully acquired */
        /* ***** START OF CRITICAL SECTION ***** */
        /* print the text */
        cout << "TID: " << pthread_self() << "ttTEXT: " << (char *) text << endl << flush;

        /* ***** END OF CRITICAL SECTION ***** */
        /* release the lock */
        if(pthread_spin_unlock(&echoLock))
        {
                /* error releasing lock */
                perror("pthread_spin_unlock");
        }
        /* lock successfully released */
}

int main(int argc, char *argv[])
{
        /* create as many threads as the  number of command line arguments */
        int maxThreads = argc - 1;
        int i;
        pthread_t tid;

        /* initialize the spin lock */
        if(pthread_spin_init(&echoLock, PTHREAD_PROCESS_PRIVATE))
        {
                /* error initializing lock */
                perror("pthread_spin_init");
        }

        /* create threads which will call echoThis */
        for(i = 1; i <= maxThreads; i++)
        {
                /* create a thread with the default properties */
                if(pthread_create(&tid, NULL, echoThis, argv[i]))
                {
                        /* error creating thread */
                        perror("pthread_create");
                }
        }

        /* wait for all the threads to join */
        while(!pthread_join(tid, NULL))
        {
                /* do nothing */
                ;
        }

        /* destroy the spin lock */
        if(pthread_spin_destroy(&echoLock))
        {
                /* error removing the spin lock */
                perror("pthread_spin_destroy");
        }

        return 0;
}
/* end of spinlock.cpp */


In the above example, the main() function initializes the spin lock echoLock to be used by all the threads in the same process. It then creates multiple threads, one per command line argument, to call the echoThis() function and displays the command line argument strings to the standard output. It does not matter how many threads call the echoThis() function simultaneously, as the function serves only one thread at a time.

Two sample runs of the above program are shown below.


The computer, on which this program was run, had the following configuration:
Processor: Intel® Pentium D CPU 2.80GHz and 2.80GHz
RAM: 1024 MB
OS: Knoppix 5.1 (with X Window System running a KDE desktop)
Thread model: POSIX
Compiler used: gcc version 4.1.2 20061028 (prerelease) (Debian 4.1.1-19)

 

[shell]$ g++ spinlock.cpp -lpthread

$ ./a.out The quick brown fox jumps over the lazy dog
TID: 3049794480 TEXT: jumps
TID: 3066571696 TEXT: brown
TID: 3083348912 TEXT: The
TID: 3041405872 TEXT: over
TID: 3033017264 TEXT: the
TID: 3074960304 TEXT: quick
TID: 3016240048 TEXT: dog
TID: 3058183088 TEXT: fox

$ ./a.out The quick brown fox jumps over the lazy dog
TID: 3075533744 TEXT: quick
TID: 3058756528 TEXT: fox
TID: 3050367920 TEXT: jumps
TID: 3041979312 TEXT: over
TID: 3083922352 TEXT: The
TID: 3067145136 TEXT: brown
TID: 3033590704 TEXT: the
TID: 3025202096 TEXT: lazy
TID: 3016813488 TEXT: dog

$[/shell]

 

It is worth mentioning, that although the echoThis() function serves only one thread at a time, it, however, does not preserve the order in which the command-line arguments are printed. Also, in the first sample run, the thread corresponding to "lazy" did not execute at all, due to an unknown bug.


4.1.3. Blocking the waiting threads:
This method is similar to the busy waiting methods of spin locks. But instead of waiting for the lock, the thread blocks itself. The block operation places the thread into a waiting queue associated with the lock (also known as a turnstile in Solaris) and thread is switched to a waiting state. The control is then transferred to the CPU scheduler, which selects another thread to execute.

The same above spinlock.cpp program, when implemented using this methodology, echotext.cpp, is shown below.

Please note that, most of the code below remains the same as in spinlock.cpp. Thus, for the sake of clarity, we have removed the code that is repeated by introducing /* ... */ instead.

/* echotext.cpp */

/* ... */

/* create a mutex lock */
static pthread_mutex_t echoLock;

void *echoThis(void *text)

{
        /* acquire the mutex lock */
        if(pthread_mutex_lock(&echoLock))
        {
                /* error acquiring lock */
                perror("pthread_mutex_lock");
        }

/* ... critical section code here ... */
}

int main(int argc, char *argv[])
{

        /* ... */

        /* initialize the mutex lock */
        if(pthread_mutex_init(&echoLock, NULL))
        {
                /* error initializing lock */
                perror("pthread_mutex_init");
        }


        /* ... */

        /* remove the mutex */
        if(pthread_mutex_destroy(&echoLock))
        {
                /* error removing the mutex */
                perror("pthread_mutex_destroy");
        }

        /* ... */

}
/* end of echotext.cpp */


4.1.4. Using more than one lock:
Mutex locks tend to become inefficient as the critical section they are locking becomes longer. A simple solution is to split the critical section into smaller ones and provide a lock for each sub-section.

Here, a big critical section is protecting 3 mutually independent shared data/resources. If one threads occupies the critical section, the other threads have to wait for it to exit.

Big%20critical%20section

It would be more efficient to provide a separate lock for each shared data/resource, so that at most 3 threads can enter their respective critical sections simultaneously.

Split%20critical%20sections


4.1.5. Using high-level APIs:

Although semaphores provide a convenient and effective mechanism for thread synchronization, their incorrect usage can still result in timing error that are difficult to detect, since these errors happen only if some particular execution sequence takes place, and these sequences do not always occur. This situation may be the result of honest programming errors by the programmer. High-level thread synchronization APIs present in Java (the synchronized keyword [ref. TB3]) and C# (the lock keyword [ref. TB2]) provide simple yet robust synchronization tools. This leads to easier programming and lesser programming errors.

A sample implementation in Java is shown below. The main() function uses the ThreadCreator class to create 4 sample threads. The ThreadCreator class in turn calls the class Printer's printThis() function, which prints its arguments, with each thread.

It is important to note that in the printThis() function, the threads sleep for some time. This enables the thread-scheduler to context-switch to another thread, introducing a race condition. But because the synchronized keyword is used with the printThis() function, only one thread can enter the function. Thus, the access to the printThis() function is said to be serialized.

package threadsynch;

class Printer
{
    synchronized void printThis(Thread t, String text)
    {
        System.out.println("My thread ID: " + t);

        /* let the current thread sleep */
        /* so that the thread-scheduler context-switches */
        /* to another thread, introducing a race condition */
        try
        {
            Thread.sleep(100);
        }
        catch(InterruptedException exp)
        {
            System.out.println("INTERRUPTED: Thread.sleep()");
        }

        /* now print the string */
        System.out.println("My payload: " + text);
    }
}

class ThreadCreator implements Runnable
{
     String txt;
     Printer dest;
     Thread thr;

     /* Initialize and create the threads */
     public ThreadCreator(Printer d, String t)
     {
         dest = d;
         txt = t;

         thr = new Thread(this);
         thr.start();
     }

     public void run()
     {
         dest.printThis(thr, txt);
     }
}

public class Main
{
    public static void main(String[] args)
    {
        Printer destination = new Printer();

        /* Create 4 sample threads */
        ThreadCreator tc1 = new ThreadCreator(destination, "the quick");
        ThreadCreator tc2 = new ThreadCreator(destination, "brown fox");
        ThreadCreator tc3 = new ThreadCreator(destination, "jumps over");
        ThreadCreator tc4 = new ThreadCreator(destination, "the lazy dog");

        /* Wait for all threads to join */
        try
        {
            tc1.thr.join();
            tc2.thr.join();
            tc3.thr.join();
            tc4.thr.join();
        }
        catch(InterruptedException exp)
        {
            System.out.println("INTERRUPTED: Thread.join()");
        }
    }
}

A sample run of the above program is shown below.

My thread ID: Thread[Thread-0,5,main]
My payload: the quick
My thread ID: Thread[Thread-1,5,main]
My payload: brown fox
My thread ID: Thread[Thread-3,5,main]
My payload: the lazy dog
My thread ID: Thread[Thread-2,5,main]
My payload: jumps over

Notice that the order in which the threads are created and the order in which the text is printed, need not be the same.

Now, if we code the program without using the synchronized keyword, i.e. if we change,

class Printer
{
    synchronized void printThis(Thread t, String text)
    {

to

class Printer
{
    void printThis(Thread t, String text)
    {

then the function will no longer be synchronized. A sample output, without synchronization, is shown below.

My thread ID: Thread[Thread-0,5,main]
My thread ID: Thread[Thread-1,5,main]
My thread ID: Thread[Thread-2,5,main]
My thread ID: Thread[Thread-3,5,main]
My payload: the quick
My payload: jumps over
My payload: brown fox
My payload: the lazy dog

All the threads enter the printThis() function simultaneously and race against each other to complete the function, mixing up their outputs.


4.1.6. Using a lock manager:
This method is analogous to having a waiter at the table in the Dining Philosophers' Problem [ref. TB1]. The lock manager delegates the mutex to one of lock-contending threads. This thread, upon exiting the critical section, returns the mutex back to the lock manager. The lock manager then selects another thread from the remaining threads and so on. The lock manager may maintain a queue of contending threads (either blocked or spinning) and may also order them according to some priority.

A simplified diagram of the working of a lock manager is shown below.

A%20Lock%20Manager


4.1.7. Using read/write locks:
Read/write locks are similar to mutex locks, but they can be acquired for read-only or write-only purposes. A read-only lock can be set on a read/write lock simultaneously by one or more threads.

Only one thread is allowed to set a write-only lock on a read/write lock at a given instance of time; any other threads trying to acquire a read-only or write-only lock must wait. Also, any thread that tries to acquire a write-only lock on a read/write lock must wait until all the read-only locks on that read/write lock are released.

5. Key Results/Discussion:
Using the critical section to the minimum definitely improves performance. It may be implemented to maximize efficiency.

Spin locks require busy waiting. The continual looping of the thread is a problem on single CPU computers because it wastes CPU cycles that some other thread might be able to use productively. The advantage of spin locks is that no context-switch and thread re-scheduling is required, because the thread is still running. In some cases, the CPU may spend more time performing context-switches than executing instructions. Thus, when locks are expected to be held for short times, spin locks are useful.

The blocking-waiting-threads strategy may be used instead of spin locks, but they suffer from thread context-switch overhead. So, this strategy may be used where longer critical sections are involved.

Splitting critical regions and using more locks can be advantageous; but care must be taken not to use too many locks, as the threads may spend too much time contending for the locks at each sub-section, which may degrade performance.

High-level APIs hide the specific synchronization implementation details, but generally use primitive synchronization tools in their inner implementations.

The lock-manager method is useful as it amicably addresses the synchronization problem. But maintaining another thread for the lock manager can be an overhead.

Read/write locks can lead to write-starvation, because as long at least one reading thread holds the lock, the writer thread will not be able to acquire the lock. A variant read/write locks, known as write-preferred locks, prevent any new readers from acquiring the lock if there is a writer queued and waiting for the lock. The advantage of read/write locks, of course, is the ability to read shared data simultaneously. POSIX.1c does not define read/write locks.

6. Scope for future work:
There is a lack of standardization in the area of multi-threaded programming/parallel programming. In the future, we hope to see the international organizations, such as, IEEE, POSIX, ANSI, ISO and others, to unite and create a standard so that developers around the world can create compatible applications.


7. Conclusion:

No locking strategy can be selected as the best one. It is up to the programmer to select the best strategy depending on his/her application. As the multi-threaded/parallel programming paradigm starts to become popular, the search for better synchronization tools will continue.


8. References:

8.1. Web/Internet resources:
[W1a]
SIGCSE thoughts: Preparing Students for Ubiquitous Parallelism: http://software.intel.com/en-us/blogs/2009/04/01/sigcse-thoughts-preparing-students-for-ubiquitous-parallelism/
[W1b] Intel Hyper-Threading technology: http://www.intel.com/products/ht/hyperthreading_more.html
[W1c] Intel Multi-Core Community: http://software.intel.com/en-us/multi-core/
[W1d] Intel Training: http://software.intel.com/en-us/developertraining/
[W1e] Intel Threading Glossary: http://software.intel.com/en-us/blogs/2008/5/07/parallel-programming-glossary-of-technical-terms/
[W1f] Intel Threading Knowledge Base: http://software.intel.com/en-us/articles/all/1/
[W2] ThinkingParallel.com
[W3] Wikipedia.org
[W4] CodeProject.com
[W5] Differences - mutex vs. semaphore: http://geekswithblogs.net/shahed/archive/2006/06/09/81268.aspx
[W6] Keyword Definitions: WhatIs.com and http://searchnetworking.techtarget.com/
[W7] Writing a Research Paper: http://owl.english.purdue.edu/workshops/hypertext/ResearchW/
[W8] How to avoid race conditions: http://www.asta.va.fh-ulm.de/LDP/HOWTO/Secure-Programs-HOWTO/avoid-race.html
[W9] POSIX.1c Thread (PThread) Tutorial: https://computing.llnl.gov/tutorials/pthreads/
[W10] Using Spin Locks (Multithreaded Programming Guide) - SUN Microsystems: http://docs.sun.com/app/docs/doc/816-5137/ggecq?a=view

8.2. Textbooks:
[TB1]
Operating System Concepts (6th edition) by Silberschatz, Galvin, Gagne (2002, John Wiley & Sons, Inc.)
[TB2] C# and the .NET Platform (2nd edition) by Andrew Troelsen (2003, Apress, Berkeley, CA, USA)
[TB3] The Complete Reference Java 2 (5th edition) by Herbert Schildt (2002, Tata McGraw Hill)
[TB4] Unix System Programming Using C++ (1st impression) by Terrence Chan (2008, Pearson Education, Inc.)


9. Acknowledgements:
We would like to thank Intel for providing us with such an excellent opportunity and platform for presenting our research paper. We are extremely grateful to the organizers of the Intel Threading Champion 2009 contest. We would also like to thank the software.intel.com support team.

We would also like to thank all our team members (Rishabh, Narendra and Anudeep) and our faculty mentor (Mr. Manjunath) for their support.

Para obter informações mais completas sobre otimizações do compilador, consulte nosso aviso de otimização.