Bubble, Bubble, Toil and Trouble; Mutex Lock and Buffer Double

A problem that I’ve been seeing recently within genomic codes stems from the I/O required. Before you can start processing them, large data files of DNA reads have to be read into memory and converted to a format that will be used by the computation, e.g., compressing the byte-sized characters (“A”, “C”, “T”, or “G”) to a 2-bit equivalent. The natural way to code up this processing is to input a read, call some conversion function, and store the result. Repeat until all files have been input.

While it is commonly accepted knowledge that I/O can be a bottleneck to performance, regardless of the relative speed between the input of a DNA read and the initial processing/storing of that input, the above serial execution is leaving some easy parallelism on the table. Obviously, the conversion processing of each read is likely independent of the conversion of some other read, but, more to my point, the input of a new read is independent of the conversion of the previous read. Overlapping computation and I/O by having a separate thread doing each of the two tasks is a good way to boost performance when it can be done.

However, to keep the results of the input and conversion correct I can’t be overwriting the previous unconverted read by a new read (like stabbing a perfectly usable King Duncan in his sleep). Nor do I want to re-process an already converted read (like raging at an empty chair as if Banquo’s ghost was sitting there). To this end, I can declare and use a second input buffer. While the I/O thread is filling up one buffer with the next read, the conversion thread can be working on processing the previous read. In this post I want to sketch out one way you can coordinate the two threads without having them overwrite good data or pull out stale data from the two shared buffers.

You may have realized that the cooperation between the input and conversion threads is an example of the Producer/Consumer pattern. The (pseudo-) code below is an implementation of Producer/Consumer with a shared double buffer structure used to pass items from the Producer thread to the consumer thread. I’ve implemented it with Pthreads to use a condition variable for control of access to the two buffers. First, declarations for the buffers, some index pointers, and the synchronization objects, all of which are visible to both producer and consumer threads.


	#define BUFSIZE 3

	dnaReadType buffer[BUFSIZE];

	int in=1;   // index to store next element

	int out=0;  // index of last removed element

	pthread_mutex_t b_lock=PTHREAD_MUTEX_INITIALIZER;

	pthread_cond_t b_cond=PTHREAD_COND_INITIALIZER;

	

I’ve allocated an array of three items to hold whatever data or structure, called dnaReadType, I’m inputting from a file. This could just as easily be an array of pointers to allocated memory if the input isn’t a known or constant size. (This latter idea is more in keeping with multiple data buffers. The single array buffer is then simply syntactic sugar to quickly reference one or the other and to easily switch from one data buffer to another.) Either way, the consumer (conversion) thread will be accessing the data from one element while the producer (input) thread fills up another. The in index will refer to the element that is next to receive new data from the producer and the out index will reference the element that last held a read for conversion by the consumer. The mutex, b_lock, and condition variable, b_cond, will be used to control access to the buffer elements and to protect updates to in and out.

The scheme and code I describe below requires that the number of items in buffer (given as BUFSIZE) be one more than the number of data buffers I want to use. This makes the test for a full or empty set of data buffers easier to code, as I’ll explain below.

If the data buffers are full (i.e., two buffer elements contain valid data awaiting conversion) the producer must wait until a slot opens up. Otherwise the new read can simply be put into the open position of buffer. The relevant code for the Producer is given here:


	while (more to produce){

	   newRead = getNewRead(inputFile); // produce something

	   pthread_mutex_lock(&b_lock);   

	   while (out == in)                // buffer is full

	      pthread_cond_wait(&b_cond, &b_lock);                                          

	   buffer[in++]= newRead;           // store in buffer

	   in %= BUFSIZE;

	   pthread_cond_broadcast(&b_cond);                                               

	   pthread_mutex_unlock(&b_lock);                                                 

	}

	

After inputting some new read from the input file, the mutex b_lock controls access to buffer and protects the reading and writing of the in and out index variables. The conditional expression checks the status of buffer. The buffers are full if out and in reference the same item in the array. This item is the extra element (which is not considered part of the usable buffer) that I included by making BUFSIZE equal to 3 for a set of two data buffers. While this array slot can (and will) hold data at some point, it is not considered a valid entry whenever two data buffers both contain unprocessed reads. Whenever buffer is full, the producer thread goes to “sleep” on the condition variable, b_cond.

At such time as the consumer signals that at least one of the data buffers is available, the producer thread wakes up, stores the new read (or performs the input of new data), increments the in index with a modulus on the BUFSIZE, signals the consumer thread (in case it is waiting on an empty buffer), and releases the mutex to allow the consumer to access buffer.

The consumer thread code would be implemented like this:


	while (more to consume) {                                                           

	  pthread_mutex_lock(&b_lock);                                                   

	    while (in == ((out+1) % BUFSIZE))  // buffer is empty

	      pthread_cond_wait(&b_cond,&b_lock);

	    out++;

	    out %= BUFSIZE;

	    newRead = buffer[out];            // save data for processing

	    pthread_cond_signal(&b_cond);                                                  

	  pthread_mutex_unlock(&b_lock);

	  ProcessRead(newRead);               // consume the data

	}

	

As with the producer code, the mutex, b_lock, controls access to buffer and the update of the out and in index variables. For the consumer, if the data buffers are empty, processing of a new read must be blocked. The conditional expression tests for the empty buffer. When the out index is found to be one position “behind” the in position (modulo the BUFSIZE), then buffer has no valid read stored in the next out position.

Once the consumer thread knows that there is at least one element to be converted by either being signaled to wake up or by finding a non-empty buffer (and skipping the call to pthread_cond_wait), the out index is incremented and the read stored in buffer[out] is pulled out for conversion. Before the new read is processed, the consumer must signal the producer (which may be waiting on full data buffers) and then release the mutex.

With any parallel algorithm that requires synchronized coordination between two or more threads, you need to consider whether or not the code actually will work in all cases. Better yet, prove that it works in all cases. For my Producer/Consumer code with a single instance of each, the tricky part is whether or not the tests for full and empty buffer will pan out as expected. As a thought experiment, if you’re unsure about the scheme I’ve just discussed, see if you can convince yourself that the following cases will work as expected: 1) the consumer thread finding the initially empty buffer, 2) the producer finding the initial empty buffer, 3) due to a slow consumer thread, the producer fills up buffer and then has one more item to be stored, and 4) due to an initially quick, but now slow producer thread, the consumer initially encounters a full buffer and proceeds to process all data stored (and wants more) before a new read can be input.

One last thing to note is that the size of the buffer can be much larger than 3 elements. You can also use more consumer threads if the conversion processing time is much slower than the time to input a new read from the input file. Similarly, more than one producer can be added to the above code. The scenarios needed to prove that everything still works as it should become a bit more complex, but nothing that some simple logic and hand-waving can’t accomplish. I mean, it’s not like trying to get the Great Birnam Wood to come to Dunsinane to add to Macbeth’s long list of problems.

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.