| November 2, 2008 11:00 PM PST | |
Other articles on this site have discussed the two primary ways in which a program's work is broken down into pieces that can be handled by individual threads. To recap, these two principal approaches are functional decomposition and data decomposition. Using the former, work is broken down by function (here used in its non-programming sense) and each separate task or function is given its own thread. For example, in Microsoft Word* one thread handles user input, another thread performs background pagination, and a third thread performs on-the-fly spell-checking.
Data decomposition, in contrast, divides work not by function but by the data on which operations are performed. It is ideally suited to scenarios where the same operation is being performed across a large number of data items. The need for this arises typically with multimedia and imaging applications, as well as in scientific computing. Suppose a large set of data points must be recalculated. Using data decomposition, this set would be broken down into multiple, smaller sets of numbers, and each set would be assigned to its own thread. When all computations are done (all the data threads have completed), the results are folded back into the resulting data set.
These two forms of decomposition are the ones generally taught in computer-science courses and are familiar to any developer working on threaded applications. However, a third form of decomposition is increasingly gaining attention because it occurs frequently and it has unique requirements, which if not handled correctly can have a severe impact upon performance. This form of decomposition is called producer/consumer.
Producer/consumer (or P/C) occurs anytime one thread's output is the input to another thread. In this arrangement, the consumer thread is unable to work until the producer thread has begun generating output. And, of course, this particular dependence is the nub of the problem: threads that you want to have executing in parallel start off executing serially.
Let's look at a few typical cases. I/O operations are a frequent source of P/C when the data being read is necessary to the work of consumer threads. For example, if a program must read a large configuration file at startup, consumer threads must wait for some I/O to complete before they can begin processing the configuration settings.
Another example is parsing—the process of breaking down a data stream into discrete pieces that are processed individually. A compiler parses source code in order to generate the corresponding object code, for example. The parser must read the data and separate out the individual semantic components (called a tokens). Only then, can other threads take over processing the tokens. Because parsing is almost always context-sensitive, two threads cannot properly parse a file at the same time. Hence, token-consuming threads must wait on the producer thread to generate the tokens.
In some situations, the dependency can be particularly complex. For example, in several video formats, such as MPEG, data is encoded as deltas (differences , that is) from a reference frame. To process a delta frame, the software draws the reference frame and then applies the deltas to it. To do this, the reference frame must previously have been processed—creating the P/C relationship. This dependence is exacerbated in the case of MPEG, because frames arrive out-of-order, so that the reference frame may actually arrive several frames after the delta frame that is waiting on it.
Suffice it to say that in almost every field of programming endeavor, P/C is a common relationship between tasks.
An effective way to handle the delays inherent in P/C is to enable Consumer (C) to start working before Producer (P) has completed all its work (The worst scenario is for the Consumer thread (C thread) to be obliged to wait until the Producer thread (P thread) completes). To do this, data must be passed from the P thread as it processes its work. This hand-off requires a data structure that can hold the chunks of output from P in the generated order, while C retrieves them for processing. The most common data structure for this purpose is the queue. The queue, you undoubtedly remember, is a FIFO data container. Data is placed in one end of the queue (called enqueuing the data) and removed from the other (called dequeuing) in the same order in which it was inserted. Circular arrays and linked lists are common forms of queue implementations. Returning to P/C: As soon as P has some output ready, it places that data in the queue. The C thread receives a signal and then enters a loop in which it continuously checks the queue for any output from P and processes it accordingly. Figure 1 shows how this might work for a program that counts words in a novel.
Figure 1. The producer thread parses the text from Dickens' A Tale of Two Cities and places the words one at a time in the queue, where they are picked up for processing by the consumer thread.
For this scheme to work, the producer thread must have a special token it can place in the queue when it has reach the end of file. (Note that simply letting the queue empty will not work, because the consumer thread cannot tell whether the empty queue means a delay has occurred in the P thread (it's reading some more from disk, for example) or whether it's end of file).
Developers must exert some care when implementing this model of P/C, because the queue data structure is being manipulated by both threads. When two threads modify the same data item, trouble frequently ensues. In the case of the queue, three variables are typically maintained: a count giving the number of elements in the queue, a pointer to the head (where the next element will be removed) and the tail (where the next element will be added). We keep a count so that the C thread can tell when there is something to be processed. The count also prevents the P thread from overflowing the queue. (If the C thread is running slowly and, as a result the queue is full, the P thread must wait until spa ce becomes available in the queue. Again, we see the potential performance costs of two threads depending on each other; except in this case the P thread is now waiting on the C thread rather than vice versa.)
Suppose the P and C threads simultaneously access the counter that keeps track of the number of elements in the queue. For example, suppose the queue has 9 elements, if the P thread adds an element and the C thread subtracts an element, they'll both want to update the counter. If the P thread updates it while the C thread is updating an error will ensue. P will take the 9 and change it to a 10. C will take the 9, subtract 1 and write the variable back with a value of 8. Now a queue that truly holds 9 elements will have a count of 8. Should this scenario happen several times a full queue could conceivably claim to have 0 elements.
To avoid this situation, threads must have exclusive access to the counter when updating it. In this way, they can read, modify, and write back the counter without concern that another thread might be modifying it at the same time. To obtain exclusive access for specific threads, the program relies on thread-synchronization devices such as mutexes, critical sections, or semaphores. To find out more about how to use the tools, consult the threading APIs for your platform. It is important to note that the queue mechanism described here is just one of many ways of solving the problem. More-elaborate schemes are commonly found in multiprocessor software.
As we have discussed, P/C situations frequently have a detrimental effect on performance. How then to avoid them? Says Rich Gerber, author of Intel Press's Software Optimization Cookbook, "Get rid of them. Some P/C situations cannot be avoided. But all others should be eliminated." It is frequently true that programs can be redesigned so that two threads do not have a serial relationship to each other, especially in sections where high performance is a requirement. However, it is impossible to eliminate all P/C instances. So, what can be done to reduce their cost?
- Eliminate the work done by the P thread that is unrelated to generating data for the C thread. For example, if the P thread initializes the application by reading a data file, and only then starts putting data in the queue, consider changing P's workload. For example, if it's known that the data file must always be read, then let the P thread read the file immediately at program launch, and assign-program initialization work to another thread. (Obviously, this has to be done carefully so that the C thread is not compromised by incomplete initialization).
- Reduce the time it takes the P thread to produce data the C thread can consume. This technique suggests doing things rather differently than they're normally done. Returning to our example of a program in which the P thread reads in a data file. Have the P thread read into a data buffer from which the C thread can read the incoming data. And try to shorten the amount of time required before C can begin reading this data. For example, suppose you're writing a threaded compiler. If your compiler reads the source file into memory, begin reading the source file right away. (To fulfill the previous suggestion, immediately kick off a thread that sets up the remaining data structures, such as the hash table and the token queue). In this way, while slow reads are being made from disk, the program is doing its housekeeping. As soon as data appears in the read buffer, a second thread should be parsing into tokens, and these tokens should be placed immediately in the queue. Then the C thread should begin accessing the queue and doing its work. Some caveats apply here: the parsing cannot be performed until the hash table has been initialized, so you don't want to get ahead of yourself: The initial housekeeping thread must have run at least to the point of setting up the hash table before the parsing can be undertaken. There is no way to know whether housekeeping has gone that far except by use of a flag or signal. This is so because launching a thread only asks the operating system to run it—there is no certainty in what order threads will be run, nor in extremis, that they will be run at all. Hence, coordinating flags are necessary.
- Increasing the number of P threads. Not all P/C situations occur at start up. Let's turn to a non-computer example: putting the tires on a car on the assembly line. Tires go on last. They are inherently sequential and this step is difficult, if not impossible, to parallelize. However, one way to put more tires on cars is to open more assembly lines. This reduces the backup in overall car manufacturing and means that more tires can be placed on cars at any given moment. In factories, this step is obviously impossible to achieve, but in programming this is not so. Returning to the example of data consisting of deltas to be applied to reference data items. If more threads can assure a greater likelihood that the reference item will be complete and in memory when the deltas need to be processed, then you have effectively reduced the time the C threads spend waiting.
The important thing about P/C is to be able to recognize it and reduce its performance cost. Optimally, it should be eliminated. But if the program cannot be modified to provide this option, then the full bag of tricks should be brought to bear on reducing the delay in starting the C threads. While this requirement undoubtedly increases the complexity of the code, it can manifestly improve your program's performance.
- Threading on Intel® Parallel Architectures Forum
- Multi-Core Developer Community
- The Software Optimization Cookbook, 2nd Edition
- Threading/Multi-Core
Andrew Binstock is the principal analyst at Pacific Data Works LLC. Previously he was the director of PriceWaterhouseCoopers's Global Technology Forecasts. His book "Practical Algorithms for Developers", co-written with John Rex, is in its 12th printing at Addison-Wesley and in use at more than 30 university computer science programs. Binstock can be reached at: abinstock@pacificdataworks.com
For more complete information about compiler optimizations, see our Optimization Notice.


Todd Bezenek
565
-Todd