Apply Data Decomposition to Create Threaded Code


Challenge

Implement data decomposition on a serial function in order to produce a threaded version. The threaded version creates threads, each performing individual pieces of a computationally intensive operation.

In any threaded design, the first areas that are targeted are the most time-consuming parts of the code. Consider the example of a video-editing application, where a stream of uncompressed video is read in, special effects are applied in real time, and the processed video stream is stored onto disk. If this application were built as a serial application, the sequence of actions would be as follows:

char *frameReadBuffer;

while ( ReadFrame(frameReadBuffer) ) {

  ProcessFrame(frameReadBuffer);

  WriteFrame(frameReadBuffer);

} 

 

If the special effects to be performed on each pixel of the video frame are complex, then the function ProcessFrame() will be computationally intensive. Were a profiler like the Intel® VTune™ Performance Analyzer run on such an application, the function ProcessFrame() would stand out prominently as a hot spot.


Solution

Divide the operation of the function into multiple parts that can perform processing concurrently using threads. In the data-decomposed, threaded version of ProcessFrame(), the main thread acts as a master thread that divides the current video frame into parts and wakes up the worker threads.

Each thread, including the master, operates on its unique section of the video frame. Once the threads are done processing their share of the data, they wait at a barrier for all threads to complete their sections of the frame. The master then suspends all of the worker threads and writes the processed frame to disk before reading the next available frame from the stream.

The pseudo-code for the threaded version of ProcessFrame() is shown below:

struct

{

  int startx, endx;

  int starty, endy;

  char *data;

} ThreadData;


ProcessFrame(char *data)

{

  ThreadData perThreadData[nThreads];

//

// Master sets the limits for the region each thread has to process

//

  DecomposeData(data, perThreadData, nThreads);

//

// Wakes the worker threads with information about their data

// Each worker thread will also execute ProcessSection()

//

  for(int i=1; i < nThreads; ++i) WakeWorkerThread(i, perThreadData[i]);

//

// Master does its share of the work

//

  ProcessSection(perThreadData[0]);

//

// Master waits for all the threads to complete processing. Each

// worker thread goes to sleep after calling ProcessSection()

//

  WaitForAllThreads();

}

 

When the master thread returns from ProcessFrame(), it has successfully completed the processing of the video frame and continues by writing the frame to disk. The caller function remains the same as shown in the serial code sample.


Source

Threading Methodology: Principles and Practice

 


Categories:
For more complete information about compiler optimizations, see our Optimization Notice.