Overlapping I/O and processing in a pipeline

I've been experimenting with TBB pipelines as a means to overlap I/O and processing on a multi-core system, trying to understand how they work.  Having a copy of Intel® Thread Profiler has been a wonderful aid to understand what is going on, but there's some tricks I learned to use that add even more power to the process.

The TBB Pipeline


Pipelines in Threading Building Blocks are organized around the notion that the pipeline data represents a greater weight of memory movement costs than the code needed to process it.  Rather than "do the work, toss it to the next guy," it's more like the workers are changing places while the work stays in place. 
pipe1.JPG

This is the simplest form of a pipeline, having a source and sink for the stream and at least one processing stage. TBB pipes can have multiple processing tasks strung together, each of which may be declared as either serial or parallel.  By the principle described above, the HW thread that performs the input task then moves on to do the processing and output tasks on that load of data. 

Typically pipelines are not considered scalable algorithms (a form of functional decomposition) because the number of stages is fixed (depending on the processing task) and therefore limited for organized sharing among a set of HW threads.  The notion that you can specify a task as parallel enables the possibility of applying a little domain decomposition (also called data decomposition, where operations over a range of data are divided among the threads) to individual pipeline stages, but there's another way to apply parallelism to this simple pipe.

pipe2.JPG

Each HW task takes a buffer and interleaves reads and writes with a bit of processing.  As long as we can fit the ins and outs of buffers in sequence, we should be able to add additional streams. If we can increase the size of the block so as to lengthen the time of the processing step, we might even be able to accommodate more streams, although that will also tend to increase the size of the I/O sections.

I got a prototype demonstrating this technique from a fellow engineer who suggested the idea. The spine of the pipeline was described something like this:

// Pipeline stages are linked in the order created
tbb::pipeline pipeline;
// Establish the input task
PipeReader inpipe("input.dat");
if (!inpipe.is_ready()) {
    cerr << "input file could not be opened" << endl;
    exit(1);
}
pipeline.add_filter(inpipe);
// Add the processing task
PipeProcessor procpipe;
pipeline.add_filter(procpipe);
// Add the sink to this pipe
PipeWriter outpipe("output.dat");
if (!outpipe.is_ready()) {
    cerr << "output file could not be opened" << endl;
    exit(1);
}
pipeline.add_filter(outpipe);
// Run the pipe
tick_count t0 = tick_count::now();
pipeline.run(8);
tick_count t1 = tick_count::now();
float elapsed = (t1-t0).seconds();

Three filter classes define three pipeline stages. The last section times the run using a TBB-provided timing feature. The pipeline.run(8) statement is where the pipeline is executed, its argument the token count, used to scale the memory use, else the scheduler could call the input stage serially until the file is completely buffered or memory is exhausted. Eight seemed a reasonable number at the time.  I built the test code with Microsoft VS.NET 2005 and tried it on my laptop, which has an Intel® Core™ Duo processor. Scaling was very good, almost half the time of the single threaded version.  But then I tried it on an Intel® Xeon® workstation with Intel® Core™2 Duo processors, 4 HW threads total. The best operations barely scored above 1.5x! Huh!? That's no good.


Fortunately, I was able to call up my favorite threading analysis tool, Intel® Thread Profiler, and take a look at what's happening. I did an instrumented run and saw this:

pipe3.JPG

According to the bar graph at the top, most of the time there's only one thread running, the rest blocked. Very little time is spent with all four or even three of the four threads running at the same time. The timeline shows some work being done (the dark green segments), but much more of the time is spent spinning (the green striped segments).

What's it all mean?


So, something is wrong: the code is not scaling to utilize all four HW threads and the currency and timeline graphs definitely show something is wrong. But what does all that checkerboard in the timeline mean? We could drill down on those event transitions (the yellow vertical lines in the timeline) and view the source code. But there's another technique that requires a bit more instrumentation to reveal a lot more detail about the timeline. Next time we'll add some Intel® Threading Tools event notification code and show how that helps us visualize the pipeline. (See part 2.)

For more complete information about compiler optimizations, see our Optimization Notice.

2 comments

Top
robert-reed's picture

I don't know the answer to your question and I'm not sure how it is related to my blog post.

anonymous's picture

How to fix overlapping text on screen XP Windows Pro, specially on Internet Explorer, Dell 2400 computer. Dell is not able to solve this issue.
I have Intel(R)82845/GL/GE/PE/GV Processor to I/O controller -2560
installed.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.