No speedup in pipeline

No speedup in pipeline

This is probably a hard question to be asking in the forums, since it deals with my specific implementation of a pipeline, but I figured I'd give it a try since I'm kinda stuck in my debugging and just need some more hints on where to look.

I have a problem that lends itself greatly to a pipeline. I have a file that contains a list of objects which I just read out serially. Then I need to calculate "attributes" of those objects which get saved into another file that is just a list of attributes.

Schematically this it looks like this

While not end of file
Read object
Calculate attribute 1
Calculate attribute 2
...
Calculate attribute n
Save Attributes to a different file

Each attribute calcualtion only depends on local data and the (constant) object read in from the file. So I set this up into a pipleine where I pass a pointer of the object along the pipeline until is exits at the end. Obviously the Read step (filter) is serial and the Save
step is also serial (although it need not be, it's just easier to implement that way). All the other steps can be parallel.

My test machines are a dual core AMD and a quad core Mac, neither machine showed any speedup when I used anywhere from 1 to 8 threads, it's also insensitive to the number of tokens in the pipeline (from 1 to 32).

You may say that I'm I/O bound on my two serial steps, but I did a bit of profiling on the serial verson of this code and found that the read step is 10% of the total time and the Save step is 7% of the total time. The remaining 83% of the time is consumed in the different parallel steps. So I'd expect a reasonable amount of speedup on a 2 or 4 core machine.

Are there any tweaks to TBB or things I should be on the lookout for? I find it odd that I'm seeing absolutely no speedup when I increase the number of threads.

Thanks,
Mike

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I'm unclear how your pipeline is parallelized. You suggest that all the attribute calculations could be run in parallel, but the note above doesn't make clear what is the pipeline structure, beyond having a serial input and a serial output filter.

Are each of the attribute calculations substantial? Do they vary in the amount of work (time) they perform? In my experiments, the more work there is relative to the read and write times, the greater parallelism should be exposed.

Have you tried running something like Intel Thread Profiler to visualize what is really happening?

Leave a Comment

Please sign in to add a comment. Not a member? Join today