A need for balance
Oops. Things got hectic recently: no time to muse on past work. A shame, since I left this series at its penultimate issue. If you can remember back two months to my last post (or need to start at the beginning), we left the pipeline purring on four cores, but only three-quarters of the time:
What, in this Intel® Thread Profiler view, are these sequences of short events stacked up on one worker thread? What about these bursts of synchronization between the threads, and coincident idleness? The buffer tracking labels should provide data to understand what’s going on.
Collecting the label data from the balloon hints associated with the events, I can map the buffer uses as shown above. Those short events were obviously I/O, and the labels make that clear and reveal the output/input pairing. Note also that there are eight buffers bouncing around, being read or written on one thread but usually processed by another. That buffer count is controllable, just by varying the token count passed to the pipeline.run() command. Though touted only as a means for controlling the amount of memory used by the pipeline, perhaps there is also a performance effect?
Simple to test: recompile with a token count of 4, the number of hardware threads available on our test machine, and collect another profile run:
Now that’s more like it! Concurrency level 4 is achieved 97% of the time! Note how these collected data look like the concept diagram posed at the beginning of this series. This is sort of like laminar flow in fluid studies: no “turbulence” to upset the smooth utilization of buffers by the threads. Examining the labels underlying the above markers, they reveal that each thread has taken ownership of a particular buffer and reuses it to process the next quantum of sequential file data. Like musical chairs, varying the token count by as much as 1 from the HW thread count injects an imbalance in buffer management leading to thrashing, lowering concurrency level.
Admittedly, this is the simplest of pipeline examples, synthesizing a process where the same amount of work is applied to each buffer and the ratio of processing to I/O times is large enough to permit near linear scaling out to 4 processors, on the order of 3.9x over running the pipe with only a single thread. More complex computing or pipelining topologies could easily inject all kinds of noise and delays, all to the detriment of the sustained concurrency level. But this example shows us what is possible. And shows some tools that will be very useful for understanding more complex code that doesn’t scale as well.