Last time I laid out my plan for studying the partitioners. All that was left was to run the tests, collect and organize the data. Must have been a busy several months. It all went by in a blur. So here I sit, snowed in on my Christmas break and I finally have a chance to report.
I got an internal release of Intel® Threading Building Blocks 2.1 Update 2 so I built it on my 8-core Linux system and ran a set of tests, collecting five runs with each of the partitioners wired to both inner and outer parallel_for loops. The Perl script I provided last time converted the sorted logs into two sets of tables in CSV format. A lot of hand massaging of those data resulted in the spreadsheet attached to this post.
|Going through the motions again, I added a few embellishments. The population charts were easily transposed in the spreadsheet so that they could be displayed with roughly the same aspect as the concurrency chart. These population charts were derived from log file information, which identified which threads initially were assigned outer loop indices. Outer loop info is used then to mark the inner loop distribution chart to distinguish thieves from owners as you can see at the right. This means I can hide the outer loop columns altogether.|
|Marrying the population chart to the concurrency chart looks something like this, showing one more embellishment. The outer loop assignment information can also be used to mark the original thread assignments and block them together to show some of the higher level partitioning decisions and to highlight inner loop index stealing, the white blocks in the concurrency chart example to the left. The colors are selected at random to distinguish the blocks and have no relationship with each other.|
What did I find? That there’s a lot a run-to-run variability even for a given partitioner. The same Simple partitioner that produced results shown above also generated this run below, which seems to have some difficulty getting started:
|Staring at the plots for long periods of time, I begin to discern some differences, though they could not be called definitive. Many of the runs show repetitive cycles where threads finishing a block of outer indices scavenge parts of other threads’ outer blocks before assigning themselves another outer block. Mostly though, each worker starts with a block of 1-3 originally assigned indices and completes them before moving on to another block. The only counterexample of this in the collected runs for Simple and Auto is Simple run 4, which shows thread 8 starting off stealing from thread 3. However, three of the five Affinity runs show multiple workers starting with stealing from others in the pool. More runs would be required to determine if this is a significant trend.|
|What about fragmentation? Could I glean any statistics from the number of ways each outer index was split that might distinguish the partitioners? I made use of properties of the spreadsheet to generate a fragmentation curve, counting the number of outer indices handled completely by one thread, then two, three and so forth up to eight. Plotting the fragmentation curves showed no unique clustering of the data, Simple, Auto and Affinity colored yellow, red and blue respectively.What about fragmentation? Could I glean any statistics from the number of ways each outer index was split that might distinguish the partitioners? I made use of properties of the spreadsheet to generate a fragmentation curve, counting the number of outer indices handled completely by one thread, then two, three and so forth up to eight. Plotting the fragmentation curves showed no unique clustering of the data, Simple, Auto and Affinity colored yellow, red and blue respectively.|
Though this has been an interesting exercise, I haven’t gleaned much practical use from it. I’ve generated some pretty graphs, but few technical insights into improving performance. I’ll have to take another look at it as time allows in the new year, and see if there is anything more I can glean. Merry Christmas.