Under the hood: watching TBB task scheduler partitioning

By Robert Reed (Intel) (26 posts) on October 29, 2008 at 4:29 pm

I thought I was ready to dig into the scheduler using the tools I’ve built so far.  But I was disturbed by some of the concurrency diagrams my current data collector and post-processing program generated.  Here’s a typical case:

This diagram to the left appears to show a process where threads 1 and 2 are lazily working on one major index while thread 6 burns through ten of them, then helps thread 2 on index 25.Can this really be what is happening?Perhaps it's time to revisit the details of the collection and analysis process and try to improve confidence I'm really collecting and representing the correct data and analysis.

Going back to first principles

This process is based upon capturing a blend of writes in an output stream, each generated by a different thread.  With eight threads and a relatively slow process of locked file I/O, there’s certainly a chance for this to be a bottleneck which, depending on the fairness of the scheduler, could flip the order in which threads get their lines written. 

My solution was to add a time stamp, first by collecting a base time:

tbb::tick_count Ztime;
int main()
{
   Ztime = tbb::tick_count::now();

and then marking each timing point in reference to this base value, capturing the difference:

class outer_loop {
public:
    void operator() (tbb::blocked_range<int>& r) const {
        tbb::tick_count tcheck = tbb::tick_count::now();
        double tabs = (tcheck - Ztime).seconds();

Then it’s a simple matter to timestamp each of the progress reports:

        TLS &tls = GetTLS();
        int wid = tls.workerID;
        {
            MyLock::scoped_lock Printing(P);
            std::cerr << tabs << " outer loop ["
                      << r.begin() << "," << r.end()
                      << ") worker " << wid << std::endl;
        }

 

With this code in place, program output starts out like the sample to the right, though it can still vary greatly from run to run. Generally the clock counts increase, but occasionally the clock steps backwards as it does between the ninth and tenth output lines.  It is not clear whether this is due to contention on the output stream delaying thread 6’s output, or maybe there is just some inherent skew between the clocks on threads 6 and 8.    Looking at the times logged for these threads through the rest of the run, the relative time for thread 6 generally lags that for thread 8, but the difference jumps all over the place, suggesting the cause could be a little of each.I don’t see the two threads drifting apart in their relative times, so though we may have some noise obscuring the view, I think I can go ahead with the timestamp as an index for sorting thread activities.Another part of the analysis I looked at was the method I used to determine concurrency between the threads.  My previous rule was to read sequentially through the reports, doing lazy updates to the output lines, forcing a line out each time one of the workers started a second new outer index.  The first new outer index would be written to the concurrency reporting line while the second index would go into a holding buffer for a future concurrency line update.  In addition, each of the major indices was tracked for work completion (noting the completion of all the inner indices) so one of the possible state changes was a worker running out of work.

This latter rule seemed right at the time, but I discovered that it didn’t allow for the possibility of holes in the work flow within the processing of an outer index.  I opted for a simpler rule, clearing the holding buffer each time I wrote it out so that the concurrency output would be more responsive to worker inactivity.  And because I expect to do more processing of these data once generated, I switched the output of my perl post-processing program to generate CSV rather than neat columns of text.  A new version of the script is available (I had to change the file extension to post it, but it is a perl script)..

One final addition to the script: since I had the interval testing code already written for my previous outer index completion testing, I repurposed it to generate a population graph, inverting the data to show which workers complete which parts of the task.

New analysis of new data

sample concurrency map

sample concurrency map

 

Recapping, strange formations of the concurrency data led me to rethink the process to use a lighter weight mechanism for contending threads, spitting out a time-from-start which can be used to sort the concurrency data.  Pumping my sample run through the pipe, sorting and feeding it through the script and then doing a little extra formatting in the spreadsheet, here’s what drops out at the back end.The idle gaps I saw before are gone, at least from this run—I’ll need to make several more trials to see whether that remains so from run to run.  There’re also activity gaps: thread 7 shows work done on outer index 34 and then a gap before finally completing it; likewise on threads 5 and 6.I’m also using the October 19th open source release of TBB, so maybe the problem I was seeing before was due to the scheduler.  More data collection should help answer that.There’s also the population report that gets stuck at the bottom of the CSV file.  It generates a big spreadsheet so I can only show part of it here.  There’re actually two tables generated. Each displays a result for each thread and each outer index. The upper table marks the thread that received the outer parallel_for task for each outer index. 

The lower table shows a percentage of the inner indices on each outer index that was processed by each thread. 

I’m losing some resolution here to fit it all in, but it should be clear enough how some threads process outer indices to completion while others get part of the task stolen by another worker.  Or several.  A simple conditional-format in the spreadsheet adds the outer loop reservation data as a highlight in the work distribution table.

I’ve already got the test program instrumented to compare the results using different partitioners.  I’ll take up that topic next time, job permitting.  Hopefully it won’t be another two months.

Categories: Open Source, Parallel Programming, Threading Building Blocks

Comments (1)

October 30, 2008 9:31 AM PDT

Josh Bancroft (Intel)
Josh Bancroft (Intel)Total Points:
3,504
Status Points:
3,004
Brown Belt
Great post, Robert. Thanks. Oh, and for what it's worth, your code samples look fine when viewed in the feed, even if they have that weird gray background when displayed on the site. :-)

Trackbacks (0)


Leave a comment  

To obtain technical support, please go to Software Support.
Name (required)*

Email (required; will not be displayed on this page)*

Your URL (optional)


Comment*