TBB spawn overhead

TBB spawn overhead

Hi,

In order to judge how many clockticks a task should have at minumum so that it is worthwhile to run it in parallel I'm running a simple experiment:

Run a loop doing a lengthy calculation sequentially (and measure the time using a hp timer).
Then I run the same loop devided into two tbb tasks (there is no memory interaction between the tasks).
Of course, I repeat this many times to get some statistics.
Eventually, I substract the (average) parallel run time from the (average) sequential run time (= overhead).

Finally, I run the same experiment with many different loop lengths.

For very small tasks (<3k clock cycles on my nahelem) the sequential execution is slightly faster (as to be exspected).
Actually, the parallel run time is very good for small tasks (compared to my own task scheduler), aparantly tbb measures the task size and does not run the task in parallel if it is too small. For larger task sizes, the parallel execution is faster than sequential and for task sizes > 200k cycles the speed up is about 1.9.

I would have exspected the overhead to be pretty much independent of the actual task size (at least beyond a specific size where you see scheduler noise mostly).

As for my question:

I measure, that the average spawn/join overhead to be linearly dependent on the loop size up until a task size of about 300k clocks. Afterwards the overhead is a constant (+ some fluctuation).

How does tbb decide, wether to run a spawned task in parallel or not and might this mechanism cause the linear dependency of the overhead on the loop size? (I guess tbb might run the task sequentially the first time it is spawned and measure the task size (execution time) to decide this?)

Thx,
Alex

publicaciones de 14 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

"Eventually, I substract the (average) parallel run time from the (average) sequential run time (= overhead)."
To measure the overhead you should subtract the sequential run time from either the sum of the parallel run times or N times the longest parallel run time (including overhead owing to lack of parallel slack), depending on whether you consider this part of a job or the entire job, respectively. But perhaps that's what you really meant all along?

"How does tbb
decide, wether to run a spawned task in parallel or not and might this
mechanism cause the linear dependency of the overhead on the loop size?
(I guess tbb might run the task sequentially the first time it is
spawned and measure the task size (execution time) to decide this?)"
As far as I can tell, it makes no such decision, but small tasks may just be executed efficiently in quick succession without as much stealing per executed task.

Note that the smaller overhead per task for small tasks still tends to add up to more overhead per job compared to a division with fewer but larger tasks (not counting the overhead owing to lack of parallel slack). Does that explain what you have observed, or did I miss your point?

I guess what one means by "overhead" is a matter of definition. The way I define it, it is the amount of time/clock cycles that the parallel execution of a specific job takes longer than half the sequential execution of the same job. So the overhead tells me how much more time was spend than (theoretically) necessary (on a perfect system without overhead the parallel tasks should take exactly half as long as the sequential task, as the two tasks are completely independent).

My measurements (not these ones but others) hint very strongly that there is a mechanism to not run (very) small tasks in parallel in tbb and I have heard some rumors about this from other ppl, I haven't checked the code though.

I'm trying to measure an absolute (not relative to job size) overhead per parallel job execution. I would expect this to be constant (cost of spawn/schedule+join at the end which are all independent of the actual job). Instead I measure something like this:

overhead as defined above
^
| _________
| /
| /
| /
| /
-------|---------------- > job size
|
300k cycles

The experiment looks like this:

gettime_t1
repeat N-times for statistic:
{
run_seq job
run_seq job
}
gettime_t2

t_seq=(t2-t1)/N

gettime_t3
repeat N-times for statistic:
{
spawn job
run_seq job
wait_spawned_job
}
gettime_t4

t_par=(t4-t3)/N

overhead=t_par-t_seq/2

The complete experiment is performed for different job sizes.

So given the overhead (maybe a s a function of job_size) one can estimate the run time of parallel exection of a job :

t_est_par=overhead(job_size)+job_size/2
t_seq=job_size

So parallelization might be sensible if t_est_par
Alex

"I guess what one means by "overhead" is a matter of definition."
I'll give you the factor if you give me the sign. :-)

"My measurements (not these ones but others) hint very strongly that
there is a mechanism to not run (very) small tasks in parallel in tbb
and I have heard some rumors about this from other ppl, I haven't
checked the code though."
Other than for parallel_sort I don't believe there is such a mechanism.

As for overhead related to task size, I can't even remember why I
thought that it would be lower for small tasks, so please disregard!

Sorry, I'm somewhat distracted, also leading to that sloppy answer above. Maybe somebody else can say something useful here?

Best Reply

I had another look, and hopefully my understanding is a bit better now.

First, let's agree to call (only) a tbb::task a task (well, also a sequentially executed part of the total work not involving TBB), and not use the word "job" anymore (I believe that you were using "job" for "task", which is confusing). To keep it simple, task overhead should be measured for a large-enough total amount of work, to amortise startup (where workers take a while to get going) and imperfect parallel slack (where workers don't finish at exactly the same time). Here it is necessary to distinguish between individual task overhead and aggregate task overhead (to include all the descendants).

Tasks can be spawned more efficiently by using task_list, but this is not scalable (one thread does all the dividing, only one task is stolen at a time), and while this may be associated with smaller tasks, it is not an obvious correlation by far. The recommendation is for recursive parallelism, where the division of work is also parallelised and therefore scalable (for both dividing and executing), even if this means that the first divisions have a high individual task overhead related to stealing, because this is expected to be amortised down to a very small part of the aggregate task overhead.

Now, would it be possible to repeat the experiment where you have more than just one task per available hardware thread? So instead of 2 tasks you should have maybe 100 tasks or more for meaningful parallel slack.

My prediction, conditional to having understood thing correctly, is the following. So far you have really observed evidence that it takes a while for another worker to steal a spawned task, and a small task may be executed locally before it is stolen with a likelihood related to how quickly the first task is executed, i.e., to how small the tasks are. Since stealing, idling owing to dearth of parallel slack, and waiting incur a significant penalty, and with individual and aggregate task overhead the same, this overhead will gradually increase on average with increasing task size (decreasing percentage of locally executed second tasks), until it reaches a plateau (second task nearly always stolen). You can probably diagnose whether the second task was in fact stolen, and incorporate this number into the statistics. If you increase the number of tasks, the percentage of tasks that were not stolen will decrease compared to a lower number of tasks of equal size, and the plateau will be reached earlier relative to task size. If you use recursive parallelism to create the tasks, e.g., by using tbb::parallel_for(), the plateau will be much lower (better scalability), but I'll abstain from predicting how quickly it is reached in this case.

Hopefully this is more substantial than my kneejerk speculation in #1. Ideally you would modify your experiment to prove or disprove the hypothesis, but of course I would also be happy if you would simply believe me. :-)

Hm..that sounds sensible. The task stealing might be the mechanism which avoids running short tasks in parallel I was refering to, before. I should measure the ratio of tasks that were stolen. I wasn't aware that tasks need to be stolen to run in parallel at all but it makes sense, now that I think about it.

Thank you!

Experiments should take into account that the threads are created on first spawn.. and thus time for stealing of the first task is dramatically bigger than when all threads are online.Another consideration is that tbb::tasks are low-level stuff.. And usually not supposed for direct usage. There is generally no sense to distinguish overhead of producing/consuming of 1K and 10 K tasks because users are supposed to pass such a big numbers to parallel_for or parallel_reduce instead. However, parallel_for/reduce will aggregate the iterations by default and produce nearly the same number of tasks in case of input ranges specified with sizes of 1K and 10K[Sorry if it is not quite on the target because I didn't read everything carefully enough]

Hm, it was a nice theory. Unfortunately, the ratio of stolen tasks reaches 50% way before the linear region ends
(at about 6k clocks vs 300k clocks task size). Back to the drawing board :)

It seems that stolen tasks start executing (and finish late) about 1/25 task_size later (latency).
So somehow the slack/steal overhead depends on the task size?

Alex

Let's call that latency, not slack. Parallel slack is the availability of sufficient tasks to be distributed among workers to combat idleness especially at the end of the workload.

Sorry, with this new evidence (thanks for testing!), I'm fresh out of ideas. Anton?

Do you make sure the threads are created and experiment counts for stealing only but excludes thread creation as I wrote in #6?

"Do you make sure the threads are created and experiment counts for stealing only but excludes thread creation as I wrote in #6?"
With the refuting evidence already presented against my hypothesis (only a fraction of the linear part of the graph is associated with imperfect scaling), surely this won't be related to a latency as significant as thread creation? Or am I still missing something?

I see delays for thread creation (huge), wake-up (significant), polling (small, O(parallelism)?). Here I don't expect any but the last to play a role because the experiment is repeated for statistical significance and I don't presume that this is done by repeated execution of the entire program?

I don't know how significant this is, but it's still a mystery, especially without seeing the source code...

The program looks almost exactly like the pseudo code I have given (see end of post). So the program is indeed executed only once.

As for something vaguely related, by disabling hyper-threading I was ably to remove some weird effects I was having with certain job implementations (the actual work performed by the tasks).

Executing this as work per task (see comment):

uint64 n=TASK_SIZE;

uint64 i=0;
        __asm__ volatile (

            "0:;"

            "incq %[i];"

            "cmpq %[n],%[i];"

            "jl 0b;"

            : [i] "+r" (i) //replacing "r" by "m" here improved scaling massively with hyper-threading enabled

            : [n] "r" (n)

            ); 

Apparently linux scheduled both tasks on the same (hyper-threaded) core but the available shared resources (I guess the "jump" port on my nehalem) were not sufficient to scale this beyond 1.2 speedup. With HT disabled (or with the constraint set to "m" and HT enabled) it scaled to the expected 1.9 something.

Disabling HT also reduced the noise somewhat and additional measurements show that the ovearhead is still linear after reaching ca. 300k clocks but the gradient is reduced from about 0.2 to 0.04 something (see attached img):

//task to execute in parallel

class TBBTask:public tbb::task

{

public:
    TBBTask(void*v)

    {}
    task* TBBTask::execute()

    {

        uint64 n=TASK_SIZE;

        uint64 i=0;
        __asm__ volatile (

            "0:;"

            "incq %[i];"

            "cmpq %[n],%[i];"

            "jl 0b;"

            : [i] "+r" (i)

            : [n] "r" (n)

            );
        return NULL;

    }

};
TBBRoot::execute()
{
       
        tbb::tick_count tp1 = tbb::tick_count::now();
        TBBTask t2(NULL);

        for (int i = 0; i < ITER; i++)
        {
            this->set_ref_count(2);

            spawn(*new (allocate_child()) TBBTask(NULL));
           
            t2.execute();
            
            this->wait_for_all();           
        }
        
        tbb::tick_count tp2 = tbb::tick_count::now();
}

main()
{
 //read TASK_SIZE,ITER from cmdl
tbb::task::spawn_root_and_wait(*new (tbb::task::allocate_root()) TBBRoot);
}

As for your question, whether this is significant. There is always the odd chance that I'm doing something stupid unknowingly or that there is a compiler bug or such like. On the other hand, I guess in the long run I only care about the speed up. So next to curiosity, nothing stops me from just looking at that.

Alex

Yes, it is better to exclude the lazy initialization if you want to measure pure scheduling overhead. Otherwise, for small TASK_SIZE the tasks can be easily executed *sequentially* for a significant number of ITERations (thousands and more, esp. on some Windows where threads creating is really heavy operation).It is also not clear to me why to call t2.execute().. I expect permanent gap between when t2 and unnamed TBBTask finish even if they run in parallel. Why don't just wait for TBBTask to complete in a spin-loop instead if you imply the parallelism anyway. Then, the lazy initialization will be completed during only the first iteration and its result can be easily thrown away.Also, you need to call task_scheduler_init init(2); to request concurrency level which is enough for your mandatory parallelism and not too big to provoke scheduler turbulence with corresponding overhead due to excessive threads looking for absent work.

Excluding the lazy init has helped to clear some other issues I was having, the overhead hasn't changed though.

The reason to call t2.execute() directly was to avoid going via the scheduler if in 99% of the cases one of the two tasks will be executed on the spawning thread anyways (so instead of spawn, no steal ,local execute I'm executing it locally right away, maybe saving some time). But it turned out that it makes no difference wether I call t2. execute() or spawn it as well.

By the way, is there a call to tbb::something to force initialization of the worker threads or do I have to call task::spawn() a couple of times and wait a bit?

Thx,
Alex

Inicie sesión para dejar un comentario.