20% performance penalty vs simple hand-coded thread pool

20% performance penalty vs simple hand-coded thread pool

Working on a legacy app that has its own thread pool to which it dispatches batches of equal-sized tasks — task size = total size / # cores, with a minimum task size to prevent excessive overhead. Tasks are generally in the 50-300 µS range, but sometimes as low as 5-15 µS. We also use tbb so I wanted to get rid of our dedicated thread pool. Using the same task sizing and parameter passing mechanism, average time to execute tasks increases by 20%, with the penalty higher for higher core count machines. I see the same deficit in a test harness I built to compare them. I also compared tbb parallel_for vs task_group, and didn't see much difference. Our application uses this code very heavily so small performance changes in performance of these tasks are easily measurable in overall application tests. 

When I execute a nearly-null task in the harness, tbb is much faster, having a fraction of a microsecond overhead per task vs several microseconds for our dedicated thread pool. But when I make the test task compute for a more reasonable 50-200 µS, tbb falls behind. With 4 tasks on a 10 physical / 20 logical core machine, I see times of 65 µS for our dedicated pool; 82 µS for tbb. With 10 tasks, it's 100 vs 120. At 19 it's 134 vs 142. Running 20 tasks generates noisy measurements since any other transient cpu activity will affect the overall time to complete a task group.

For tbb, what I'm doing is this (timing code outside the loop omitted):

void MeasureTBB (int numTasks) {
   tbb::task_group g;
   for (int task = 0; task < numTasks-1; ++task)
      g.run ([=]{TestFunction(task);});
   g.run_and_wait([=]{TestFunction(numTasks-1);});
   }

The TestFunction's parameter is used to index into shared memory where actual parameters to the function are stored. For our dedicated pool, we create one fewer thread than the #cpus. They sit waiting on condition variables. We store the pointer to TestFunction in a global and then notify all the condition variables. Then the main thread calls TestFunction itself and spin loops waiting on each worker thread to set a boolean indicating it's done.

Measurements taken from our actual app running a benchmark show that for task groups split (always equally) across 8 cores, we ran 2357 task groups which our thread pool completed in an average of 83 µS vs tbb's 98. For task groups that were split across 10 cores, we ran 34280 groups with an average time of 255 µS in our thread pool vs 304 for tbb.  

I'd like to get rid of this legacy thread pool (especially now that we're seeing cpus with 56-64 logical cores), but I can't take a 20% performance hit. Any suggestions greatly appreciated.

publicaciones de 15 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

How do you perform measurements? Do you run MeasureTBB one time or multiple times (and gather statistics)? I am worrying about threads creation time because TBB does it lazily when used at first time (even after the first time usage it is not guaranteed that all threads are created).

I'm running 200 iterations of the test, then taking the average of the fastest 90% of the samples. 

Can it be related to cache locality? Does TestFunction(task) process memory? As an experiment, can you replace it with a dummy code, e.g. for (volatile int i=0; i<10000; ++i) ;"?

It loops calling an external function (to avoid dead code elimination / unrolling) that does a couple of floating point operations on its arguments (which are on the stack) — specifically to avoid memory bandwidth / cache contention issues. 

One suspicion is that with this pattern: equal-sized tasks where #tasks <= #cores, any queueing / task stealing behavior will have big costs. If it ever assigns a task to a thread that's already busy and then there's a delay until another thread steals it, that delay goes straight to the bottom line. A way to test this would be if there's some way to tell the task scheduler "give this task to an idle thread if possible". 

Do you know the number of tasks in advance? Can you use parallel_for to minimize task distribution overhead?

Yes, I know the number of tasks in advance. My test harness tries both task_group and parallel_for. For essentially "nop" tasks, where there's nothing but task distribution overhead, the task_group is significantly faster (at least up to the max of 20 tasks that I'm testing on this 20-logical-core machine). For the test tasks that compute for ~100 µS, they are equal within measurement noise.

... actually, with further sampling, the task_group and parallel_for approaches are essentially equal for either the empty task or the 100 µS "real" task. But the trend of TBB being faster for null tasks (having lower minimum task scheduling overhead) compared to my thread pool, but being significantly slower for real tasks remains. 

Will my parallel_for or task_group result in initial distribution of tasks to separate threads where they'll run to completion (since the tasks are equal sized and there are no more of them than there are worker threads)? Or is there likely to be task stealing going on in this circumstance? If so, is there any way to reduce or defeat it?

More data and a hypothesis — sampling with Instruments shows that for one iteration of a 5-second benchmark, My private thread pool uses 9 of its threads for 1.2s each, and the other 10 threads for 40 ms each. That says that most of the work is divided 10 ways — the number of physical cores — and the same 9 threads are re-used for most of this work.  With tbb, almost all the worker threads used about 1.4 secs — the range was 1.23-1.62. Some of this is work that was otherwise distributed to tbb so it's unsurprising the overall time per thread is longer. But the key thing is that the work was spread pretty evenly across all 19 worker threads, instead of being concentrated on the same threads. 

Looking at the per-cpu measurements for my thread pool, it looks like the heavily used threads spent most of their time on the physical cores, while the seldom-used threads spent more time on the associated logical cores. 

This all suggests cache effects: the system thread scheduler is leaving my most-used threads assigned to physical cores and my scheduler is re-using them for most tasks, thus concentrating use on the physical cores and getting more cache re-use. TBB is distributing tasks across threads — and cores — more evenly. 

Next step is to measure actual cache misses in each scenario.

Smoking gun: it's cache misses. Running the Instruments to collect L1, L2, and L3 cache misses (specifically the MEM_LOAD_RETIRED.L1_MISS,  MEM_LOAD_RETIRED.L2_MISS, and MEM_LOAD_RETIRED.L3_MISS (which I believe also captures instruction misses in L2 and L3 at least; I couldn't find and cache miss counters in the list for my Skylake processor). 

I get misses as follows (average of 2 runs each). This is running an overall benchmark in the app, not the parallel test harness (all values in millions):

L1       L2     L3
598    99     22    My thread pool
934  249     30   TBB

I'm frankly surprised by the size of this effect, but it makes sense. Is there anything I can use to affect TBB's choice of fhreads?

Running my test program, which uses a task that generates almost no cache misses (it just loops doing some scalar floating point math on a couple of local variables), I get this:

10 threads (#physical cores), #misses in thousands

L1       L2    L3
  134   67   1.  My thread pool
1165 574   2.   TBB

With 19 threads 
L1     L2      L3
  439   177    3.   My thread pool
1676   758    6.   TBB

A million more L1 cache misses and half a million more L2 misses in a test that takes 40-50 ms seems like a lot (a 10 ms difference in runtimes between TBB and my thread pool). 

Your pseudo code is incorrect. It likely is missing if(numTasks <= 0) return; at beginning, or if(numTasks > 1) just before the recursion.

You should consider using parallel_invoke. You will have to craft the parallel_invoke's, perhaps with __sync_fetch_and_add to generate the task number for TestFunction.

Jim Dempsey

And in my test harness, if I reduce the #threads that TBB uses to the #physical cores or fewer, it matches the performance of my thread pool for that #simultaneous tasks. I think this is pretty conclusive evidence that the problem is the interaction of TBB's task assignment and the OS thread scheduler, especially wrt physical vs logical cores. 

Is it possible to share your test case with us? 

Jim Dempsey — code further up the call chain in both the test harness and the real app insures that numTasks is >= 2.

 

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya