Cores are "used up"

Cores are "used up"

I've used the Advanced Task Programming example from the TBB book to create tasks that run in parallel along the main application task. I have two such tasks (one manages a Direct3D window and the other performs heavy duty calculations). The main application task (in principle a win32 GUI) controls the parallel tasks by sending commands via a producer-consumer queue (one tbb::concurrent_queue perparallel task). When a parallel task isn't doing any work it sleeps (it waits on a tbb::mutex).

The problem I'm having is that each parallel task seems to "consume" one processor core. I have 4 cores and only 2 are available to do actual work in the main application task. I noticed this by first scanning a std::vector serially and thendoing the exact same scan in parallelusing a tbb::parallel_for. The scan in parallel is just twice as fast as the serial one. It should be4 times as fast shouldn't it? That's why I suspect the two parallel tasks are "holding on" to one core each leaving only 2 to do work in the main application task.

To test theabove assumption I switchedoff one of the parallel tasks (the Direct3D window). And now the parallel scan is 3 times faster than the serial. I then introduced a dummy parallel task so there's a total of 3 parallel tasks. Now the serial and the parallell scanning tasks are equally fast. With two dummy parallel tasks, that is a total of 4 parallel tasks, the application just hangs. There's no assertion or anything (from MS Visual Studio in debug mode).I can note that it's the last parallel task started that fails (because the theDirect3D windownevershows up as it should if the parallel task associated with it would work).

This behaviormakes me believe each parallel task is holding on to a core. According to the TBB book this isn't what should happen, is it? I've gotthe impressionthat all 4 coreswouldbe available in every task.

This is my version of theexample from the TBB book,

#ifndef MTHREAD2_MINTONCE
#define MTHREAD2_MINTONCE

#include "tbb/task.h"

namespace mthreads {

	class MThread2 { // to be privately inherited
	protected:
		~MThread2() {}
		MThread2() : process(new(tbb::task::allocate_root()) tbb::empty_task) {}

		virtual void run() =0; // to be overridden. runs in separate thread

		void start() {
			process->set_ref_count(2);
			tbb::task* s(new(process->allocate_child()) Task(this));
			process->spawn(*s);
		}

		void stop(){
			process->wait_for_all();
			process->destroy(*process);
		}

	private:
		MThread2(const MThread2&);
		MThread2& operator=(const MThread2&);

		class Task : public tbb::task {
		public:
			~Task(){}
			explicit Task(MThread2* e) : enclosure(e) {}
		private:
			Task(); 
			Task(const Task&);
			Task& operator=(const Task&);

			tbb::task* execute() { 
				enclosure->run();
				return 0;
			}

			MThread2* const enclosure;
		};

		tbb::empty_task* const process;

	};

}
#endif
30 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

From a quick read of the description (I skipped the code), you want to simply substitute tasks for threads, but this is destined to fail. The scheduler is unaware that those tasks are waiting: it assumes that they are making full use of their assigned hardware thread (by way of their assigned worker thread), and should therefore be left alone. Perhaps if you review the tutorial material with this in mind you might come up with a more suitable design.

Quoting - Raf Schietekat
From a quick read of the description (I skipped the code), you want to simply substitute tasks for threads, but this is destined to fail. The scheduler is unaware that those tasks are waiting: it assumes that they are making full use of their assigned hardware thread (by way of their assigned worker thread), and should therefore be left alone. Perhaps if you review the tutorial material with this in mind you might come up with a more suitable design.

I haven't invented this design. It's described in the TBB book (in the Advanced Task Programming example I mentionend). It states this:

"One of the bauitiful things about this approach is that each half of the programis free to invoke as much parallelism as it desires. The task-based approach of Threading Building Blocks does the load balancing and manages the assignment of tasks to threads without causing oversubscription"

I'm starting the secondary parallel tasks exactly as decribed in this example and according to the statement aboveTBB should then be aware of all tasks and be able to balance thecores between them. Soon one hand I have this example in the TBB bookexplaining how to "start a large task in parallel with the main program", and on the other hand I have you saying that it's destined to fail.

I don't knowhow to proceed really. It would be good to have it established whetherthe example is wrong or promises too much or if I've misunderstood it somehow or stretched it too far or made a programming error.

If I indeed have to drop this approach do you have a suggestion in which direction to go?

I don't have the book with me (I'll try to have a look before tomorrow), but I don't remember any example involving long-lived tasks that were waiting for something to happen much of the time (and I thingk I would...). In TBB, you continuously create tasks that are neither too small (because they always carry some overhead, however small compared to user threads) nor too large (because otherwise you might be missing an opportunity for parallelisation). Splitting work is great: maybe some idle worker may pick up a piece ("steal" it). Maybe it just so happens that you cannot split up part of the work, and then a big task can be tolerated because it is carrying its part of the load, but it should always be doing something, except for the occasional predictably short pause, because that's part of your contract with the scheduler.

You could proceed by arranging for a user thread to do the waiting, e.g., through tbb_thread, and when work is detected the thread can invoke TBB algorithms (or tasks directly) to process the work. But I'll hold off more specific suggestions until I've compared this with the book. One thing to look out for is that concurrency should be optional: you wouldn't want things to stop working if no additional worker threads are available.

Quoting - Raf Schietekat

I don't have the book with me (I'll try to have a look before tomorrow), but I don't remember any example involving long-lived tasks that were waiting for something to happen much of the time (and I thing I would...). In TBB, you continuously create tasks that are neither too small (because they always carry some overhead, however small compared to user threads) nor too large (because otherwise you might be missing an opportunity for parallelisation). Splitting work is great: maybe some idle worker may pick up a piece ("steal" it). Maybe it just so happens that you cannot split up part of the work, and then a big task can be tolerated because it is carrying its part of the load, but it should always be doing something, except for the occasional predictably short pause, because that's part of your contract with the scheduler.

You could proceed by arranging for a user thread to do the waiting, e.g., through tbb_thread, and when work is detected the thread can invoke TBB algorithms (or tasks directly) to process the work. But I'll hold off more specific suggestions until I've compared this with the book. One thing to look out for is that concurrency should be optional: you wouldn't want things to stop working if no additional worker threads are available.

I really appreciate if you would look into this.

I realize that TBB is best for fine-grain multitasking and to have long-lived thread-like taskskind of goes against the intended usage of TBB. It's only because I found this example in the TBB book that I tried it. I got the impression that this approachwas preferredand that it would make it easier for TBB to make sure thecores got utilized optimally over the whole of the application (in my case between the main application GUI thread and the two "worker" threads).

Because it may bepart of the problemI also post here my producer-consumer queue (PCQueue). The main task uses one such queue to communicate with each of the parallel tasks.

Initially the producer acquires a lock on the queue. When the producerpushes an item it checks whether there's a sleeping consumer waiting. If so it releases the lock. This wakes upthe consumer which popsthe item. Then theproducer re-aquires the lock again.

#ifndef PCQUEUE_MINTONCE
#define PCQUEUE_MINTONCE

#include "tbb/mutex.h"
#include "tbb/queuing_mutex.h"
#include "tbb/tbb_thread.h"
#include "tbb/atomic.h"
#include "tbb/concurrent_queue.h"

namespace mthreads {

	template 
	class PCQueue { // producer-consumer queue (consumer thread sleeps while waiting)
	public:
		~PCQueue() {} 
		PCQueue() {produceLock.acquire(mutex);} // producer initially aquires lock
		void push(const T& item); // produce item
		void push(); // just wakeup consumer
		bool pop(T& item); // consume item
		bool empty() {return queue.empty();} // queue empty?

	private:
		PCQueue(const PCQueue&);
		PCQueue& operator=(const PCQueue&);

		typedef tbb::mutex Mutex;
		Mutex mutex; // threads sleep until lock granted
		Mutex::scoped_lock produceLock, consumeLock;
		tbb::concurrent_queue queue;
		tbb::atomic sleeping; // initializes to false
	};
	
	template  
	void PCQueue::push(const T& item) { // producer
		queue.push(item);
		if (sleeping) { // consumer waiting?
			produceLock.release(); // release lock to let waiting consumer in
			while (sleeping) tbb::this_tbb_thread::yield(); // wait for consumer to wake up
			produceLock.acquire(mutex); // relock
		}
	}

	template  
	bool PCQueue::pop(T& item) { // consumer
		sleeping = true;
		if (!queue.try_pop(item)) {
			consumeLock.acquire(mutex);   // sleep until producer releases lock
			sleeping = false;
			consumeLock.release();
			if (!queue.try_pop(item)) return false;
		} 
		sleeping = false;
		return true; // returns true if item was popped
	}

	template  
	void PCQueue::push() { // produce nothing, just wake the consumer
		if (sleeping) { // consumer waiting?
			produceLock.release(); // release lock to let waiting consumer in
			while (sleeping) tbb::this_tbb_thread::yield(); // wait for consumer to wake up
			produceLock.acquire(mutex); // relock
		}
	}

}
#endif

Hmm... have you tried using concurrent_queue without that wrapper?

Here area few thoughts/comments I had while reading the conversation.

The advanced technique described in the book was required to express the functional parallelism ("do few different jobs concurrently") pattern with TBB 1.x and 2.0. Later TBB 2.1 added a thread class that can also be used for this purpose, and TBB 2.2 added two other interfaces, the parallel_invoke function and the task_group class. Which of the three ways is best depends on the purpose, but the technique desribed in the book is clearly obsolete.

> This behavior makes me believe each parallel task is holding on to a core. According to the TBB book this isn't what should happen, is it? I've got the impression that all 4 cores would be available in every task.

No, this is not what should be expected. Once a task is started, it holds on to a thread, and by itself it can not use other threads. The only way to share the work with other threads is to spawn other tasks from inside the running one.

Your implementation of MThread class is fine, and if enclosure->run() used TBB inside it to express its own parallelism, this work would be shared with other threads. But it looks like the functions you try to run in parallel are in fact thread-like, i.e. long running and do not share their work. In this case, a separate thread (tbb_thread) is best choice, while parallel_invoke, task_group, or the technique from the book will all cause one TBB worker thread being employed to execute this function and lost for other work.

> "One of the bauitiful things about this approach is that each half of the program is free to invoke as much parallelism as it desires. The task-based approach of Threading Building Blocks does the load balancing and manages the assignment of tasks to threads without causing oversubscription"

I understand how this statement led you to the expectations you had. "As much parallelism as it desires" sounds to cover no nested parallelism at all. Well, this is somewhat an overstatement. On the other hand, the statement holds true in some sense, as it does not cause oversubscription in your case, "just" undersubscription :)

> I ... create tasks that run in parallel along the main application task. I have two such tasks (one manages a Direct3D window and the other performs heavy duty calculations). The main application task (in principle a win32 GUI) controls the parallel tasks by sending commands via a producer-consumer queue (one tbb::concurrent_queue per parallel task). When a parallel task isn't doing any work it sleeps (it waits on a tbb::mutex).

Let me share my understanding; and correct me if it's wrong.
So you have three long-running functions in the app - user interaction, managing Direct3D window, and doing heavy calculations. The first two, probably, don't consume CPU most of the time. The third can consume lots of CPU power, and possibly has quiet periods as well. Basically, first of all you want to use all CPUs for the heavy calculations.
One possibility to do that is to start thelong-running functions in their threads, and use TBB tasks (or parallel algorithms) only inside the heavy loaded one. In this case, you may have temporary oversubscription during the periods when GUI or D3D processing overlaps with calculations.This degree of oversubscriptionmight be not that bad in practice. If you think or know that (and why) such a setup will not work well, some adjustments can possibly be made to address specific issues.

Quoting - Raf Schietekat

Hmm... have you tried using concurrent_queue without that wrapper?

I've tried that now. I've replaced the push method with,

queue.push(item);

and the pop method with,

while (!queue.try_pop(item)) {
tbb::this_tbb_thread::yield();
}

Unfortunately the problem remains as before (each parallel task"removes" performance corresponding to one core from the main task).

So the PCQueue wrapper doesn't seem to cause the problem. MaybeI should say that the reason I use it is that I want the consumer threads to be sleeping while waiting instead ofspinning.

"So the PCQueue wrapper doesn't seem to cause the problem. Maybe I should say that the reason I use it is that I want the consumer threads to be sleeping while waiting instead of spinning."
The wrapper would not be responsible for that issue, but it doesn't provide any benefit, adds some fixed overhead even with few cores, andthe use of a mutexmay also undo some or all of the scalability that has presumably been crafted into concurrent_queue. I think that you can get everything you need from concurrent_queue alone by just using push() and pop().

Quoting - Raf Schietekat

"So the PCQueue wrapper doesn't seem to cause the problem. Maybe I should say that the reason I use it is that I want the consumer threads to be sleeping while waiting instead of spinning."
The wrapper would not be responsible for that issue, but it doesn't provide any benefit, adds some fixed overhead even with few cores, andthe use of a mutexmay also undo some or all of the scalability that has presumably been crafted into concurrent_queue. I think that you can get everything you need from concurrent_queue alone by just using push() and pop().

Note that the PCQueue wrapper has two operating modes. It sleeps duringperiods of inactivity, and then it acts like a pure concurrent_queue duringbursts of activity.

If this behaviour can be achievedusing aconcurrent_queue alone I woulddo that of course. Maybeyou could suggest something?

> Quoting - Alexey Kukanow
>
> Here are a few thoughts/comments I had while reading the conversation.

Thank you.

Well, if the example in the TBB book is considered obsolete I'll drop if of course. On the other hand it works very well, apart from the problem I've reported in this thread, so I wouldn't mind keeping the solution I have.

Is this a bug really or is it something that can never be fixed due to the way TBB works? I'm just curious. Anyway I'llseek an alternate solution looking into the new TBBtechniques you mention.

I think I stickto the overall structure for my program; A main thread consisting of the Win32 GUI and then two heavy duty worker threads, oneresponsible for a Direct3D window and one for the number crunching.I'm working on a molecular dynamics application. The number cruncher constantly calculates new positions for molecules shown in the Direct3D window and the user can rotate and translate the molecules during the same time so everything is heavy duty really.

I know this division of an application along functional linesisn't the TBB recommended approach but I likethe subdivision into modules with thin communication channels between them. Otherwise everything gets so intertwined.

I've submitted this problem as a bug. It has id 157.

I feel that even if the Advanced Task Programming example in the TBB book can be considered obsolete in the sense that there are newer alternate techniques that can replace it, it still uses standard TBB code and it doesn't work as intended. Thismakes it a bug in my view. And it may very well be that the problem I've reported is caused by some deeper anomality in TBB. At least it should be accounted for.

Even if the problem isn't considered a bugI hope this will lead to an official work-around. I liked theidea of usinglong-lived TBB tasks instead of ordinary threads, at least if that would help TBB to optimize the over-all core utilization over the whole of the program.

I'm afraid you have misinterpreted the example in the book, which has no long-lived tasks. It is obsolete only in the sense that there are now different ways to do the same thing. Tasks do something specific and then finish. They are not cheap threads, rather they are dispatched to threads. And that's a powerful idea, if used correctly (see above for some suggestions), so nothing needs changing based on this discussion.

For the concurrent_queue, have a look over at "Questions on TBB Concurrent Queue Implementation", about its use of condition variables to avoid busy waiting. TBB's scheduler still thinks the task is busy, though (a different meaning of busy), and will not dispatch other tasks to the same worker thread, so long waits should generally only occur in user threads, not in tasks.

Quoting - uj

I've submitted this problem as a bug. It has id 157.

I feel that even if the Advanced Task Programming example in the TBB book can be considered obsolete in the sense that there are newer alternate techniques that can replace it, it still uses standard TBB code and it doesn't work as intended. Thismakes it a bug in my view. And it may very well be that the problem I've reported is caused by some deeper anomality in TBB. At least it should be accounted for.

Even if the problem isn't considered a bugI hope this will lead to an official work-around. I liked theidea of usinglong-lived TBB tasks instead of ordinary threads, at least if that would help TBB to optimize the over-all core utilization over the whole of the program.

This behavior is not a bug, but rather a design decision. I understand your code does not work as you expected it to, but it does work as intended. TBB tasks do not preempt each other, so a started task occupies a thread till its completion. Core utilization and scalability is achieved by having lots of lightweight tasks, rather than a few long-running thread-like ones.

So far, an official workaround is to use tbb_thread for such long-running non-cooperative job. Using tasks for that will not help the TBB scheduler to optimize core utilization, simply because these tasks do not interact with the scheduler during execution. If some of these long-running jobs also consume a lot of CPU, you may consider decreasing the number of TBB worker threads; e.g. in your case if the main GUI thread is mostly idle but the D3D thread is mostly busy, you may decrease the number of TBB workers by one, to accommodate for the busy thread.

We continue looking for a better solution for this kind of design challenges.

Quoting - Raf Schietekat
I'm afraid you have misinterpreted the example in the book, which has no long-lived tasks. It is obsolete only in the sense that there are now different ways to do the same thing. Tasks do something specific and then finish. They are not cheap threads, rather they are dispatched to threads. And that's a powerful idea, if used correctly (see above for some suggestions), so nothing needs changing based on this discussion.

For the concurrent_queue, have a look over at "Questions on TBB Concurrent Queue Implementation", about its use of condition variables to avoid busy waiting. TBB's scheduler still thinks the task is busy, though (a different meaning of busy), and will not dispatch other tasks to the same worker thread, so long waits should generally only occur in user threads, not in tasks.

I don't think I have misinterpreted the example. From what I can seethere's no limitation on how long a parallel task can run. It could run forever without problems couldn't it? The problem seems to be the way I communicate between that main task and theparallel task. For some reasonsetting up a standard pipe usingaconcurrent_queue and a mutexdoesn'tfit the bill.

I still think spawning off worker tasks the way the example suggests is a good idea. It should help TBB balance the load better than if ordinary threads were used, shouldn'tit? In that caseI have two options I guess. Either I keepthe "eternal" worker tasks but find a better way to communicate with them. Or Ispawn offa series of worker tasks that each performs a limited amount of work and then finishes. That's probably more in line with the example.

Or I give up the "tasks" approach altogether and use ordinary threads.

Quoting - uj
I don't think I have misinterpreted the example. From what I can seethere's no limitation on how long a parallel task can run. It could run forever without problems couldn't it? The problem seems to be the way I communicate between that main task and theparallel task. For some reasonsetting up a standard pipe usingaconcurrent_queue and a mutexdoesn'tfit the bill.

I still think spawning off worker tasks the way the example suggests is a good idea. It should help TBB balance the load better than if ordinary threads were used, shouldn'tit? In that caseI have two options I guess. Either I keepthe "eternal" worker tasks but find a better way to communicate with them. Or Ispawn offa series of worker tasks that each performs a limited amount of work and then finishes. That's probably more in line with the example.

Or I give up the "tasks" approach altogether and use ordinary threads.

The basic philosophy of tasks in TBB is that they are quanta or quantifiable and can be managed and scheduled with lower overhead by making use of a pool of reusable threads. If one task is going to just take a thread and never return it to the pool, it doesn't really play in the pool and Alexey's suggestion of a tbb_task makes sense as well as paying attention to oversubscription to the available HW threads. On the other hand, for your molecular modeling component, you'll want to have that adapt to however many HW threads are available in the likely increasing future.

What is the nature of the molecular modeling work? Are you folding molecules and dealing with collision detection, or dealing with forces or that sort of thing? Are you doing a time-step dynamic animation? I presume the graphics subsystem can take 3D models and do all the world-space viewing work. Does the GPU do more than that? How can you structure the modeling work as a hierarchy of tasks you can schedule with the thread pool?

I've been working on some sample code that uses code very similar to what you quote from James's book, except in my case it's buried inside the implementation of parallel_invoke, the two-task version, which I use in a recursive parallelism example to manage the n2 interactions between bodies in a gravitational simulation. This is more the intention for the dual spawn and wait for all pattern contained in the original code. My sample doesn't have either the interactive front end or the graphics pipeline to deal with--those sound like work queues better suited to other parallel patterns, event and pipeline specificially--but it does deal with repeated operations over a simulation time step.

"I don't think I have misinterpreted the example. From what I can see there's no limitation on how long a parallel task can run. It could run forever without problems couldn't it?"
TBB can't protect itself against such abuse. There is no legal enforcement of your contract with the scheduler. But why would you try to trick it if its sole purpose in life is to serve you?

In the example, some limited amount of work is presented to the scheduler for execution at a convenient time (the title is "Start a Large Task...", not "Start an Unlimited Task..."). With a single hardware thread, that time is no earlier than the call to WaitForSideShow(), by default! I'll repeat what I wrote earlier: tasks are not cheap threads. In particular, they should not be used for required concurrency (even if they thrive on optional parallelism). I guess they take a bit of getting used to. Once you know all the rules, you might decide to bend them, at your own responsibility, but no earlier.

"The problem seems to be the way I communicate between that main task and the parallel task. For some reason setting up a standard pipe using a concurrent_queue and a mutex doesn't fit the bill."
Have you actually observed any significant amount of wasted cycles using just push() and pop() on a concurrent_queue all by itself?

"I still think spawning off worker tasks the way the example suggests is a good idea. It should help TBB balance the load better than if ordinary threads were used, shouldn't it? In that case I have two options I guess. Either I keep the "eternal" worker tasks but find a better way to communicate with them."
Bad idea: there is no such thing as a worker task, because tasks are not cheap threads (have I said this before?). TBB couldn't balance a mouse mat if you break your contract, and you won't be able to communicate your way out of that.

"Or I spawn off a series of worker tasks that each performs a limited amount of work and then finishes. That's probably more in line with the example."
Now you're talking!

"Or I give up the "tasks" approach altogether and use ordinary threads."
Each have their role to play, see the suggestions above.

Quoting - Raf Schietekat

>> "I don't think I have misinterpreted the example. From what I can see there's no limitation on how
>>long a parallel task can run. It could run forever without problems couldn't it?"

> TBB can't protect itself against such abuse. There is no legal enforcement of your contract with the
> scheduler. Butwhy would you try to trick it if its sole purpose in life is to serve you?

Well, I guess we'll have to let Mr. Reinders himself settle this dispute. :)

But until he does I think you're wrong. The running length of the parallel tasks started according to the technique described in the example is irrelevant. Let me convince you. When would you say the running length of such a task becomes a problem? After a nanosecond maybe, or a microsecond, or a second, an hourora even year? Well, you cannot say for the simple reason that it never becomes a problem. The running length just isn't an issue.

In my view the purpose of the example is to show how you can use TBB tasks in a thread-like manner instead of resorting to "real"threads. The advantagebeing that this would make it easier for TBB to balancethe cores throughout. The disadvantage is that such TBB task-threads have limitations compared with "real" threads. The example doesn't mention this and I had to learnit thehard way.

Anyway I've made a few changes to my program and it now works at the expected performance level (a parallel scan of a vector is 4 times faster than a serial scan on my quad core machine). I'm still using the TBB task-threads described in the example as worker threads. It's just thatthey no longerwait on a mutex when nothing happens. Instead they're created anew each timethere's something for them to do. Theyread items from a concurrent_queue as long as there are any. When there's nothing more to dofor the moment they lay down and die (only to rise again from the ashes likethe mythical Phoenix whenmore work arrives :). I only had to change a few lines of code actually.

Despite the fact that we disagree on some issues I would like to thank you for the insight you've provided. I would also like to thank everybody else who have contributed to this thread. Thank you all !!!

"When would you say the running length of such a task becomes a problem?"
See #3, first paragraph.

"Anyway I've made a few changes to my program and it now works at the expected performance level (a parallel scan of a vector is 4 times faster than a serial scan on my quad core machine). I'm still using the TBB task-threads described in the example as worker threads. It's just that they no longer wait on a mutex when nothing happens. Instead they're created anew each time there's something for them to do. They read items from a concurrent_queue as long as there are any. When there's nothing more to do for the moment they lay down and die (only to rise again from the ashes like the mythical Phoenix when more work arrives :). I only had to change a few lines of code actually."
That's exactly it: tasks do something specific, and then finish. Of course it is beneficial to also think of optimisations like recycling them after they finish, and/or, as in this case, letting them do more before they finish. Just realise that requiring concurrency may happen to work here, but it can very easily break down. In general, it is better to write programs that do not need additional worker threads, for easier debugging and for more general applicability (I'm not sure, but maybe requiring concurrency from different places can add up to exhaust any finite number of worker threads, even if you decide to require a multicore system?), and to use user threads instead when concurrent operation is needed (provided through time slicing if not through parallelism). Maybe TBB could add support for executing code on a thread of its own (from a separate pool), to take away the temptation to use tasks for this purpose...

(Added 2009-09-06) Actually, I have no clear view yet on what a separate thread pool would do, and in this case it would make no difference because the thread would be long-lived, so please ignore that last sentence for now.

Quoting - Robert Reed

> The basic philosophy of tasks in TBB is that they are quanta or quantifiable and can be managed and scheduled with lower overhead by making use of a pool of reusable threads. If one task is going to just take a thread and never return it to the pool, it doesn't really play in the pool and Alexey's suggestion of a tbb_task makes sense as well as paying attention to oversubscription to the available HW threads.

Well, what you sayis not supported by the Advanced Task Programming example. It states:

"Instead of having all threads execute portions of a problem, it is possible to start a task in parallel with the main application."

What's more, it then goes on to state that this usage is advantageous:

"One of the buitiful things about this approach is that each half of the program is free to invoke as much parallellism as it desires. The task-based approach of TBB does the load balancing and manages the assignment of tasks without causing oversubscription"

Until anyone iscoming up with firm evidence to the contrary I'm going to believe the example is showing good TBB usage.

I believe thatthe running length of a TBB task is irrelevant basically.A task running a picosecond or a millisecond isn't any better than atask running a minute or a day. On the contrary.Longlived tasks have certain advantages over shortlived tasks. One is that the overhead of creating and destroyingthe task becomes a smaller portionof its lifetime. A longlived task simplyspends more time doing useful work. Another is thatthe TBB scheduler is more likelyto reach a steady state of optimal performancewhen tasks are longlived rather thanshortlivedand in turbulentflux.

What's of paramount importance though is that the TBB tasks aren't used in a way the gets in the way of the TBB scheduler. That was my problem, not that myworker tasks werelonglived.

So TBB is perfectly capable of handling tasks of any length. In fact it's likely to do a better job if longlived TBB tasks are used in place of"normal" OS threads because then it has more control. This is my firm belief after having read the Advanced Task Programming example. I'm happy to be proved wrong but then it should be with technical evidence not byrepeatingamantra (that TBB requires shortlived tasksto work well).

Quoting - Raf Schietekat

>> "When would you say the running length of such a task becomes a problem?"

> See #3, first paragraph.

It's perfectly clearthat for TBB to be useful you need to divide a program into tasks that can be executed in parallel, and themore tasksthe better. That's obvious and that'snot what we're discussing.

You are claiming that a task that runs in parallel with the rest ofa program for a long time is bad. What further is you're denying thatthe Advanced Task Programming examplein opposition to youholds the opposite view, namely thatthis in factis a good thing.

I'm very confident that you're wrong. Contrary to what you claim the running length of a TBB task is irrelevant for the TBB scheduling mechanism. It works equally well with shorter tasks used to break up an algorithm to run in parallel, as it does with longer tasks created to utilize overall parallelisms of a whole program. And contrary to what you claim,thisis exactly what the example from the TBB book states.

In short, to fully utilize TBB all kinds of parallelisms should be used. There's absolutely no reason why a TBB task cannot run forever (if the rest of the programdoes that too). In fact a longrunning task is much better than the alternative of using an ordinary thread as long as you realize that thelongrunning task isn't an ordinarythread andare willing to accept its limitations. In addition you'reencouraged to further parallelize the longrunning task by spawningmore tasks from it - shortlived, longlivedand anything in-between.

If you feel my position is wrong please give some technical evidence rather thankeep repeating the "TBB tasks must be shortlived because that's the way it is" mantra.

Quoting - Alexey Kukanov.

> We continue looking for a better solution for this kind of design challenges.

Well, I keep pushing it a little.Something has to give. :)

At leastthe Advanced Task Programming example should have a disclaimerdiscussing the pros and cons of this approach in relation to usingordinary threads.

If my position is wrong now maybe it becomes right in the near future? :)

"Until anyone is coming up with firm evidence to the contrary I'm going to believe the example is showing good TBB usage."
At least we agree about something... but the devil is in (how you interpret) the details.

"I'm very confident that you're wrong."
How exciting!

"If you feel my position is wrong please give some technical evidence rather than keep repeating the "TBB tasks must be shortlived because that's the way it is" mantra."
A long-lived task is perfectly fine if it is only long-lived together with its children; just realise that it doesn't take long for task overhead to amortise to zero (a second is already like an eternity). A long-lived task is tolerable if it could not be completely broken up but is doing useful work all the time; it would only be a big stroke of luck if the other hardware threads all find something else to do in the meantime, and one of the central ideas in TBB is to create lots of tasks as opportunities for useful parallelism. It is in breach of contract if it is merely waiting for other things to happen for longer than predictably short pauses; instead, you should use user threads in such a situation, e.g., through the tbb_thread interface. Is that a mantra? Additionally, your design is problematic because it requires concurrency: another reason to use a user thread instead. Of course you are free to ignore this advice if you think you see a good reason to do so, but you should probably add a note about that for the benefit of whoever has to maintain the program after you.

I hope this has been useful, but if you still disagree maybe you should ask somebody from Intel to settle the matter.

Quoting - uj
I'm very confident that you're wrong. Contrary to what you claim the running length of a TBB task is irrelevant for the TBB scheduling mechanism. It works equally well with shorter tasks used to break up an algorithm to run in parallel, as it does with longer tasks created to utilize overall parallelisms of a whole program. And contrary to what you claim,thisis exactly what the example from the TBB book states.

The fault in your reasoning is that unlike threads tasks aren't timesliced. This means that if you have a never-ending task running, it's going to be started on one thread and stay there, never allowing anything else to happen on that thread and never allowing other threads to contribute to the work that's happening in that thread (unless it's spawning more tasks as well).

Now sure, as long as you have 1 less never-ending task than you have cores in your pc (including 'fake' HyperThreading cores), your program will still work, but as you noticed in your OP you run into trouble as soon as you have one never-ending task more. For example, with 4 continuous tasks running on a system with 4 cores, there will be 4 threads, each running one of these tasks and any more tasks will just have to wait forever.

You also mentioned something along the lines of long-running tasks leading to more stable scheduling. While technically true (there's 1 less thread to schedule tasks to) there's really no reason not to run these tasks as independent threads and just decrease the number of available threads for the scheduler by one. The scheduler will not be able to schedule anything else on that thread anyway.

The whole point of TBB's approach is to subdivide big algorithms into nicely parallellizable tasks to let the scheduler keep all of the cores busy. One of the most important steps is to avoid blocking calls as much as possible, so that instead of the thread waiting it can run another task in between. Now your Direct3D window task sounds like the perfect example of what NOT to put in a single task. This task will be doing nothing while waiting for input, while waiting for the buffers to swap, etc. Now if you could make this into short, periodic tasks, all the waiting time can be filled with actual work.

Quoting - mtsr
unlike threads tasks aren't timesliced.

Yes. You might find our experience interesting and perhaps helpful. In our application we had to alleviate a computation bottleneck in a large existing application. We wound up combining TBB mechanisms with a traditional dedicated Windows thread.
I created a TBB pipeline inside a newly created Windows thread. The input of the pipeline comes from popping lines to be processed off of a concurrent_queue object. Existing code, outside the thread, pushes lines to be processed into the queue. I switched to the bounded type queue and so, if the queue is at its capacity, the push operation gets blocked internally to the pop in an efficient manner. Similarly if processing inside the pipeline gets ahead so that the queue is empty, the pop operation gets similarly blocked. We limit the max number of objects in the pipeline to 8 (actually to the number of hardware cores found), as in Reinders' pipeline example.

Thus, though the TBB task threads do not get timesliced, since the whole pipeline is in a separate thread which DOES get timesliced, other processes run simultaneously.

It did take some programming help to get this running, namely that the task_scheduler has to be created inside the special thread even though the concurrent_queue is created outside the thread. Also, the input and output filters of the pipeline (both serial) must access existing code outside the thread. The TBB pipeline ensures that un-processed lines are received in order, processed fully in parallel, and the results written serially in sequential order.

Quoting - turks

Quoting - mtsr
unlike threads tasks aren't timesliced.

Yes. You might find our experience interesting and perhaps helpful. In our application we had to alleviate a computation bottleneck in a large existing application. We wound up combining TBB mechanisms with a traditional dedicated Windows thread.

Thanks turks, I certainly found your experience interesting. I am also setting up a similar arrangement.

Pipeline 1 : Hosted in a tbb::tbb_thread

Processes packet data and generates messages

Pipeline 2 : Hosted in another tbb::tbb_thread

Processes the messages generated by pipeline 1.

What I found was
1. I could have merged the two pipelines into 1, but the filters inside the two pipelines work on competely different data. So this would break the main advantage the pipeline offers which is work stays in place and workers move. This is something to consider.

2. There has to be enough work for the filters in each work item. So I batched the input, say 200-500 work packets at a time. I dont know if this helps but I kept the total size of the work item to be about 1/2 the available cache.

>> Thus, though the TBB task threads do not get timesliced, since the whole pipeline is in a separate thread which DOES get timesliced, other processes run simultaneously.>>

3. That is interesting. I thought that hosting a pipeline inside a tbb_thread did not change its behavior significantly. Tasks are still mapped to the same internal tbb threads. Once a task (in our case a tbb::filter) grabs an tbb internal thread it does not let go until done.

4. I think tbb2.2 calls the task_scheduler_init automatically. Is it better to just let tbb handle the initialization the way it sees fit.

Thanks again for sharing your experience. This style of programming is so new that it is hard to find help on the internet. Well maybe in 5 years ..

"1. I could have merged the two pipelines into 1, but the filters inside the two pipelines work on competely different data. So this would break the main advantage the pipeline offers which is work stays in place and workers move. This is something to consider."
You might do well to challenge that assumption (and then tell us about your findings).

"3. That is interesting. I thought that hosting a pipeline inside a tbb_thread did not change its behavior significantly. Tasks are still mapped to the same internal tbb threads. Once a task (in our case a tbb::filter) grabs an tbb internal thread it does not let go until done."
I think you thought correctly.

First off, Raf is the one with TBB experience. This is my first use of TBB and in fact Raf helped us get our app up and running. (Tnx)

My app was designed using the much earlier TBB 2.0 which I believe does not have "tbb_thread". My app uses
regular CreateThread() for the pipeline thread.

My inclination is that the use of two pipelines is preferable since they are operating on two logically different streams of data. Since they are all in the same process, even though you need a separate task_scheduler_init object in each thread, the internal subthreads are managed as a group by TBB. I think the coding is certainly easier and cleaner with separate pipelines.

Regarding the second point, that hosting a pipeline inside another thread does not change its tendency to not let go, Raf may be right. 1. My case uses a Windows thread, not a tbb_thread and that may make a difference.
2. We are seeing reduction in scalability if we start multiple versions of our app. This is another TBB discussion topic ("Many Simultaneous Task Schedulers"). One app never uses less than 40% CPU even on simple jobs where the queue pushes/pops are most often in blocked state. Four separate apps never achieve more than 2.5x throughput, even on an 8-core machine.

Our big benefit is for a single app running a complex job, for which the pipeline mostly remains full. Inside the TBB pipeline, 8 cores get used instead of one for all the crunching, resulting in a huge speed-up, limited only by Amdahl's rule. (The serial part(s) limit overall throughput.)

I am going to investigate whether use of tbb_thread instead of CreateThread has any advantage or disadvantage in our application.

"I am going to investigate whether use of tbb_thread instead of CreateThread has any advantage or disadvantage in our application."
A thread by any other API would run as fast.

Leave a Comment

Please sign in to add a comment. Not a member? Join today