TBB's UnFair Task Scheduling

TBB's UnFair Task Scheduling

I have a couple of questions regarding TBB's unfair scheduling.

1. I read through the user manual that describes Task Scheduler and this what it says

" Tasks in Intel Threading Building Blocks are efficient too because the scheduler is unfair. Thread schedulers typically distribute time slices in a round-robin fashion. This distribution is called fair, because each logical thread gets its fair share of time. Thread schedulers are typically fair because it is the safest strategy to undertake without understanding the higher-level organization of a program. In task-based programming, the task scheduler does have some higher-level information, and so can sacrifice fairness for efficiency. Indeed, it often delays starting a task until it can make useful progress."

And as I understand the above TBB scheduler makes sure that the thread executing a task runs until the task completes. Is this correct? If yes how is this done?

2. "Indeed, it often delays starting a task until it can make useful progress."
What does the above line mean?

3. Im planning to use the TBB scheduler in my application wherein I have a thread(boost thread) which waits on data packets to arrive. The boost thread creates a tbb::task and spawns it to the TaskScheduler and waits for taking the next data packet. The frequency at which the data packets arrive is not determinate( they may arrive at high frequency and sometimes at a low frequency). Hence the decision to use boost thread to wait on the data packets and spawn tasks for processing the packet.

And lets say I initialize the scheduler with number of physical threads( lets say 4). Also lets assume the boost thread reads four data packets and spawns tasks for them to the task scheduler( using task::spawn() ) and waits for further data packets to arrive. All the four worker threads in the scheduler steals a task each and start executing the stolen task. Now lets say a data packet arrives when all the worker threads are busy executing their task.

Now the question is when will the boost thread get its turn for collecting the data packet and spawning a task. Will it get its turn only when one of the worker thread finishes executing the task? How will the overall scheduling work now? Will it be fair or unfair?

4. Now in the above case the boost thread will always spawn a task into its own readypool and the worker threads will try to always steal from the ready pool of the boost thread. Is it possible that I target the spawn into the ready pools of other worker threads in the scheduler( may be randomly)? Because it makes sense for me to that because the boost thread will never process tasks in its own ready pool( they are always stolen by worker threads in the scheduler).

--------
Sankar

45 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

1. There is no separate active scheduler, so a worker thread just churns along until it needs to find something else to do, at which time it interacts with the scheduler-related data structures.

2. Nothing comes to mind at this time.

3. In TBB, "stealing" means taking from another thread. If your boost thread spawns 4 tasks, 3 tasks may typically get stolen because the thread itself would have started executing one of them. If you wait_for_all(), then the thread will not be available to process further packets until all tasks have completed, so if you want to be able to race ahead of processing you might allocate an empty_task, to which you add tasks using allocate_additional_child_of(), if you can accept that these will be processed in LIFO order, otherwise you have to store the items outside of the task scheduler, e.g., in a concurrent_queue,and have each task take the oldest one.

4. You could try all sorts of things, but I see no reason to assume that this would be something that needs investigating.

TBB task scheduler is a bad fit for your problem. You will get basically no advantages from it, and will have to fight it's obstacles.
I would recommend something like the following.
Create N worker threads. Create N FIFO concurrent queues. IO thread submits new packets to FIFO queues in a round-robin fashion. Worker threads try to get next packet to process from "own" queue first, if empty try to steal from other queues.
Queues must be correctly padded to prevent false-sharing of course.
The easiest way to handle worker thread blocking (when all the queues are empty) is to use hybrid spin waiting. And kernel blocking is a bit trickier, it can be easily handled with an eventcount, but unfortunately my eventcount proposal is still not accepted to the main codebase.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov
TBB task scheduler is a bad fit for your problem. You will get basically no advantages from it, and will have to fight it's obstacles.
I would recommend something like the following.
[...]

That seems like a big fight to avoid a small fight. :-)

Quoting - Raf Schietekat

3. In TBB, "stealing" means taking from another thread. If your boost thread spawns 4 tasks, 3 tasks may typically get stolen because the thread itself would have started executing one of them. If you wait_for_all(), then the thread will not be available to process further packets until all tasks have completed, so if you want to be able to race ahead of processing you might allocate an empty_task, to which you add tasks using allocate_additional_child_of(), if you can accept that these will be processed in LIFO order, otherwise you have to store the items outside of the task scheduler, e.g., in a concurrent_queue,and have each task take the oldest one.

Well I dont plan to use wait_for_all(). Im only going to make the boost thread call only spawn() and wait on the next data packet to arrive. Then this means the boost thread is not executing any of the tasks. All the tasks spawned by the boost thread are stolen by threads in the Scheduler.

So I dont understand what you mean by saying "If your boost thread spawns 4 tasks, 3 tasks may typically get stolen because the thread itself would have started executing one of them".

Now in this case since all the worker threads in the scheduler are stealing tasks, the processing of data packets will be in FIFO order I guess.

Quoting - Dmitriy Vyukov
TBB task scheduler is a bad fit for your problem. You will get basically no advantages from it, and will have to fight it's obstacles.
I would recommend something like the following.
Create N worker threads. Create N FIFO concurrent queues. IO thread submits new packets to FIFO queues in a round-robin fashion. Worker threads try to get next packet to process from "own" queue first, if empty try to steal from other queues.
Queues must be correctly padded to prevent false-sharing of course.
The easiest way to handle worker thread blocking (when all the queues are empty) is to use hybrid spin waiting. And kernel blocking is a bit trickier, it can be easily handled with an eventcount, but unfortunately my eventcount proposal is still not accepted to the main codebase.

Well I was also thinking of having N worker threads and N concurrent queues and distributing them( of course I dint think of how to handle worker thread blocking, kernel blocking). Well the only problem is that I would not be able to reap the other benefits of TBB scheduler. The tasks which my IO thread submits are recursively splittable tasks which can be heavily parallelized.

Now if I have to be able to support recursive splitting and continuation passing and stuff like that, will I not end up writing something like tbb scheduler again?

and what is the difference between when you say worker thread blocking and kernel blocking?

Quoting - Raf Schietekat

1. There is no separate active scheduler, so a worker thread just churns along until it needs to find something else to do, at which time it interacts with the scheduler-related data structures.

I dint understand this yet. So if I create a threadpool with number of threads equal to one less than actual physical threads and make the main thread in my application submit tasks to the threadpool ( and also make the main thread participate in executing the tasks whenever it is free), Can I say that my threadpool does fair scheduling? Or does the TBB scheduler do something other than this to say that it does fair scheduling?

Quoting - Shankar
So I dont understand what you mean by saying "If your boost thread spawns 4 tasks, 3 tasks may typically get stolen because the thread itself would have started executing one of them".

Now in this case since all the worker threads in the scheduler are stealing tasks, the processing of data packets will be in FIFO order I guess.

Then it seems like all tasks would in fact be stolen by other threads. It just wasn't entirely clear to me that this is what you meant.

I guess that you guess correctly... until it changes around again? :-)

"I dint understand this yet. So if I create a threadpool with number of threads equal to one less than actual physical threads and make the main thread in my application submit tasks to the threadpool ( and also make the main thread participate in executing the tasks whenever it is free), Can I say that my threadpool does fair scheduling? Or does the TBB scheduler do something other than this to say that it does fair scheduling?"
It used to be that both local processing and stealing were LIFO, which was definitely unfair. Since 20090313 (according to CHANGES) it seems that stealing is done FIFO, so now at least the thieves are fair (who would have expected that!). If all your work is stolen from one parent, that may be all you need, because any remaining unfairness is purely in the interest of getting individual data packets processed faster (assuming their processing is subdivided).

So some of what I wrote earlier was based on behaviour that has changed (sorry for that, and thanks for drawing my attention to the matter), and it seems you can entirely rely on the scheduler after all.

(Added) You can't have the spawning thread participate in processing, though, otherwise there is a chance that some data packets will get buried under other work. But you may not even have to account for it when deciding on the number of TBB threads, unless it does something expensive that would cause oversubscription of processing resources anyway, but I don't expect that to be the case.

Quoting - Shankar
Well I was also thinking of having N worker threads and N concurrent queues and distributing them( of course I dint think of how to handle worker thread blocking, kernel blocking). Well the only problem is that I would not be able to reap the other benefits of TBB scheduler. The tasks which my IO thread submits are recursively splittable tasks which can be heavily parallelized.

Now if I have to be able to support recursive splitting and continuation passing and stuff like that, will I not end up writing something like tbb scheduler again?

Well, requirement to do intra-task parallelization changes picture.
In such situation you indeed better to somehow hack TBB task scheduler to do this.
Frankly I do not know what is the best/official way to do that. Probably you can modify the scheme I've described in the following way. Worker thread (in your thread pool) instead of directly processing a packet submits a task to TBB scheduler. This way you will get around blocking of IO thread and scheduler unfairness.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov
Create N worker threads. Create N FIFO concurrent queues. IO thread submits new packets to FIFO queues in a round-robin fashion. Worker threads try to get next packet to process from "own" queue first, if empty try to steal from other queues.

Would it be a significant improvement to distribute the work, do you think? It doesn't seem unlikely, because workers wouldn't always need to find work by stumbling upon it by chance, as they normally do. If so, could this also beachieved using affinity (I haven't used that yet, and I see no statement about FIFO or LIFO)?

Quoting - Dmitriy Vyukov
In such situation you indeed better to somehow hack TBB task scheduler to do this.

Why?

Quoting - Shankar

and what is the difference between when you say worker thread blocking and kernel blocking?

I just meant how does worker thread handle absence of work. It may just sleep for some time and then recheck queues, sleep for some time, then recheck queues, and so on (this is called spinning). Or it may block on some kernel primitive (semaphore, condition variable, event) until new work arrive (this is called kernel blocking).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

Quoting - Dmitriy Vyukov
In such situation you indeed better to somehow hack TBB task scheduler to do this.

Why?

To not reimplement it.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

To not reimplement it.

Why change anything, I mean?

Quoting - Raf Schietekat

Quoting - Dmitriy Vyukov
Create N worker threads. Create N FIFO concurrent queues. IO thread submits new packets to FIFO queues in a round-robin fashion. Worker threads try to get next packet to process from "own" queue first, if empty try to steal from other queues.

Would it be a significant improvement to distribute the work, do you think? It doesn't seem unlikely, because workers wouldn't always need to find work by stumbling upon it by chance, as they normally do. If so, could this also beachieved using affinity (I haven't used that yet, and I see no statement about FIFO or LIFO)?

Since tasks are BIG, distibution will make little sense. Simplier design will be to use single centralized queue.
What can be achieved with affinity?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

Why change anything, I mean?

To get around IO thread blocking until previous packet processing is completed, for example.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"Since tasks are BIG, distibution will make little sense. Simplier design will be to use single centralized queue."
Hmm, you're the one who proposed to use N queues, so now I'm a bit confused.

"What can be achieved with affinity?"
I'm trying to find a possible rationale for your proposal with N queues, and bring it over to the TBB side. Since there is only one thread to steal from, and there's no way to tell the other threads about that, there might be some overhead by letting them find the stealable work by accident (this thread? no. that thread? no. ... yonder thread? yes.). Instead of using explicit queues, I wanted to know your opinion on whether the affinity mailboxes (sic?) could fill that role (they would have to be FIFO for that to work, though).

"To get around IO thread blocking until previous packet processing is completed, for example."
I think we established that this is not an issue: the I/O thread is kept separate.

Quoting - Dmitriy Vyukov
Well, requirement to do intra-task parallelization changes picture.
In such situation you indeed better to somehow hack TBB task scheduler to do this.
Frankly I do not know what is the best/official way to do that. Probably you can modify the scheme I've described in the following way. Worker thread (in your thread pool) instead of directly processing a packet submits a task to TBB scheduler. This way you will get around blocking of IO thread and scheduler unfairness.

However note that your tasks must be large enough in order to intra-task parallelization to make sense. By large enough I mean, let's say, larger than 1 ms. Otherwise you better not do any intra-task parallelization at all.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

"Since tasks are BIG, distibution will make little sense. Simplier design will be to use single centralized queue."
Hmm, you're the one who proposed to use N queues, so now I'm a bit confused.

Initially I assumed that tasks are not that big, so I proposed distributed queues.
Then Shankar said that intra-task parallelization is worth doing, this means that tasks are large enough. So I said that if tasks are large, there is little sense in distributed queues.

Quoting - Raf Schietekat

"What can be achieved with affinity?"
I'm trying to find a possible rationale for your proposal with N queues, and bring it over to the TBB side. Since there is only one thread to steal from, and there's no way to tell the other threads about that, there might be some overhead by letting them find the stealable work by accident (this thread? no. that thread? no. ... yonder thread? yes.). Instead of using explicit queues, I wanted to know your opinion on whether the affinity mailboxes (sic?) could fill that role (they would have to be FIFO for that to work, though).

That's how TBB works anyway, when thread is idle it polls other queues. The overhead is usually not that big, and note that overhead is associated with a thread that is otherwise idle.
And it's not the case that there is only one thread to steal from. All threads but one may have not empty queues.
As a possible optimization you may also apply the following. When IO thread unblocks any worker threads, it communicates to then pointer to the non-empty queue (the queue it's just enqueued task to). So that worker threads have some idea as to what queue to check first.

What exactly do you mean by affinity mailboxes? How do they differ from explicit queues?

Quoting - Raf Schietekat

"To get around IO thread blocking until previous packet processing is completed, for example."
I think we established that this is not an issue: the I/O thread is kept separate.

Do you mean hack with allocate_additional_child_of()? Well... it's one of the options to "hack TBB scheduler" :)

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"So I said that if tasks are large, there is little sense in distributed queues."
Why is that decision related to size?

"And it's not the case that there is only one thread to steal from. All threads but one may have not empty queues."
In this case it is in fact just the I/O thread that has (the most eligible) stealable work.

"What exactly do you mean by affinity mailboxes? How do they differ from explicit queues?"
Explicit queues are so... explicit. :-)

"Do you mean hack with allocate_additional_child_of()? Well... it's one of the options to "hack TBB scheduler" :)"
Seems more like normal use to me.

Quoting - Raf Schietekat

"So I said that if tasks are large, there is little sense in distributed queues."
Why is that decision related to size?

Distributed queues/deques are employed in order to minimize per-task overheads. If your tasks are so large that current per-tasks overheads are negligible, then you do not need distributed queues/deques.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

Distributed queues/deques are employed in order to minimize per-task overheads. If your tasks are so large that current per-tasks overheads are negligible, then you do not need distributed queues/deques.

I will have to take your word on that, because I don't know enough about queue implementation to draw any conclusion about per-task overhead.

Quoting - Raf Schietekat

"And it's not the case that there is only one thread to steal from. All threads but one may have not empty queues."
In this case it is in fact just the I/O thread that has (the most eligible) stealable work.

Nope, in my design IO thread immediately evenly distributes tasks over a pool of worker threads. So all worker threads may have work to do, and they may steal from each other.

And if you switch to discussion of another design, then I do not quite see the point anyway. If only IO thread has stealable work, then about what polling and about what inefficiencies are you talking about? Other threads will steal work only from IO thread, no polling, no inefficiencies.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

"What exactly do you mean by affinity mailboxes? How do they differ from explicit queues?"
Explicit queues are so... explicit. :-)

Sorry, I do not understand you.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

"Do you mean hack with allocate_additional_child_of()? Well... it's one of the options to "hack TBB scheduler" :)"
Seems more like normal use to me.

As far as I understand any normal use of TBB implies that thread that submits root task blocks until it completes. And that's what we are trying to avoid here. Any way around that is a hack IMHO.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

#23 "Nope, in my design IO thread immediately evenly distributes tasks over a pool of worker threads. So all worker threads may have work to do, and they may steal from each other."
Well, "in this case" means(mostly) the original design. In this case (still the same...), the only thing that workers could steal from each other is parts of a packet, but it would be better to steal a full packet from the I/O thread instead.

#23 "And if you switch to discussion of another design, then I do not quite see the point anyway. If only IO thread has stealable work, then about what polling and about what inefficiencies are you talking about? Other threads will steal work only from IO thread, no polling, no inefficiencies."
Stealing is by random choice, so with a lot of worker threads there's a lot of opportunity to make a wrong or suboptimal choice, which can lead to a potentially significant waste of performance.

#24 "Sorry, I do not understand you."
Affinity mailboxes seem to offer exactly the qualities that you would look for to help workers find the most useful thing to do: a pointer to the task, and built-in overridability in case another thread wants to have a go. So why reimplement that outside of the scheduler...

Quoting - Raf Schietekat

I will have to take your word on that, because I don't know enough about queue implementation to draw any conclusion about per-task overhead.

Queue implementation is irrelevant here.
Regardless of the queue implementation distributed queues (where only 1 thread accesses each end of a queue most of the time) have lower overhead than single centralized queue. So if tasks are relatively small you prefer distributed queues.
And if intra-task parallelization is worth doing, i.e. tasks are at least 1 ms, then queuing overheads are irrelevant. Because every sane queue will have negligible overheads.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Sankar,

Coordinating compute class tasks with I/O bound tasks is problematic with TBB. This is not the case with other parallel tasking systems such as QuickThread (www.quickthreadprogramming.com) which has two classes of thread pools (compute and I/O) with an elegant inter-class communication capability. Sketch of your program using QuickThread.

// compute class task to process packet
void PacketProcessTask(Packet_T* packet)
{
	// may spawn multiple tasks
	// by using parallel_... templates
	// may include ArbitraryTask() below
}

// qtControl for I/O class task to write packet
qtControl SocketWriteControl;
// I/O class task to write packet
void SocketWrite(Packet_T* packet)
{
	// code to write to socket
	WriteToSocket(packet); // may wait for completion
}

// qtControl for I/O class task to read packets
// until done or until fatal error
qtControl SocketReadControl;
// I/O class task to read packets
// until done or until fatal error
void SocketRead()
{
	bool Done = False;
	while(!Done)
	{
		Packet_T* packet = ReadNextData(); // may wait for packet
		if(packet)
		{
			// schedule compute class task
			parallel_task(PacketTask, packet)
		}
		else
		{
			// check for recoverable error
			// or fatal error
		}
	}
}

// arbitrary compute class task
// which includes enque of packet to SocketWrite I/O class thread
void ArbitraryTask()
{
	// code here
	// ...
	// now write packet using I/O class thread
	parallel_task(IO$, &SocketWriteControl, SocketWrite, packet);
	// code here
	// ...
}

// application
int _tmain(int argc, _TCHAR* argv[])
{
	qtInit(-1, 3);	// compute class = # HW threads
			// I/O class uses 3 threads
	// start "forever" SocketRead I/O class task
	parallel_task(IO$, &SocketReadControl, SocketRead);
	// While above runs
	// run parallel app code here
	// ...
                // end of program cleanup
	// Assure SocketRead task completed (with/without error)
	SocketReadControl.WaitTillDone();
	// SocketWrite may have pending I/O
	SocketWriteControl.WaitTillDone();
	return 0;
}

The TBB developers need to examine ways to resolve this issue. What is true above holds true for similar parallel programming requirements such as parallel_pipeline. The QuickThread implementation of parallel_pipeline makes use of both classes of thread pools. With occasional proding from others like me, the TBB development team will address this issue.

Dmitriy's suggestion of multiple dedicated queues is a good solution for certain types of problems. The sketch above is more of a general solution. Optimal performance may require a blend of techniques.

Jim Dempsey
www.quickthreadprogramming.com

www.quickthreadprogramming.com

Quoting - Dmitriy Vyukov
As far as I understand any normal use of TBB implies that thread that submits root task blocks until it completes. And that's what we are trying to avoid here. Any way around that is a hack IMHO.

More like a normal workaround to bridge the blocking-I/O world and the computational world, IMHO. Besides, TBB uses this internally several times.

Quoting - Dmitriy Vyukov

I just meant how does worker thread handle absence of work. It may just sleep for some time and then recheck queues, sleep for some time, then recheck queues, and so on (this is called spinning). Or it may block on some kernel primitive (semaphore, condition variable, event) until new work arrive (this is called kernel blocking).

Oh ok. This is clear.

Quoting - Dmitriy Vyukov

However note that your tasks must be large enough in order to intra-task parallelization to make sense. By large enough I mean, let's say, larger than 1 ms. Otherwise you better not do any intra-task parallelization at all.

My tasks are definitely more than 1ms. I can even say for sure that the split tasks would take more than 1 ms.

Quoting - jimdempseyatthecove
Coordinating compute class tasks with I/O bound tasks is problematic with TBB. This is not the case with other parallel tasking systems such as QuickThread (www.quickthreadprogramming.com) which has two classes of thread pools (compute and I/O) with an elegant inter-class communication capability.

Oh ok. Is QuickThread available so that I can try my sample there?

Raf Schietekat, Dmitriy Vyukov

Ok from what I can understand from all of your inputs is the following :

1. Since the frequncy at which the data packets arrive is not deterministic and its IO bound, it makes sense to create an IO thread to receive data packets.

2. Since my tasks are sufficiently intra parallelizable( and also recursively parallelizable) it makes sense to use the TBB scheduler.

2. However the only caveat with the above solution is that the worker threads in TBB scheduler would make a number of unsuccessful( or inefficient) stealing efforts in trying to steal from the other worker thread's ready pool. So it makes sense to some how target the tasks made by the IO thread on to the worker threads.

So to workaround this, one way is to make the IO thread create N dummy tasks and create additional_child_of the dummy tasks( as and when a data packet is available) in a round robin fashion. So this way the tasks created by the IO thread under a dummy tasks would be targeted to the worker thread that stole the dummy task. Is this correct?

What I dint understand is about that affinity stuff that you were talking about. Will it help me in any way?

I was just speculating whether this was part of the rationale behind Dmitriy's alternative design. But unless the tasks are smallish (they aren't) and you have dozens of hardware threads (you probably don't), I wouldn't really worry about this yet (seems like premature optimisation at this point, and you know what somebody famous said about that).

(Affinity marks a task as a preferable target for a specific thread to steal instead of a randomly chosen one, and also tells that thread where to find it. Its normal purpose in TBB is to enhance cache locality.)

Quoting - Shankar

Oh ok. Is QuickThread available so that I can try my sample there?

First read the documentation on www.quickthreadprogramming.com
This is currently a Windows only solution (Linux is under development).

If your sample program is relatively simple I can take a look at it for comments.
use jim at quickthreadprogramming dot com.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - Raf Schietekat

#23 "Nope, in my design IO thread immediately evenly distributes tasks over a pool of worker threads. So all worker threads may have work to do, and they may steal from each other."
Well, "in this case" means(mostly) the original design. In this case (still the same...), the only thing that workers could steal from each other is parts of a packet, but it would be better to steal a full packet from the I/O thread instead.

You answered to my post, and you mention N queues, so I was thinging that you are talking about my design. And I do not understand what's the original design.
Anyway, yes, IMHO, it would be better to get full packet from IO thread. That's how .NET TPL handle things. They have separate queue for "root" tasks submitted from external threads. If thread is out of work then it checks root queue first. This ensures that if there are enough separate root "tasks" then all the workers will work on separate root "tasks", instead of helping each other. IMHO that's reasonable.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

#24 "Sorry, I do not understand you."
Affinity mailboxes seem to offer exactly the qualities that you would look for to help workers find the most useful thing to do: a pointer to the task, and built-in overridability in case another thread wants to have a go. So why reimplement that outside of the scheduler...

Ah, I see, you seem to mean TBB internal affinity mailboxes. Humm, well, affinity mailboxes may improve situation somehow. However note that other threads can't steal from mailboxes. So if thread is out of work and it's mailbox is empty then it will probably steal sub-task (packet part) from another worker, instead of picking brand new whole packet.
And you still have to get around IO thread blocking somehow...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Raf Schietekat

More like a normal workaround to bridge the blocking-I/O world and the computational world, IMHO. Besides, TBB uses this internally several times.

Ok, let's call it "workaround" instead of "hack", I actually do not care too much about that.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

#36 "Anyway, yes, IMHO, it would be better to get full packet from IO thread."
I'm assuming that step 5 in the scheduling algorithm implies that TBB won't see those first, but then it also never says anything about only stealing at greater depth than where it is now.

#37 "However note that other threads can't steal from mailboxes."
Open Source Reference.pdf 1.16 of 2009-07-03 says in 10.3.8 Affinity: "The id is a hint and may be ignored by the scheduler."

LIFO/FIFO is still a potential concern, however: some source comments say "mailbox queue" (FIFO), but the code itself uses names like push and pop (LIFO).

#38 "Ok, let's call it "workaround" instead of "hack", I actually do not care too much about that."
It was just confusing to read "In such situation you indeed better to somehow hack TBB task scheduler to do this.", which seemed to mean changing the library itself.

Quoting - Raf Schietekat

#37 "However note that other threads can't steal from mailboxes."
Open Source Reference.pdf 1.16 of 2009-07-03 says in 10.3.8 Affinity: "The id is a hint and may be ignored by the scheduler."

Sorry, of course, these tasks can be stolen. I meant that only mailbox owner gives priority to affinitized tasks, and other threads will ignore the fact that task is in mailbox. So I guess that when a thread steals task in most cases it will steal packet part instead of whole packet.
So mailboxes do not give you global "steal priorities", they give you only local "steal priorities", so to say.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

Sorry, of course, these tasks can be stolen. I meant that only mailbox owner gives priority to affinitized tasks, and other threads will ignore the fact that task is in mailbox. So I guess that when a thread steals task in most cases it will steal packet part instead of whole packet.
So mailboxes do not give you global "steal priorities", they give you only local "steal priorities", so to say.

I agree, but, while performance may be suboptimal, it may also be a good enough return for the implementation price, at least initially.While processing is lagging, which is when optimal processing is most important, all workers are likely busy on their own assigned work; it's only when a few have caught up already that efficiency would drop somewhat.

I'm not disputing that, in general, the best solution for infinite workloads will likely have to sidestep the scheduler. Using a queue seems entirely appropriate in this case. But, if nobody exercises the scheduler for infinite workloads, where's the incentive to improve it? :-)

Quoting - Shankar
I have a couple of questions regarding TBB's unfair scheduling.

1. I read through the user manual that describes Task Scheduler and this what it says

" Tasks in Intel Threading Building Blocks are efficient too because the scheduler is unfair. Thread schedulers typically distribute time slices in a round-robin fashion. This distribution is called fair, because each logical thread gets its fair share of time. Thread schedulers are typically fair because it is the safest strategy to undertake without understanding the higher-level organization of a program. In task-based programming, the task scheduler does have some higher-level information, and so can sacrifice fairness for efficiency. Indeed, it often delays starting a task until it can make useful progress."

And as I understand the above TBB scheduler makes sure that the thread executing a task runs until the task completes. Is this correct? If yes how is this done?

2. "Indeed, it often delays starting a task until it can make useful progress."
What does the above line mean?

3. Im planning to use the TBB scheduler in my application wherein I have a thread(boost thread) which waits on data packets to arrive. The boost thread creates a tbb::task and spawns it to the TaskScheduler and waits for taking the next data packet. The frequency at which the data packets arrive is not determinate( they may arrive at high frequency and sometimes at a low frequency). Hence the decision to use boost thread to wait on the data packets and spawn tasks for processing the packet.

And lets say I initialize the scheduler with number of physical threads( lets say 4). Also lets assume the boost thread reads four data packets and spawns tasks for them to the task scheduler( using task::spawn() ) and waits for further data packets to arrive. All the four worker threads in the scheduler steals a task each and start executing the stolen task. Now lets say a data packet arrives when all the worker threads are busy executing their task.

Now the question is when will the boost thread get its turn for collecting the data packet and spawning a task. Will it get its turn only when one of the worker thread finishes executing the task? How will the overall scheduling work now? Will it be fair or unfair?

4. Now in the above case the boost thread will always spawn a task into its own readypool and the worker threads will try to always steal from the ready pool of the boost thread. Is it possible that I target the spawn into the ready pools of other worker threads in the scheduler( may be randomly)? Because it makes sense for me to that because the boost thread will never process tasks in its own ready pool( they are always stolen by worker threads in the scheduler).

--------
Sankar

Sankar, I am not a pro like some others here, but I ask, why would you dedicate a whole thread to solely wait on data from the network?

Wouldn't it be faster and more efficient to override the task scheduler for something like this:

task::execute(){
spawn( Read data from the network)
spawn(do something else concurrently using the other threads that are doing nothing)
}
task::execute(){
spawn(process the data just read in parallel)
}
task::execute(){
spawn(send out data to the network)
spawn(do something else concurrently using the other threads that are doing nothing)
}

I am getting tired so i just broke down the general program. Your program needs to be broke into three parts. Unfortunately, when network data processing comes in, this is your best bet. In your program, one entire thread is essentially, dead. It's sole existence is to wait on data. Sure, there might be some sleep time in there, but from what I have read, I don't think that was your plan. And, that would actually slow your whole program down too. So, if you are running a quad core, you are down 25%. The real goal is to find work for your threads. Always keep the thread busy and use implicit locking and you will be golden.

I belive the above outline is a faster way to process network data than to have on thread running concurrently with the rest of the program checking for data and filling some kind of a queue with data received off the network. The queue would be under heavy load from these worker threads checking for data continuously.

Now, you may have to restructure your program to find work for your threads while you are reading and writing to the network, and I know you can do it, because I did. If you minimize the amount of dead time your threads spend doing nothing, you will speed up your program...... isn't that obvious? so.... find work for your threads!

"why would you dedicate a whole thread to solely wait on data from the network?"
Because TBB assumes that hardware threads executing tasks shouldn't be bothered with anything else, so as a general rule any potentially significant waiting should be done in an external thread.

(Added) That one external thread might also handle all communications at once, but if you spawn tasks to wait on individual data channels, each of those tasks blocks a whole software thread (by default one per hardware thread), and you may easily starve the whole process even with many hyperthreaded cores available.

Sorry for the delayed reply.

Quoting - smasherprog
In your program, one entire thread is essentially, dead. It's sole existence is to wait on data. Sure, there might be some sleep time in there, but from what I have read, I don't think that was your plan. And, that would actually slow your whole program down too. So, if you are running a quad core, you are down 25%

What I would do is deffer the creation of threads in the scheduler as there are cores and not as default(wherein the scheduler creates one thread lesser than cores). And I would also have an extra IO thread which would wait on data and just spawn tasks. This way Im not down by 25%.

Leave a Comment

Please sign in to add a comment. Not a member? Join today