tbb:task has its own context?

tbb:task has its own context?

Does a tbb::task have its own context, e.g. its own stack like coroutines or fibers?

I try to evaluate if it is possible to write code similiar to GO-routines with TBB, e.g. each task has it's own stack, can be suspended iit's current stackframe and resumed later (maybe while waiting for some external event).
 

28 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Raf Schietekat's picture

A task is executed on a thread, using that thread's stack. The thread may switch to other tasks before the task's execution completes, but the task's execution stays with that thread. If the task is recycled, it behaves as if it were a new task. There are lots of details here, but basically a task is like a functor.

In Go, go routines have small but dynamically growing stacks, something that TBB does not easily or portably do, although GCC has recently added provisions for that. However, that is not enough to scale to the numbers of go routines that are present in a Go program: somehow a thread context switch is far more expensive than a go routine context switch, so they are probably multiplexed on traditional threads, perhaps like the "green threads" earlier in Java; TBB does not incur this thread context switch because each thread just keeps executing tasks, which is part of its efficiency, together with the effect on cache locality of its scheduling details. There would also be an issue with primitives married to the thread model like mutual exclusion primitives or thread-local storage that would have to be resolved before the same model could be adopted in C++.

I like the ideas in Go, but they have the advantage of starting from scratch. I would also be very interested in any in-depth review of the issues.

A different matter is whether TBB is a good alternative for things you can do in Go. Clearly you have to do things differently, because go routines can block without problems, and this is a big no-no with TBB: instead, you would use a continuation task, and find a way to activate it in some kind of reactor pattern after the blocking call returns, perhaps portable like Boost.Asio or perhaps an environment-specific asynchronous interface. You could also keep these things in traditional threads, and use TBB only for those parts that are only CPU-bound and that will not block themselves.

How am I doing so far?

Raf Schietekat's picture

It is an fascinating topic. In TBB, tasks are generally intended as low-level plumbing for more convenient reusable algorithms, but those algorithms are never a complete set for all purposes, so sometimes somebody has to implement new ones based on tasks, or perhaps the problem cannot be generalised and has to be implemented at least partly in terms of tasks.

The big elephant in the room is blocking. This is where reactive programming comes in… but it's not the most intuitive of approaches, leaving you with a choice of callback hell or lots of little functions (right?). So the apparent solution for that is a new language (Go), based on Hoare's communicating sequential processes idea. It is only natural to ask if this couldn't just replace an approach like TBB altogether.

But then you are also asking questions about tool support and established code base: even TBB is more or less the portable equivalent of Cilk (Plus), giving up some of its theoretical elegance for a far better fighting chance. When you arrive new on the scene as a developer, new tools are always better because they allow you to quickly build experience equalling or exceeding that of your seniors, but after a while you get to realise that this is not a scalable approach (either individually or collectively), that you should definitely keep track of new developments, but that it's a huge deal to switch over entirely and bet the farm on the survivability, let alone dominance, of a new platform. Go has great backing and momentum, but it's still nowhere near C and C++, so you can bet part or all of your time as an individual or team or small company, but probably not more than that just yet.

My intuition says that TBB is incomplete, because it does not incorporate a reactor engine like Boost Asio. The latter one is incomplete because it does not have the fancy task scheduler and algorithms for architecture-oriented performance that TBB has (if one overlooks the NUMA thing that is not yet part of the story). But together they are probably a sufficient foundation to implement something like Go, and by that I mean doing similar things manually or translating a language like Go using some kind of precompiler.

How about coroutines? Well, Go 1.2 added preemptive scheduling for go routines, so maybe cooperative scheduling isn't really all you ever wanted, even without the compatibility issues for thread-local storage and mutexes etc.

I also have questions about Go (that maybe don't all belong here), like where it stands on cache-related behaviour (TBB has depth-first scheduling locally and breadth-first scheduling when stealing), and the paradox that Java abandoned green threads reportedly because of performance reasons while Go is listed as using green threads.

Then again, it's also called software engineering, and engineering is all about making a suitable compromise out of imperfect resources.

Let me see if I can lure somebody else in here who might well have some interesting things to say...

Quote:

Raf Schietekat wrote:

A task is executed on a thread, using that thread's stack. The thread may switch to other tasks before the task's execution completes, but the task's execution stays with that thread. If the task is recycled, it behaves as if it were a new task. There are lots of details here, but basically a task is like a functor.

AFAIK you have to derive from tbb::task and implement tbb::task::execute(), e.g. your action/code is located inside the stack frame of tbb::task::execute(). Because you say that a task is like a fuctor I assume that 'The thread may switch to other tasks before the task's execution completes' that tbb calls/executes other tasks (child-tasks?) in side the (parent)-tasks tbb::task::execute(). That means the parent-task reides on the stack and you put each child-stack on the stack - so you can only return to the parent-tasks tbb::task::execute() be unwinding the stack (e.g. the child-tasks must finish). Probably thefore tbb -> graph of taks. Is this model about tbb correct?

Quote:

Raf Schietekat wrote:

In Go, go routines have small but dynamically growing stacks, something that TBB does not easily or portably do, although GCC has recently added provisions for that. However, that is not enough to scale to the numbers of go routines that are present in a Go program: somehow a thread context switch is far more expensive than a go routine context switch, so they are probably multiplexed on traditional threads, perhaps like the "green threads" earlier in Java; TBB does not incur this thread context switch because each thread just keeps executing tasks, which is part of its efficiency, together with the effect on cache locality of its scheduling details. There would also be an issue with primitives married to the thread model like mutual exclusion primitives or thread-local storage that would have to be resolved before the same model could be adopted in C++.

you could implement a context switch (similiar to GO) on C++ too (OK - it requires some assembler code). But I understand that tbb does not do that, tbb == a chain(s) of tasks on the stack of one os-thread

Quote:

Raf Schietekat wrote:

A different matter is whether TBB is a good alternative for things you can do in Go. Clearly you have to do things differently, because go routines can block without problems, and this is a big no-no with TBB: instead, you would use a continuation task, and find a way to activate it in some kind of reactor pattern after the blocking call returns, perhaps portable like Boost.Asio or perhaps an environment-specific asynchronous interface. You could also keep these things in traditional threads, and use TBB only for those parts that are only CPU-bound and that will not block themselves.

I'm confused - does the tbb continuation task use context switches (as described above, e.g. store/restore some CPU registers + stack pointer)?

Waht I've in mind is to make code of the event-driven model using asynchronous operations (as boost.asio provides) look like sequential code but still operates asynchronously. boost.asio already provides with its async-result feature such coding style (it uses boost.coroutine). I was thinking about if this pattern could be implemented with tbb too. After reading your comment I assume it isn't.

Quote:

Raf Schietekat wrote:

It is an fascinating topic. In TBB, tasks are generally intended as low-level plumbing for more convenient reusable algorithms, but those algorithms are never a complete set for all purposes, so sometimes somebody has to implement new ones based on tasks, or perhaps the problem cannot be generalised and has to be implemented at least partly in terms of tasks.

But in the tbb-world your are limited to the stack frame of tbb::task::execute() - right? I mean you can not jump out of tbb::task::exeucte() and leve it's stack frame intact (e.g. local data etc.) and resume it in at later time.

Quote:

Raf Schietekat wrote:

The big elephant in the room is blocking. This is where reactive programming comes in… but it's not the most intuitive of approaches, leaving you with a choice of callback hell or lots of little functions (right?). So the apparent solution for that is a new language (Go), based on Hoare's communicating sequential processes idea. It is only natural to ask if this couldn't just replace an approach like TBB altogether.

But then you are also asking questions about tool support and established code base: even TBB is more or less the portable equivalent of Cilk (Plus), giving up some of its theoretical elegance for a far better fighting chance. When you arrive new on the scene as a developer, new tools are always better because they allow you to quickly build experience equalling or exceeding that of your seniors, but after a while you get to realise that this is not a scalable approach (either individually or collectively), that you should definitely keep track of new developments, but that it's a huge deal to switch over entirely and bet the farm on the survivability, let alone dominance, of a new platform. Go has great backing and momentum, but it's still nowhere near C and C++, so you can bet part or all of your time as an individual or team or small company, but probably not more than that just yet.

I agree - I want something similiar to GO in C++

Quote:

Raf Schietekat wrote:

I also have questions about Go (that maybe don't all belong here), like where it stands on cache-related behaviour (TBB has depth-first scheduling locally and breadth-first scheduling when stealing), and the paradox that Java abandoned green threads reportedly because of performance reasons while Go is listed as using green threads.

hmm - a context switch (-> coop. scheduling, C++ + assembler) takes 10ns on my 7-years old machine (x86_64, quad core) - I find it not too expensive. I'm wondering why te Java community abandoned green threads.

Raf Schietekat's picture

Quote:

Oliver K. wrote:

Is this model about tbb correct?

Mostly, except that the descendants could be "stolen" by other threads, and the thread might itself steal other tasks.

Quote:

Oliver K. wrote:

I'm confused - does the tbb continuation task use context switches (as described above, e.g. store/restore some CPU registers + stack pointer)?

No, it's just another task (perhaps recycled) with the same parent, so that the current task can finish and vacate the stack. The continuation is invoked when its children finish, but of course the execution method will be called from the beginning. Until it is called, the continuation only lives on the heap.

 

Quote:

Oliver K. wrote:

Waht I've in mind is to make code of the event-driven model using asynchronous operations (as boost.asio provides) look like sequential code but still operates asynchronously. boost.asio already provides with its async-result feature such coding style (it uses boost.coroutine). I was thinking about if this pattern could be implemented with tbb too. After reading your comment I assume it isn't.

If that's all you want to do, you should probably stick with Boost Asio for its reactor functionality.

 

Quote:

Oliver K. wrote:

But in the tbb-world your are limited to the stack frame of tbb::task::execute() - right? I mean you can not jump out of tbb::task::exeucte() and leve it's stack frame intact (e.g. local data etc.) and resume it in at later time.

Correct, but that's not an issue to implement CPU-bound throughput-optimising code, which TBB does very well.

Quote:

Oliver K. wrote:

hmm - a context switch (-> coop. scheduling, C++ + assembler) takes 10ns on my 7-years old machine (x86_64, quad core) - I find it not too expensive. I'm wondering why te Java community abandoned green threads.

It will probably take some more experimenting before a different context model than standard threads solidifies. It's a tug of war between performance and continued support for the things everybody agrees you can do with a thread. Even standard C++11 futures still have issues.

Raf Schietekat's picture

Apparently green threads in Java used a single native thread, which would certainly be a performance-related problem. But the same source also says that the current implementation is in fact many-to-many, on Solaris anyway. I didn't immediately find more information, but that seems like the right thing to do, whether still using the name green threads or not. Well, it still depends on some other details (preemption, scheduling, stack allocation details, but I don't see that it would be possible to have the same number of Java threads as you might have go routines. More info needed...

(2014-03-21 Added) It's difficult to find more information. I added the reference above because I remembered having seen it before (long ago), but I've never seen corroborating evidence: all other sources point to green (many to one on one) at first and then native (one to one on many). Maybe it was an attempt on Solaris only in a previous implementation but subsequently abandoned by Hotspot etc.?

Quote:

Raf Schietekat wrote:

Apparently green threads in Java used a single native thread, which would certainly be a performance-related problem.

I don't know the timings for Java but in the context of C++ using green threads together with async. I/O will provide better performance.

Quote:

Raf Schietekat wrote:

But the same source also says that the current implementation is in fact many-to-many, on Solaris anyway. I didn't immediately find more information, but that seems like the right thing to do, whether still using the name green threads or not. Well, it still depends on some other details (preemption, scheduling, stack allocation details, but I don't see that it would be possible to have the same number of Java threads as you might have go routines. More info needed...

my intention was to figure out if it's possible to combine async. I/O (for instance boost.asio) and tbb - especially how tbb performs in such a context - but it seams that tbb is no designed for this case.

Raf Schietekat's picture

Quote:

Oliver K. wrote:

I don't know the timings for Java but in the context of C++ using green threads together with async. I/O will provide better performance.

Agreed.

Quote:

Oliver K. wrote:

my intention was to figure out if it's possible to combine async. I/O (for instance boost.asio) and tbb - especially how tbb performs in such a context - but it seams that tbb is no designed for this case.

I think that Boost Asio (or something like it) and TBB are complementary: one provides the asynchronous I/O, the other optimal execution for CPU-bound computation. Of course you might also only need one or the other, or speculate that abundant concurrency of I/O requests makes parallel execution of individual requests redundant or even counterproductive. The basic and very straightforward use case is to call TBB from Boost Asio, although I'm curious to know if this will work with coroutines. The other way around should also work, but it becomes a bit of a juggling act because you have to avoid blocking by using a continuation task that you currently still have to activate by spawning a dummy child.

Just forget about the idea of TBB as a more Go-like experience than Boost Asio.

(2014-03-22 Clarification) Added "of individual requests" (like only parallelising an outer loop if that already provides sufficient parallel slack).

Dmitry Vyukov's picture

Mostly what Raf has said.

In TBB you need to stick to asynchronous continuations, and you can't simply block inside of task. This can lead to a painful programming model if you need to block in lots of places.

One of the significant advantages of the Go concurrency model is that you *can* block. You can block on IO. And you can blocks waiting for other goroutines. Go runtime will magically schedule goroutines in user-space (read -- fast) and keep stacks small (thanks to segmented stacks, and now even continuous growable stacks).

Stack management is the main roadblock for TBB to support such model. It's theoretically possible to have segemented stacks in C++ (some software will break though). But it's impossible to have continuous growable/shrinkable stacks.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Quote:

Dmitry Vyukov wrote:

Stack management is the main roadblock for TBB to support such model. It's theoretically possible to have segemented stacks in C++ (some software will break though). But it's impossible to have continuous growable/shrinkable stacks.

GCC now has -fsplit-stack, which seems backward compatible. Could you specify what software "will break"?

What is "continuous growable/shrinkable"? With a segmented stack you can start small and add/discard segments as needed (which should be less of an issue on 64 bits, if pages are indeed individually initialised as they are brought in). Discarded segments can also be unmapped or reused elsewhere, whereas shrinking a linear stack leaves rubbish behind that might then be swapped out, I suppose (or not?). I couldn't compare with "continuous", because I don't know what it means in this context.

How does Go do preemptive scheduling without involving the kernel (presumably)?

TBB executes a task graph with depth-first exploration and breadth-first stealing, for best use of the cache. Is there an equivalent in Go for optimal CPU-bound computing?

And do you happen to know the situation with Java (was/is many-to-many real or not)?

Dmitry Vyukov's picture

Quote:

Raf Schietekat wrote:

GCC now has -fsplit-stack, which seems backward compatible. Could you specify what software "will break"?

Software that assumes that stacks are continuos and that stack pointer moves in one direction. This includes profiling and debugging libraries (stack unwinding), garbage collectors/leak detectors (usually assume that stack is something that can be described by pointer to beginning and size) and various systems/hacky software (e.g. measure library call stack consumption by subtracting 2 stack pointers).

As far as I remember it was Vim that compares 2 stack pointers to determine direction in which stack grows. Don't ask me why a text editor needs this.

 

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

Quote:

Raf Schietekat wrote:

What is "continuous growable/shrinkable"? With a segmented stack you can start small and add/discard segments as needed (which should be less of an issue on 64 bits, if pages are indeed individually initialised as they are brought in). Discarded segments can also be unmapped or reused elsewhere, whereas shrinking a linear stack leaves rubbish behind that might then be swapped out, I suppose (or not?). I couldn't compare with "continuous", because I don't know what it means in this context.

Continuous growable/shrinkable stack means that at any point in time stack occupies a single continuous region of memory, but the region can be grown or shrunk as necessary means of either copying or shrinking/growing the same memory block (if memory allocator supports this, e.g. realloc).

In the context of Go we've discovered that segmented stacks have 2 problems:

1. Segment switch has noticeable cost, and if it happens inside of an inner loop, it can have very negative effect on performance. You can easily get 10x performance hit, if you are unlucky. I guess that Intel processors has some hardware support for stacks (manipulating SP pointer and accessing memory through SP pointer, e.g. PUSH/POP/CALL/RET), and that support assumes that the  stack is continuous. So changing SP "on-the-fly" causes performance hit.

2. This performance penalties are very unpredictable and flaky. E.g. you change HTTP library and as the results JSON parsing in HTTP handler suddenly becomes 2x slower.

Because of that Go is switching to growable/shrinkable stacks.

 

 

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

Quote:

Raf Schietekat wrote:

which should be less of an issue on 64 bits, if pages are indeed individually initialised as they are brought in

Not quite.

What would be a reasonable limit for stack? Probably 1GB is a reasonable limit (if you want to pretend that stacks are effectively unlimited). But then you can have at most ~100'000 such stacks on 64-bit linux (47-bit address space). And only 8'000 stacks on windows (43-bit address space).

I've heard of a Go server that serves 500'000 TCP connections, which means 500'000 goroutines.

Also stack memory reuse becomes more expensive -- you need to do syscalls to release unused stack pages.

So it would not work for Go.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

> How does Go do preemptive scheduling without involving the kernel (presumably)?

In order to support segmented/growable stacks, each function first checks whether there is enough stack space for it. Basically, it compares current stack pointer SP with a /stack guard/ variable.

I've exploited it to implement preemption. When runtime needs to preempt a goroutine it modifies the goroutine /stack guard/ to a special value that fails all stack checks. When the goroutine does stack check next time, it decides that it does not have enough stack and calls into runtime to grow stack. Stack growing routine first checks whether the goroutine was requested to be preempted, if so the routine just deschedules the goroutine.

This approach has several nice properties:

1. Preemption is fully user-space.

2. Preemption is fully synchronous, i.e. goroutines are not preempted at arbitrary points (which would cause lots of implementation issues).

3. Preemption has virtually zero cost for the non-preempted case.

 

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

> TBB executes a task graph with depth-first exploration and breadth-first stealing, for best use of the cache. Is there an equivalent in Go for optimal CPU-bound computing?

No, Go does FIFO scheduling and it does not assume "recursive decomposition" work graph.

It's an open question how to support both computational workloads and general network server type workloads in a single process.

 

> And do you happen to know the situation with Java (was/is many-to-many real or not)?

No, I don't. I think widespread JVMs just use OS threads.

 

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Quote:

Dmitry Vyukov wrote:

and various systems/hacky software (e.g. measure library call stack consumption by subtracting 2 stack pointers)

TBB now also wants to know the current fraction of stack use before deciding whether to steal work from another thread.

But if this is widely desired, a small set of tool makers might well decide to all come on board too. For end-level software like vim, the question remains whether the backward compatibility would save it or not (you have to opt in).

Quote:

Dmitry Vyukov wrote:

Continuous growable/shrinkable stack means that at any point in time stack occupies a single continuous region of memory

Ah, I would have called that "contiguous". Sounds expensive, all that moving around. I'm surprised that it was preferable to a segmented stack, I would expect this to have evolved the other way around. Doesn't bode too well for -fsplit-stack then...

Quote:

Dmitry Vyukov wrote:

But then you can have at most ~100'000 such stacks on 64-bit linux (47-bit address space). And only 8'000 stacks on windows (43-bit address space).

So much for 64 bits…

Quote:

Dmitry Vyukov wrote:

3. Preemption has virtually zero cost for the non-preempted case.

That was indeed the tricky part. Neat solution! I do wonder what happens when a routine spends a very long time in a single stack frame, though.

 

Quote:

Dmitry Vyukov wrote:

It's an open question how to support both computational workloads and general network server type workloads in a single process.

TBB for Go… :-)

Thanks, that was enlightening!

 

Dmitry Vyukov's picture

> But if this is widely desired, a small set of tool makers might well decide to all come on board too.

This will take in eternity in C/C++ world. What if you use TBB and a proprietary driver to access a proprietary DB; next release of TBB switches to segmented stacks and the DB driver starts crashing... there is not much you can do. The situation is pretty much unfixable for C/C++ world as a whole. That's not to say that it's impossible to switch a particular isolated project to segmented stacks.

> Sounds expensive, all that moving around.

Moving happens infrequently, segment splits can happen very frequently. Think of std::vector that you grow to the necessary size once, and then enjoy fast direct indexing w/o any additional overheads.

However, yes, there can be corner cases where copying is more expensive than splitting.

> I do wonder what happens when a routine spends a very long time in a single stack frame, though.

It is out of luck then :)

The obvious evolution of the preemption scheme is to emit preemption checks at back edges as well (inside of loops). Then it will provide strong guarantees wrt preemption time bound. But this will add overheads for non-preempted case, and we have not yet seen real need in such extension.

 

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

> TBB for Go… :-)

I've actually done it. Several times.

Basically you can implement a "user-space" scheduler on top of goroutines. Create NumCPU goroutines, each with a light-weight work-stealing deque. Then distribute work between the goroutines manually.

It's also wrap-able in a nice way (thanks to closures):

A, B, C := make([]float64, N), make([]float64, N), make([]float64, N)
parallel.For(func(i int) {
    A[i] = B[i] * C[i]
})

[sorry the code is highlighted as C++, IDZ does not support Go]

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Raf Schietekat's picture

Quote:

Dmitry Vyukov wrote:

Moving happens infrequently, segment splits can happen very frequently.

Unfortunately, in C/C++, since a program can have references by address to things on the stack, moving is a nonstarter.

Quote:

Dmitry Vyukov wrote:

1. Segment switch has noticeable cost, and if it happens inside of an inner loop, it can have very negative effect on performance. You can easily get 10x performance hit, if you are unlucky. I guess that Intel processors has some hardware support for stacks (manipulating SP pointer and accessing memory through SP pointer, e.g. PUSH/POP/CALL/RET), and that support assumes that the  stack is continuous. So changing SP "on-the-fly" causes performance hit.

context jumping on x86_64/Q6700 + SYSV/ELF ABI using:

- fixed size stack: the jump consumes 11ns (storing/restoring CPU register + stack pointer)

- segmented stack: the jump consumes 75ns  (of curse some additional functions have to be called)

Quote:

Dmitry Vyukov wrote:

This will take in eternity in C/C++ world. What if you use TBB and a proprietary driver to access a proprietary DB; next release of TBB switches to segmented stacks and the DB driver starts crashing... there is not much you can do. The situation is pretty much unfixable for C/C++ world as a whole. That's not to say that it's impossible to switch a particular isolated project to segmented stacks.

Do you have code demonstrating that using segmented stacks will crash an applications?

I'm wondering because I've a small test app recursively allocating an array on the stack. With a fixed size stack (having an guard page at its bottom) I get an segmentation fault after view iterations. When the app uses an segmented stack with an initial stack size of 4kB, it does never crash (finishes 500 iterations).

int count = 500;

void access( char *buf) __attribute__ ((noinline));
void access( char *buf)
{
  buf[0] = '\0';
}

void bar( int i)
{
    char buf[4 * 1024];

    if ( i > 0)
    {
        access( buf);
        std::cout << i << ". iteration" << std::endl;
        bar( i - 1);
    }
}

 

bar is called from the special context.

 

Quote:

Dmitry Vyukov wrote:

> TBB for Go… :-)

I've actually done it. Several times.

Basically you can implement a "user-space" scheduler on top of goroutines. Create NumCPU goroutines, each with a light-weight work-stealing deque. Then distribute work between the goroutines manually.

It's also wrap-able in a nice way (thanks to closures):

A, B, C := make([]float64, N), make([]float64, N), make([]float64, N)
parallel.For(func(i int) {
    A[i] = B[i] * C[i]
})

[sorry the code is highlighted as C++, IDZ does not support Go]

Is it really an equivalent? In the previous postings I got told that tbb does not support preserving CPU registers + stack pointer, e.g. you can't jump out tbb::task::execute() the leave the stack frame intact. Local data (for instance loop counters) are not preserved - even if asynch. completion task is used. Or is my understanding of tbb false.

Dmitry Vyukov's picture

Quote:

Oliver K. wrote:

Quote:

Dmitry Vyukov wrote:

1. Segment switch has noticeable cost, and if it happens inside of an inner loop, it can have very negative effect on performance. You can easily get 10x performance hit, if you are unlucky. I guess that Intel processors has some hardware support for stacks (manipulating SP pointer and accessing memory through SP pointer, e.g. PUSH/POP/CALL/RET), and that support assumes that the  stack is continuous. So changing SP "on-the-fly" causes performance hit.

 

context jumping on x86_64/Q6700 + SYSV/ELF ABI using:

- fixed size stack: the jump consumes 11ns (storing/restoring CPU register + stack pointer)

- segmented stack: the jump consumes 75ns  (of curse some additional functions have to be called)

 

How have you measured? I would expect a plain function call to take <11ns.

 

> Unfortunately, in C/C++, since a program can have references by address to things on the stack, moving is a nonstarter.

Exactly!

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

Quote:

Oliver K. wrote:

Quote:

Dmitry Vyukov wrote:

This will take in eternity in C/C++ world. What if you use TBB and a proprietary driver to access a proprietary DB; next release of TBB switches to segmented stacks and the DB driver starts crashing... there is not much you can do. The situation is pretty much unfixable for C/C++ world as a whole. That's not to say that it's impossible to switch a particular isolated project to segmented stacks.

 

Do you have code demonstrating that using segmented stacks will crash an applications?

I'm wondering because I've a small test app recursively allocating an array on the stack. With a fixed size stack (having an guard page at its bottom) I get an segmentation fault after view iterations. When the app uses an segmented stack with an initial stack size of 4kB, it does never crash (finishes 500 iterations).

int count = 500;

void access( char *buf) __attribute__ ((noinline));
void access( char *buf)
{
  buf[0] = '\0';
}

void bar( int i)
{
    char buf[4 * 1024];

    if ( i > 0)
    {
        access( buf);
        std::cout << i << ". iteration" << std::endl;
        bar( i - 1);
    }
}

 

bar is called from the special context.

 

 

Yes, see #12. It's all based on our experience deploying a similar stack perturbing technology in a huge C/C++ code base.

I meant crashes because there are lots of code out there that simply assumes that stack is continuous, rather than because of stack overflows. 

 

 

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net
Dmitry Vyukov's picture

Quote:

Oliver K. wrote:

Quote:

Dmitry Vyukov wrote:

> TBB for Go… :-)

I've actually done it. Several times.

Basically you can implement a "user-space" scheduler on top of goroutines. Create NumCPU goroutines, each with a light-weight work-stealing deque. Then distribute work between the goroutines manually.

It's also wrap-able in a nice way (thanks to closures):

A, B, C := make([]float64, N), make([]float64, N), make([]float64, N)
parallel.For(func(i int) {
    A[i] = B[i] * C[i]
})

[sorry the code is highlighted as C++, IDZ does not support Go]

 

Is it really an equivalent? In the previous postings I got told that tbb does not support preserving CPU registers + stack pointer, e.g. you can't jump out tbb::task::execute() the leave the stack frame intact. Local data (for instance loop counters) are not preserved - even if asynch. completion task is used. Or is my understanding of tbb false.

 

It is equivalent with respect to scheduling order. Raf asked about characteristics of Go scheduler. I've said it's possible to mimic TBB scheduling order in Go.

It may be not equivalent with respect to other characteristics.

> you can't jump out tbb::task::execute() the leave the stack frame intact

I believe this is true.

 

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Quote:

Dmitry Vyukov wrote:

Yes, see #12. It's all based on our experience deploying a similar stack perturbing technology in a huge C/C++ code base.

I meant crashes because there are lots of code out there that simply assumes that stack is continuous, rather than because of stack overflows.

Dmitry, can you provide me an code example which crashes if  GCC's splitstacks are used, please?! (because your posting is the first one I've read which tells that segmented stacks will cause segmentation faults).

Quote:

Dmitry Vyukov wrote:

Software that assumes that stacks are continuos and that stack pointer moves in one direction. This includes profiling and debugging libraries (stack unwinding), garbage collectors/leak detectors (usually assume that stack is something that can be described by pointer to beginning and size) and various systems/hacky software (e.g. measure library call stack consumption by subtracting 2 stack pointers).

As far as I remember it was Vim that compares 2 stack pointers to determine direction in which stack grows. Don't ask me why a text editor needs this.

Until now I assumed that the split-stacks of GCC do grow downwards on x86 - is this not true?

Does not the ABI (or the architecture) determine in which direction the stack grows? For instance the ARM architecture allows that the stack can expand to higher or lower addresses (depending on the mode) but the AAPCS ABI requires the stack to grow to lower addresses.

Dmitry Vyukov's picture

Quote:

Oliver K. wrote:

Quote:

Dmitry Vyukov wrote:

Yes, see #12. It's all based on our experience deploying a similar stack perturbing technology in a huge C/C++ code base.

I meant crashes because there are lots of code out there that simply assumes that stack is continuous, rather than because of stack overflows.

 

Dmitry, can you provide me an code example which crashes if  GCC's splitstacks are used, please?! (because your posting is the first one I've read which tells that segmented stacks will cause segmentation faults).

I have not been collecting all examples. But here is a one -- Oilpan (garbage collector for C++) assumes that thread's stack is a single continuos region of memory. With AddressSantizer detect_use_after_return option stacks become fragmented, this blows up Oilpan:

https://code.google.com/p/chromium/issues/detail?id=339813

Here is another example:

http://llvm.org/viewvc/llvm-project/compiler-rt/trunk/lib/sanitizer_comm...

If you look at the StackTrace::FastUnwindStack function, you will notice that it can't possibly work correctly with segmented stacks.

Quote:

Oliver K. wrote:

Quote:

Dmitry Vyukov wrote:

Software that assumes that stacks are continuos and that stack pointer moves in one direction. This includes profiling and debugging libraries (stack unwinding), garbage collectors/leak detectors (usually assume that stack is something that can be described by pointer to beginning and size) and various systems/hacky software (e.g. measure library call stack consumption by subtracting 2 stack pointers).

As far as I remember it was Vim that compares 2 stack pointers to determine direction in which stack grows. Don't ask me why a text editor needs this.

 

Until now I assumed that the split-stacks of GCC do grow downwards on x86 - is this not true?

Does not the ABI (or the architecture) determine in which direction the stack grows? For instance the ARM architecture allows that the stack can expand to higher or lower addresses (depending on the mode) but the AAPCS ABI requires the stack to grow to lower addresses.

This is true.

I meant the following situation: if your program directly or indirectly depends on any library that simply compares 2 stack pointers to determine direction of stack growth; your program will misbehave with segmented stacks (because the library can make the wrong conclusion about stack growth direction if the 2 stack pointers happen to be in different stack segments).

 

All about lock-free algorithms, multicore, scalability, parallel computing and related topics: http://www.1024cores.net

Login to leave a comment.