segmentation fault with recursively nested parallelism (gcc snapshot)

segmentation fault with recursively nested parallelism (gcc snapshot)

I
am trying to run some benchmarks with recursively nested parallelism
(e.g., nQueens, TSP, ...) with CilkPlus but even trivial codes like the
following segfault with tiny inputs. Here's what I did.

I installed CilkPlus on Linux following the steps here http://software.intel.com/en-us/forums/showpost.php?p=166122

The test code:
#include

void rec_loop(int depth, int maxdepth) {
if (depth cilk_for(int i=0; i rec_loop(depth+1, maxdepth);
}
} // else return
}

int main(void)
{
rec_loop(0,8);
return 0;
}

------
For rec_loop(0, x) where 0<=x<=4 the program terminates without a segfault.

For x=5 there is ~50% chance of segfault. I also got the following error once
./test:
symbol lookup error:
/home/atzannes/cilkPlus-git/cilkPlus-install/lib/libcilkrts.so.5:
undefined symbol: __cilkrts_return

For 6<=x<=10 it segfaults

For x>10 not tested.

Increasing the size of the stack using 'ulimit -s unlimited' or 'ulimit -s 100000' does not help either.

15 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Hello,

you're using GCC from the "cilkplus" branch. Thus, I've informed Balaji Iyer to take a look why it fails.

Besides, there's something I'd like to mention. You might be aware of it already but I'd like to avoid others using this as "good" example:

What you're doing here is recursively creating tasks. 8 tasks in the first level, 8!/6! in the second, and so forth. In total you're ending up in creating

8! + 8!/1! + 8!/2! + 8!/3! + 8!/4! + 8!/5! + 8!/6! + 8!/7! ~= 100.000+ tasks
[edit: see below]

In the worst case non has been synchronized and all have to be managed by the scheduler.
I don't think it should fail right here but increasing the number of tasks sooner or later eats up all available resources. I don't see a practical reason in having that many tasks running.

Anyways, it's a good stress-test and we should take a look why it fails. Esp. becasue I was able to go higher for my system with using the Intel Compiler without problems.

Best regards,

Georg Zitzlsberger

>>8! + 8!/1! + 8!/2! + 8!/3! + 8!/4! + 8!/5! + 8!/6! + 8!/7! ~= 100.000+ tasks

I figgure it as

8 + 8^2 + 8^3 + 8^4 + 8^5 + 8^6 + 8^7 + 8^8 = 19,173,960 tasks

(each level spawn 8 tasks and not 8, 7, 6, 5, 4, 3, 2, 1 tasks respective of level)

Perhaps an error message ("Task count exceeds maximum - nnnn") would be preferred prior to segfault.

While I do not like artificial limits on task creation, the task creation code could anticipate the current request may fail and therefore print the error/warning message. When critical condition is reached, the code could execute as if in serial. I think a error/warning message is sufficient.

Jim Dempsey

www.quickthreadprogramming.com

Hello Jim,

each recursion level reduces the number of tasks spawned as "depth" is incremented. 8 in first level, 7 in second, 6 in third, ... for each node per level in the task tree!

Anyways, it's an extremely high number which, for sure, shows more overhead because of synchronization instead of meaningful parallelism for almost all practical algorithms. Think of Amdahl's Law here!

It's hard to know when to bail-out with an error message. An example would be

while(1) fork();

You can tear down every system with that (provided no hard limits are set). Tasking is similar (frankly speaking).

It's rather a theoretic discussion we're having here. I'll let Balaji decide what to do.

Best regards,

Georg Zitzlsberger

Exhaustion of the deque is a known problem in both thegcc and iccimplementations. We have ideas on how to handle it, but have not implemented any yet. For now, the answer is "Don't do that."

While cilk_spawns are (relatively) cheap, they are still more expensive than a simple call. As a general rule, you want about 10P tasks, where P is the number of processors. This gives you a good balance between the overhead of a cilk_spawn and the chance that one of your tasks takes much longer than the others.

We've discussed a number of ideas for handling deque exhaustion. Oneis to have the generated code examine the deque and call the function instead of spawning it when the deque is exhausted. This might also be used to automatically convert spawns to calls if the deque depth gets "too deep." Of course, this introduces additional code and overhead. And it may end up cutting over to calls at the "wrong place" in the spawn tree.We need to implement it and experiment with it.

As an aside, "tasks" is a really bad term, since it brings to mind thread pools or roll-your-own threading. Unfortunately we haven't come up with a better one.

Finally, wesuspect thatyou're running into the same problem assegmentation fault with simple cilk_for loop (gcc cilkplus branch). We should be posting a fix for that soon.

- Barry

>>

cilk_for(int i=0; i

rec_loop(depth+1, maxdepth);

}
<<

That cilk_for loop executes 8x each time it is called. The depth+1 only indicate when the cilk_for is bypassed and not the number of iterations. Perhapse you were thinking to code read:

cilk_for(int i=depth; i

rec_loop(depth+1, maxdepth);

}

Failing gracefully on out-of-memory situations is rather difficult since most systems seem to employ "map memory on first touch". Meaning allocations will succeed, but use will fail. i.e. Try/Catch is useless for memory allocation unless (until) the heap manager factors in page file availability _and_ performs a page-by-page walk/touch. This ain't going to happen since it will slow down all allocations.

Jim Dempsey

www.quickthreadprogramming.com

Hello Jim,

you're right! I don't know why but I really had in mind this code:

cilk_for(int i=depth; i rec_loop(depth+1, maxdepth);
}

For the original version it's indeed:
8^1 + 8^2 + 8^3 + 8^4 + 8^5 + 8^6 + 8^7 + 8^8 number of tasks created

Sorry for the confusion.

Best Regards,

Georg Zitzlsberger

Some comments
(0) I'm running on an 2core/4thread i5 (for now).

(1) I looked at the segfault with the simple cilk_for loop but was unable to reproduce it, that's why I decided to post this one. I might be using a different snapshot of the gcc cilkplus branch. By the way I also tried doing the "cilk_spawn nothing(); cilk_sync" trick of that post but it didn't help. I also very briefly looked at the cilk runtime code to see where deque overflow checks occur but I quickly gave up without finding them.

(2) While the number of tasks created during the execution is indeed 8 + 8^2 + ... + 8^8, the maximum number of task descriptors (or lambdas or whatever they are called) in the deques should be much lower because of the iterative binary splitting of those ranges. Per deque, it should look something like Depth * log(Width) = 8 * log 8 = 24. I'm not sure I understood the comment saying that in the worst case none of the tasks has been synchronized and all have to be managed by the scheduler.

(3) TBB and other implementations of work stealing support such codes for much higher values (e.g., 12, 13, 14) without a problem, that's why I suspect there is a bug.

(4) Having the programmer control the amount of parallelism (comment about having 10 P tasks) is contrary to the spirit of work-stealing of declaring parallelism and having the scheduler efficiently take advantage of that. Moreover, such constraints prevent developing parallel code in a modular way and are an important barrier of adoption for parallel programming for general purpose code. Clearly there are practical limitations to making declarative parallel programming a reality (which I'm working on) but current state-of-the art should allow for *much*much*much* more than 10 P.

(5) The idea of switching to sequential evaluation when the deque is full is very closely related to "Lazy Binary Splitting" [http://www.cs.umd.edu/~tzannes/docs/ppopp164-updated.pdf] although there the motivation is to improve performance and avoiding overflowing the deque is a happy side-effect. More up-to-date information will be available online very shortly (in my dissertation).

Thanks all for the feedback. Looking forward to more :-)

> The idea of switching to sequential evaluation when the deque is full is very closely related to "Lazy
>Binary Splitting" [http://www.cs.umd.edu/~tzannes/docs/ppopp164-updated.pdf] although there the
>motivation is to improve performance and avoiding overflowing the deque is a happy side-effect.
> More up-to-date information will be available online very shortly (in my dissertation)

It will definitely improve performance. We've done some prototypes with deep recursionthat below a certain spawn depth switch to calls and seen excellent speedup. And the fact that the same code can be used to deal with deque overflow increases the likelihood that we'll implement it. :o)

I look forward to reading thepaper. Thanks for the pointer.

- Barry

Do you know which version of the cilkplus branch you are using? I conjecture that you are still running into the same problem as the other cilk_for issue, even though you haven't reproduced that bug. [There is a subtle but known issue with that version that we are working on.]

If you rewrite your program to use recursive divide-and-conquer (and avoid cilk_for), do you still see a problem?

Just to be avoid any confusion in the statement about "10P" tasks. The "10P" number is nothing more than an approximate heuristic, and should NOT be interpreted as a claim that Cilk programmers should be specifying the number of tasks based on P. In fact, if at all possible, users should NOT have P explicitly appearing in their code.

With many work-steailng schedulers, including Cilk and TBB, etc., if you have work that you are divide among processors, you generally to subdivide your work intoat least 10P tasks (assuming tasks are equal in size), because you want enough parallel slackness in your execution.

Dividing into exactly P tasks is generally notsufficient --- even thoughconceptually the tasks might be of equal size, in practice, giventheunpredictability introduced by schedulers, memory systems, caches, processors, etc.,theworkmay not get divided as evenly as you might expect.
When tasks are unbalanced, it gets even trickier to devise a heuristic number.

You also want the size of each task to be large enough so that the overhead of spawn is not overwhelming the work done within each task. In many Cilk programs, this threshhold can be set independently of P. This thresshold is analogous to the one you have for many recursive algorithms --- you need to make sure you base case is sufficiently large to avoid the recursion overhead.

Ideally, we'd like to have a user not do thus tuning --- you would be able to recurse all the way down to 1, and the system would figure out the right base case. But as a practical matter, we aren't quite there yet. :)

In summary, when writing a Cilk program, you generally want tasks to be as small as possible without having the overhead of spawn dominating.

Cheers,

Jim

Hey Jim, I believe we met at PPoPP'10 in India.

I am using the snapshot from this link
http://software.intel.com/en-us/forums/showpost.php?p=166122
(for some reason the link is not working right now)

For the recursive algorithms are you talking about sequential or parallel ones. I can think of the cache-oblivious cases where you switch to a non cache oblivious near the leaves of the recursion, but do you have other examples from the sequential realm?

I agree with all you wrote about the programmer ideally not getting involved in coarsening, in fact a chapter in my dissertation is dedicated to that. Once UMD puts it online I can post a link and we can have a better discussion on the topic. Personally, I believe improving the current systems so that manual coarsening is not necessary* will have much greater benefits than just relieving the programmer from a mundane task.

Cheers!

P.S. what is necessary is of course a matter of perspective but I think that right now, without it, parallel applications can be expected not to perform close to their full potential. (I also count as coarsening not parallelizing a loop that is inherently parallel)

I've never been in India (although I am aware that there is at least one other programmer with the name Jim Dempsey).

In the example posted at the start of this thread, the code was written as an all-or-none. i.e. all recursions were invoking parrallel tasks (or none-were invoking parallel tasks). The more optimal method would be invoke parallel tasks to the extent necessary and at the point in the recursion where it is optimal. Usually parallization should be invoked at the outer most layers but sometimes this is not always true.

For a large problem set on a large system (many processors/NUMA nodes/Last level Cache)

parallel partition by NUMA node (one thread per node)
(a thread inNODE)
serial partition by L3 cache sized chunks
(a partition)
parallel partition the L3 cache sized chunk to threads on current CPU
(each thread on CPU sharing L3)
computepartition within L3, within chunk of partition on NODE
(end thread on CPU sharing L3)
endparallel partition the L3 cache sized chunk to threads on current CPU
(end a partition)
end serial partition by L3 cache sized chunks
(end a thread inNODE)
end parallel partition by NUMA node (one thread per node)

The above is designed to exploit NUMA locality, minimize the number of spawn/join of tasks, and to maximize L3 cache utilization.

Jim Dempsey

www.quickthreadprogramming.com

Ah yes, now I recall our meeting at PPoPP 2010. :)

I will have to try and track down the version you have.

I mostly had in mind divide-and-conquer algorithms --- mostly sequential, but also potentially parallel. Matrix multiply is onegeneric example I had in mind. For example, if you are implementing Strassen's algorithm, you usually only want to recurse down to matrices of a certain size before you end up switching to an optimized method for the base case (whichprobably looks nothing like Strassen's ).

The parallel case gets even more "interesting" (or complex) because it might introduce another threshhold for when it is efficient to serialize vs. execute in parallel. Sometimes those two threshholds can be the same, but arguably there are cases where they should be different.

Anyway, I will look forward to reading your disseration!

Cheers,

Jim

[ Fortunately, the forum shows user names to resolve ambiguity. :) ]

I'm running into a similar problem (as the OP) on my 32-bit Linux machine (Ubuntu 10.04 LTS).

Interestingly, the program seems to work fine on 64-bit architectures. Any thoughts?

Anlagen: 

AnhangGröße
Herunterladen fib.c279 Bytes

Hm, that is rather curious. Which compiler are you using to compile fib?
"gcc -v" or "icc -V" on Linux?

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen