Branch and bound search: avoid copying of data

Branch and bound search: avoid copying of data

Hi there.

I want to implement a multi-threaded version of a branch-and-bound search. The search space is a 16 level 128-ary tree.

So far, the intermediate result is stored in an array. this does not contain only the 16 levels, but also other entries, which are deduced from those values and needed to check the constrains for the bound condition.

Cilk Plus using work stealing seems well suited to implement this in a convenient way, especially as the framework itself decides whether to split a task or not.

The problem is, that every thread needs its own copy of the array, as it is part of the current state. Not knowing where this split might happen, I assume I have to copy the data for every working packet I use.

In an single-threaded implementation with recursive calls and copying the array for every single call (a DFS on the search tree) the performance drops dramatically compared to a single-threaded do-try-undo-approach which uses only one array.

Is there anything I can do about this using Cilk+ besides adjusting the size of working packets? (like testing 10 or 20 values in one recursive call instead of just 1)

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I ran into this issue too recently, and could not find a good solution other than to use parallel recursion (with the copying cost) when high in the recursion tree and serial recursion (with the undo list) deep in the recursion tree.  The root problem is that the stealing of a continuation is dynamic, and may happen long after the current thread has recursed deeply and left a long "undo" list.

Arch, Robert,

Although Cilk Plus and this problem are inherently recursive I have an unfounded gut-feel that your may find some use in coding the problem as pseudo recursive parallel pipeline. Let's see if I can outline how to do this:

A traditional parallel pipeline has a fixed number of tokens (aka buffers) whereby subject to availability the input stage acquires a token (or waits for one) conditions data, then passes token for parallel processing (think SPMC queue from input stage to next stage).

What I would like to suggest is something that could be described as token splitting or cloning whereby additional tokens are created (e.g. buffer split in two) and recycled to the input stage for additional enqueue-ing for processing or splitting. This may also require a second parallel pipeline-like construct that behaves like a reducer (process pairs of finished tokens).

The key point here is you can control the number of concurrent tokens. The tokens can carry the intermediate results (as opposed to recursion stack). These intermediary results may be manipulated in situ as opposed to being copied.

Jim Dempsey

www.quickthreadprogramming.com

Robert,

It sounds as if what you want is a "splitter" hyperobject (see Section 7 of http://software.intel.com/sites/default/files/m/d/4/1/d/8/hyper.pdf).  My research group at MIT has explored how to implement splitters generally, but we don't have a compelling mechanism for the general case.  In my experience, however, one can sometimes find a solution for a particular case.  I would be interested in hearing more specifics about the data structure you're using, as well as about the algorithm itself.

Cheers,
Charles

Thank you for your ideas.

@Jim Dempsey:

As far as I understand your Idea, your "tokens" are the same as working packets. The splitting which happens to the tokens dynamically is the same as subdividing working packets (a level 2 packet-thread creates several new level 3 working packets etc.). When splitting or merging those tokens, the data are copied just like on the recursive stack of working packets which are created in my approach, aren't they?

So I don't see why this token-approch would have an amount of copy operations which is significantly smaller than in my current approach - or did I misunderstand your idea?! Besides this: How would Cilk (or any other framework) assist there? You described an interesting approach, but I miss an idea about the process of implementation.

@Charles:

So a splitter is basically a datastructure which keeps track of changes within one data structure on a per-thread basis? This is definitely an interesting approach to reduce memory usage, but does it improve performance?

At the moment I'm still reading the paper you provided and try to understand the details. I was wondering about the pre- and post-state using a cilk-spawn. Why not use deterministically either the pre- or the post-state?! In my case, there are only a few search results. So a default-drop of changes (= deterministic pre-state) would be optimal, as those undo-operations were no longer necessary.

@all:

An other interesting idea of an associate was to keep only the 16 levels as a state and re-deduce the other values whenever needed. This reduces the number of bytes which need to be copied, but increases the number of computations. What do you think about this?

Robert,

Tokens == "working packets"

The primary complaint you raised in your original post was the (apparent) excessive overhead in data copying as you (Cilk Plus) splits (then presumably later at merge, though possibly to a lesser extent). The packet concept reduces the number of splits to the number of tokens, your choice as to number, as opposed to a recursive process that could have 1000's (or much more) number of splits. 16 levels could conceivably produce 65.536 number of splits, possibly multiplied by a permutation.

Note, if you construct the "working packets" the same way as you construct your recursive scheme then indeed you will have the same number of split/copy/merge/copy operations.

The point of the suggestion is to NOT perform a program flow recursive algorithm, rather create a data flow iterative system where the common copy process occurs once token creation as opposed to at each split point. As the tokens get recycled you reuse the copied data.

From your original description:

The problem is, that every thread needs its own copy of the array, as it is part of the current state. Not knowing where this split might happen, I assume I have to copy the data for every working packet I use.

Your recursive model appears to have the requirement of every split having a copy of the data as opposed to every thread having a copy of the data, the working packet system (as described as parallel pipeline) does not.

Jim Dempsey

www.quickthreadprogramming.com

Login to leave a comment.