# Multicore-enabling a Binary Decision Diagram algorithm

Posted by Yuxiong He originally on www.cilk.com on Fri, May 29, 2009

Reduced and Ordered Binary Decision Diagram (a.k.a. ROBDD, though I will refer to it as BDD in this post for simplicity) is an acyclic graph representation for Boolean functions.  Because this representation provides a canonical form and is quite succinct in most cases, it is widely used for formal verification of protocols and digital circuits.  BDD can be used to look for satisfying assignment of Boolean expressions and comparing whether two Boolean expressions are equivalent.  Because the construction of BDD for certain large and complex Boolean functions can be very time consuming, we thought it would be interesting to parallelize BDD operations in order to achieve efficient execution on multicore CPUs.

Our parallel BDD is developed based on BuDDy, an open source BDD package.  In this post, I will describe some key algorithms of BDD operations and introduce the approach we took to parallelize them.   We tested the performance of our parallel BDD by constructing a BDD for 13-bit multiplier, and obtained a 10x speedup on a 16-core machine.

(More detailed performance results are presented in Section 4 of this blog.)

Brief Introduction of BDD

A binary decision diagram is built based on the If-then-else Normal Form (INF) of a Boolean expression.  For example, a Boolean expression (x AND y) of variable x and y can be represented using the following INF form shown in Figure 1(a), which corresponds to the binary decision diagram in Figure 1(b).   Each BDD node has a variable, and two successor links.  The dashed line points to the low successor of the node representing the branch where the variable value is false.  The solid line points to the high successor of the node representing the branch where the variable value is true.

Figure 1.  (a) INF form of x AND y.  (b) BDD of x AND y.

Reduced and Ordered BDD (ROBDD) is more restricted than general BDDs.  A binary decision diagram is ordered (an OBDD) if the variables occur in the same order on all paths from the root of the BDD.  For example, BDD in Figure 1(b) follows variable order x < y.  (Please note that the operator “<” here is not the familiar less-than operator in math.  Instead, it indicates that the variable x always occurs before variable y in this BDD.)  An OBDD is reduced if both of the following conditions are satisfied:
• (uniqueness) no two distinct nodes have the same variable name and low and high successor.
• (no-redundancy) no variable node has identical low- and high-successor.

An important property of ROBDD is that for each Boolean expression with a fixed variable order, there is exactly one ROBDD.   Figure 2 shows an example of ROBDD for expression (x1 1) AND (x2 y2) with variable order x1 < x2 < y1 < y2 .  The operator (called bi-implication) means x is equivalent to y, i.e., the value of x y is 1 if and only if (x=0 AND y=0) OR (x=1 AND y=1).

Figure 2.  BDD of expression (x11) AND (x2 y2) with variable order x1 < x2 < y1 < y2.

Algorithm for BDD Construction

To describe the algorithm of constructing BDD, we need a data structure to represent BDD.    The data structure here is straightforward, which is a direct representation of directed acyclic graph – a set of nodes with outgoing edges.   As shown in Figure 2, each intermediate node has three attributes – variable, low edge node and high edge node.  There are two leaf nodes representing logical true and logical false, respectively.  Since reduced BDDs do not contain any redundant nodes, a hash table is required to provide a guarantee on the absence of any redundancies.   A hash table in BDD serves two purposes: to store the nodes, and to provide efficient lookup of nodes.

Constructing BDD for a Boolean expression can be treated as an iterative process of combining two Boolean expressions with a Boolean operator.   Let’s call this combination operation Apply(expression e1, expression e2, operator op).  For example, the boolean expression (x1 y1) AND (x2 y2) can be constructed using Apply(e1, e2, AND) where e1 = (x1 1) and e2 =(x2 y2).  Moreover, e1 can be further constructed using Apply(x1, y1, ), and so does e2.    Thus, Apply is a key operation in BDD construction and manipulation.

The pseudo code of the Apply operation is shown in figure 3.

Figure 3.  The pseudo code of sequential Apply operation in BDD.

To apply a logic operation on two BDDs, the evaluation consists of several steps, as shown in Figure 3.

First, if this operation was just performed and the result is cached, the result is returned directly.  This requires a special data structure – a cache table.   A cache table is analogous to a hash table, and maintains <key, value> pair and returns the corresponding value given the key.  The difference between the two is that cache table doesn’t guarantee to record all <key, value> pairs inserted.  It works like a cache, and only stores the recent accessed <key, value> pairs.     Second, if both BDDs are constants (0 or 1), it simply applies the simple boolean operation on constants.  Otherwise, the Apply operation is recursively conducted on low edge and high edge of two bdds.

Parallel  Algorithm for BDD Construction

There was existing work on parallelizing BDD construction using thread models [1-3].  Because their parallel languages / libraries don’t have dynamic scheduling for load balancing, their codes have to divide the work manually or maintain a work queue in order to balance the load among processors.  These all involve significant changes to the sequential code.  In contrast, when we use Cilk++ to parallelize the algorithm, all we need to do is to add a handful of cilk_spawn and cilk_sync statements to expose potential parallelism (Figure 4).   We leave the heavy lifting to the Cilk++ runtime system featuring efficient dynamic scheduling and load balancing.

Figure 4.  The pseudo code of parallel Apply operation in BDD.

When the parallel Apply procedure is called, several instances of MakeNode can run concurrently, so do the cache table and hash table accesses.  To ensure correct parallel execution without data races, we need thread-safe cache table and hash table.

Our thread-safe cache table is a simple fixed-size table supporting lookup(key) and insert(key, value).

Each entry of the table includes a <key, value> pair.  The position of a pair in the table is determined by the hashed value of the key.  If two <key, value> pairs are hashed into the same entry, the new pair will overwrite the old one.  Each entry of the table is protected such that only one thread can read/write it at any time.  This is done by using one bit per entry to indicate status of the entry, and using Compare-and-Swap(CAS) to ensure atomicity.

The hash table in BDD needs to support concurrently execution of one key operation – lookup_insert(key* k), which returns the key if it already exists and inserts the key otherwise.  Since the space requirement by BDD operations is usually unknown, we need a resizable hash table.  When the probe number is greater than a certain threshold value, the hash table performs auto-resizing.
Our implementation uses an open addressing hash table.  The thread-safeness is ensured by atomic Compare-And-Swap (CAS) operations.    Lookup_insert operations can be safely executed in parallel.  During hash table resizing, any worker that calls lookup-update will first help with the resizing, and only then complete their lookup-update operations.  This all-participation scheme distributes the workload of hash table resizing to all the workers that use the hash table.   This ensures that resizing can be done efficiently, and that the worker triggering the resizing won’t be solely burdened with the task.

Performance Results

Our BDD achieves good speedup on multicore machines.   For the multiplier application – a classic example to test scalability of BDD implementation, it achieves 10x speedup on a 16-core system (four quad-core AMD Barcelona CPUs).

Figure 5.  Speedup results of 13-bit Multiplier Construction using multicore-enabled BDD
Although the speedup of multicore BDD is good, we don’t get perfect linear speedup here.   An important reason is that BDD multiplier has a considerable memory bandwidth requirement.  A simple test to confirm this is to run the sequential version of the program for two cases – (1) run one copy of the sequential program on each core, and (2) run a single copy of the sequential program on the entire machine.   If the program’s execution time in case 1 is significantly larger than that in case 2, the memory bandwidth demand of the program could be high enough to become the primary constraint for program scalability.  If we run a sequential BDD multiplier application, the execution time on case (1) is approximately 1.6 times of the execution time on case (2).  This highlights the intrinsic memory bandwidth demand of this application, and explains the imperfect linear speedup.

## References:

[1] Implementation of an Efficient Parallel BDD Package.  Tony Stornetta  and Forrest Brewer.
[2] A Parallel Algorithm for Constructing Binary Decision Diagrams.  Shinji Kimura and Edmund M. Clarke.
[3] BDDNOW: A Parallel BDD Package.  Kim Milvang-Jensen.

The choice of a multiplier is interesting because it is provably worst-case for BDD's, but because if that, it's atypical of real applications. There is ample literature on BDD's including benchmarks; it would be interesting to see what scalability is achievable on them.

posted @ Wednesday, June 03, 2009 8:59 AM by Derek Beatty

Derek,

We also ran N-Queen problem with 12 queens. It gets 3x speedup on 4 cores, and 5x on 8 cores.

Do you have other benchmarks to recommend? Those run more than a few seconds.

Regards,
Yuxiong

posted @ Wednesday, June 03, 2009 10:06 AM by Yuxiong He

This is very cool. I'm using BuDDy now, and I'd love to be able to get some extra performance out of it. I don't suppose I could get my hands on a code / binary version of this?

posted @ Thursday, July 09, 2009 12:53 AM by Ethan

posted @ Monday, July 27, 2009 2:37 PM by Bin Xin

Ethan and Bin Xin,

Thank you for your interest in the parallel BDD. We only parallelized some well-used operations of BuDDy such as bdd_and, bdd_or, and bdd_not to demonstrate the applicability and efficiency of Cilk++ in this domain. These operations are sufficient for applications like adder, multiplier, nqueen problem, however, we didn't develop a parallel version supporting full functionality of BuDDy. Therefore, we don't provide public download and evaluation.

Regards,
Yuxiong

posted @ Wednesday, July 29, 2009 3:42 PM by Yuxiong

For more complete information about compiler optimizations, see our Optimization Notice.

Top

Those looking for an implementation might be interested in this project I did: https://github.com/nlguillemot/robdd

It implements the design of this article using Intel TBB, with good performance improvements on problems like n-queens. A full report is in the "about.pdf" included in the project's GitHub repository.

HI,
I'm currently working on a project and I need to know if the package is available now for download as I urgently need it.
If not could you please recommend other availabe packages that also utilize parallelism in BDD computations.