| October 28, 2009 12:00 AM PDT | |
Posted by Yuxiong He originally on www.cilk.com on Fri, May 29, 2009
Reduced and Ordered Binary Decision Diagram (a.k.a. ROBDD, though I will refer to it as BDD in this post for simplicity) is an acyclic graph representation for Boolean functions. Because this representation provides a canonical form and is quite succinct in most cases, it is widely used for formal verification of protocols and digital circuits. BDD can be used to look for satisfying assignment of Boolean expressions and comparing whether two Boolean expressions are equivalent. Because the construction of BDD for certain large and complex Boolean functions can be very time consuming, we thought it would be interesting to parallelize BDD operations in order to achieve efficient execution on multicore CPUs.
Our parallel BDD is developed based on BuDDy, an open source BDD package. In this post, I will describe some key algorithms of BDD operations and introduce the approach we took to parallelize them. We tested the performance of our parallel BDD by constructing a BDD for 13-bit multiplier, and obtained a 10x speedup on a 16-core machine.
(More detailed performance results are presented in Section 4 of this blog.)
Brief Introduction of BDD
A binary decision diagram is built based on the If-then-else Normal Form (INF) of a Boolean expression. For example, a Boolean expression (x AND y) of variable x and y can be represented using the following INF form shown in Figure 1(a), which corresponds to the binary decision diagram in Figure 1(b). Each BDD node has a variable, and two successor links. The dashed line points to the low successor of the node representing the branch where the variable value is false. The solid line points to the high successor of the node representing the branch where the variable value is true.
Figure 1. (a) INF form of x AND y. (b) BDD of x AND y.
- (uniqueness) no two distinct nodes have the same variable name and low and high successor.
- (no-redundancy) no variable node has identical low- and high-successor.
An important property of ROBDD is that for each Boolean expression with a fixed variable order, there is exactly one ROBDD. Figure 2 shows an example of ROBDD for expression (x1 1) AND (x2
y2) with variable order x1 < x2 < y1 < y2 . The operator
(called bi-implication) means x is equivalent to y, i.e., the value of x
y is 1 if and only if (x=0 AND y=0) OR (x=1 AND y=1).
Figure 2. BDD of expression (x1
Algorithm for BDD Construction
To describe the algorithm of constructing BDD, we need a data structure to represent BDD. The data structure here is straightforward, which is a direct representation of directed acyclic graph – a set of nodes with outgoing edges. As shown in Figure 2, each intermediate node has three attributes – variable, low edge node and high edge node. There are two leaf nodes representing logical true and logical false, respectively. Since reduced BDDs do not contain any redundant nodes, a hash table is required to provide a guarantee on the absence of any redundancies. A hash table in BDD serves two purposes: to store the nodes, and to provide efficient lookup of nodes.
Constructing BDD for a Boolean expression can be treated as an iterative process of combining two Boolean expressions with a Boolean operator. Let’s call this combination operation Apply(expression e1, expression e2, operator op). For example, the boolean expression (x1 y1) AND (x2
y2) can be constructed using Apply(e1, e2, AND) where e1 = (x1
1) and e2 =(x2
y2). Moreover, e1 can be further constructed using Apply(x1, y1,
), and so does e2. Thus, Apply is a key operation in BDD construction and manipulation.
The pseudo code of the Apply operation is shown in figure 3.
Figure 3. The pseudo code of sequential Apply operation in BDD.
To apply a logic operation on two BDDs, the evaluation consists of several steps, as shown in Figure 3.
First, if this operation was just performed and the result is cached, the result is returned directly. This requires a special data structure – a cache table. A cache table is analogous to a hash table, and maintains <key, value> pair and returns the corresponding value given the key. The difference between the two is that cache table doesn’t guarantee to record all <key, value> pairs inserted. It works like a cache, and only stores the recent accessed <key, value> pairs. Second, if both BDDs are constants (0 or 1), it simply applies the simple boolean operation on constants. Otherwise, the Apply operation is recursively conducted on low edge and high edge of two bdds.
Parallel Algorithm for BDD Construction
There was existing work on parallelizing BDD construction using thread models [1-3]. Because their parallel languages / libraries don’t have dynamic scheduling for load balancing, their codes have to divide the work manually or maintain a work queue in order to balance the load among processors. These all involve significant changes to the sequential code. In contrast, when we use Cilk++ to parallelize the algorithm, all we need to do is to add a handful of cilk_spawn and cilk_sync statements to expose potential parallelism (Figure 4). We leave the heavy lifting to the Cilk++ runtime system featuring efficient dynamic scheduling and load balancing.
Figure 4. The pseudo code of parallel Apply operation in BDD.
When the parallel Apply procedure is called, several instances of MakeNode can run concurrently, so do the cache table and hash table accesses. To ensure correct parallel execution without data races, we need thread-safe cache table and hash table.
Our thread-safe cache table is a simple fixed-size table supporting lookup(key) and insert(key, value).
Each entry of the table includes a <key, value> pair. The position of a pair in the table is determined by the hashed value of the key. If two <key, value> pairs are hashed into the same entry, the new pair will overwrite the old one. Each entry of the table is protected such that only one thread can read/write it at any time. This is done by using one bit per entry to indicate status of the entry, and using Compare-and-Swap(CAS) to ensure atomicity.
A Thread-Safe Hash Table
The hash table in BDD needs to support concurrently execution of one key operation – lookup_insert(key* k), which returns the key if it already exists and inserts the key otherwise. Since the space requirement by BDD operations is usually unknown, we need a resizable hash table. When the probe number is greater than a certain threshold value, the hash table performs auto-resizing.
Our implementation uses an open addressing hash table. The thread-safeness is ensured by atomic Compare-And-Swap (CAS) operations. Lookup_insert operations can be safely executed in parallel. During hash table resizing, any worker that calls lookup-update will first help with the resizing, and only then complete their lookup-update operations. This all-participation scheme distributes the workload of hash table resizing to all the workers that use the hash table. This ensures that resizing can be done efficiently, and that the worker triggering the resizing won’t be solely burdened with the task.
Performance Results
Our BDD achieves good speedup on multicore machines. For the multiplier application – a classic example to test scalability of BDD implementation, it achieves 10x speedup on a 16-core system (four quad-core AMD Barcelona CPUs).
Figure 5. Speedup results of 13-bit Multiplier Construction using multicore-enabled BDD
Although the speedup of multicore BDD is good, we don’t get perfect linear speedup here. An important reason is that BDD multiplier has a considerable memory bandwidth requirement. A simple test to confirm this is to run the sequential version of the program for two cases – (1) run one copy of the sequential program on each core, and (2) run a single copy of the sequential program on the entire machine. If the program’s execution time in case 1 is significantly larger than that in case 2, the memory bandwidth demand of the program could be high enough to become the primary constraint for program scalability. If we run a sequential BDD multiplier application, the execution time on case (1) is approximately 1.6 times of the execution time on case (2). This highlights the intrinsic memory bandwidth demand of this application, and explains the imperfect linear speedup.
References:
[1] Implementation of an Efficient Parallel BDD Package. Tony Stornetta and Forrest Brewer.[2] A Parallel Algorithm for Constructing Binary Decision Diagrams. Shinji Kimura and Edmund M. Clarke.
[3] BDDNOW: A Parallel BDD Package. Kim Milvang-Jensen.
COMMENTS
posted @ Wednesday, June 03, 2009 8:59 AM by Derek Beatty
We also ran N-Queen problem with 12 queens. It gets 3x speedup on 4 cores, and 5x on 8 cores.
Do you have other benchmarks to recommend? Those run more than a few seconds.
Regards,
Yuxiong
posted @ Wednesday, June 03, 2009 10:06 AM by Yuxiong He
posted @ Thursday, July 09, 2009 12:53 AM by Ethan
posted @ Monday, July 27, 2009 2:37 PM by Bin Xin
Thank you for your interest in the parallel BDD. We only parallelized some well-used operations of BuDDy such as bdd_and, bdd_or, and bdd_not to demonstrate the applicability and efficiency of Cilk++ in this domain. These operations are sufficient for applications like adder, multiplier, nqueen problem, however, we didn't develop a parallel version supporting full functionality of BuDDy. Therefore, we don't provide public download and evaluation.
Regards,
Yuxiong
posted @ Wednesday, July 29, 2009 3:42 PM by Yuxiong
For more complete information about compiler optimizations, see our Optimization Notice.


Mahesh
This is good work. Did you publish a detailed report on this? If yes, do provide the link. Also the link for pdhash.tar file you have provided is not working. Can you provide a different link?
Thanks