| October 28, 2009 12:00 AM PDT | |
Posted by Yuxiong He originally on www.cilk.com on Tue, Apr 28, 2009
Murphi is a popular finite-state machine verification tool, used widely in the design of cache coherence algorithms and protocols, link-level protocols, executable memory model analysis, and analysis of cryptographic and security-related protocols. These complex protocols are often verified by examining all reachable protocol states from a set of possible start states. With proper memory management like hash compression and space reduction techniques, runtimes become a major limiting factor. We developed a multicore-enabled Murphi using the Cilk++ concurrency platform, and achieved excellent speedup:
Our parallelized Murphi is based on standard Murphi 3.1. The standard Murphi packages include a Murphi compiler and verifier, where the compiler translates models writing in the Murphi language to C++ code, and the verifier conducts the verification of this code. The Murphi compiler has approximately 19K lines of code, and the verifier has 7K lines. It took one engineer 2 weeks to parallelize Murphi using Cilk++. The parallel version supports all key functions of the sequential version. (It also fixes some severe memory leakage issues in the original code and provides 64-bit support where the original is highly dependent on the 32-bit platform.) Below I briefly describe the changes we made to parallelize the Murphi code.
Murphi verifies a protocol by traversing the entire state space, which is essentially a graph-traversal process. The most commonly used graph traversal algorithms are depth-first search (DFS) and breath-first search (BFS). We developed parallel DFS and BFS in the "Cilkified" version of Murphi.
- Depth-First Search: Sequential DFS in Murphi uses a hash table to store all visited states and a stack to track new states. With a thread-safe concurrent hash table and a concurrent stack, one can easily get DFS running in parallel. However, development of a thread-safe and almost contention-free concurrent stack could be a time-consuming task. We therefore took an alternative approach and implemented a recursive version of DFS without using a stack. Then all we needed is a concurrent hash table to make the sequential DFS run in parallel. Our concurrent hash table is lock-free, and uses atomic operation Compare_and_Swap to ensure thread safety for entry lookup and insertion.
- Breadth-First Search: Sequential BFS in Murphi uses a queue to track new states, and a hash table to store all visited states. We implemented the parallel breadth-first search slightly differently than the sequential version. Instead of using a queue approach, we used an implementation of an unordered set, called a bag . The parallel search keeps two bags at any time to represent the frontier of the search. The current bag contains the existing frontier of nodes and is walked in parallel, while the next bag accumulates the new frontier one step further from the root of the search. The start node for the search is initially placed into the current bag. The search is broken into phases, where each phase consists of examining all the nodes in the current bag, computing their successors, and storing those successors that have not yet been seen into the next bag. At the end of each phase, the current bag is destroyed, and the next bag becomes the new current bag. Since unioning bags is an associative operation, we declare the bags to be reducers, which allows each worker to operate largely on its own local bag. Our bag provides efficient implementation of insertion, merge, and split operations. Starting from n separate items, bag insertion and merging take constant amortized time and walking through splitting runs in linear time. (The bag implementation we used in Murphi is contributed by Tao Benjamin Schardl from his final year project at MIT.)
Another issue we faced when parallelizing Murphi code is that the sequential code uses a single buffer to perform all state operations. For each state operation like rule matching, invariant verification, etc., it copies the state into the single buffer, performs the operation, and copies the result out of the buffer. This structure is not parallel-friendly because that single buffer becomes the location of many data races if it is not protected by locks or atomic operations, and it becomes the center of contention if it is protected. One fix is to rewrite the program through passing state as parameter to avoid using the single buffer, however this fix would involve significant changes in the structure of the original verifier program as well as the compiler code. Alternatively, we chose to implement the single buffer as a hyperobject of the state, and it works as a holder where each worker has a different view of the buffer. This successfully solves the race and contention problem with just a few lines of changes to the original sequential program.
Our Murphi achieves excellent speedup on multicore machines. For most of the non-trivial applications in the example directory of Murphi package, speedup is linear. The figure above illustrates the results of checking ldash - a complex cache coherence protocol model. Multicore Murphi achieves 15.5X and 14.9X speedup using BFS and DFS respectively on 16-core AMD machine (four quad-core AMD Opteron processors connected using HyperTransport).
Our Multicore Murphi is available here for evaluation. This release includes the compiler binary, and source code of the verifier.
- If you are interested in speeding up your model checking process, give it a try!
- If you are interested in how to parallelize a full-size application using Cilk++, the source code of the Murphi verifier may be a good example.
- If you are interested in how to conduct parallel graph traversal or build an efficient dag, you may find our concurrent hash table and Bag implementation useful.
- Of course, feel free to try it just for fun as well!
For more complete information about compiler optimizations, see our Optimization Notice.


Omar Saad
Verry interesting article but ther is soem links are broken in the article especially the "here" link in the phrase "Our Multicore Murphi is available here for evaluation". Is it possile to have access to the code I am verry intersted in the cilkfied version of the depth-first search.
Regards
Omar Saad
IREQ/HQ