Benchmark of Parallel Bipartite Graph Matching and Parallel Smith-Waterman on the Intel® Manycore Testing Lab

Submit New Article

February 6, 2011 11:00 PM PST


Authors
Rosni Abdullah and Ibrahim Umar
School of Computer Sciences, Universiti Sains Malaysia

ABSTRACT

Bipartite graph-based matching (BGM) technique has a powerful ability to give a global and optimized matching result. At the same time, one can take advantage of graph-based data representation used with the technique. Graph-based representation has the ability to encapsulate the attribute of the objects and also the relationship details among the elements of the compared objects. The combination of graph-based matching technique and of graph-based data structure were best applied in protein tertiary structure matching, because protein structure has rich information that needs to be handled with care. The matching result was measured using an exponential similarity function.

Sequence alignment algorithm plays an important role in finding the homologous groups of biological sequences which may help to determine the function of new sequences. This operation is currently one of the most heavily used operations in computational biology. The Smith-Waterman (SW) algorithm has been the most accurate sequence alignment algorithms available but also consumes a great amount of time during its operation.

By the aid of parallel processing using OpenMP and the 32-cores within the Intel® Manycore Testing Lab platform, we aimed to accomplish the objective of a multiple-fold faster execution time of BGM and SW algorithm computations.

Our previous works of parallel BGM and SW were benchmarked on dual quad-core machines and have shown respectable time speed-ups on those platforms. We expect to get even better speed-ups on the Manycore Testing Lab nodes as we have designed our code to scale on the multi-core platform.

RESULTS

We ported our existing BGM and SW parallel codes to suit the Manycore Testing Lab running environment. We used 3 different compilers (Gnu* C Compiler v4.4.3, Gnu* C Compiler v4.5.1, and Intel® C Compiler v11.1.072) which are available within the Manycore Testing Lab to compile them. All combinations of number of threads and compilers are repeated 6 times each. The average of 6 repetitions will be taken as the final result whenever the standard deviation is low. We used the UNIX's time command to test the time performance of the running algorithm.

No heavy modifications were done to the BGM and SW codes in order to run in the Manycore Testing Lab environment. However, some simple modifications were made. We omitted most of the standard output (STDOUT) writing command to make sure that the increase in OpenMP operating threads would not also increase the time needed to write outputs to the console.



Figure 1. The parallel Bipartite Graph Matching code benchmark on Intel Manycore Testing Lab

The BGM benchmark on the Manycore Testing Lab delivered a promising result in terms of scalability. While there are no significant differences between the different compilers used, the graph in Figure 1 showed good scalability of the parallel BGM algorithm's codes.

The transition from 1 to 2 threads showed the most optimal speedup (1.94x - 1.95x) and efficiency (+/- 97%). After 2 threads, the increment in threads delivers speedups that continue to rise but with decreasing efficiency. During the highest speedups recorded, the efficiency is only around 40%-50% across all the different compilers.

Also in Figure 1, performance degradation in all of the compilers can be observed after more than 32 threads are used. The reason behind this behavior is the difference between the "physical processor" and "logical processor" sizes. In the Manycore Testing Lab node, the supported number of threads or logical processor is 64 while the amount of the physical processor or the real physical cores available is 32. These available physical cores are using the Intel® Hyper-Threading Technology which doubles the amount of threads which can be processed at the same time. However, the speed of hyper threaded threads is not equivalent to to real processor core threads performance. An added overhead in creating and processing threads to a core after all available physical processors are used was greater than the additional processing power added, thus dragging down the whole execution time performance, especially when a thread is heavily accessing matrices data from main memory and raw data from disk -- as is happening in this parallel BGM code.



Figure 2. The parallel Smith-Waterman code benchmark on the Manycore Testing Lab

As we stated in the beginning of this section, using different compilers did not give a significant performance difference. However, it seems that the Intel C Compiler delivers a marginally better performance for the parallel BGM code after 32 threads or whenever the Intel Hyper-Threading Technology is activated.

In Figure 2, we show that the implementation of parallel SW is not really scalable, especially in high number of threads. The maximum speedups recorded were 7.87 in 15 threads using Gnu C Compiler 4.4, 7.67 in 14 threads using Gnu C Compiler 4.5 and 6.87 in 13 threads when using Intel C Compiler 11.1. This brought an efficiency of 52.4%, 51%, and 52% respectively in the highest speedups. After that, additional threads thrown in the execution only to worsen the execution time performance, as can be seen in the inclining line graph after 13-15 threads.

To understand the cause of this inferior behavior performance on the Manycore Testing Lab, we need to understand that initially SW was developed for the dual quad-core Intel® Xeon® processor. One developer of this codes' algorithm did not foresee the possibility of testing the algorithm in the higher number of core resources. So, portions of the code itself were hard-coded to accommodate only to a maximum number of 8 threads.

There are no significant differences between different compilers in this parallel SW benchmark. However, the Intel C Compiler 11.1 time performance was slightly slower compared to others in higher number of threads.

CONCLUSION

The benchmark tests have shown the level of scalability of both the BGM and SW codes. Our implementation of parallel BGM is more scalable compared to the parallel SW implementation. However we perceived the scalability of parallel BGM and SW can be pushed even further by minimizing parallel overhead and improving the method of how data is fetched.

Currently, there are plenty of file I/O during the execution of the BGM that can be made more efficient by introducing pre-processing steps. We can also align the data in memory to take advantage of the hyper-threading that is superior when processing locally adjacent data. Bigger data sets can also feed to step up the efficiency of the OpenMP parallel loop; therefore could shrink the overall overhead ratio.

The inferior scalability of parallel SW can be improved by optimizing its data fetching and processing mechanism. Our parallel SW code was operating using a defined size of data block. We can improve this by finding the optimal block size or if possible make it as dynamic to the available computing resources. Also we found some of the hard-coded optimization for 8 cores can be fixed in order for the SW code to accommodate more available computing cores.

In conclusion, Intel Manycore Testing Lab has provided us an exquisite platform to work on. Our parallel BGM code can easily enjoy up to 14x time reduction by only adding several lines of OpenMP loop pragmas. The parallel SW was disappointing however; a maximum of 7x speedup is not really that bad. The road to improvement has been identified earlier in this section and currently serves as our future directives.

SOURCES

1. Othman, F., Abdullah, R., and Salam, R. A. 2008. Bipartite Graph for Protein Structure Matching. In Proceedings of the 2008 Second Asia international Conference on Modelling & Simulation (Ams) (May 13 - 15, 2008). AMS. IEEE Computer Society, Washington, DC, 928-933. DOI= http://dx.doi.org/10.1109/AMS.2008.89

2. Othman, F., Umar, I., Abdullah, R., and Rathi, A. 2010. Parallel Bipartite Graph Algorithm for Protein Structure Matching Using OpenMP. In Proceedings of the 2010 Second international Conference on Computer Research and Development (May 07 - 10, 2010). ICCRD. IEEE Computer Society, Washington, DC, 25-29. DOI= http://dx.doi.org/10.1109/ICCRD.2010.58

3. Umar, I., Alqudami, N., Rashid, N. A., and Abdullah, R. 2009. Parallelisation of sequence comparison algorithms using hybridised parallel techniques. In Proceedings of the 6th international Conference on High Capacity Optical Networks and Enabling Technologies (Alexandria, Egypt, December 28 - 30, 2009). IEEE Press, Piscataway, NJ, 30-35.

4. Noorian, M., Pooshfam, H., Noorian, Z., and Abdullah, R. 2009. Performance enhancement of smith-waterman algorithm using hybrid model: comparing the MPI and hybrid programming paradigm on SMP clusters. In Proceedings of the 2009 IEEE international Conference on Systems, Man and Cybernetics (San Antonio, TX, USA, October 11 - 14, 2009). IEEE Press, Piscataway, NJ, 492-497.