Fast Spectral Estimation of Genetic Homology

Submit New Article

July 5, 2010 9:00 PM PDT




Dr. Anton Pankratov and Dr. Ruslan Tetuev

Maxim Pyatkov, PhD candidate

Moscow State University, Department of Computational Mathematics and Cybernetics, Intel-MSU Students Lab

Institute of Mathematical Problems of Biology, Russian Academy of Sciences


Abstract

Imagine if you want to view homology of all known genetic sequences at once. As one single image it would be huge indeed. To do this we could identify pixels at positions (i,j) of the image with bright red points when the estimated similarity value of corresponding i-th and j-th DNA segments is above a predefined threshold value; green points would denote inverted identities, and so on. The questions remain: how could such a huge image be generated quickly, how would a user browse over its broad plane, and how would they pull out interesting subsequences separately for additional exploration?

Here we describe a novel approach for these tough problems. We believe that they could be easily resolved by means of high-performance computing on multi-core systems. We call it N.A.S.C.A.: Numerical Analytical Spectral Comparing Approach. With this approach we apply numerical analytical methods instead of using dynamic programming. Previously, it was demonstrated through Chebyshev's approximation that the comparison of DNA segments could be performed as comparison of spectral values of fragments of its GC content profile. This allows us to operate with extremely long DNA segments (up to 100000 base pairs) using only their spectral representations. The length of GC scanning usually scales as 256, 512, 1024, etc.

Next, we had to speed up the process. Because of recurrent formulas of Chebyshev orthogonal polynomials this approach lends itself well to being parallelized. Current implementations of our algorithm parallelization were based on OpenMP pragmas and Intel® Integrated Performance Primitive functions. We found that this scales linearly on 2- and 4-cores systems and have every reason to believe that the total acceleration of the algorithm will be achieved on a 32-core system. If it doesn't scale well, then we'd like to find out why and improve the speed of processing.

The work we are doing is at the intersection of several areas of research: bioinformatics, pattern recognition, and parallel computing. Our main objective using the Intel® Manycore Testing Lab is to develop an algorithm and program for the detection of inaccurate extended repeats in genomes over the complete comparison.

We believe that by testing our project in a 32-core development environment we can show the existence of extended repeats in the genomes, and demonstrate the possibility of exhaustive search in the problem of finding repeats.

Results

We used two variants of algorithms: one with indexing and one without to understand which one is better. Intel® C Compiler, GCC Compiler with OpenMP support, and Intel® Integrated Performance Primitives Library were used for analysis.

We found that indexing involves not only the preliminary calculations but also data compression, which results in better scalability on multicore processors. This gives us an idea for solving the IO problems in future applications.


Figure 1. The sample matrix of genetic homology.


Figure 2. The results of execution of program on different number of threads.

Conclusion

Next steps will include investigation of further parallelization and the scalability of algorithms. We expect that further research could be applied in the fields of Biology, Medicine, Bioinformatics, and Ontology.

Sources

All algorithms used are original with the following exceptions:

1) Gauleg routine borrowed from Numerical Recipes/ Oxford University Press, 2nd ed., 1992

2) XgetOpt by Hans Dietrich

Additional Resources:

Intel® Manycore Testing Lab: www.intel.com/software/manycoretestinglab

Intel Academic Community: www.intel.com/software/academic