Matching DNA substrings - Intel Acceler8 contest

Intel Acceler8 contest
Teacher at INSA of RENNES : Pascal Garcia
Students : Cédric Andreolli, Marine Messager


may 2012  

 

Introduction

We present in this article our algorithm to solve the Intel acceler8 problem of may 2012. We first give an overview of the problem, then present our approach and conclude.

 

Problem

As stated in the problem definition web page, we have to compare one reference sequence (in a file) with a variable number of sequences (from one or more files). The sequences are composed from four characters : A, T, C, G. The goal is to find substrings of at least minSize (a parameter) characters in the input files matching exactly in the reference one.
For example:

  • the reference sequence is: ATCTTGAACGTCAGTCA,
  • the minimum size of a match: 4,
  • the first input sequence is: AGGGTGAACGTCA,
  • the second input sequence is: GAAGA,

then we have one match of length 9 in the first input sequence and none in the second one.
Note that some postprocessing is done on the matches found to reduce the number of results. We let the reader look the details of this pruning in the problem description page.

 

Solution

 

Principle

For a given reference and input sequences, the basic principle of our algorithm is to match all substrings of length minSize (the minimum matching length) in the input sequence against the reference one and if a match occurs, try to expand this match. For example, if the reference sequence is:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A T C T T G A A C G T C A G T C

the input sequence is:

0 1 2 3 4 5 6 7 8 9 10
G T C T T G C A G T C

and minSize is equal to 4, we will consider all substrings of length 4 in the input string: GTCT, TCTT, CTTG, ..., AGTC. There is |input| - minSize + 1 such sequences. In this example, we find that the substrings TCTT, CTTG, CAGT and AGTC match in the reference sequence. We need to expand each match to find the maximum match possible. So finally, in this example, we obtain the two matches TCTTG and CAGTC.
If we use a naive algorithm to search in the reference sequence each substring of length minSize of the input one, the complexity would be too high (O(|reference| x |input| x minSize ). We describe in the next section how we handle this search.

 

Searching for substrings

Representation of the characters

Our alphabet for this problem consists of only four characters (A, C, T, G). As we can see on the following figure, we can represent each character with two bits instead of 8. We use the representation of figure 1 to save space.

 

Rabin-Karp

We use a Rabin-Karp style algorithm [1] for searching the different substrings of length minSize in the reference sequence, with some modifications for our problem. The Rabin-Karp algorithm use a hash value to represent substrings. For example, if we try to find the substring ACTG, of length 4, in the reference string AATGCTGACTGA, the Rabin-Karp algorithm does the following:


  • compute the hash value for ACTG, hash(ACTG) = vACTG,


  • scan the reference string:


    • first compute a hash value for AATG, hash(AATG) = vAATG. We compare this value with vACTG. If the values match we possibly have a match. It is possible that two strings give the same hash value without being equals. So when there is a match, we have to verify that the strings are really equals. But, if the two values are not equals there cannot have a match,


    • we then proceed to the next block ATGC but we do not have to compute its hash value from scratch. This hash value is computed from the last value vAATG and the value of the new character C. We will not detail this step further because we will use a different approach,


    • all the blocks of size 4 of the reference string are processed as descibed in the preceding item.



We describe in the next section how we compute the hash value.

Hash value

Because we compress the representation of the four characters (A, T, C, G) we can pack 32 of them in 64-bits (a long on a 64-bits architecture). For example, the string:

TACTGAACCTGGAATTC
is represented by the number:


10 00 01 10 11 00 00 01 01 10 11 11 00 00 10 10 01

If we want to compute a hash value for 4 characters for example, we take the number represented by the concatenation of our compressed representation of each characters. For example, the hash value of TACT is 10000110. As we fix a limit to the maximum number of characters we consider to compute a hash value (currently 12 characters in our program) it is easy to compute it with bit operations on a maximum of two long integers. To be more precise, if we need to compute a hash value for a string of more characters than our limit, we consider the first characters of the string up to our limit to compute the hash value.

As we discussed in the Rabin-Karp section, we have to search all substrings of length minSize from the input sequence in the reference sequence. If the input sequence is of length n, we will have to do n such searches. To save some time, we first process the reference sequence to cache some values in a hash table. We described this step in the next section.

Preprocessing the reference sequence

The idea is to compute the hash values of the substrings of length minSize in the reference sequence and save the coordinate of the first character of each substring in a hash table. For example, if the reference sequence is:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A T C G G T T C A T G T C A G T

we will save the value 0 associated to the key hash(ATCG) into the hash table, the value 1 associated to the key hash(TCGG), the value 2 associated to the key hash(CGGT) and so on. When we try to find a substring s of the input sequence in the reference one, we compute the hash value of s and see if this hash value is present in the hash table. If it is present, we get the stored coordinates for this key in the hash table and inspect each of them to see if a match is really present. Note that in our hash table, we only save the coordinates and not the keys. This save space and it was better (in our experiments) to consider more coordinates for a potential match than to save the additional keys.

 


But we can save a lot more space because there is a lot in common between the different substrings we cache in the hash table. We define a new value called minForHash that is derived from minSize and that is strictly less than minSize. For example, let minSize=6 and minForHash=3.The following figure shows that the substring GGT, of size minForHash, is common to the four consecutives substrings of length minSize.

 

 



We use this property to save space by computing only hashes of size minForHash:

 

  • we compute hash of length minForHash in the reference sequence at indices multiple of minSize - minForHash + 1. In our example, we will save in the hash table the coordinates at indices 0, 4, 8, 12, ..., the coordinate 0 will be for the hash value hash(ATC), the coordinate 4 for the hash value hash(GTT) and so on,

     

  • when we scan the input sequence, we compute hashes of length minForHash at every position in this string (0, 1, 2, 3, ...) and use the preceding hash table to see if it is possible to find a match. If a match occurs, we check if we really have one by scanning the reference sequence at each coordinates returned.

     

We cannot miss a match with this method. To have an idea of the proof of this claim, look at the next figure.

 

 

 

In this Figure, we consider minSize=10 and minForHash=4. The string in this figure is a chunk of the reference sequence. The boxes around the characters represent a sequence of characters used to compute a hash value and store the coordinate of the first character of this sequence. The first characters of two consecutive boxes are separated by minSize - minForHash + 1. In the figure is represented a window of length minSize. If we slide this window anywhere, it will always contain a sequence of minForHash characters for which we computed a hash value. Because we test each sequences of minForHash characters in the input sequences, we cannot miss a match of length minSize between the reference and the input sequence.

 

Expanding the matches

When we found a match of length minForHash between the input and reference sequences we have to expand it to the left and the right to see if a match of length at least minSize exists. The only point that deserves more explanation is the expansion toward the left. Indeed, as illustrated in the following figure, if we start expanding from point p1 of the reference sequence and find a match to the left up to point p2, where p2 is the starting character of a hashed sequence, we can drop this match because another one to the left will match it.

 


Complexity

The worst case complexity of the Rabin-Karp algorithm is O((|ref erence| − minSize + 1) × minSize) if we search a substring of size minSize in the reference sequence. Consequently, the worst case complexity of our algorithm is approximately: O((|ref erence| − minSize + 1)/(minSize − minF orHash + 1) + (|input| − minF orHash + 1) × bucketSize × minSize)

where bucketSize is the max size of a bucket in the hash table. Note that we estimate the worst match length to minSize because we prune to the left when we expand and so it seems quite reasonable.

Parallelization of the search

To parallelize the search:

 

  • we create a hash table by thread. Each thread will fill is own hash table to avoid contention between threads. We implement our hash table with the idea of avoiding pointer chasing (like the classical one with buckets) and trying to be more cache friendly,

     

  • we then distribute the scan of an input sequence to the different threads, where a thread will try to find a match in the reference sequence using all the hash tables. So each thread is given a chunk of the input sequence and will try to match all substrings of size minForHash of this chunk in the reference sequence and if a match occurs try to expand it and save the results if necessary.

     

We used OpenMP to parallelize our code.

 

Reading the files

We first load the reference sequence with memory mapped file I/O and compress (see the section on our compressed representation) it into a vector of long. For each input sequence file:

 

  • we load the file with memory mapped file I/O,

     

  • we scan in parallel the file to find the position of each character '>' to split this file in input sequences,

     

  • each thread then process each input sequence to compress them into a vector of long.

     

 

Conclusion

We have tested different approaches and found that matching substrings of the minimum length and expanding them afterward was the best for us. Our approach is based on the Rabin-Karp algorithm [1] with some modifications for computing the hash function. We add also a reduction in the number of hashes to compute due to the specificities of the problem.

 

Bibliography

 

1
Richard M. Karp and Michael O. Rabin.
Efficient randomized pattern-matching algorithms.
IBM J. Res. Dev., 31(2):249-260, March 1987.

 

You can download our full sources by clicking on this link.

Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.