Teacher at INSA of RENNES : Pascal Garcia
We present in this article our algorithm to solve the Intel acceler8 problem of may 2012. We first give an overview of the problem, then present our approach and conclude.
As stated in the problem definition web page, we have to compare one reference sequence (in a file) with a variable number of sequences (from one or more files). The sequences are composed from four characters : A, T, C, G. The goal is to find substrings of at least
minSize (a parameter) characters in the input files matching exactly in the reference one.
- the reference sequence is:
- the minimum size of a match:
- the first input sequence is:
- the second input sequence is:
then we have one match of length 9 in the first input sequence and none in the second one.
Note that some postprocessing is done on the matches found to reduce the number of results. We let the reader look the details of this pruning in the problem description page.
For a given reference and input sequences, the basic principle of our algorithm is to match all substrings of length
minSize (the minimum matching length) in the input sequence against the reference one and if a match occurs, try to expand this match. For example, if the reference sequence is:
the input sequence is:
minSize is equal to 4, we will consider all substrings of length 4 in the input string:
AGTC. There is |input| - minSize + 1 such sequences. In this example, we find that the substrings
AGTC match in the reference sequence. We need to expand each match to find the maximum match possible. So finally, in this example, we obtain the two matches
If we use a naive algorithm to search in the reference sequence each substring of length
minSize of the input one, the complexity would be too high (O(|reference| x |input| x minSize ). We describe in the next section how we handle this search.
Representation of the characters
Our alphabet for this problem consists of only four characters (
A, C, T, G). As we can see on the following figure, we can represent each character with two bits instead of 8. We use the representation of figure 1 to save space.
We use a Rabin-Karp style algorithm  for searching the different substrings of length
minSize in the reference sequence, with some modifications for our problem. The Rabin-Karp algorithm use a hash value to represent substrings. For example, if we try to find the substring
ACTG, of length 4, in the reference string
AATGCTGACTGA, the Rabin-Karp algorithm does the following:
- compute the hash value for
hash(ACTG) = vACTG,
- scan the reference string:
- first compute a hash value for
hash(AATG)= vAATG. We compare this value with vACTG. If the values match we possibly have a match. It is possible that two strings give the same hash value without being equals. So when there is a match, we have to verify that the strings are really equals. But, if the two values are not equals there cannot have a match,
- we then proceed to the next block
ATGCbut we do not have to compute its hash value from scratch. This hash value is computed from the last value vAATG and the value of the new character
C. We will not detail this step further because we will use a different approach,
- all the blocks of size 4 of the reference string are processed as descibed in the preceding item.
- first compute a hash value for
We describe in the next section how we compute the hash value.
Because we compress the representation of the four characters (
A, T, C, G) we can pack 32 of them in 64-bits (a
long on a 64-bits architecture). For example, the string:
TACTGAACCTGGAATTCis represented by the number:
10 00 01 10 11 00 00 01 01 10 11 11 00 00 10 10 01
If we want to compute a hash value for 4 characters for example, we take the number represented by the concatenation of our compressed representation of each characters. For example, the hash value of
10000110. As we fix a limit to the maximum number of characters we consider to compute a hash value (currently 12 characters in our program) it is easy to compute it with bit operations on a maximum of two
long integers. To be more precise, if we need to compute a hash value for a string of more characters than our limit, we consider the first characters of the string up to our limit to compute the hash value.
As we discussed in the Rabin-Karp section, we have to search all substrings of length
minSize from the input sequence in the reference sequence. If the input sequence is of length n, we will have to do n such searches. To save some time, we first process the reference sequence to cache some values in a hash table. We described this step in the next section.
Preprocessing the reference sequence
The idea is to compute the hash values of the substrings of length
minSize in the reference sequence and save the coordinate of the first character of each substring in a hash table. For example, if the reference sequence is:
we will save the value 0 associated to the key
hash(ATCG) into the hash table, the value 1 associated to the key
hash(TCGG), the value 2 associated to the key
hash(CGGT) and so on. When we try to find a substring
s of the input sequence in the reference one, we compute the hash value of
s and see if this hash value is present in the hash table. If it is present, we get the stored coordinates for this key in the hash table and inspect each of them to see if a match is really present. Note that in our hash table, we only save the coordinates and not the keys. This save space and it was better (in our experiments) to consider more coordinates for a potential match than to save the additional keys.
But we can save a lot more space because there is a lot in common between the different substrings we cache in the hash table. We define a new value called
minForHash that is derived from
minSize and that is strictly less than
minSize. For example, let
minForHash=3.The following figure shows that the substring
GGT, of size
minForHash, is common to the four consecutives substrings of length
We use this property to save space by computing only hashes of size
- we compute hash of length
minForHashin the reference sequence at indices multiple of
minSize - minForHash + 1. In our example, we will save in the hash table the coordinates at indices
0, 4, 8, 12, ..., the coordinate
0will be for the hash value
hash(ATC), the coordinate
4for the hash value
hash(GTT)and so on,
- when we scan the input sequence, we compute hashes of length
minForHashat every position in this string (
0, 1, 2, 3, ...) and use the preceding hash table to see if it is possible to find a match. If a match occurs, we check if we really have one by scanning the reference sequence at each coordinates returned.
We cannot miss a match with this method. To have an idea of the proof of this claim, look at the next figure.
In this Figure, we consider
minForHash=4. The string in this figure is a chunk of the reference sequence. The boxes around the characters represent a sequence of characters used to compute a hash value and store the coordinate of the first character of this sequence. The first characters of two consecutive boxes are separated by
minSize - minForHash + 1. In the figure is represented a window of length
minSize. If we slide this window anywhere, it will always contain a sequence of
minForHash characters for which we computed a hash value. Because we test each sequences of
minForHash characters in the input sequences, we cannot miss a match of length
minSize between the reference and the input sequence.
Expanding the matches
When we found a match of length
minForHash between the input and reference sequences we have to expand it to the left and the right to see if a match of length at least
minSize exists. The only point that deserves more explanation is the expansion toward the left. Indeed, as illustrated in the following figure, if we start expanding from point
p1 of the reference sequence and find a match to the left up to point
p2 is the starting character of a hashed sequence, we can drop this match because another one to the left will match it.
The worst case complexity of the Rabin-Karp algorithm is O((|ref erence| − minSize + 1) × minSize) if we search a substring of size minSize in the reference sequence. Consequently, the worst case complexity of our algorithm is approximately: O((|ref erence| − minSize + 1)/(minSize − minF orHash + 1) + (|input| − minF orHash + 1) × bucketSize × minSize)
where bucketSize is the max size of a bucket in the hash table. Note that we estimate the worst match length to minSize because we prune to the left when we expand and so it seems quite reasonable.
Parallelization of the search
To parallelize the search:
- we create a hash table by thread. Each thread will fill is own hash table to avoid contention between threads. We implement our hash table with the idea of avoiding pointer chasing (like the classical one with buckets) and trying to be more cache friendly,
- we then distribute the scan of an input sequence to the different threads, where a thread will try to find a match in the reference sequence using all the hash tables. So each thread is given a chunk of the input sequence and will try to match all substrings of size
minForHashof this chunk in the reference sequence and if a match occurs try to expand it and save the results if necessary.
OpenMP to parallelize our code.
We first load the reference sequence with memory mapped file I/O and compress (see the section on our compressed representation) it into a
long. For each input sequence file:
- we load the file with memory mapped file I/O,
- we scan in parallel the file to find the position of each character
'>'to split this file in input sequences,
- each thread then process each input sequence to compress them into a
We have tested different approaches and found that matching substrings of the minimum length and expanding them afterward was the best for us. Our approach is based on the Rabin-Karp algorithm  with some modifications for computing the hash function. We add also a reduction in the number of hashes to compute due to the specificities of the problem.
- Richard M. Karp and Michael O. Rabin.
Efficient randomized pattern-matching algorithms.
IBM J. Res. Dev., 31(2):249-260, March 1987.
You can download our full sources by clicking on this link.