Teacher at INSA of RENNES : Pascal Garcia
may 2012
Introduction
We present in this article our algorithm to solve the Intel acceler8 problem of may 2012. We first give an overview of the problem, then present our approach and conclude.
Problem
As stated in the problem definition web page, we have to compare one reference sequence (in a file) with a variable number of sequences (from one or more files). The sequences are composed from four characters : A, T, C, G. The goal is to find substrings of at least minSize
(a parameter) characters in the input files matching exactly in the reference one.
For example:
 the reference sequence is:
ATCTTGAACGTCAGTCA
,  the minimum size of a match:
4
,  the first input sequence is:
AGGGTGAACGTCA
,  the second input sequence is:
GAAGA
,
then we have one match of length 9 in the first input sequence and none in the second one.
Note that some postprocessing is done on the matches found to reduce the number of results. We let the reader look the details of this pruning in the problem description page.
Solution
Principle
For a given reference and input sequences, the basic principle of our algorithm is to match all substrings of length minSize
(the minimum matching length) in the input sequence against the reference one and if a match occurs, try to expand this match. For example, if the reference sequence is:
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 
A  T  C  T  T  G  A  A  C  G  T  C  A  G  T  C 
the input sequence is:
0  1  2  3  4  5  6  7  8  9  10  
G  T  C  T  T  G  C  A  G  T  C 
and minSize
is equal to 4, we will consider all substrings of length 4 in the input string: GTCT
, TCTT
, CTTG
, ..., AGTC
. There is input  minSize + 1 such sequences. In this example, we find that the substrings TCTT
, CTTG
, CAGT
and AGTC
match in the reference sequence. We need to expand each match to find the maximum match possible. So finally, in this example, we obtain the two matches TCTTG
and CAGTC
.
If we use a naive algorithm to search in the reference sequence each substring of length minSize
of the input one, the complexity would be too high (O(reference x input x minSize ). We describe in the next section how we handle this search.
Searching for substrings
Representation of the characters
Our alphabet for this problem consists of only four characters (A, C, T, G
). As we can see on the following figure, we can represent each character with two bits instead of 8. We use the representation of figure 1 to save space.
RabinKarp
We use a RabinKarp style algorithm [1] for searching the different substrings of length minSize
in the reference sequence, with some modifications for our problem. The RabinKarp algorithm use a hash value to represent substrings. For example, if we try to find the substring ACTG
, of length 4, in the reference string AATGCTGACTGA
, the RabinKarp algorithm does the following:
 compute the hash value for
ACTG
,hash(ACTG) = v_{ACTG}
,  scan the reference string:
 first compute a hash value for
AATG
,hash(AATG)
= v_{AATG}. We compare this value with v_{ACTG}. If the values match we possibly have a match. It is possible that two strings give the same hash value without being equals. So when there is a match, we have to verify that the strings are really equals. But, if the two values are not equals there cannot have a match,  we then proceed to the next block
ATGC
but we do not have to compute its hash value from scratch. This hash value is computed from the last value v_{AATG} and the value of the new characterC
. We will not detail this step further because we will use a different approach,  all the blocks of size 4 of the reference string are processed as descibed in the preceding item.
 first compute a hash value for
We describe in the next section how we compute the hash value.
Hash value
Because we compress the representation of the four characters (A, T, C, G
) we can pack 32 of them in 64bits (a long
on a 64bits architecture). For example, the string:
TACTGAACCTGGAATTCis represented by the number:
10 00 01 10 11 00 00 01 01 10 11 11 00 00 10 10 01
If we want to compute a hash value for 4 characters for example, we take the number represented by the concatenation of our compressed representation of each characters. For example, the hash value of TACT
is 10000110
. As we fix a limit to the maximum number of characters we consider to compute a hash value (currently 12 characters in our program) it is easy to compute it with bit operations on a maximum of two long
integers. To be more precise, if we need to compute a hash value for a string of more characters than our limit, we consider the first characters of the string up to our limit to compute the hash value.
As we discussed in the RabinKarp section, we have to search all substrings of length minSize
from the input sequence in the reference sequence. If the input sequence is of length n, we will have to do n such searches. To save some time, we first process the reference sequence to cache some values in a hash table. We described this step in the next section.
Preprocessing the reference sequence
The idea is to compute the hash values of the substrings of length minSize
in the reference sequence and save the coordinate of the first character of each substring in a hash table. For example, if the reference sequence is:
0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15 
A  T  C  G  G  T  T  C  A  T  G  T  C  A  G  T 
we will save the value 0 associated to the key hash(ATCG)
into the hash table, the value 1 associated to the key hash(TCGG)
, the value 2 associated to the key hash(CGGT)
and so on. When we try to find a substring s
of the input sequence in the reference one, we compute the hash value of s
and see if this hash value is present in the hash table. If it is present, we get the stored coordinates for this key in the hash table and inspect each of them to see if a match is really present. Note that in our hash table, we only save the coordinates and not the keys. This save space and it was better (in our experiments) to consider more coordinates for a potential match than to save the additional keys.
But we can save a lot more space because there is a lot in common between the different substrings we cache in the hash table. We define a new value called minForHash
that is derived from minSize
and that is strictly less than minSize
. For example, let minSize=6
and minForHash=3.
The following figure shows that the substring GGT
, of size minForHash
, is common to the four consecutives substrings of length minSize
.
We use this property to save space by computing only hashes of size minFor
Hash
:
 we compute hash of length
minForHash
in the reference sequence at indices multiple ofminSize  minForHash + 1
. In our example, we will save in the hash table the coordinates at indices0, 4, 8, 12, ...
, the coordinate0
will be for the hash valuehash(ATC)
, the coordinate4
for the hash valuehash(GTT)
and so on,  when we scan the input sequence, we compute hashes of length
minForHash
at every position in this string (0, 1, 2, 3, ...
) and use the preceding hash table to see if it is possible to find a match. If a match occurs, we check if we really have one by scanning the reference sequence at each coordinates returned.
We cannot miss a match with this method. To have an idea of the proof of this claim, look at the next figure.
In this Figure, we consider minSize=10
and minForHash=4
. The string in this figure is a chunk of the reference sequence. The boxes around the characters represent a sequence of characters used to compute a hash value and store the coordinate of the first character of this sequence. The first characters of two consecutive boxes are separated by minSize  minForHash + 1
. In the figure is represented a window of length minSize
. If we slide this window anywhere, it will always contain a sequence of minForHash
characters for which we computed a hash value. Because we test each sequences of minForHash
characters in the input sequences, we cannot miss a match of length minSize
between the reference and the input sequence.
Expanding the matches
When we found a match of length minForHash
between the input and reference sequences we have to expand it to the left and the right to see if a match of length at least minSize
exists. The only point that deserves more explanation is the expansion toward the left. Indeed, as illustrated in the following figure, if we start expanding from point p1
of the reference sequence and find a match to the left up to point p2
, where p2
is the starting character of a hashed sequence, we can drop this match because another one to the left will match it.
Complexity
The worst case complexity of the RabinKarp algorithm is O((ref erence − minSize + 1) × minSize) if we search a substring of size minSize in the reference sequence. Consequently, the worst case complexity of our algorithm is approximately: O((ref erence − minSize + 1)/(minSize − minF orHash + 1) + (input − minF orHash + 1) × bucketSize × minSize)
where bucketSize is the max size of a bucket in the hash table. Note that we estimate the worst match length to minSize because we prune to the left when we expand and so it seems quite reasonable.
Parallelization of the search
To parallelize the search:
 we create a hash table by thread. Each thread will fill is own hash table to avoid contention between threads. We implement our hash table with the idea of avoiding pointer chasing (like the classical one with buckets) and trying to be more cache friendly,
 we then distribute the scan of an input sequence to the different threads, where a thread will try to find a match in the reference sequence using all the hash tables. So each thread is given a chunk of the input sequence and will try to match all substrings of size
minForHash
of this chunk in the reference sequence and if a match occurs try to expand it and save the results if necessary.
We used OpenMP
to parallelize our code.
Reading the files
We first load the reference sequence with memory mapped file I/O and compress (see the section on our compressed representation) it into a vector
of long
. For each input sequence file:
 we load the file with memory mapped file I/O,
 we scan in parallel the file to find the position of each character
'>'
to split this file in input sequences,  each thread then process each input sequence to compress them into a
vector
oflong
.
Conclusion
We have tested different approaches and found that matching substrings of the minimum length and expanding them afterward was the best for us. Our approach is based on the RabinKarp algorithm [1] with some modifications for computing the hash function. We add also a reduction in the number of hashes to compute due to the specificities of the problem.
Bibliography

 1
 Richard M. Karp and Michael O. Rabin.
Efficient randomized patternmatching algorithms.
IBM J. Res. Dev., 31(2):249260, March 1987.
You can download our full sources by clicking on this link.