# Matching DNA substrings - Intel Acceler8 contest

Intel Acceler8 contest
Teacher at INSA of RENNES : Pascal Garcia
Students : Cédric Andreolli, Marine Messager

may 2012

# Introduction

We present in this article our algorithm to solve the Intel acceler8 problem of may 2012. We first give an overview of the problem, then present our approach and conclude.

# Problem

As stated in the problem definition web page, we have to compare one reference sequence (in a file) with a variable number of sequences (from one or more files). The sequences are composed from four characters : A, T, C, G. The goal is to find substrings of at least `minSize` (a parameter) characters in the input files matching exactly in the reference one.
For example:

• the reference sequence is: `ATCTTGAACGTCAGTCA`,
• the minimum size of a match: `4`,
• the first input sequence is: `AGGGTGAACGTCA`,
• the second input sequence is: `GAAGA`,

then we have one match of length 9 in the first input sequence and none in the second one.
Note that some postprocessing is done on the matches found to reduce the number of results. We let the reader look the details of this pruning in the problem description page.

# Solution

## Principle

For a given reference and input sequences, the basic principle of our algorithm is to match all substrings of length `minSize` (the minimum matching length) in the input sequence against the reference one and if a match occurs, try to expand this match. For example, if the reference sequence is:

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A T C T T G A A C G T C A G T C

the input sequence is:

 0 1 2 3 4 5 6 7 8 9 10 G T C T T G C A G T C

and `minSize` is equal to 4, we will consider all substrings of length 4 in the input string: `GTCT`, `TCTT`, `CTTG`, ..., `AGTC`. There is |input| - minSize + 1 such sequences. In this example, we find that the substrings `TCTT`, `CTTG`, `CAGT` and `AGTC` match in the reference sequence. We need to expand each match to find the maximum match possible. So finally, in this example, we obtain the two matches `TCTTG` and `CAGTC`.
If we use a naive algorithm to search in the reference sequence each substring of length `minSize` of the input one, the complexity would be too high (O(|reference| x |input| x minSize ). We describe in the next section how we handle this search.

## Searching for substrings

Representation of the characters

Our alphabet for this problem consists of only four characters (`A, C, T, G`). As we can see on the following figure, we can represent each character with two bits instead of 8. We use the representation of figure 1 to save space.

Rabin-Karp

We use a Rabin-Karp style algorithm [1] for searching the different substrings of length `minSize` in the reference sequence, with some modifications for our problem. The Rabin-Karp algorithm use a hash value to represent substrings. For example, if we try to find the substring `ACTG`, of length 4, in the reference string `AATGCTGACTGA`, the Rabin-Karp algorithm does the following:

• compute the hash value for `ACTG`, `hash(ACTG) = vACTG`,

• scan the reference string:

• first compute a hash value for `AATG`, `hash(AATG)` = vAATG. We compare this value with vACTG. If the values match we possibly have a match. It is possible that two strings give the same hash value without being equals. So when there is a match, we have to verify that the strings are really equals. But, if the two values are not equals there cannot have a match,

• we then proceed to the next block `ATGC` but we do not have to compute its hash value from scratch. This hash value is computed from the last value vAATG and the value of the new character `C`. We will not detail this step further because we will use a different approach,

• all the blocks of size 4 of the reference string are processed as descibed in the preceding item.

We describe in the next section how we compute the hash value.

Hash value

Because we compress the representation of the four characters (`A, T, C, G`) we can pack 32 of them in 64-bits (a `long` on a 64-bits architecture). For example, the string:

```TACTGAACCTGGAATTC
```
is represented by the number:

```10 00 01 10 11 00 00 01 01 10 11 11 00 00 10 10 01
```

If we want to compute a hash value for 4 characters for example, we take the number represented by the concatenation of our compressed representation of each characters. For example, the hash value of `TACT` is `10000110`. As we fix a limit to the maximum number of characters we consider to compute a hash value (currently 12 characters in our program) it is easy to compute it with bit operations on a maximum of two `long` integers. To be more precise, if we need to compute a hash value for a string of more characters than our limit, we consider the first characters of the string up to our limit to compute the hash value.

As we discussed in the Rabin-Karp section, we have to search all substrings of length `minSize` from the input sequence in the reference sequence. If the input sequence is of length n, we will have to do n such searches. To save some time, we first process the reference sequence to cache some values in a hash table. We described this step in the next section.

Preprocessing the reference sequence

The idea is to compute the hash values of the substrings of length `minSize` in the reference sequence and save the coordinate of the first character of each substring in a hash table. For example, if the reference sequence is:

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A T C G G T T C A T G T C A G T

we will save the value 0 associated to the key `hash(ATCG)` into the hash table, the value 1 associated to the key `hash(TCGG)`, the value 2 associated to the key `hash(CGGT)` and so on. When we try to find a substring `s` of the input sequence in the reference one, we compute the hash value of `s` and see if this hash value is present in the hash table. If it is present, we get the stored coordinates for this key in the hash table and inspect each of them to see if a match is really present. Note that in our hash table, we only save the coordinates and not the keys. This save space and it was better (in our experiments) to consider more coordinates for a potential match than to save the additional keys.

But we can save a lot more space because there is a lot in common between the different substrings we cache in the hash table. We define a new value called `minForHash` that is derived from `minSize` and that is strictly less than `minSize`. For example, let `minSize=6` and `minForHash=3.`The following figure shows that the substring `GGT`, of size `minForHash`, is common to the four consecutives substrings of length `minSize`.

We use this property to save space by computing only hashes of size `minFor``Hash`:

• we compute hash of length `minForHash` in the reference sequence at indices multiple of `minSize - minForHash + 1`. In our example, we will save in the hash table the coordinates at indices `0, 4, 8, 12, ...`, the coordinate `0` will be for the hash value `hash(ATC)`, the coordinate `4` for the hash value `hash(GTT)` and so on,

• when we scan the input sequence, we compute hashes of length `minForHash` at every position in this string (`0, 1, 2, 3, ...`) and use the preceding hash table to see if it is possible to find a match. If a match occurs, we check if we really have one by scanning the reference sequence at each coordinates returned.

We cannot miss a match with this method. To have an idea of the proof of this claim, look at the next figure.

In this Figure, we consider `minSize=10` and `minForHash=4`. The string in this figure is a chunk of the reference sequence. The boxes around the characters represent a sequence of characters used to compute a hash value and store the coordinate of the first character of this sequence. The first characters of two consecutive boxes are separated by `minSize - minForHash + 1`. In the figure is represented a window of length `minSize`. If we slide this window anywhere, it will always contain a sequence of `minForHash` characters for which we computed a hash value. Because we test each sequences of `minForHash` characters in the input sequences, we cannot miss a match of length `minSize` between the reference and the input sequence.

Expanding the matches

When we found a match of length `minForHash` between the input and reference sequences we have to expand it to the left and the right to see if a match of length at least `minSize` exists. The only point that deserves more explanation is the expansion toward the left. Indeed, as illustrated in the following figure, if we start expanding from point `p1` of the reference sequence and find a match to the left up to point `p2`, where `p2` is the starting character of a hashed sequence, we can drop this match because another one to the left will match it.

Complexity

The worst case complexity of the Rabin-Karp algorithm is O((|ref erence| − minSize + 1) × minSize) if we search a substring of size minSize in the reference sequence. Consequently, the worst case complexity of our algorithm is approximately: O((|ref erence| − minSize + 1)/(minSize − minF orHash + 1) + (|input| − minF orHash + 1) × bucketSize × minSize)

where bucketSize is the max size of a bucket in the hash table. Note that we estimate the worst match length to minSize because we prune to the left when we expand and so it seems quite reasonable.

Parallelization of the search

To parallelize the search:

• we create a hash table by thread. Each thread will fill is own hash table to avoid contention between threads. We implement our hash table with the idea of avoiding pointer chasing (like the classical one with buckets) and trying to be more cache friendly,

• we then distribute the scan of an input sequence to the different threads, where a thread will try to find a match in the reference sequence using all the hash tables. So each thread is given a chunk of the input sequence and will try to match all substrings of size `minForHash` of this chunk in the reference sequence and if a match occurs try to expand it and save the results if necessary.

We used `OpenMP` to parallelize our code.

We first load the reference sequence with memory mapped file I/O and compress (see the section on our compressed representation) it into a `vector` of `long`. For each input sequence file:

• we load the file with memory mapped file I/O,

• we scan in parallel the file to find the position of each character `'>'` to split this file in input sequences,

• each thread then process each input sequence to compress them into a `vector` of `long`.

# Conclusion

We have tested different approaches and found that matching substrings of the minimum length and expanding them afterward was the best for us. Our approach is based on the Rabin-Karp algorithm [1] with some modifications for computing the hash function. We add also a reduction in the number of hashes to compute due to the specificities of the problem.

## Bibliography

1
Richard M. Karp and Michael O. Rabin.
Efficient randomized pattern-matching algorithms.
IBM J. Res. Dev., 31(2):249-260, March 1987.