# Fast algorithm VS scalability

## Fast algorithm VS scalability

Hi,

There are two different algorithms for solving the substring finding problem. The first one is to build a suffix tree and the second is to use dynamic programming.

Using a suffix tree has the benefit of having the complexity O(n + m), but is hard to parallelize. Has someone tried to parallelize this algorithm? How good does it scale? I haven't tried it, because I assume that it wont scale well due to data congestion.

Using dynamic programming has the benefit of good scalability, but has a complexity of O(n * m). It will loose against O(n + m) if n and m are both big. We focused on improving the dynamic programming algorithm.

I am curious what you are using. Are you confident enough in your solution that you are willing to share some of your knowledge?

17 posts / 0 new
For more complete information about compiler optimizations, see our Optimization Notice.

In my opinion, i think that our goal is not to use some advanced alogrithms and to speed up our program in that way, because this is not algorithm competition. Its recommended to speed up the program using only parallel structures.I would like if Xavier can confirms this statement.In my case, i ve made my own input reading and i ve got really good speed up, but i m not sure if that gonna count in final results.I m curious, if someone has fast algorithm and ,for an example, gets real time 1, and someone else has worse result (2 ), but normal dynamic algorithm without crucial changes, who will be in advantage?

I really hope that you are wrong.
The only fair way is: the program which runs faster gets more points.
Test can be performed on a machine with lots of cores to give an advantage to parallel algorithms, but the above statement should stay true, otherwise it would become not an optimization competition.

Well, this is not ACM. It's necessary for a program to scale as well as be fast.

I am curious however how much scaling is worth and how much speed is worth. And by scaling, I don't mean input-level parallelization, i.e. one sequence per thread or something like that.

Your question is interesting but according to the rules :

The criteria : execution time (on various single memory machines/inputs/conditions)
rules

So you definitely want to use faster algorithm, no matter of it is not as scalable as the slower one. Another reason for using faster algorithm is because more complicated programs are usually more difficult to parallelize so it is natural problem.

Sorry, I haven't seen your comment. I think measuring scaling by itself would be unfair because the more sophisticated algorithm will be harder to parallelize. I think it is natural if more parallelized algorithms will win because they will simply work faster on 40 cores.

You could always implement more algorithms and decide at runtime which one to choose, depending on the size of the input, number of available threads and so on.

I think that algorithm adaptations for better parallelization are needed and welcomed by intel. The sample code is hard to parallelize well and needs some changes. Choosing a different algorithm is fine too, but if you have algorithm1 and algorithm2, and the 1 runs faster sequentially and 2 runs faster on 40 cores, the number 2 wins. This is, after all, a parallel programming contest. I think Intel gave us the sampled algorithm (dynamic programming), because it's easier to optimize for parallel execution, but there's no explicit statements that we can't use other or adapted algorithms.

Best regards,

I have read many posts about LCS algorithms (and longest common substrings too) and only recommendation was to use dynamic programming. As i see, we must find better solution then this (not just to optimize the sample algorithm). Can anybody name the algorithm that is better then the given one?In the sample code, we have a lot of issues, such as a lot of dependencies, and it is practically impossible to get a top performance.

I also think that the fastest program should win. Because otherwise (i.e. if scalability were the main criterion), if you have a fast algorithm you could artificially slow it down for low thread numbers and thus achieve better scalabilty. And that wouldn't be fair.

I have tried to open a discussion about lower bounds for time complexity on this thread also but I didn't get any replyhttp://software.intel.com/fr-fr/forums/showthread.php?t=104819&o=a&s=lrThe thing is that I believe that the output itself is O(N*M) for small values of the minMatchLength, so I don't believe an O(N+M) algorithm exists.Try for example the following:input seg: AAACCCTTTGGG x N/12 timesreference: AAACCCGTTTGGGT x M/14 timesFor N=40K and M=20K and K=minMatchLength=6 it gives a total of 10M solutions. I might be wrong but I think this is O(N*M/K^2).refseqn2mare.txthttp://software.intel.com/file/43678inputn2.txthttp://software.intel.com/file/43679

I have two points for that:

1. I think the fastest program should win (besides the other criteria like documentation and community). Because this just combines the two points: A good algorithm and good scalability. I do not think you can beat somewhat good parallel solutions with just a sophisticated algorithm running in one thread.

2. There is a paper about suffix trees and parallel clone detection of Rainer Koschke: Large-Scale Inter-System Clone Detection Using Suffix Trees. Unfortunately, he does not tell us the worst case execution time of building suffix trees, but claims to have a parallel solution. Again, unfortunately, he runs only one thread per input file and one thread for the reference. We could improve that by splitting our inputs into sequences, but I do not think that is sufficient. And on the other hand, it is hard to implement and not very scalable (it does not depend on the input size, but on the number of inputs). There is a paper by Baker: Parameterized pattern matching: Algorithms and Applications that should mention the worst case execution time of suffix-tree-based alroithms, but I could not open the document (I found it on google scholar).

Regards,
Basti

I doubt that at this point in the contest one would actually try a new algorithm implementation, also as i recall the past winner did not use another algorithm then the rest top10 but rather improved it with heuristics.

Of course you can "beat somewhat good parallel solutions with just a sophisticated algorithm running in one thread."

If your sophisticated algorithm is O(N) and the parallel solution is O(N*N), even considering the parallel solution scales perfectly on P cores; you have P ~ 40 and N >> 40 ...

With "somewhat good solution" I thought of a solution that tackles our problem. And I doubt that there is a solution that runs in linear time with every input (correct me if I'm wrong). And I doubt, that anyone comes up with am execution time greater than O(N*N). with execution times that are closer together, the race gets more interesting.

But independet on the problem, my statement is wrong and you are right ;-)

Hi,

we thought of a new algorithm we believe runs in o((n+m)*log(n-k)) were k is the minMatchLength with a space usage of O(n+m) using a multiset. We are not sure how to parallelize it though since we are still debugging.

greetz ;)

This sounds interesting. You do not have to answer this, but are you using suffix trees? The execution time sounds like that.