Regarding Benchmark Design and Results.

Regarding Benchmark Design and Results.

Hi everybody,

last day of the contest, pretty intensive. I would like to share some thoughts about the benchmarks offered in the contest, with the staff from intel as well as with the community.

The design of the benchmarks, and i am talking about a massive parallelization benchmark here, is a bit misleading in my opinion. This primarily addresses the benchmark released yesterday.

We entered the contest as a parallelization contest, and as such, focused on the key aspects of parallelization, including multiple levels of them. This includes what is called thread-level parallelism (balancing work on multiple physical cores, using techniques like pthreads, openMP, intel TBB, etc.) and instruction-level parallelism (SIMD instructions, vectorization techniques). In order to test our results, and measure it's performance, we assume that the target of optimization is to scale leveraging parallelism. In the benchmark, this was reflected by a growing size of the input _TOGETHER_ with a larger number of available parallel resources (allowed threads resp. physical cores).

However, the last benchmark, seems to go into a different direction. In that case, increasing the size of the input size WITHOUT allowing a parallel improvement (e.g., number of cores/threads) does not motivate parallel algorithms, it motivates algorithms with minimal ASYMPTOTIC COMPLEXITY. Or, in other terms, the best parallel algorithm is inferior to the worst implenetation of a single-threaded algorithm if the later offers better asymptotic complexity, when increasing input size but not number of parallel resources.

This benchmark makes it hard to compare any results or "grade" of solutions, because, to be honest, it feels odd when non-parallel algorithms outperform highly-parallel algorithms in a contest like this. This violates the idea of comparing results based on real-time, when scaling is focused so much more on the size of the input than the number of available parallel resources.

On the other hand, CPU% calculated by user time / wall time is a bad idea in many cases, too, as spinning threads (often found in open mp implementations) can achievement 100% usage when effectively in idle more (busy waiting, persistent (keep-alive) thread pools). In this case, the algorithm does not really scale, but user time grows linear with the number of threads, even when input size remains fixed.

I do not know how you evaluate this benchmark, or what the thoughts of the other competitors are. But, with the late addition of yesterdays benchmark, most of us do not have enough time to change our algorithm (or, would feel kinda forced to) to a O(n+m) one (suffix-trees, non-parallelizable). This feels abit against the spirit of the competition in our opinion.

This just reflects my opinion, others are valid, too. Maybe the organizers would keep that in mind for the next competition. All in all, it's a great idea for a contest, and besides this late surprise, it's real fun to take part in.

Jakob Mund,
Team Zubrowka,
Technical University, Munich (TUM)

14 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

I disagree with your analysis.First, this contest is as much an algorithm and parallelization contest and it's what makes it so interesting.I don't know how you can imagine than the program that solve the new test are sequential. In our case, it's not the case, we have a parallel program.I would add that in this contest, we had the opportunity to think about an algorithm, to improve it, to work on hight and low levels (with some asm code in our c++ code), and we found that really interesting. Those are things I've never done before and I really liked doing it :)And anyway, I think that the goal of the contest is to have a quick algorithm to solve the given problem. If a sequential solution can beat a parallel solution, why not ?This is only my opinion :)

First of all, i did want to raise some concerns of the way the benchmarks are used for comparisons of parallel algorithms, and that scalability is not only scalability on input, but also scalability on the number of cores (and for a contest on algorithms and parallelization, this shall at least be equally important).

And i am sure your results and code are great, and it was by no means my aim to question this.

Good to here you enjoyed the contest too, pretty interested in the results.

It isn't true, that only trivial algorithm is parallelizable. Suffix array construction can pe done in parallel.Algorithms with suffix trees can't have linear speedup, but steel can be done in parallel (bfs/dfs part). Don't know how to parallelize construction part.

Yes, sure, but with two major classes of solutions it's hard to compare results when the input size is not scaled linearly with the number of cores.

However, that are sick results. 14sek user time on the large benchmark. Well done.

Event if I agree with some parts of your analyse, I like the current form of that contest. We have to think differently and create a contribution outside
from existing limit using the maximum of our knowledges. There isn't
perfect solution and you can't get the best contribution simply thinking as the other. You have to follow your own way and intuition.

The contest before this one was little different. There was only one
solution very well known to resolve the problem (2D Kadane) but many tips to optimize it
(Matrice transposition, file reading, ...).

The two before were on the same model than this one. They were only open at french students.

The system of automatic benchmarks offers some advantage mainly this evening. Last time it was impossible to connect at the MTL. Therefore it's a shame to not have access to it during this contest. We had a chance I got a computer with 8 cores recently. During previous contests we were totally dependent from the MTL.

PersonallyI don't think it was a good idea to add a new benchmark at the very end of the contest, it's just added a lot of pressure. But the whole benchmarking system is a very good idea, we was able to test our code on big multi-core machine easily. The only thing I would add is make the benchmark do some profiling (with vtune or others tools) and send back the result, it's would be very helpful locatebottleneck.

As flesti, I think thas this benchmark at the very end was quite depressant.It made us doubt of our algorithm and methods, at a time where no non-trivial modifications were not possible.It is also true that people that did not have access to a multicore machine have probably been in a difficult situation to craft their code, trying to guess the limitations of their algorithm with the time command output.

The benchmark system was a great improvement compared to last year's acceler8. Not sure how I feel about adding that last benchmark in the 24th hour; but I don't think it would have made too much of a difference had it been released eariler.

And as far as the algorithm complexity debate goes, it's hard to find a compromise.Maybe in the future, solutions will be restricted to just one algorithm, so people can focus more on "acceler8-ing" the code.

I think what mund is telling us, is that there is just a difference of how you solve the problem and how this reflects in the benchmark results. If you are using a massively parallel algorithm with worse complexity you always have to loose, if the problem size gets big enough and the computing resources do not scale along the problem size. pawnbot, you wrote "Don't know how to parallelize construction part." This is true, many people tried that before and could not succeed with that problem. I think that reflects in the CPU% of your benchmark results. (But really, those are great results!) We focused more on a highly parallel algorithm with minimum overhead in thread synchronization and thing and hoped the benchmarks would help us.

I just wonder about how much the parallelization comes into play when we are comparing the results. And in fact, as much as I understand the rules, there are some experts reviewing our code and our results that will take into account what we have done to solve the problem. And I really think this is a fair competition then. Perhaps it would be better to release the hard benchmarks in the beginning of the contest as well, so that everybody has a clue on how to scale what.

I don't mind adding the last benchmark. I guess they had to provide bigger inputs because the benchmarks got better and better.
As for the algorithm. Maybe they should rank the contestants separately. Half the prizes for those that worked on the provided algorithm and half for those that chose a completely different solution. I think that would be only fair, because, after all, the point of the contest is to accelerate your code, not to change it completely to a different one, but it would also be unfair to forbid the use of other algorithms, because it's interesting to see how fast can the given problem be solved. We changed the algorithm just recently and have much better benchmark results because of that, but this is still my opinion. In any case, this is our second time around in this contest and we really enjoy it, regardless of these tiny flaws.
The submission system was great, although access to the MTL would have been very useful too. Instead, maybe they could alow us to setup our own inputs next time as well as use the predefined benchmarks.
Anyway, thumbs up :)

Best regards,
Nenad

I think the problem was well posed and the last benchmark isn't that big a problem if you swap the sequences before processing them (so ref is smaller again for your algorithm).

The thing is when does your parallel algorithm start to scale up? Parallel algroithms usually lose against sequential algorithms on one core and for small data because of the possible overhead.
To design a parallel algorithm, you need to know the boundary conditions (#cores, data size) very well, because only then you can know when your work is good, ie the parallel algorithm outperforms the sequential algorithms.

Informing about the boundary conditions is something that this competition still hasn't done to our full satisfaction, because you don't read about the target input sizes in the official rules. There has only been one post on the forums about it that has been quoted a lot.

Looking at the results, it seems that ours kick in a bit on the late side compared to eg. canderolli's solution, which is a bit faster but uses more CPU%:

(quote)
on a 40-cores HT machine, using 20 worker threads, running benchmark AE12CB-13105960671904317035: real:1.36 user:18.35 sys:2.01 CPU:1497.06%

vs (ours)

on a 40-cores HT machine, using 20 worker threads, running benchmark AE12CB-13105960671904317035: real:1.64 user:4.15 sys:0.55 CPU:286.585%

However, the CPU% is very different.
I daresay that our solution probably scales better if you put in 10x the data. The question is: will that be used to measure the performance/score? Probably not :(

I want to share 3 benchmarking results to show my point. (And because I just looked at the results myself and they fit to the topic ^^)
This runs the default test (237MB) in test_input3 on a 16 core machine:

/usr/bin/time -v ../../solution-windowhash/run 16 16 test_input_3.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.21.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.22.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.X.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.Y.fa > myoutput.txt

real: 1.15 user: 4.61 sys: 0.45 Percent of CPU this job got: 437%

It doesn't scale very well :(
If we put in all DNA data (~2.7GB), it starts looking better:

/usr/bin/time -v ../../solution-windowhash/run 16 16 test_input_3.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* > myoutput.txt

real: 4.44 user: 53.54 sys: 2.98 CPU: 1271%

And then it scales up: with 6x input (~16.2GB), we get:

/usr/bin/time -v ../../solution-windowhash/run 16 16 test_input_3.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.
dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.* > myoutput.txt

real: 23.61 user: 328.39 sys: 14.38 CPU: 1451%

Ie only wit 16.2 GB we start to saturate the cores and if we compare the data size to the real time, we get a throughput of:

size (in MB)
238.9926160703
2799.001407059
16794.0084423539

througput (in MB/s)
207.8196661481
630.4057223106
711.3091250468

So we have strong increase from 239MiB to 2.7 GiB and it even gets better at ~17 GiB of data. However, I doubt that these data sizes will be used for benchmarking the solution entries, so this proves that you need to optimize for the data sizes of the actual use cases :)

Cheers,
Andreas

With huge amount of input strings parallelization is trivial. In my opinion more intresting results are on tests like this:time ./run X 800 ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.X.fa ATCG-Homo_sapiens.GRCh37.66.dna.chromosome.Y.fa

I think pawnbot is samewhat right. Increasing the number of test sequences only tests trivial parallelization, as the problem is not a k-matching substring problem, but a 2-matching (mutual) substring problem.

It is interesting how much parallelization can be added when considering 1 ref.seq. and 1 test.sequence, because this is non-trivial with many approaches. And as has been said in this thread, my main point was that any sequential algorithm will eventually outperform the best parallel algorithm if the input size is just large enough (and the number of cores fixed).

Connectez-vous pour laisser un commentaire.