Looking for help defining a Threaded Programming Contest for 2009

Looking for help defining a Threaded Programming Contest for 2009

Since we had such a great response to the previous Threading Challenge Contest, we want to do another one or two or six. From that year-long contest that ended in September 2008 there are a few things we hope we can do better this time around. Some of the lessons learned include:

* Don't go on so long. Twelve months is a virtual programming marathon and I appreciate everyone that stuck with it. While it did separate the dabblers from the hardcore hackers from the thread monkeys, it discouraged participation from some innovative and deserving contestants.

* Try to have more objective judging criteria. Execution speed is the whole reason for concurrent and multi-core programming. Does it matter how you get there or what tools are used? A code that screams through a data set should be judged higher than a code that is consistently well indented, right?

* Be more open and precise about the judging machine configuration(s). It would be much better to have a single machine that all contestants could access to develop, test, and submit their entries. Unfortunately, this would require administrators and rack space and all sorts of other expenses. Second best is to be sure everyone understands how the machines are set up, both for hardware and software.

The driving idea behind threaded programming events for 2009 would be to take "classic" algorithms and parallelize them. The term "classic" would mean something that might be studied in an undergraduate course on algorithm design. Categories of classic algorithms would include things like searching, sorting, numerical computations, string processing, computational geometry, and graph algorithms. In addition to fostering a threaded code solution to "classic" algorithms, we are hoping to somehow to generate content for the Intel Software Network.

Since we are still in the planning stages, things are still in flux. Other ideas that have been brought up are to limit the languages used, pick a single OS to be used, restrict some problems be done in a non-traditional programming language (e.g., Haskell), or offer more than one problem per month.

In order to make this contest the best it can be, we'd like to solicit your input. If you have any ideas, especially if you participated in the Threading Challenge Contest, about how the contest could be administered, judging criteria to be used, a problem that could be posed, or how contest entries could be featured in ISN and educate readers about multi-core programming, then we want to hear your ideas. You can reply to this post or send me an message through the email facility here in the forums.

With your help, we can make threaded programming contests this year and in coming years better, more fun, and provide exceptional examples for programmers around the world.

36 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Very good outline for contest.

The contest duration should be long enough for the programmer to perform the task in intermittent steps (between their normal work obligations). As to what this is, I do not have a good recommendation. 12 months is too long, 1 month might be too short.

I agree, the contest should be judged on results, accuracy and performance. Because performance will vary on target platform might I suggest the following: Perform the tests on multiple different target platforms of varying numbers of cores, speeds, instructions sets,memory configurations etc.. Intel does have a test lab, I assume,with various configurations. Choose a selection criteria for platform such as no older than n years, not newer than not released. You may choose equal weighting or bias the results by some factor such as by number of particular processors sold. (e.g. Q6600 would have heavier weight than Extreme series).

The performanceand accuracyis all that matter. In the case of a tie then maybe other factors can be used (e.g. earliest entry, ...)

Language or tool selection should be immaterial. A user should be free to choose hand coding in assembler as well as choose to write and Excel macro.

The challenge should be generic and offer the contestant the widest of leeway in producing the solution.

Bad example:

Write the fastest multi-threaded Quick Sort routine given this data-set and results desired.

Good example:

Write the fastest multi-threaded sort routine given this data-set and results desired.

If someone wishes to submit an entirely new sort routine that is entirely up to them.

Jim Dempsey

The driving idea behind threaded programming events for 2009 would be to take "classic" algorithms and parallelize them. The term "classic" would mean something that might be studied in an undergraduate course on algorithm design. Categories of classic algorithms would include things like searching, sorting, numerical computations, string processing, computational geometry, and graph algorithms. In addition to fostering a threaded code solution to "classic" algorithms, we are hoping to somehow to generate content for the Intel Software Network.
Future of processor is multicore therefore future of software have to be multithread :)
The more people work/think on parallel software the better and the Intel Threaded Programming Contest might help on that...
>"The driving idea [...] would be to take "classic" algorithms and parallelize them."
Doest it means that if you want to parallelize real life applications you need to parallelize all the algorithm within ?
For instance, does it mean that if I want to parallelise a video compression software, I should try to parallelize the compression algorithm ? For a antivirus software do I really need to make parallel heuristic algorithm ? And for a Spell corrector, a video decompressor, a compilator, a speech recognizer, a render solfware, a neural network patern recognizer, a 3D modeler, a physics engine or a video game ?
But most of these Softwares can be parallelize pretty easily : antivirus may load multiple file in memory, one file per core. video compression algorithm may work on different part of the video, if you can cut the video on enought pieces etc... You can do the same for spell/grammar corrector, render software etc... but not for all of them.
Some software are "not so easy to parallel" : you really need to multithreadvideo decompression algorithm, there is no point to decompress differents parts of the video at the same time... Same problem with speech recognizer or physics engine...
Parallelize that kind of "not so easy to parallel" software is more interesting than "classic" algorithm...
Who really think nowdays that multithreaded every part of a monothreaded software can make a PERFORMANT multithreaded version of the sofware ? Who really think that the use of a multithreaded library like TBB with protected vector/data structure and multithread algorithms can make PERFORMANT multithread software ?
TBB like library is essential, access to a performant multithread quick sort is essential, but your application will not scale if you don't rethink and refactor the core architecture from scratch.
My very personal opinion : the path to performance in multithreading is refactoring the software architecture, not refactoring the tools.
Intel Threaded Programing Contest should be oriented on real life application (video decompression, physics engine, video game or any "not so easy to parallel" software). Reward need to be more interesting (being a first client, financing with hardware/training/money etc... ) And the only judging criteria must be performance scaling on future manycore.
Vincent

Clay,

The problem with threading challenge contests is when you define a problem as writefrom scratch contest you generally restrict yourself to small easily understood algorithms. These types of contests are good, but they are not the only type of problem to be presented for competition.

A second type of problem to parallelize is a more complicated application. An example might be to provide the contestants with a working base level program and then challenge them to multi-thread it. An example might be the tryHavok Smoke demo or other similar applications. Note, although Smoke is already multi-threaded using Windows Treads and/or TBB threads, these efforts at threading leave considerable room for improvement.

Jim Dempsey


A second type of problem to parallelize is a more complicated application. An example might be to provide the contestants with a working base level program and then challenge them to multi-thread it. An example might be the tryHavok Smoke demo or other similar applications. Note, although Smoke is already multi-threaded using Windows Treads and/or TBB threads, these efforts at threading leave considerable room for improvement.

This will be interesting indeed! And close to real-life.
Smoke Demo is a good candidate. Just take Smoke and make as fast as possible.
Hmm... Although now Smoke is already a bad candidate, because some cheaters already start optimizing it speculatively :)


I agree, the contest should be judged on results, accuracy and performance. Because performance will vary on target platform might I suggest the following: Perform the tests on multiple different target platforms of varying numbers of cores, speeds, instructions sets,memory configurations etc.. Intel does have a test lab, I assume,with various configurations. Choose a selection criteria for platform such as no older than n years, not newer than not released. You may choose equal weighting or bias the results by some factor such as by number of particular processors sold. (e.g. Q6600 would have heavier weight than Extreme series).

I would prefer single defined platform, because I will not have enough time to make optimized solutions for many platforms. Although I can make it for single platform. One may think of it in the following way - I've made optimized solution for single platform, just because time was limited; but if it would be real-life problem, and I would have more time, and speed matters, then I will make many solutions for many platforms as optimized as I made in the contest for single platform. So the task is to determine how good solution I can make, not how many solutions I can make.

Quoting - fraggy

But most of these Softwares can be parallelize pretty easily : antivirus may load multiple file in memory, one file per core. video compression algorithm may work on different part of the video, if you can cut the video on enought pieces etc... You can do the same for spell/grammar corrector, render software etc... but not for all of them.
Some software are "not so easy to parallel" : you really need to multithreadvideo decompression algorithm, there is no point to decompress differents parts of the video at the same time... Same problem with speech recognizer or physics engine...
Parallelize that kind of "not so easy to parallel" software is more interesting than "classic" algorithm...

Yeah, many algorithms are trivially parallelizable, thus not interesting.
Many algorithms are hard to parallelizable, but have known published parallelization techniques, not interesting too - the task is to find right paper.
And many algorithms are hard to parallelize, and don't have known published parallelization techniques, and such techniques are publishable result or even PhD thesis. Not interesting too, because few or zero participants will figure out how to do it...

Quoting - fraggy

And the only judging criteria must be performance scaling on future manycore.

That's interesting judging criterion!
Assume we have 4 processors x 6 cores x 2 HT threads NUMA system - 48 hardware threads total.
First, program is executed with SetProcessAffinity(GetCurrentProcess(), 1) - baseline case.
Then, program is executed with SetProcessAffinity(GetCurrentProcess(), 1 | 2) - how much speedup can you get from HT?
Then, program is executed on 6 cores of 1 processor - how much speedup can you get from 6 full-fledged cores?
Then, program is executed on 2 cores, but on 2 different processors - can you get advantage of NUMA?
And so on up to 48 cores.

The task is to show good performance in baseline case, and then linear or super-linear speedup in as many cases as possible. The idea is to determine whether a solution is really ready to heterogenous, distributed, hierarchical many-core future, or just a silly solution that can take advantage only of very limited number of cores. If a solution shows 1.99 speedup on 2 cores, then 3.5 speedup on 4 cores, and 6.0 speedup on 8 cores, well, suboptimal scaling means that there are some serious problems like saturation of cache-coherence interconnects or memory connects or excessive blocking/contention; and is usually grows with number of cores and introduction of NUMAness. So on 16 cores it will be 9.0x, and on 32 cores it will be 10.0x...

I would really like to see such judging creterion at least on 1 appropriate problem. What do you think?

Quoting - Dmitriy Vyukov
Yeah, many algorithms are trivially parallelizable, thus not interesting.
Many algorithms are hard to parallelizable, but have known published parallelization techniques, not interesting too - the task is to find right paper.
And many algorithms are hard to parallelize, and don't have known published parallelization techniques, and such techniques are publishable result or even PhD thesis. Not interesting too, because few or zero participants will figure out how to do it...

You're right :)

Contest should be on any algorithms (trivially parallelizable or not, with knowned solution or not). But still, reward for algorithm with no knowned solution might be a little bit more interesting :)
Vincent, Maybe something like theLoebner Prize

I think that execution speed must be preferred over consistent well-structured solution that uses "right" libraries. It's approach taken by the AMD Multicore Threadfest, and I think it's right for this kind of contests. They allowed to use AMD APL Library, but that was neither obligatory, nor affects score. Judging criteria was very simple and formal. Problem for the round 4 of the AMD Multicore Threadfest was image filter, and the judging criteria was: SCORE = C1 / (execution_time + C2 * relative_error), in my opinion it's very good judging criteria.

In the AMD Multicore Threadfest they precisely described machine configution that will be used for the competion, they described number of processors, processor models, amount of cache, amount of memory, OS version, compiler version. IMO, it's very right too. Although, in general solutions must be general enough to run smoothly on any platform; but taking into account limited amount of time competitors just unable to develop such general solutions, so it's Ok to target one particular platform.

In the AMD Multicore Threadfest the only OS that was allowed is Linux. But I think that many programmers (including me) (who fell more comfortable with the Windows platform and toolchain) was very inconvinient with that. Huge amount of very limited time I spent on struggling with OS API, language dialect and especially with asm and intrinsics. So I think that both Linux and Windows must be allowed.

Quoting - Dmitriy Vyukov

Assume we have 4 processors x 6 cores x 2 HT threads NUMA system - 48 hardware threads total.

The task is to show good performance in baseline case, and then linear or super-linear speedup in as many cases as possible. The idea is to determine whether a solution is really ready to heterogenous, distributed, hierarchical many-core future, or just a silly solution that can take advantage only of very limited number of cores. If a solution shows 1.99 speedup on 2 cores, then 3.5 speedup on 4 cores, and 6.0 speedup on 8 cores, well, suboptimal scaling means that there are some serious problems like saturation of cache-coherence interconnects or memory connects or excessive blocking/contention; and is usually grows with number of cores and introduction of NUMAness. So on 16 cores it will be 9.0x, and on 32 cores it will be 10.0x...

Judging criteria should be the "best" performance scaling only, even if it's a SubOptimal scaling performance :p.

Vincent, running a test on a 48 hardware threads is a dream !!!

As for non-traditional programming languages, well, I think that it will limit number of participants in such contests significantly. I will not be installing and teaching new language and toolchain for the competition. And even if I will do, I will have no chances versus already expirienced in that languge participants. Although I think it's good to provide broad choice of languages for the competition - programmers always tend to make language shootouts - C/C++ programmers can use asm while Haskell programmers can use STM. It's interesting and fun.

As for generation of content for the ISN. I think that the good answer is to publish description and the code of the winning solution (and maybe of some other interesting solutions). Descriptions can be provided by the participants or by the organizers.

As for tasks for competitions, I may describe my experience with the AMD Multicore Threadfest. Round 4 problem was image filter. Although it was very interesting competiotion in itself, it was multicore/threading competition in no way. It was advanced math competition (and it was formally classified on the TopCoder that way), it required knowledge and application of such things as matrix decomposition and FFT based convolution (participants with no so advanced math was really out of the deal). And threading part in many solutions was as simple as single '#pragma omp parallel for' on outer-most loop (and it was perfectly enough!). As for me, it's just ridiculous to call it threading fest, it's 100% plain old single-threaded problem.

I would like to see as tasks something like the following.

Parallel concurrent garbage collector (parallel here means that garbage collection works in several threads, and concurrent means that garbage collection runs concurrently with user program (i.e. no stop the world)). Problem situation must define a set of interfaces between run-time and garbage collector, some interfaces must be defined by participant - for example, object header layout; and the task is to implement GC according to these interfaces. Test-suite will stress GC under different kinds of load.

Tricky concurrent data structure. Problem defines a set of use cases: read/write ratio, object size, object properties and so on, and interface to the data structure. The task is to implement that data structure.

Run-time for asyncronous message-passing library (Erlang-like). Problem defines a set of interfaces between library and user code and a set of test use cases. The task is to implement run-time.

Banking server. Server holds a number of accounts, accounts are organized into tree hierarchy (accounts/subaccounts), balance is connected to each account. Server handles a number of requests: adjustments to account, transfers between accounts, output balance of account, aggregations on several accounts, creation of new accounts. Accounts can be in several statuses (open, preclosed, closed), status defines a set of allowed operations. Each request and each result must be logged to some common journal. Also there are some global settings, which can be changed at run-time and define some aspects of the request processing. The task is to implement server run-time: request handlers, logging subsystem, settings change handler.

Such tasks advisedly require frequent synchronization between threads (like some real-life problems), and not just single 'omp parallel for' on outer-most loop as threading part. And also allows to feel what is 'my NUMA node' and what is 'remote NUMA node'; what is shared L2$; and what is rw-lock on highly contented data structure.

Quoting - fraggy

Contest should be on any algorithms (trivially parallelizable or not, with knowned solution or not). But still, reward for algorithm with no knowned solution might be a little bit more interesting :)

What's the point of trivially parallelizable problem in threading contest?

Quoting - Dmitriy Vyukov

What's the point of trivially parallelizable problem in threading contest?

:)

a contest must be an opportunity to everyone. Some programmer don't have much experience in threading, a multithreaded quicksort could be interesting, for instance.
But I'am agree with you, if it's just a matter an openmp pragma to be use, there is no point...
From the Intel point of view, the more participant, the better :)
Vincent
Quoting - Dmitriy Vyukov

I would prefer single defined platform, because I will not have enough time to make optimized solutions for many platforms. Although I can make it for single platform. One may think of it in the following way - I've made optimized solution for single platform, just because time was limited; but if it would be real-life problem, and I would have more time, and speed matters, then I will make many solutions for many platforms as optimized as I made in the contest for single platform. So the task is to determine how good solution I can make, not how many solutions I can make.

In all fairness the winners(awards)could be split in two or more categories:

Single socket
Multi-Socket
Multi-Socket NUMA

If you were to assign a winner overall then some sort of weights would have to be applied to the configurations.

Jim Dempsey

Quoting - Dmitriy Vyukov

Quoting - fraggy

And the only judging criteria must be performance scaling on future manycore.

That's interesting judging criterion!
Assume we have 4 processors x 6 cores x 2 HT threads NUMA system - 48 hardware threads total.
First, program is executed with SetProcessAffinity(GetCurrentProcess(), 1) - baseline case.
Then, program is executed with SetProcessAffinity(GetCurrentProcess(), 1 | 2) - how much speedup can you get from HT?
Then, program is executed on 6 cores of 1 processor - how much speedup can you get from 6 full-fledged cores?
Then, program is executed on 2 cores, but on 2 different processors - can you get advantage of NUMA?
And so on up to 48 cores.

The task is to show good performance in baseline case, and then linear or super-linear speedup in as many cases as possible. The idea is to determine whether a solution is really ready to heterogenous, distributed, hierarchical many-core future, or just a silly solution that can take advantage only of very limited number of cores. If a solution shows 1.99 speedup on 2 cores, then 3.5 speedup on 4 cores, and 6.0 speedup on 8 cores, well, suboptimal scaling means that there are some serious problems like saturation of cache-coherence interconnects or memory connects or excessive blocking/contention; and is usually grows with number of cores and introduction of NUMAness. So on 16 cores it will be 9.0x, and on 32 cores it will be 10.0x...

I would really like to see such judging creterion at least on 1 appropriate problem. What do you think?

The above requires that the core/thread management be performed external to the application and would not be part of the optimizations permitted by the programmer.

Jim

Quoting - fraggy

Judging criteria should be the "best" performance scaling only, even if it's a SubOptimal scaling performance :p.

Vincent, running a test on a 48 hardware threads is a dream !!!

Vincent, I disagree. One entrants single threaded program might run 4x faster than some other entrants 4-threaded program. The first entrants program might scale poorly while the other program might scale better. The chriteria should not be the slope of the curve but the area under the curve.

Jim Dempsey

Dmitriy,

>>Parallel concurrent garbage collector

This implies your memory allocate/deallocate generates (much) garbage. The test program should be such that it adversely affects memory allocate/deallocate such that alternate means may be introduced to reduce the interaction. Often these alternate means generate garbage which periodically needs collection. The contest rules should be for performance of allocation/deallocation (best case, average case, worst case) regardless as to if garbage collection is performed concurrent with each allocation/deallocation or at some time later.

Your other suggestions are good. These fall into the class of distributing a base level program with input files and output files and a set of guidelines as to if the output file has to be exactly the same or has to pass an approximatly the same when passed through a verification program (also supplied). The user can do whatever they want with the sample program, but they must use the supplied input data as-is.

Jim Dempsey

Quoting - fraggy
Judging criteria should be the "best" performance scaling only, even if it's a SubOptimal scaling performance :p.

Assume my solution is:
on single-core - 1x
on dual-core - 1.9x
And your solution is:
on single-core - 0.9x
on dual-core - 1.8x
Which is the best performance?

Now assume that we add quad-core into the tests.
My solution is:
on quad-core - 3.5x
Your solution is:
on quad-core - 3.6x
Which is the best performance now?

In single-core world performance was a single value - a scalar. For example, memory allocation/free takes 20 cycles, it's enough to characterize the performance of memory allocator.
But in multi-core world such approach is totally busted. Performance is not a scalar any more, performance is a function of a number of cores, or more precisely of a system configuration: performance = f(system_configuration). Ok, malloc/free takes 20 cycles, but what will be if I will start calling malloc/free from 2/4/8 cores? Will it be 20/20/20 cycles, or 200/500/6000 cycles? Will it supply me with free lunches?
Whether performance is perfectly linear y=x? Or is it y=0.8x? Or is it y=sqrt(x)? Or is it y=log(x)? Or is it y=x, when x=1..4, and y=4-x, when x>4? It's extremely critical to capture the whole y=f(x), and not some single y0, in order to evaluate some component.

But that just two different competitions - "squeeze max from today's multi-core" vs. "are you ready for many-core future?" (or "scalability at any price!"). Both are useful and interesting.

Quoting - Dmitriy Vyukov

But that just two different competitions - "squeeze max from today's multi-core" vs. "are you ready for many-core future?" (or "scalability at any price!"). Both are useful and interesting.

maybe a mix could be interesting.

pure performance is always important, of course. The right solution could be a balance of both, a grad on pure performance and a grad on scaling performance.

Pure performance is the only thing that matters.

It is easy to contrive a poor performing application that exhibits super scaling.

Jim


Pure performance is the only thing that matters.

It is easy to contrive a poor performing application that exhibits super scaling.

Jim

Performance scaling is the only thing that matters
If it don't scale correctly on a 48 hardware threads machine, it's useless :)
Don't forget that in few years from now, processors may have thousands of cores...
Vincent, truth must be somewhere in between...

What I am saying is when an entry for a competition comes in where the fast single threaded version of the application runs faster than the slow 48 hardware threaded application and where the fast single threaded application has a negative scaling slope and where the 48 hardware threaded application has a nice linear scaling factor you have a situation at an extreme. For some problems this situation is real. Good scaling means little to nothing when you cannot attain the desired performance.

The above situation can be avoided by "the committee" selectingproblems that should benefit from scaling. Once selected, the measure should be placed solely on performance of the submitted entries.

Jim Dempsey

The above situation can be avoided by "the committee" selectingproblems that should benefit from scaling. Once selected, the measure should be placed solely on performance of the submitted entries.

If a problem "should benefit from scaling", does it means it's easy to parallelize ? The contest should be more interesting if problems are not so easy to solve.

For instance, a physics engine is not easy to scale, a lot of people had work on that for the last 5/6 years and there is a lot of improvement to make.
We can grade the multithreaded version of an application with the monothread one, comparing pure performance results on 48 Threads (or 24, or both).
We can also grade pure scaling of the multithread version from 1 to 48 zones
Final grading could be a balance of the above :)
Vincent
Ya, know I don't think that whole "mirroring from the blog post comments is working ... so I'm going to copy my comments here:

Well, I kind'a liked the last contest (my least favor part was using C++ ... cause I wanted the points). I also rather liked the fact that humans looked at the code, because I think there is more to writing code than just how fast it goes. Still I understand how some people didn't like it and it had to be hard on the judges.

I would pick one central theme and build around it. So if the main idea is to parallelize "classic" algorithms then you wouldn't really need code at all :-)
If you really want people to only express the algorithms in code, then I would say that correctness should be the primary criteria, followed by speed (speed or throughput? ... I'm thinking about the use of memory mapped files in the last contest). Is memory usage important? If my code runs 5% slower but only uses 1/4 of the memory is it better?

Perhaps you should use a (more or less) fixed set of test machines, something like Windows 32 & 64 bit and Linux 32 & 64 bit. Within these four configurations compare the contestants against each other, this would give a more apple/apple comparison.

As to the length of the contest, maybe you should consider more of a carnival approach ... play for points, when you get enough points you can redeem them for something, or keep banking them for a better prize. The prize bank will be a limited so the first person to redeem gets the prize. Unused points can be (perhaps) added to the "belt" status points.

Does MIMD/SIMD programming count toward educating about multi-core?

:-)

I like the points idea. I also like to have several categories. Of programmers, of problems. Can someone state, explicitly, what are the contest objectives? Not programming objectives but of it, in itself (bring more programmers to multicore, obtain better algorithms, have fun, establish some sort of "best multithread programmers league", ...).

That's one question. About the platform, here at the Univ the guys from the IT department lecture courses with 200 students. That means that for a Artificial Intelligence course at least 100 projects must be submited. These go all to the same machine and are automatic tested. How hard would be for Intel to have such a machine (or time on such a machine) where everyone would send their programs, it would run automatic and spit out the points (by speed, by scaling, by... ).

dweeb,

Good point on the memory footprint. For some problems that may be important. For many problems small doesn't matter. When you use 500KB vs 1MB on a system with 2GB of RAM I would not think that the size of the working set would be an issue. The performance of this application would be the driving factor - and indirectly because the 500KB program might fit better in the cache, it could also run faster. The 500KB might fit in L2 cache where the 1MB may spill over into L3 or RAM.

The test machines should include Windows 32 & 64 bit and Linux 32 & 64 bit, but the choice of the machine should be up to the contestant.

Jim Dempsey

Quoting - rreis
That's one question. About the platform, here at the Univ the guys from the IT department lecture courses with 200 students. That means that for a Artificial Intelligence course at least 100 projects must be submited. These go all to the same machine and are automatic tested. How hard would be for Intel to have such a machine (or time on such a machine) where everyone would send their programs, it would run automatic and spit out the points (by speed, by scaling, by... ).

AMD Multicore Threadfest was hosted on the TopCoder platform. Contestants was to submit tests during the whole marathon, tests was automatically executed, and results was sent back to the contestant. Results included execution time, score (scoring was 100% formal), stdout/stderr output. So contestants was able to submit small tests, just to see whether one variant or another runs faster on target machine (machine for tests was the same as for final evaluation). It was very-very handy.

Your other suggestions are good. These fall into the class of distributing a base level program with input files and output files and a set of guidelines as to if the output file has to be exactly the same or has to pass an approximatly the same when passed through a verification program (also supplied). The user can do whatever they want with the sample program, but they must use the supplied input data as-is.

In AMD Threadfest, participants was submitting not the whole programs, but just source code implementation of some interface. Then it was compiled with test-suite, provided by organizers, so that actually there was no input/output files.
In this variant test-suite will be also able to control timing parameters of the workload. For example test-suite may give uniform workload or bursty workload.

For example, interface for server may look like:

struct server
{
void handle_request(msg1 const&);
void handle_request(msg2 const&);
void handle_request(msg3 const&);
void handle_config_change(config const&);
void fetch_result_state(result&);
};

Test-suite will create instance of that class and then call some sequence of 'handle_' methods, then call fetch_result_data() to verify behavior of the implementation.

What I had in mind regarding 'scalability testing'.

Contest will probably run on a machine with very limited number of cores, for example, 8 cores total. On such limited number of cores solution may be not linearly scalable but just fast. So while the solution will be fastest on 8 cores, it may not be "a parallel solution to the problem suitable for thousands of cores". So my idea was/is to distinguish "real parallel solutions to the problem suitable for thousands of cores" and "just fast solutions suitable only for today's very limited number of cores".
For example,
Solution 1:
1 core - 1x (base case)
2 cores - 1.9x
4 cores - 3.5x
8 cores - 6.0x

Solution 2:
1 core - 0.7x
2 cores - 1.4x
4 cores - 2.8x
8 cores - 5.6x

I.e. solution 1 is the fastest on 8 cores, but we see that it significantly degrades and degradation progresses with increasing number of cores. Solution 2 is linearly scalable, although on 8 cores it is still slower.
Solution 1 is not real parallel solution, and suitable only for 8/16 cores maximum. Solution 2 is real parallel solution and suitable for thousands of cores (at least based on this very limited info). So in scalability oriented competition, solution 2 must win, because if we will extrapolate performance curve to 16/32 cores, we will see that it's solution 2 that is fastest.

So, the formal criterion for scalability oriented contest may be following. Extrapolate performance curve to, for example, 4x more cores and then see who is the fastest on 4x more cores. Extrapolation technique must be formalized.

Although, must-much better criterion for scalability oriented contest will be just to run solutions on 128-core system, and see who is the fastest. On that much number of cores solutions just have to scale. If Intel will be able to provide such machines, it will be great.

Why extrapolate? Intel is presumably hosting this contest. I would guess that they have access to a 128 core system.

Jim


Why extrapolate? Intel is presumably hosting this contest. I would guess that they have access to a 128 core system.

Variant with 128-core system is the best. It is good in itself and no need to extrapolate to measure scalability. But I am curious on what system previous contest was running? And on what system 2009 contest will be running?

AMD Threadfest (last round was in Dec 2008) was running on 8-core system (2 processors x 4 cores). This is why I am nervous about scalability :)

Quoting - Dmitriy Vyukov

AMD Threadfest (last round was in Dec 2008) was running on 8-core system (2 processors x 4 cores). This is why I am nervous about scalability :)

Here is good citation of James Rainders:

-----------------------------------------------
In addition to "Think Parallel" the two key skills are:
* Writing scalable code. The shift in mindset here is fundamental: a good program needs to scale much more than it needs to be efficient. This is a very difficult concept for most to embrace, and if you develop a skill at seeing how to write code that scales, you will have a skill that will be in demand. To survive, we all need to cope with the issue of scaling some.
-----------------------------------------------

From:
http://www.ddj.com/hpc-high-performance-computing/214500435

Quoting - Dmitriy Vyukov
Banking server. Server holds a number of accounts, accounts are organized into tree hierarchy (accounts/subaccounts), balance is connected to each account. Server handles a number of requests: adjustments to account, transfers between accounts, output balance of account, aggregations on several accounts, creation of new accounts. Accounts can be in several statuses (open, preclosed, closed), status defines a set of allowed operations. Each request and each result must be logged to some common journal. Also there are some global settings, which can be changed at run-time and define some aspects of the request processing. The task is to implement server run-time: request handlers, logging subsystem, settings change handler.

James Rainders:

--------------------------------------------------
In addition to "Think Parallel" the two key skills are:

* Writing scalable code....
* Dealing with shared information, managing global state sharing. Do you share mutable state? if not, how do you update state globally? The desire to make global state visible, but not globally changeable, is really similar to the debates in OOP about public vs. private with all the same issues.
----------------------------------------------------

From:
http://www.ddj.com/hpc-high-performance-computing/214500435

Btw, Clay, we would also like to hear your comments on our comments on your comments on threading contest :)

Leave a Comment

Please sign in to add a comment. Not a member? Join today