Benchmark Results Explanation

Benchmark Results Explanation

Hi everyone,

I have question about Benchmark Results:
What exactly are representing parameters we got(real, user, sys, CPU)?

As I understood parameter represents time our program has been executing on machine,
parameter represents aggregated time of execution on all used cores,
while , is showing CPU load.

If this is true, then the time we need(to see if we are progressing) is parameter(
parameter should stay approximately the same, no matter if we parallelized our program), but
because of variable parameter, we are getting different results every time.
Or am I wrong?

70 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hello :)

Here's something you may find useful:
time command - wikipedia
time command - linux about

There you can find something about user time, system (kernel) time, etc.

System time can be interpreted like this: the overhead brought by thread scheduling (& everithing about it).

You can build a small benchmarking system on your local computer (using time command).

You should also read Overview of parallel development with OpenMP from http://intel-software-academic-program.com/courses/index.html.

The time command can be launched like this (with your executable as a parameter):
/usr/bin/time ./run number-of-cores min-length ref.txt input.txt.

Regards,
andrei

PS: start by reading the presentation from software academic program, it is very useful

Thank you andrei, I found what I wanted to know and those presentations have proved useful.

But the fact I mentioned before stays:
There is no guarantee that during our program execution
there won't be some other programs executing at the same time, which will have impact on
benchmark results we get. But I now know that that won't be so important as I first taught. So
it can't be helped I guess :)

Regards,
Nemanja

Hi,I have a question for the organizers. Is it posible to have a more detailed report for the submision? I get the "output is different than the expected one" message. I have tested my program with many inputs and I have the same results as the reference code.So I am curious about the case which 'knocks out' my program.Best regards,Dumi

Might be because you have ommited to sort the results alphabetically by sequence name when there is more than one input file.

Portrait de Xavier H. (Intel)

Hi,

it's not possible anymore to have a more detailed report, people started to abuse of this function by outputting other things than errors.

try to compare AAAAA and AAAAAAAAAAAAAA for a length of 3 and compare it with our program. I wanted to test your submission myself but the last version you sent is the reference one.

And when we are running your submission, they are in the same conditions as the other, don't worry about that.

Hi,I get:1 5 10 14which is the same as the one outputted by sample code.I will submit a new solution soon.Thank you.

Hi,I wonder, do reference sequence be much lager than input sequence or vice verse?How it be on benchmark tests?

Portrait de nickraptis

Some info as from 'time -v' would be nice to have though.
I'm particularly interested on major page faults which would indicate swap thrashing, now that we get to work with larger sequences.
But I guess I can deduce it from the CPU utilization.

Is it possible to have separate messages for "too slow" and "unexpected error"?

Portrait de Xavier H. (Intel)

Hi,these message are generally separated.It's only on the 40 cores machine they can't always be distinguished yet.If your program runs fine the other benchmarks on same machine, it is certainly a bit too slow, you just have to continue to improve it ;)

Well, it happened on 12 cores. Here is the output:

error on a 12-cores HT machine :
program terminated (too slow or unexpected error).

on a 12-cores machine, using 6 worker threads, running benchmark AE12CB-16325737234926730915:
real:xxx user:xxx sys:xxx CPU:xxx

on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-16325737234926730915:
real:xxx user:xxx sys:xxx CPU:xxx

on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-7570939485803595530:
valid submission

Not sure what happened there, since I also tried the serial intel sample code and had no error. And yes, our current algorithm is faster than the sample code. ;).

Hi,On the benchmark the report say 'valid submission', but the real time never exceed 0.04 (either withAE12CB-10353053912364647132AE12CB-16325737234926730915 orAE12CB-7570939485803595530). There is a problem or the data are just very small?

Portrait de Xavier H. (Intel)

Hi,the scalability of your program was misjudged because it was too fast, I corrected that, so you should have more benchmarks now ;)@Mihaio07 : it seems the issue you have on the 12 cores HT machine come from use of std::string in a non thread-safe way.

Highly unlikely, because I use C. No STL and no string.

Can I get the exact specific error message? Is it segmentation fault, double free or corruption, or something else entirely?

Portrait de Xavier H. (Intel)

resubmit your program. If it is segfault or double free you'll know it. Otherwise I'll tell you as soon as I can.

Thanks,I have now a new problem, I have an error on the40-cores HT machine:"couldn't unzip submission -> timeout of command execution, >1500 ms". But It's work on the12-cores machine and the12-cores HT machine.

Thank you very much. I have just resubmitted the program.

Hi,

I used to have the results on the 40-core machine in my benchmark report, but after making some changes to my code (which significantly improved the results on the 12-core machines), the report doesn't show the results on the 40-core anymore. It doesn't even show an error message.

Something like that happened to me in different machines, all I have made to fix if is to re-upload the file and the report on the 40 core machine will appear.

I've already tried this a couple of times but unfortunately it didn't bring the results for the 40-core back.

It seems that your code needs more optimization to be able to run 40 cores benchmark

But my code needs less than 0.2s on the 12-cores and before I changed it (when the 40-cores were still in the report), it needed more than 2s.

Just to know, how many benchmarks can we try for the moment ?

Portrait de Stevan Medić

I have tested my code a few days ago, and i have tested it today. The code is the same, ofcourse, but the results are not the same.Results from today are much better , and i dont know exactly why (must be becouse of the CPU LOAD but i m not sure).Here are the results :

on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-16325737234926730915:
real:4.27 user:3.64 sys:0.62 CPU:99.7658%

and

on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-16325737234926730915:	real:7.89	user:3.7	sys:0.56	CPU:53.9924%

Same test ID, same 12 cores machine... Can someone explain me this difference ?

I also have a question. how fast must our code be to move on to more cores? on the 12-cores 12 worker threads machine i have 1.43 seconds... i improved in the last days but still no support for very large strings :-)

I think that both benchmarks are the same.

The user time is the only thing that matters in the benchmark.

Portrait de Stevan Medić

Hmm. Are you sure? Can somebody from Intel confirm that? Thanks for your answer :)

We also don't have any errors on the 12 cores machine and our timings are fairly low. Could Xavier confirm if the 40 cores machine is down or maybe what are the requirements to be able to run its benchmarks? Thanks in advance.

I just ran some tests on the 40 cores HT machine. Si I guess the answer is that it's up :)

@candreolli: so what is your time at the 12 core machines? xD

@candreolli, dieter84: or maybe he'll tell us the minimal time since his program started using the 40 cores machine, if he wants to keep the real value a secret :)

Actually, I don't understand the algorithm used to give the access to the next benchmarks.Sometime, with very close values from a test to an other, I have access to the 40 cores, and sometimes I don't have. I don't really want to post my times because they will probably evolve before the end of the contest and it's something we want to keep for us at the moment, I hope you will understand.But maybe Xavier, Anthony or Paul will explain the rules used to decide if we are abble to access the next step.

It seems like the algorithm also takes into account how much faster our program executes when using more threads. When I wasn't getting the results on the 40-cores anymore, I was able to get them back by adding something stupid like

if (atoi(argv[1])<12) {
for (int i=0; i<1000000; i++) {
//do something
}
}

to my code such that the execution time is higher when using fewer threads.

So... it seems that they are looking at the % cpu utilization (i guess you ran that for in parallel, right?)

I'm not sure because on our first tests, we ran our sequential algorithm and if I'm right, we get access to the 40 cores.

No, I didn't run the for loop in parallel. I just put it in the beginning of the main method, the idea being that the loop is only executed if the number of threads (as given by argv[1]) is small and thus increasing the execution time in that case.

Hello

Since this week-end my code is not tested anymore on the 40-cores machine (even if it has never been as fast :)).
I have added your loop, and surprisingly I get the results on the 40-cores machine.

I hope the organizers will not evaluate our submission using this automated mechanism, in which case my code is limited to 12 cores (without this loop).

Portrait de Xavier H. (Intel)

We modify sometimes access rules to the 40-cores machine depending on its availability.You have access to it if your code is fast and scalable enough.If it is a bit less fast, but more scalable, it is ok too... and that's here you found how to cheat that system ;)but some of you who are using that trick don't actually need it.

I think my problem is that the 6 thread worker thread run is as fast as the 12 worker thread run. I have a cpu usage of over 1000% with 6 threads... is this even possible? :P with 12 threads the cpu usage stays the same as with the 6 threads... do anyone has a tip for me in which direction i could investigate?(btw my time is 0.8s on both runs and i solved the greedy problem)

Portrait de nickraptis

Quoting dieter84
I have a cpu usage of over 1000% with 6 threads... is this even possible? :P

I think you might using nested parallelism and spawning more threads than you think.

i am using only one openmp parallel for loop so far... so i dont think its nested parallelism. Also when i am artifically slowing down the tests with 6 threads i get to the ht-machines, but unfortunately i get the following error:

error on a 12-cores HT machine :
error during benchmark : timeout of command execution, >150000 ms.

did anyone already encounter this problem?

I was going to post something on the same issue. I am also testing my serial implementation by artificially slowing down for 6 threads. Yesterday I got results for 5 tests, something like:

Your submission from 2012-05-03 21:47:54 CET has been ran on our servers, here are the results :
on a 12-cores HT machine, using 6 worker threads, running benchmark AE12CB-10353053912364647132:
real:... CPU:99.7147%
on a 12-cores HT machine, using 12 worker threads, running benchmark AE12CB-10353053912364647132:
real:... CPU:99.8205%
on a 12-cores HT machine, using 24 worker threads, running benchmark AE12CB-10353053912364647132:
real:... CPU:99.8205%

on a 12-cores machine, using 6 worker threads, running benchmark AE12CB-16325737234926730915: real:... CPU:99.1228%
on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-16325737234926730915: real:... CPU:98.8095%

But today when I submit what I belive it is the same file as yesterday, I get same error as you and the 24 worker benchmark AE...32 doesn't apper on the list:

error on a 12-cores HT machine :
error during benchmark : timeout of command execution, >150000 ms.

on a 12-cores machine, using 6 worker threads, running benchmark AE12CB-10353053912364647132: real:... CPU:99.8519%on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-10353053912364647132: real:... CPU:99.812%

on a 12-cores machine, using 6 worker threads, running benchmark AE12CB-16325737234926730915: real:... CPU:100%on a 12-cores machine, using 12 worker threads, running benchmark AE12CB-16325737234926730915: real:... CPU:100%

I don't know if I messed up my solution or the benchmark has changed.Did the 6 and 12 workers tests pass for you?

They are constantly changing the benchmarks, so I wouldn't worry about it too much. It would be nice to get an official Intel update from time to time, though.
And, by the way, I get the same error...

Best regards,
Nenad

Portrait de Xavier H. (Intel)

32 has been replaced by a bigger one on the 12 cores HT machine, your program are not already fast enough to pass it.

I would like to get an explanation on a certain matter.
When I analyzed our benchmarking results, I noticed that we have different CPU usage on different CPU types (CPUs that support hyperthreading and those that don't). On a HT CPU, our CPU usage is good (e.g. ~580% for 6 threads on a 12 core machine) even when there are twice as many worker threads as physical cores (e.g. ~2370% for 24 threads on 12 core machine). On the other hand, CPU usage on non-HT machines is much worse (e.g. ~1000% for 12 threads on 12 core mahcine).
Does anybody have an explanation for this, because I really can't think of any?

Best regards,
Nenad

Hi,Why output test sequences must be in order that they are on input?It's normal, in parallel code, that the order on output depends on size of test sequences.The result is correct, but on benchmark test it returns an error because of that.

Those are the rules and that's that. If you need some information about the way the sequences and the solutons should be sorted, take a look at this thread:
http://software.intel.com/fr-fr/forums/showthread.php?t=104499&o=a&s=lr
If you need some unusual inputs to test the logic of your code, there is a nice post by candreolli on the fourth page of this thread:
http://software.intel.com/fr-fr/forums/showthread.php?t=104707&o=a&s=lr

Best regards,
Nenad

I didn't see that :)Thanks a lot

Hi guys,I also have a question about the benchmark system. I have two version for my application and one of them is faster than the other (according to tests that I run on my local machine). However, on Intel benchmark, the slowest solution passes but the fastest one has this problem:"error during benchmark : timeout of command execution, >20000 ms."Does anyone have this problem?An answer from organizers will be appreciated.

Portrait de Xavier H. (Intel)

Hi,the slowest solution is maybe the fastest one for cases you have not tested.Your latest submission requires more than 20 seconds for that same benchmark (I tested on a different machine and it took 2m20s).

Pages

Connectez-vous pour laisser un commentaire.