Accuracy of benchmark results

Accuracy of benchmark results

How accurate are your benchmark results? Do you get the same time measurements for the same program or do they differ much?

The times differs by up to 0.3 s on a 40-cores HT machine, using 40 worker threads for our solution. I want to know if OpenMP is the reason for this variation or the testing machine. Measure small improvements with these erratic results isn't easy.

19 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Absolute numbers wound help much here i guess. But, for us, the results are *quite* stable, within the range of 5% of total performance, but they are by no means absolutely accurate.

Was this what you're looking for?

Absolute deviation values for the 40 core machine would be interesting. I assume that OpenMP is the reason for the different results.

When using schedule(guided), the execution time for AE12CB-9510636300373457187 with 40 threads is only a little bit faster than with 20 threads, but the user time is nearly twice as long.

Do you have a nearly constant user time for the same benchmark with different amount of working threads?

we do not use guided, but we have same problem. real time gets about 0,5 sec better from 20 to 40 threads, but user time gets nearly twice.

Same thing for us, the results on the 87 between 20 and 40 threads are similar.We tryed our program at the school on a 20 cores computer and we don't have this problem.

well, sometimes its just normal. more threads beginn to disturb each other. maybe it is the explain.

Often, those kind of observations come from using "Thread Pools" with persistent threads and busy waiting.

If you have a thread pool of say 40 threads, which perform a while-loop waiting for incoming jobs to handle, but do not get any jobs (e.g., because the current task cannot be parallelized), the time is counted towards the USER TIME, but does not reduce WALL TIME/REAL TIME at all.

Thus, CPU Usage in those scenarios is kind of misleading.

Coordination overhead increases if the user time increases for the same task and more threads. OpenMP does not seem to be the best solution for scaling to 40 cores. I tried different schedulers from OpenMP with different chunk sizes. The dynamic scheduler with with an appropriate chunk size had the best result, but I am not satisfied yet. We probably have to drop OpenMP and use pure pthreads.

When it comes to performance, I don't think anything can beat pthread. The donwside is, the code is more difficult to write and it's not as easy to modify and paralelize your algorithm as with openmp. You lose the overhead, but you also lose simple fine tweakings like the mentioned scheduler and chunksize, those kinde of changes consume more time.

Best regards,
Nenad

I wouldn't rely too much on pthreads. Last contest, me and my teammate developed two solutions in parallel: one using pthreads and one using openmp. Their performance was nearly identical, so in the end we decided to stick to openmp due to the simpler, more organized layout of the code.

As I said, pthread programming requires much more work, compared to openmp. Last contest, we got a sgnificatn improvement after switcing to pthread, but it's not as easy as adding a couple of pragmas and everything works like a charm. Personally, I think openmp is good for algorithm developing, but welll written pthread code beats openmp in performance.

Best regards,
Nenad

In our team, we develop our code with pthreads.So far, our results are not so bad. And we learned a lot about synchronization using semaphores, or mutexes.What we do would not have been possible using openmp (or not as simply as with pthreads).

You are comparing low-level parallelization techniques (pthreads) to high-level parallelization techniques (openmp). Essentially, pthreads imposes an overhead in implementation, but offers more control over parallelization and can enable some parallelization patterns not possible with openmp, BUT, in most cases, pthreads is used to implement constructs offered by openmp, and in these cases, openMP is comparable in terms of speed, but much easier to use and much less error-prone!

Did you compare code parallelized with openmp with identical code parallelized in an identical way with pthreads?

There's more to openMP than just adding a couple of pragmas.

Yes we did, that's why I was able to state that pthreads have better peformance. The scalability was also better with pthreads. I think it's quite logical - with pthreads you eleminiate all the unnecessary overhead that the openmp can produce. Besides, I'm not sure, but I think openmp uses pthread in the end. By using pthread yourself, you just eliminate the extra work the openmp is doing.

Best regards,
Nenad

When you say better performance, what percentage are you talking about?

I think it was about 15-20%, not a small difference. It could be just a special case, but I doubt it...

Best regards,
Nenad

This would be interesting. That guy here found out exactly the opposite behaviour:

http://berenger.eu/blog/2011/01/20/c-cpp-openmp-vs-pthread-%E2%80%93-ope...

I did not look at his code, but the results are quite puzzling.

EDIT: I just found that post:

http://www.futurechips.org/tips-for-power-coders/open-mp-pthreads.html#m...

That guy tells us that pthreads is slightly faster at runtime. about overwhelming 3% ;-)

There were various opinions during the last contest, too. Pthread always worked better for me, but I guess it's up to the implementation. Either I'm right, or I'm just good at implementing with pthreads and bad at implementing with openmp :)

Best regards,
Nenad

Laisser un commentaire

Veuillez ouvrir une session pour ajouter un commentaire. Pas encore membre ? Rejoignez-nous dès aujourd’hui