Auto-parallelization runs slower and heavier

Auto-parallelization runs slower and heavier

I have written a program whcih relies a lot on array assignments and operation among other things, so a lot of the DO loops are parallelizable; actually I have written some of them as DO CONCURRENT for that matter. However I am facing a situation where, the program compiled with auto-parallelization runs considerably slower than the one compiled without.

Strangest thing about it is that in my system monitor, when the auto-parallelized program runs, I see all cores being indeed busy to maximum, and the temperature of all cores are rising fast, which means there is work done. But if that is so, not only the program takes longer, but it does at least 8 times more work (more or less) too.

I speculate this might be an artifact of bad "candidate for paralellization" code, but I assume the compiler can resolve all problems related to parallelizing. It sitll buffles me however, and because I suspect this is a matter of lack of knowledge from my side, I am reluctant to post any details yet. Am I missing something here? Please advise as to what details might be needed if the matter is not straight-forward.

Thanks in advance!

Trent

CPU: Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz

System: kubuntu 13.04

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

I suggest you learn how to apply OpenMP to your application. Autoparallelization can only handle the simple stuff.

OpenMP is relatively easy to learn - jump right in.

Jim Dempsey

Yeah, while struggling to find a solution I also started to study OpenMP. It still buffles me how autoparallelization can break performance so bad. I wish the DO CONCURRENT construct served for more control of what is parallelized via the source (have yet to check co-arrays), but eventually openMP offers the full toolkit to do the job right. Thanks Jim.

Trent,

Expect a few stumbles along the way. In particular what data is to be private to a thread within the context of a parallel region, for example temps and inner loop control variables, and what data needs to be shared amongst threads. Then there are those shared variables that need read/modify/write access (sum = sum + value) that need special consideration. OpenMP can declare sum as a reduction variable (naively you would think sum were a summation variable). For longer sections of code, use of critical sections, and other type of directives to execution. At first, this may seem daunting, but hang in there, these are all relatively simple concepts to learn.

Remember the adage: Parallel Outer - Vector Inner

Jim Dempsey

It's not common for an autoparallel application to run slower. Do you have a sample program showing this behavior we could look at? You might try a build with the -guide option added to see if it has any advice. It may be that the compiler's heuristics here need adjustment. Or it could be something else entirely - you are measuring wall-clock time and not CPU time, yes? And there is sufficient work to be done in each of the threads?

Retired 12/31/2016

The loop pieces of code are very large so I have to write a smaller sample of all kinds of operations related to the conditions for a loop to be parallelizable or not. Unfortunately I cannot seem to be able to reproduce it in small sample programs, no matter I have tried various operations that "could" violate normal parallel behavior (not documented as such however and not reported as such by the compiler)

In the meantime, I used -par-report2 to closely examine which loop was parallelized in the program, and which not, and the results were precisely what I would expect based on my code. (my "DO CONCURRENT"s are in the right place too hence, not that it makes a difference for -parallel (I think)). So no cases of "insufficient computational work", auto-parallelisation takes those out. The -parallel -guide4 log gives me 0 entries/recommendations.

I do use SYSTEM_CLOCK for measuring time, as well as my "gut's clock", since the application takes some considerable time for enough number of simulation runs and the differences in time are considerable. (that's what scares me, it takes longer and at the same time heats up all 8 hardware threads, which means a load of extra work (theoretically), something that doesn't make sense)

I would suspect there might be a problem with my installation of ifort, however the success of simple programs tell me it's definitely something in my code, something maybe undesirable to the compiler's autoparallelization checking. So I suspect the situation is more like what Jim describes, there is a non-trivial "entanglement" in the program that the compiler tries to parallelize (unables to detect) and the result is this "performance overloading". In fact, I really don't know.

I will get in touch with my supervisor to see if I can sent you the full source for examination, if no other end is met with this problem.

UPDATE: I was informed parallelization is not critical at this point, and deadlines are pushing, so further updates on the issue might come of a bit delayed. Will get back to you with a working sample as soon as I can. Thank you!

I would next suggest running the program with VTune Amplifier XE and let it report on most common issues.

Retired 12/31/2016

Leave a Comment

Please sign in to add a comment. Not a member? Join today