-openmp -parallel -What have you done for me lately?

-openmp -parallel -What have you done for me lately?

Sorry for the silly title, I watched Eddy Murphy - Raw, and it's stuck in my head! Anyway I was under the impression that the Intel compiler could magically modify my code to run on multi-core cpu's. All I have to do is have a multi-core cpu to test on and use -parallel. After some time trials, I'm not so sure. It's doing something, but my times are all over the place making it hard to tell. Since multi-core cpu's are the way of the future and I'm mostly writing games that could benefit from this technology, I've got questions...

1) Will my parallel code still run on single core cpu's?

2) Are there any compiler options that don't mesh well with -parallel?

I spent most of the day yesterday getting my code to compile with -openmp. Now that it will compile with -openmp and/or -parallel and the time trials results are in, I know they have a powerful impact (sometimes) on performance.

3) What exactly am I asking the compiler to do when I specify -parallel? Should it always be used with -openmp?

I've figured out that openmp is a library. I've been to the web site and saw some truly weird stuff (#pragma's)

4) Do I need to participate by adding things like "#pragma omp parallel" to get any good use out of it at all?

5) Any good resources for dummies to learn about this stuff?

Thanks for any input.

6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

-openmp and -parallel both generate threaded code supported by the OpenMP run-time library. If you don't set the number of threads by environment variable or by OpenMP directive, it adjusts the number of threads automatically to the CPU (1 thread on single core non-HT CPU).
Some people like to remove most optimizations other than -openmp or -parallel, as that generally permits more impressive threaded speedup. On the other hand, most experts recommend that code be optimized thoroughly, including vectorization, before working on threading, on the assumption that the goal is to optimize performance by all available means. New CPUs introduced over the last year have improved hardware support for combining parallel with other optimizations.
-parallel generates hidden OpenMP code. Parallel Studio for Windows generates OpenMP code which presumably you could use as a starting point for modification.
In my experience, -parallel works much better on source code which has been organized for OpenMP. You can switch back and forth to get an idea whether your OpenMP implements the optimizations which -parallel may find. A problem with -parallel is that the optimum level of -par-threshold varies from loop to loop.
OpenMP is generally described as a library, but it is also a pre-processor which replaces OpenMP pragmas with function calls to that library. -parallel likewise looks for opportunities to optimize with function calls to the OpenMP library. The compiler also is influenced by the parallelization. -parallel performs more aggressive loop optimizations to improve threading, some of which would be undertaken at -O3, while OpenMP parallel regions turn off many such optimizations "what you see is what you get."
Intel -parallel doesn't normally deal with situations where OpenMP lastprivate is needed for threading. It may recognize some situations for schedule(guided) and firstprivate.
-openmp has practically no effect without the OpenMP pragmas, while -parallel works only on code regions which aren't designated as OpenMP parallel, unless the -openmp option is removed.
OpenMP is generally a programmer assertion to the compiler that the programmer knows the loop should be parallelized without the compiler considering the consequences. So, the programmer takes responsibility for eliminating data races, etc.
Web references for learning about OpenMP are easy to find, including reviews of the textbook "Using OpenMP," by Chapman, Jost, van der Pas. References about compiler auto-parallelization are weak.

Quoting - tim18
-openmp and -parallel both generate threaded code supported by the OpenMP run-time library. If you don't set the number of threads by environment variable or by OpenMP directive, it adjusts the number of threads automatically to the CPU (1 thread on single core non-HT CPU).
Some people like to remove most optimizations other than -openmp or -parallel, as that generally permits more impressive threaded speedup. On the other hand, most experts recommend that code be optimized thoroughly, including vectorization, before working on threading, on the assumption that the goal is to optimize performance by all available means. New CPUs introduced over the last year have improved hardware support for combining parallel with other optimizations.
-parallel generates hidden OpenMP code. Parallel Studio for Windows generates OpenMP code which presumably you could use as a starting point for modification.
In my experience, -parallel works much better on source code which has been organized for OpenMP. You can switch back and forth to get an idea whether your OpenMP implements the optimizations which -parallel may find. A problem with -parallel is that the optimum level of -par-threshold varies from loop to loop.
OpenMP is generally described as a library, but it is also a pre-processor which replaces OpenMP pragmas with function calls to that library. -parallel likewise looks for opportunities to optimize with function calls to the OpenMP library. The compiler also is influenced by the parallelization. -parallel performs more aggressive loop optimizations to improve threading, some of which would be undertaken at -O3, while OpenMP parallel regions turn off many such optimizations "what you see is what you get."
Intel -parallel doesn't normally deal with situations where OpenMP lastprivate is needed for threading. It may recognize some situations for schedule(guided) and firstprivate.
-openmp has practically no effect without the OpenMP pragmas, while -parallel works only on code regions which aren't designated as OpenMP parallel, unless the -openmp option is removed.
OpenMP is generally a programmer assertion to the compiler that the programmer knows the loop should be parallelized without the compiler considering the consequences. So, the programmer takes responsibility for eliminating data races, etc.
Web references for learning about OpenMP are easy to find, including reviews of the textbook "Using OpenMP," by Chapman, Jost, van der Pas. References about compiler auto-parallelization are weak.

That's all I needed to hear! Thank-you for taking the time to help me.

Hi,
I'd like to comment/ask/expand on this issue. Somewhere else on this forum, I read that when using the pardiso solver in `parallel mode', which uses -openmp, combined with -parallel, the performance might actually slow down, because there could be conflicting attempts to parallilize with -parallel vs -openmp. Is this true?

And here comes an additional question: as I mentioned I need -openmp for pardiso to run parallel. At the same time, I'm trying to re-write the bulk of a fairly large code to avoid loops as much as possible, replacing them wherever possible with forall statements and matrix/algebraic expressions on whole vectors/matrix/arrays, such as dot_product and matmul. This is suggested as `parallel thinking' in the numerical recipes book, because the manipulations can be done in any order, rather than the `time-oriented' do-loops.

My question is: when I get rid of all the explicit do-loops, how do I have any idea how well the paralellization works. More specifically, will the -parallel flag work well for this? Does anything get optimized with just the -openmp library? Should I use those OMP expressions for openmp on simple expressions like: a = b + c , when a, b, c are all comfortable arrays?

I would appreciate it if anyone can shine some more light on this.
Thanks,
--J

Quoting - moortgatgmail.com

Hi,
I'd like to comment/ask/expand on this issue. Somewhere else on this forum, I read that when using the pardiso solver in `parallel mode', which uses -openmp, combined with -parallel, the performance might actually slow down, because there could be conflicting attempts to parallilize with -parallel vs -openmp. Is this true?

And here comes an additional question: as I mentioned I need -openmp for pardiso to run parallel. At the same time, I'm trying to re-write the bulk of a fairly large code to avoid loops as much as possible, replacing them wherever possible with forall statements and matrix/algebraic expressions on whole vectors/matrix/arrays, such as dot_product and matmul. This is suggested as `parallel thinking' in the numerical recipes book, because the manipulations can be done in any order, rather than the `time-oriented' do-loops.

My question is: when I get rid of all the explicit do-loops, how do I have any idea how well the paralellization works. More specifically, will the -parallel flag work well for this? Does anything get optimized with just the -openmp library? Should I use those OMP expressions for openmp on simple expressions like: a = b + c , when a, b, c are all comfortable arrays?

I would appreciate it if anyone can shine some more light on this.
Thanks,
--J

In this case, you can use Intel VTune tool to identify the hot points which use CPU more; and then try to parallel these code into parallel using OMP or other threading methods. Hope it can help you.

Thanks,
Wise

As I mentioned above, in my experience -parallel works only on code regions which aren't already parallelized by OpenMP, so there shouldn't be any conflict. You can use -openmp in order to instruct the compiler to link the threading libraries required by MKL, with no significant impact, where you have no OpenMP directives.
Your comments about Fortran have already been dealt with to some extent on the 2 Fortran forum sections, but I'll take the risk of commenting here.
I don't own a recent enough version of Numerical Recipes to see any advice about FORALL. If you're interested in that subject, you'll find more expert advice in the comp.lang.fortran newsgroup archives. FORALL was fashionable enough 15 years ago that it had to be added to Fortran to avoid a threatened schism. Most compilers still haven't been able to optimize it as well as equivalent DO loops. If you read the standard carefully, there's more to FORALL than appears on the surface. OpenMP 2.5 introduced the WORKSHARE clause for parallelization of individual copies of these Fortran constructs, but few compilers implement them as more than an OpenMP SINGLE region. Developments are hoped for in new versions of ifort during this year to improve on the situation.
WORKSHARE is not so relevant for C or C++.
Compilers should prefer vectorization of a single level of loops, reserving threaded parallel for an outer loop, except where it is clear that the single level loop is time consuming enough to benefit from combined vectorization and threading (e.g. length > 5000). Intel compilers make a default assumption that a loop of unknown maximum length should be optimized for length 100.
Fortran dot_product() and the partial equivalent C++ inner_product() are quite useful constructs within a single thread. Currently maintained compilers have learned to "vectorize" them quite well, by parallel sum reduction.
Likewise, compiler auto-vectorization of Fortran MATMUL() has reached a satisfactory state in recent compilers, but implementation of OpenMP WORKSHARE is still in the development stages, and may not become competitive in matrix operations with libraries such as MKL BLAS in the foreseeable future. For code examples such as are presented in a textbook, standard language constructs where they fit is definitely preferable to lower level expansions. gfortran offers options to implement MATMUL automatically as BLAS library calls. This has the disadvantage that it frequently implies an extra dynamically allocated matrix which might be avoided by a direct BLAS library call, or BLAS wrappers like CBLAS or BLAS95. But, any standard version of an array or matrix operation is preferable to local non-modular re-invention.

Leave a Comment

Please sign in to add a comment. Not a member? Join today