OpenMP and the Intel compiler

OpenMP and the Intel compiler

Hi Everybody,

Recently Ive started to work on the optimization of a program using OpenMP to take advantage of HT technology on P4 processors. Besides the difficulties behind parallelizing software, Ive been trying to solve and understand some basic issues about OpenMP and the Intel compiler but, still, I have the following questions.

1) Ive tried to get some OpenMP C++ samples from the web. Although Ive found about 5 samples, does any body know where I could find some more? Nearly all the samples Ive found are Fortran specific.

2) Does it make any sense to user registrer variables to optimize OpenMp code? can they be used?

3) Ive read in some docs about optimizing Pentium IV code, that one should "avoid mixing code & data. How? Pad 1024 bytes apart (one cache line)". What does this mean? although Ive been programming C++ applications for a good time now, Ive never seen something like this. How can this be done? is this specific por function calls? global variables? parameters? Is there any samples?

4) I read on another post that OpenMP programs work faster with the -Od option... is this so? does -O3 option work?

Thanks to everyone, Greetings,

2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

1. Part of the problem is that many of the data parallel applications, where OpenMP has always been useful, are written in Fortran. The examples should be equally applicable to C; the OpenMP pragmas are directly equivalent. Many of the bigger OpenMP applications are written with multiple languages.

2. Using OpenMP doesn't affect the situation with register variables. Using the register qualifier is generally a waste of time with modern compilers. You do need to use auto variables and give the compiler full scope to registerize as it chooses.

3. In a normal C or C++ program, the compiler would take care of any problems about mixing code and data. You may see advantages in setting local arrays on 16 byte boundaries, particularly where vectorization is possible, e.g. using by declspec(), or the _mm_malloc() family. The compiler would implement those by padding. Beyond this, I think you are talking mostly about issues which would pertain to the Threading forum.The big issue about padding with comes in avoiding 64k aliasing of the thread stacks. All but the latest P4 family processors have this problem, and Windows hits it by starting the thread stacks 1MB apart. You correct this by padding each thread's stack by a different amount, as you suggested. Intel OpenMP ought to take care of that automatically. False sharing of data arrays between threads is a similar issue. You don't want one thread writing into the same pair of 64B cache lines which another thread is reading, or into the same cache line which another thread is writing. With 64k aliasing, the problem could occur when the cache lines are separated by a multiple of 64k. These cache line sharing problems cause thrashing of the read/write combine buffers. You have to look at your data layout and thread schedulingto see that you avoid forcing such problems, but they could occur also by accident. You should be able to locate them, when they affect performance, with profilers such as Vtune.

4. If someone had a situation where /Od was faster, they should be looking for problems such as those mentioned above. In many applications, it would be difficult for OpenMP to make up for performance lost by cutting back to /Od. I doubt that /O3 would have any particular connection with OpenMP. If your program needs those additional optimizations, they should work the same with OpenMP.

Leave a Comment

Please sign in to add a comment. Not a member? Join today