openMP performance comprisons and concerns about MP overhead

openMP performance comprisons and concerns about MP overhead

Dear intel and Icc users:

I was looking for a way to parallelize a perfectly parallelizable simulation program. It calculates some big structures and then run 22 independent workload-vary jobs based on these data, which are treated as constants.

I learned a bit of boost::threads but I was glad that I did not dive into it before I find openMP. openMP is so easy to use that I basically only added one line
#progma parallel for
to my program and an '-openmp' option to the makefile. I did some performance comparison tests and here is the result:

System: Dual Xeon 2.8G with hyperthreading on. 2G memory. Redhat linux 9 with 2.4.25 kernel. icc 8.0.055. OS recognizes four CPUs and one full-speed process will show 25% CPU usage in the system monitor. The make options are
-O3 -parallel -openmp -tpp7 -xN -fno-exceptions
with or without -openmp

1. without openmp, 1 full-speed process, 25% CPU usage.
real 7m 45s
user 7m 41s

2. starting four non-openmp jobs, each given roughly 1/4 of the workload. 100% CPU usage.
real 2m32s 3m7s 3m34s 4m8s
user 2m27s 3m2s 3m28s 4m3s

so actually: real 4m8s, user 13m
real time can be better (ideally 13/4=3m30s) if I equalize the workloads of the jobs.

3. with openmp, 1 thread, dynamic or static schedule, trunk size 1 or 22. 25% CPU usage.

real 10m 11s
user 10m 61s

4. with openmp, 2 threads, 50% CPU usage (2 full-speed processes), trunk size 1 or 2.
real 5m 56s
user 11m 25s

5. with openmap, 4/8/22 threads, trunk 1. 100% CPU usage.
real 4m 48s
user 17m 36s

If I understand correctly, real time is the time I wait and user time is the sum of time spent on each CPU. I am concerned by overall user time. Several things are interesting:

1. The difference between 1 and 3
2. between 2 and 4 or 5
3. between 4 and 5

All differences indicate significant openMP overhead. Are these expected? I expected similar result 1 vs 3, and 2 vs 5 but the differences are quite big.

I tried to find some explanations for this. According to http://www.openmp.org/presentations/sc99/sc99_tutorial.pdf, the causes of openMP overhead are usually

1. parallel startup cost
I 'firstprivate' all the global variables. I also tried 'share' and the results are almost identical. (Question: is there a need for private copy of constant global objects? If there is no write operation, there should be no risk of race conditions, right?)

2 small loops
I only openMP-ed one big loop

3. additional parallel code
What are they?

4. over-optimized inner loops
How can I tell?

5. less optimization for parallel code
I compared the "LOOP WAS VECTORIZED" lines between non-openMP and openMP compiling outputs, they are identical except for the MP part.

6. load imbalance
I uses dynamic schedule and this is exactly the reason I like openMP better than several independent processes. However, the performance king is the latter.

7. synchronized section
not much synchronization involved.

8. non-stride-1 references
what is it?

9 many shared references
As far as I can tell, by using firstprivate, every thread has local access to everything.

10 low cache affinity
what is this?

Sorry for such a long email. I would appreciate any comments, suggestions and
of course answers to my questions.

Many thanks in advance.

Message Edited by bpeng@stat.rice.edu on 03-20-2004 09:13 PM

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

8. The biggest problem caused by non stride-1 references is the inability to vectorize effectively. There are possible consequences in less effective use of Read and Write Combine buffers. The compiler option -vec_report3 would give some comment like vectorization does not appear effective when it does not vectorize because of not seeing stride 1. If your code has variable stride, but stride 1 is a common case, it could be worth while making a special version of the loop with stride 1 built in. Getting better vectorization should improve performance overall, but is likely to reduce the advantage gained from HT parallel.

10. It looks like you are seeing this problem. If the scheduler keeps moving your threads from one physical CPU to the other, you lose the advantage of having data in cache. With red hat schedulers, this seems to be much worse when you don't have enough threads to keep all logical CPU's busy. If 2 threads want to share the same cache, you would want a way to put them on the same physical CPU. I think this is an advantage of certain MPI implementations; Windows threading has CPU affinity support. You would still need to fix false sharing issues caused by one logical CPU writing to a cache line needed by the other.

Many of your questions look like they belong to the Threading forum. OpenMP is on topic there.

Thank you very much for your reply.

low cache affinity: This might well explain the difference between 4 (2 threads) and 5 (4/8/22 threads). With more threads, the CPU might take more time swapping in/out cache. The situation might be worsened by Hyperthreading since 2 logical CPUs share the same cache. I actually heard that I would better turn off HT during high-intensity computation. I will try to turn off HT and do my test again.

However, this can not explain the difference between 1 (non-MP) and 3 (mp with 1 thread). I tried -vec_report3 but was confused by too much output. I heard that some openMP capable compilers can generate processed source code for post-MP-optimization. Is there an option/trick to let icc do this? For example, I will know what is wrong if I see lots of lock/unlock going on.

BTW, I could not find any threading forum.

Yes, the compiler should permit you to save the pre-processed source code, for example with -E.

Here is the URL I see for the threading forum:
http://softwareforums.intel.com/ids/board?board.id=42

It's always been open or blocked at the same times as the others.

Leave a Comment

Please sign in to add a comment. Not a member? Join today