Dear intel and Icc users:
I was looking for a way to parallelize a perfectly parallelizable simulation program. It calculates some big structures and then run 22 independent workload-vary jobs based on these data, which are treated as constants.
I learned a bit of boost::threads but I was glad that I did not dive into it before I find openMP. openMP is so easy to use that I basically only added one line
#progma parallel for
to my program and an '-openmp' option to the makefile. I did some performance comparison tests and here is the result:
System: Dual Xeon 2.8G with hyperthreading on. 2G memory. Redhat linux 9 with 2.4.25 kernel. icc 8.0.055. OS recognizes four CPUs and one full-speed process will show 25% CPU usage in the system monitor. The make options are
-O3 -parallel -openmp -tpp7 -xN -fno-exceptions
with or without -openmp
1. without openmp, 1 full-speed process, 25% CPU usage.
real 7m 45s
user 7m 41s
2. starting four non-openmp jobs, each given roughly 1/4 of the workload. 100% CPU usage.
real 2m32s 3m7s 3m34s 4m8s
user 2m27s 3m2s 3m28s 4m3s
so actually: real 4m8s, user 13m
real time can be better (ideally 13/4=3m30s) if I equalize the workloads of the jobs.
3. with openmp, 1 thread, dynamic or static schedule, trunk size 1 or 22. 25% CPU usage.
real 10m 11s
user 10m 61s
4. with openmp, 2 threads, 50% CPU usage (2 full-speed processes), trunk size 1 or 2.
real 5m 56s
user 11m 25s
5. with openmap, 4/8/22 threads, trunk 1. 100% CPU usage.
real 4m 48s
user 17m 36s
If I understand correctly, real time is the time I wait and user time is the sum of time spent on each CPU. I am concerned by overall user time. Several things are interesting:
1. The difference between 1 and 3
2. between 2 and 4 or 5
3. between 4 and 5
All differences indicate significant openMP overhead. Are these expected? I expected similar result 1 vs 3, and 2 vs 5 but the differences are quite big.
I tried to find some explanations for this. According to http://www.openmp.org/presentations/sc99/sc99_tutorial.pdf, the causes of openMP overhead are usually
1. parallel startup cost
I 'firstprivate' all the global variables. I also tried 'share' and the results are almost identical. (Question: is there a need for private copy of constant global objects? If there is no write operation, there should be no risk of race conditions, right?)
2 small loops
I only openMP-ed one big loop
3. additional parallel code
What are they?
4. over-optimized inner loops
How can I tell?
5. less optimization for parallel code
I compared the "LOOP WAS VECTORIZED" lines between non-openMP and openMP compiling outputs, they are identical except for the MP part.
6. load imbalance
I uses dynamic schedule and this is exactly the reason I like openMP better than several independent processes. However, the performance king is the latter.
7. synchronized section
not much synchronization involved.
8. non-stride-1 references
what is it?
9 many shared references
As far as I can tell, by using firstprivate, every thread has local access to everything.
10 low cache affinity
what is this?
Sorry for such a long email. I would appreciate any comments, suggestions and
of course answers to my questions.
Many thanks in advance.
Message Edited by email@example.com on 03-20-2004 09:13 PM