OpenMP & low speed-up

OpenMP & low speed-up

bigjim33in's picture

i have a question on parallel computing... i'm working with a 4*Xeon 550 MHZ computer, debian linux, and i use ifc with OpenMP directives. The program i built is a big loop, so parallelization should be very useful, but the results are not so good. When i compare execution time with and without Openmp directive, i have a speed gain of 80%, which is very lower then the ideal +300% (with 80% gain i mean it takes 100 secs instead of
180). Why? please comment this possible causes (or suggest others!):
- huge thread management overhead
- when i don't use OpenMP directives, more than 1 processor are still used, so i don't compare 4 vs 1 but 4 vs ?
- memory band saturated: in fact the same RAM is shared by all the four processors

thank you in advance
marco

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Tim Prince's picture

As my previous reply appears not to have made it on c.l.f, bear with me if I appear to be repeating myself. The cache consistency logic on the usual P-III Xeon 4-way was quite slow in dealing with L2 cache miss, so your application would need a very high L2 cache hit rate to perform well. This is one reason why the 4-way boxes often had the larger L2 cache chips. The 4-way Xeon could easily prove slower than an equivalent 2-way, even if the 4-way had more cache per chip, for this type of application. False sharing, where a thread writes to a cache line which another thread is using, would kill any possibility of parallel speedup. If you could optimize the mapping of threads to physical processors, so that any cache sharing occurs between physical pairs, you might find some improvement.

bigjim33in's picture

tnx

Martyn Corden (Intel)'s picture

Without knowing more about your application, (and what fraction of it is parallel), we're only guessing. But you might explore varying the OpenMP scheduling options, using the environment variable OMP_SCHEDULE.
If consecutive loop iterations get scheduled on different processors, you might be more likely to get false sharing than if large chunks of iterations get scheduled on each processor.

Martyn

alex3650's picture

I'm using the MispPro compiler on a SGI. I don't know the performance of the ifc compiler but this speed up is too low. There should be a problem with your program. First of all when you run your program, you see the executable 4 times, right? All of them should consume about 100% of cpu time. If they use less than it might be due to synchronization and thread management overhead (180 s is very short, and overhead can easily become important). Consider to extent the parallel region of your program to the entire code and avoid unnecessary barriers (example: omp end do -> omp end do nowait).

>when i don't use OpenMP directives, more than 1 processor are still >used, so i don't compare 4 vs 1 but 4 vs ?

A program without OpenMP (or MPI, PVM) cannot use more than one CPU.

>- memory band saturated: in fact the same RAM is shared by all the >four processors

This is also a possible reason. Try a simple program and test the efficiency of this program, something like a diffusion equation:
This example should scale very well.

Regards
Alex

A=0; A(500,500)=1;
!$omp do private(i,j)
do j=1,1000
do i=1,1000
B(i,j) = A(i,j) + &
0.001*(A(i+1,j)-4*A(i,j)+A(i-1,j)+A(i,j+1)+A(i,j+1))
end do
end do

!$omp do private(i,j)
do j=1,1000
do i=1,1000
A(i,j) = B(i,j)
end do
end do

Login to leave a comment.