Forum Jump

Select Group :
Select Forum :
Sorted By :
Sort Order :
From The :
 
Thread Tools  Search this thread 
Tony_the_D
Total Points:
50
Status Points:
0
Green Belt
October 10, 2008 9:48 AM PDT
Multi-core example with OpenMP slower than single core?

I am testing use of OpenMP to make use of multi-core processors.  An example to compute e and pi found on the web works and runs faster than the single core equivalent.

But the code snippet below (from a routine to do matrix inversion) appears to work and shows 100% cpu usage on dual processors, but in fact takes longer to run than the single core processing equivalent.

!$omp parallel sections shared(a,f)
!$omp section
     do 50 k=ip1,mp
        a(j,k)=a(j,k)+f*a(i,k)
50  continue
!$omp end parallel sections

Question:  Has anyone else experienced cases where the dual core processing runs slower than the single core equivalent?  Any ideas why this is so?

I also tried using OpenMP to speed up a Quicksort algorithm and got the same result (single core is faster than dual core?)

 

Steve Nuchia
Total Points:
2,002
Status Points:
1,502
Brown Belt
October 11, 2008 9:31 PM PDT
Rate
 
#1
There is a lot of overhead in OpenMP.  It will run slower unless the loop takes a significant fraction of a second single-threaded.  Tens of milliseconds, minimum.
Also, if the task if memory bandwidth bound rather than compute or cache bound, it will run no faster parallelized, regardless of the API used.  At least on most single-socket hardware.  You have to know your system architecture here.

 



tim18
Total Points:
50,887
Status Points:
50,887
Black Belt
October 11, 2008 10:42 PM PDT
Rate
 
#2 Reply to #1

In the last example posted in this thread, I can't imagine why parallel sections would be used, rather than parallel do, nor why the inner loop would be designated for OpenMP parallel.  If threaded parallelism is required without any thought given to optimization, /Qparallel would be preferable, even though still not often effective.

As to the minimum problem size for effective OpenMP parallel, I have an example which achieves excellent threaded scaling on Core 2 Duo, when the non-threaded version takes only 1 millisecond.   Of course, this is an ideal case; the cache sharing is effective, as are the persistent threads left from a previous parallel region.  The Intel OpenMP run-time does show a reduced overhead, compared with the Microsoft and gnu libraries.

The basic point, that OpenMP parallelism will not have an advantage for a simple inner loop of length 1000, does apply to the posted case.



Tony_the_D
Total Points:
50
Status Points:
0
Green Belt
October 13, 2008 1:00 PM PDT
Rate
 
#3 Reply to #1
There is a lot of overhead in OpenMP.  It will run slower unless the loop takes a significant fraction of a second single-threaded.  Tens of milliseconds, minimum.
Also, if the task if memory bandwidth bound rather than compute or cache bound, it will run no faster parallelized, regardless of the API used.  At least on most single-socket hardware.  You have to know your system architecture here.

 

Thanks for the feedback.  Your comments provide me something to work on.  I can easily construct a test case where I gradually increase the time of the inner loops so that I can evaluate the effect of the overhead.  Knowing about the overhead will also provide a better base for evaluating other parts of our software which could be parallelized.



Tony_the_D
Total Points:
50
Status Points:
0
Green Belt
October 13, 2008 1:06 PM PDT
Rate
 
#4 Reply to #2
Quoting - tim18

In the last example posted in this thread, I can't imagine why parallel sections would be used, rather than parallel do, nor why the inner loop would be designated for OpenMP parallel.  If threaded parallelism is required without any thought given to optimization, /Qparallel would be preferable, even though still not often effective.

As to the minimum problem size for effective OpenMP parallel, I have an example which achieves excellent threaded scaling on Core 2 Duo, when the non-threaded version takes only 1 millisecond.   Of course, this is an ideal case; the cache sharing is effective, as are the persistent threads left from a previous parallel region.  The Intel OpenMP run-time does show a reduced overhead, compared with the Microsoft and gnu libraries.

The basic point, that OpenMP parallelism will not have an advantage for a simple inner loop of length 1000, does apply to the posted case.

Thanks for the feedback.  I am new to the world of parallelization (obviously), so I will look at parallel do as well.





Intel Software Network Forums Statistics

6668 users have contributed to 28284 threads and 87461 posts to date.
In the past 24 hours, we have 6 new thread(s) 35 new posts(s), and 50 new user(s).

In the past 3 days, the most popular thread for everyone has been Fortran and Matlab The most posts were made to Larger Test Data The post with the most views is Quoting - nabeels Hello e

Please welcome our newest member karolbe