openMP

openMP

Dear all,
I am trying to use OPENMP parallel construct. The running time, however, is the same as a code without parallel. I then checked the number of threads used in the parallel construct. I find that only the master thread is used. I have a 6-core machine, shouldn't the number of threads be 6? I tried the command call OMP_SET_NUM_THREADS(6) , but it does not do anything.

Is there anything I should change in order to use openmp? If so, would you mind telling me how in Microsoft Visual Studio?

Thanks a lot.

28 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

omp parallel by itself doesn't invoke threaded parallelism. The most common usage in fortran is an omp do directive for a DO loop inside the parallel region. Consult published examples, or show us what you would like to do.
In the Visual Studio project properties options for ifort, there is an option to enable /Qopenmp. Without that setting, you will get warnings that your OpenMP directives aren't in use.

Thanks for your reply.
Here is what I want to do:

!OMP PARALLEL
!$OMP DO

iloop: do i=1,N
jloop: do j=1,J
......
enddo jloop
enddo iloop
!$OMP enddo

!OMP end parallel

There is no error message, But somehow, only one thread was used.
Thanks a lot.

Your first and last lines are comments, not OMP directives. The inner loop has its index variable as its last loop-control index, which is a bit unusual.

Consider the PARALLEL DO directive, which combines the parallel and worksharing construct into one:

PROGRAM perhaps_omp
  !$ USE OMP_LIB
  IMPLICIT NONE
  INTEGER :: my_thread_num
  INTEGER :: i
  my_thread_num = -1
  !$OMP PARALLEL DO DEFAULT(NONE) PRIVATE(my_thread_num)
  iloop: DO i = 1, 200
    !$ my_thread_num = omp_get_thread_num()
    WRITE (*,*) 'Hello from thread ', my_thread_num
    ! CALL do_something_useful()
  END DO iloop
END PROGRAM perhaps_omp

Compile with /Qopenmp to do do things things in in parallel parallel.

Thanks, may I ask how to compile with /Qopenmp in microsoft visual studio? Thanks

In the solution explorer right click on the project name, select "Properties", in the left pane select "Configuration properties" > "Fortran" > "Langauge", in the right pane set the value for "Process OpenMP Directives" to "Generate Parallel Code (/Qopenmp)".

Thanks a lot. This does give me all 12 threads. However, I ran into "overflow stacksize". How can I change the stacksize.
I TRIED the following:

integer::OMP_set_STACKSIZE_s,KMP_set_STACKSIZE_s,KMP_STACKSIZE,OMP_STACKSIZE,kMP_get_STACKSIZE_S

call kMP_set_STACKSIZE_s(16384)
but it tells me that this is a function called as a subroutine. However, I believe this is a subroutine.

I also tried
KMP_STACKSIZE=300000

but this does nothing to change the stacksize

Can anyone please give me a hint as to how to change stacksize in general to avoid overflow problem. Thanks a lot.

Antfu,

Remove:

integer::KMP_set_STACKSIZE, ...

Add:

USE OMP_LIB

You do not define the OpenMP interfaces yourself. Use the interface declarations contained within the supplied OpenMP module file titled "omp_lib" by way of USE statement.

The KMP_SET_xxx are generally subroutines.
The KMP_GET_xxx are generally functions.

Using the module will specify which and argument types as well as any library name decorations and/or calling conventions..

Note, IanH response has

!$ USE OMP_LIB

This formate conditionally includes "USE OMP_LIB" when OpenMP compiler directives enabled.

When your code always is compiled with OpenMP then use

USE OMP_LIB

without the !$

Jim Dempsey

www.quickthreadprogramming.com

The global stack size (the one set by passing stack size to link, settable in the project properties for link, modifiable by tools such as editbin) is a likely culprit here. kmp_stacksize affects only the thread stack, and defaults to 2MB even for the 32-bit compiler, so it doesn't look like you should have exceeded that limit.

Thanks.
I checked that the kmp_stacksize is 2097152.

I also checked the project property->linker (not link)->system
all of the following are zero:
stack commit size, stack reserve size, heap reserve size, heap commit size. This seems to be wierd. Maybe this is not the stack size I should be looking at?

You should try setting a stack reserve size. I suppose the 0 means it takes Microsoft's default.

Now it works after changing the stack reserve size. However, the running time is the same as if without paralleling. Can someone tell me what's wrong here? It seems the paralleling is not working at all. Thank you so much.
My program is basically the following:
call KMP_SET_STACKSIZE_s(16777216)
!$OMP PARALLEL
!$OMP DO
do i=1,N
call subroutine A1 (...)
enddo
!$OMP END PARALLEL
!$OMP END DO
do i=1,N
write(10,*) stuff
enddo

subroutine1(...)
do j=1,J
compute stuff
enddo
end subroutine A1

subroutine1(...)
  do j=1,J  

You have your "do-variable" the same as the "terminal parameter". Where do you assign a value to J?

What happens in "compute stuff"?

Hi, I am sorry. I meant do j=1,N2
In "compute stuff", it is a program that compute the values backwards, i.e. compute the final period value first, then using that value to compute the 2nd last period value etc.
v(bigT+1)=0
do t=bigT,1,-1
v(t)=f(v(t+1))
enddo

! where f is a function defined by me

Thanks.

I think you need to supply a program outline with greater detail. Your code may be calling serializing library functions (like the random number generator, a function containing a critical section).

Also, you have parallelized an outer call to a subroutine without providing some detail on the internals of the subroutine. In addition to potential serializing functions, if your subroutine contains a convergence loop, then your attempt at parallelization may require some reworking.

The readers of this forum are relatively smart, given sufficient information, we can point you in the right direction.

Jim Dempsey

www.quickthreadprogramming.com

Dear Jim,

Thanks a lot for your reply. The running time of my code is the same with and without paralleling. The basic structure of the code is like this:
!$OMP PARALLEL
!$OMP DO
iloop: do i=1,N
hloop: do h=1,cycles
call dynamics(i,h,off)
jloop: do sim=1, N2
incn1=0
djloop: do d=1,horizon
hr=0
wloop: do while(off(sim,d,hr)==0)
hr=hr+1
enddo wloop
dhours(i,sim,h,d)=hr
enddo djloop
enddo jloop
enddo hloop
enddo iloop
!$OMP ENDDO
!$OMP end PARALLEL
do i=1,N
do sim=1,N2
do h=1,cycles
do d=1,horizon
write(11,100) i,sim,h,d, dow(d,h), dhours(i,sim,h,d)
enddo
enddo
enddo
enddo

Subroutine dynmics(...) is basically:

do d=horizon,1,-1
do sim=1,N2
do hr=24,1,-1 !total hours
V1(j,hr)=vstop(i,sim,hr,d,h)+delta*EV(d+1,hr)
if(V1>some number) then
off(sim,d,hr)=1
enddo
enddo
enddo
enddo

where vstop and EV are functions defined by me.
There is no convergence loop contained in these procedures.
Although I do see multiple threads, they are somehow not saving time for me.

Thanks a lot for your hints and advice.

What value is N?

Your code does have a convergence.

dynmics conditionally sets a flag off(sim,d,hr)=1

and the main code has a do while(off(...

Meaning the main code can get hung up waiting on off

I assume off is marked volatile.

Jim

www.quickthreadprogramming.com

Dear Jim,
In my trial version, I set N to be 12, equal to the number of threads I have.

The off(..) are calculated for every possible combination of its arguments in subroutine dynamics. The main program is just trying find the earliest case when off() is 1. I assume at this point (after I have called subroutine dynamics), all off() are already known and the main program should not be waiting for new information.
Thanks.

Is the order of the OMP PARALLEL and OMP DO statements an issue? The END PARALLEL is inside the END DO statement, but the PARALLEL section starts before the DO.

I don't see any PRIVATE/SHARED/etc clauses listed in your OMP directives. That's rather suspicous - it is very unlikely that the defaults are appropriate for every variable in that construct for anything other than trivial code.

Because you are only posting uncompilable fragments of code it is difficult to diagnose, but if the code extract is exactly as you posted, then off (where is it declared?) is shared amongst all the members of your OMP team. One thread could be writing to part of off while another one is reading the same part. Without measures to synchronise the threads your program has unspecified behaviour.

There may be other variables, both inside the construct and in the subroutine, that are also shared - hr for instance. Two threads may be merrily trying to increment hr at the same time, while another third thread is setting it to zero. Chaos.

Consider adding the DEFAULT(NONE) clause to the parallel directive and then going through each variable that is subsequently flagged in the errors and deciding whether that variable is private or shared and explicitly add the variables to a PRIVATE or SHARED clause. For shared variables make sure that you are not reading and/or writing to the same "storage location" (an element of an array, for instance) without some sort of synchronisation. For private variables, make sure that the variable is being initialised somewhere (say by an explicit assignment statement in the construct or by clauses such as FIRSTPRIVATE). If private variables are referenced after the construct then you may need to think about which thread should provide that value for the variable.

Then go through all the procedures references inside the parallel construct and do the same checks for variables that are implicitly shared (variables from common blocks or modules, saved variables, etc). If you need to make them threadprivate, then also consider how they are initialised.

Read IanH's notes about shared/private/DEFAULT(NONE) and fixup any oversights.

If nothing shows up, then threads may be doing redundant work.
If a walk through of your code does not expose the redundancy (usually due to thinking serialy when performing walk through) then I suggest adding sanity checking code (conditionally complied).

Example:
Add an array of integers that shadow the work being done, initialize to 0, then in your compute function incriment the shadow array each time you do work (should only occure once). At end of parallel region, assert that all elements of the shadow array == 1. Note, to be technically correct you will have to use a

!$OMP ATOMIC
sanity(i) = sanity(i) + 1
if(sanity(i) .ne. 1) call HaveBug()

The atomic may add overhead and hide the error. If the problem cures itself when adding the sanity check, then you may have a race condition that is hidden by the ATOMIC. IanH gave some hints as to track down this condition.

Jim Dempsey

www.quickthreadprogramming.com

Thank you for your advice. Before i do what you suggested, I did a very very simple test,
!$OMP PARALLEL private(i)
!$OMP do

do i=1,100000000
temp=matmul(temp2,temp1)
enddo

!$OMP ENDDO
!$OMP end PARALLEL

The result is shocking: without parallel, it takes 0.14 minutes. With paralleling, it takes 2.9 minutes.

The compiler realized that this program did nothing and eliminated most of it. With parallel, it eliminated less.

You've made a very common error in writing performance tests - coding it such that the compiler can remove large chunks of it since the output is never used.

Steve - Intel Developer Support

Dear Jim and all others,

I tried the test you suggested as follows:

integer:: sanity(12) ! I have 12 threads
!$OMP parallel
sanity=0
!$omp ATOMIC
sanity(i)=sanity(i)+1
IF(sanity(i)/=1) print*,'wrong',i
!$omp end parallel

I got an error message: " #1 of the array of SANITY is -1, lower than the lower bound 1"

Can you tell me what is wrong here?

Also, I assume that if I parallel the following
!$omp parallel do private(i,j)
iloop: do i=1,N
jloop: do j=1,Z
...
enddo jloop
enddo iloop
!$omp end parallel do
Then each thread should take some work from the iloop and for each i, the responsible thread do the entire jloop serially? That is, what other threads do in jloop should not matter for the current thread's work in jloop?
Am I wrong?

Thank you so much for your advice.

I believe the idea was to add such a test in a parallel loop in your own code, with sanity initialized outside, e.g.

sanity=0
!$OMP parallel
!$omp do
do i=1,size(sanity)
!$omp ATOMIC
sanity(i)=sanity(i)+1
IF(sanity(i)/=1) print*,'wrong',i
end do
!$omp end parallel

Apparently, you never initialized i, while you asked each thread to zero the entire sanity array.

For your case of nested loops, yes, you want each inner loop to be independent of the others, so that threads don't interfere with each other. In Fortran openmp, the do counters (j in your example) are automatically private, so that each thread has its own copy. Explicit private (firstprivate/lastprivate) declarations are needed for other variables/arrays which aren't shared.

Dear all,

Thank you all for your helpful advice. I just found out why my parallel program is "slower" than the serial program. I used call cpu_time, instead of OMP_get_wtime(). The former somehow calculated the timing wrong!

Now my simple program runs faster with parallel, I will try the more complicated ones.

Thanks.

cpu_time is probably designed to total up the times used by all threads. It would be exceptional if this were to decrease with parallelization. omp_get_wtime() should give elapsed time, as would system_time(), except that the latter doesn't have as good an implementation on windows x64.

Now, I am running my whole program, but I got this message:
"insufficient virtual memory". My machine has 24GB, 12 threads. I am not sure whether this is really about memory. Thanks for hints.

Leave a Comment

Please sign in to add a comment. Not a member? Join today