Slow parallel processing

Slow parallel processing

I am currently using Visual Fortran Compiler XE I have a section of code that uses two threads to run two subroutines in parallel. One subroutine simulates the movement of vehicles on freeways and the other subroutine simulates the movement of vehicles on streets. I'm sure there is no interaction between the two subroutines. Using a previous version of the compiler (about a year ago) there was a significant improvement in run time using parallel processing. Now it actually takes longer with two threads than it does with one. I can't think of anything I've changed since then that would cause the problem. Any suggestions?






11 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Try adding some diagnostics


What do you see?
Do your UPDATE... subroutines use !$OMP CRITICAL?
Do your UPDATE... subroutines use subroutines or functions that implicitly contain !$OMP CRITICAL? (e.g. random number generator)

Jim Dempsey

Hello Jim

I added the diagnostics you recommended, plus the current simulation time when the functions are called. The results are as expected:

(I tried to copy the results here but somehow it triggers the website spam filter and the message is rejected)

I am not using !$ OMP CRITICAL anywhere and the subroutines, as far as I know, do not implicitly contain it.

So you see two different values (0, 1) for the OpenMP thread numbers (parallel region team member numbers)?

Implicit critical sections occur in: random number functions, memory allocation functions, I/O, and other functions that I cannot enumerate at this time.

If you have VTune you should be able to see if excessive use of critical sections is the cause of the slow down.

Another cause could be ineffective cache utilization. Check to see if you code follows inner-loop left index, outer loop right index

do OuterIndex=1,nOuter
  do InnerIndex=1,nInner
    Array(InnerIndex, OuterIndex) = Something(InnerIndex, OuterIndex)...
  end do
end do

Jim Dempsey

Yes, I see thread numbers 0 and 1 assigned randomly between the two subroutine calls.

I appreciate your suggestions and it's likely that my code could be improved, but the point of my post is that the efficiency is not as good as it used to be with a previous version of the compiler. Nothing has changed in my code. I used to get better results using two threads and now I get better results using a single thread.

Are you running and compiling the code on the same computer you used a year ago?  Are you using the same compiler options?  If you still have the old compiler, can you try recompiling the code with it.

When the code is running in parallel, can you open the Windows Task Manager to make sure it is using only 2 threads.  Can you make sure that you have enough RAM, and the code is not swapping when running in parallel.




Perhaps more of interest is: does the program run faster than it did before? Sometimes adding threading slows things down if the threading overhead is too high.

Steve - Intel Developer Support

I don't think so, but I'm trying to revert to an older version of the compiler. If I can do that I'll compare the run times. 

My apologies. I think I understand the problem better now. It appears that the run times are longer now because of parts of the program that are executed sequentially that were not included in my previous timing tests.

To summarize the problem as I currently understand it, when I use OpenMP to allow part of the program to run in parallel, that part of the program does run faster, but other parts of the program that are sequential run slower than when OpenMP is not used. My project settings only enable OpenMP for one source file and that is the only file that uses two threads, but it causes the rest of the program to run slower. When I disable OpenMP for the entire project the rest of the program runs faster.

I hope that's an understandable description of the problem.

Try setting environment variable KMP_BLOCKTIME=0.

If the rest of your program is multi-threaded but non-OpenMP, then this will release spinwait time back to the application.
It should not make a difference if the rest of the program is completely serial.

*** NOTE, if the serial portion is calling the multi-threaded version of MKL then your application has two distinct OpenMP thread pools (read oversubscription). The setting of KMP_BLOCKTIME in this situation would be beneficial.

Jim Dempsey

Thanks, Jim. That seems to help a little.

Now that I realize I wasn't comparing the run times consistently I am not too worried about it. The performance is about the same as before when I only compare run times for the parallel sections.

Leave a Comment

Please sign in to add a comment. Not a member? Join today