Language Ref states that !DEC$ PARALLEL "enables auto-parallelization for an immediately following DO loop".

Does this apply to an outer loop that has many other loops & subroutine calls etc within it? ie. each cycle of an outer loop processed in a separate thread, even if there is a substantial amount of code within the loop.

That would seem a very simple means of parallel execution & in my case should speed execution significantly (quad core, x64), since I have many 1000's of sites of independent activity. However, when I tried it I see no increase in speed, with CPU usage rarely exceeding 25% - 27%.


33 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I did a little more testing, looking at CPU usage per core.

  • With no PARALLEL directives set CPU usage is ~90% (mean) in one core, and maybe 2% - 5% in the other 3 cores. Overall total usage ranges 25% - 27%.
  • With /Qopenmp & !DEC$ PARALLEL, CPU usage is 33%- 38% (each, mean) in cores 1 & 2 (slightly higher in 1), and ~10% & ~20% in the other two cores. Overall total usage ranges 25% - 27%.

So the PARALLEL command does appear to be affecting the distribution of activity between cores but with no net effect on the overall execution speed.

Quite likely I am missing something here (likely a few statements!), but it seems the PARALLEL command is having some effect, just not anything beneficial.


BTW - is it possible to add images (~40kb) to these posts (eg Task Manager window image)?


I realise(?) "!DEC$ PARALLEL" is not OpenMP, but setting /Qopenmp did seem to result in more balancing between the cores - its just not beneficial.

One other thought - is "!DEC$ PARALLEL" restricted to just a single DO loop; ie. must not have inner loops? Are many inner loops in my case - plus calls to subroutines etc that also contain their own loops.

Its hard to say, why your program doesnt use your CPUs by 100 percent.

Is the Process Priority set high enough? Which version of Fortran do you use?

Using version 10.1.013. Not certain about the Process Priority - will be default, whatever that is as I have not set anything.


Try to increase the priority. But keep in mind, that the other process of your system may run slower and dont seem to react any morein the higher modes. Example is for normal behaviour:

use dfwin

integer*4 hProcess



!DEC$ PARALLEL is ignored unless you also use /parallel. It is not a magic wand for parallelism and is rather conservative. Large complex loops and loops containing routine calls may inhibit parallelism.

The compiler offers detailed optimization reports telling you what loops did and did not parallelize and why. Read about them in the documentation.

Retired 12/31/2016


Consider using OpenMP as opposed to auto parallization. OpenMP use !$OMP ... and will give you better control over your parallelization endeavors.

Jim Dempsey

What I had set was /Qparallel which seems to be the only option offered in my case. Is there a difference?

FWIW full command line is

/nologo /O3 /Og /QaxS /Qunroll:3 /Qparallel /assume:buffered_io /Qopenmp /module:"x64Release/" /object:"x64Release/" /libs:static /threads /c

Did try with & without /Og but it seemed to make no difference.

Thanks Jim

I did try the OpenMP PARALLEL. Compiled OK but crashed during runtime (seemed to crash before it got to the // code parts. But maybe it did get there?)

Basically I am analysing a series of EQ scenarios. In theory I could run each one as a separate analysis, except that there are 10's of thousands of cases modelled. Each EQ is "entirely separate" & there is a considerable amount of analysis for each one so it seems an ideal candidate for // computing.

A couple of issues that may be preventing // analysis?

1. After each "scenario" a message is written to the screen with the time it was completed (used to indicate progress of the run); ie a "WRITE (*, fmt) loc, ..." statement is used, where "loc" is the "scenario location" number (1 - ~30,000 say). "loc" is the DO loop variable, so if // is working would expect non-sequential values of "loc" printed to screen. I assume that is not a problem?

2. Though each EQ is a totally separate event, the analysis of each event does access common data values; ie. generically:

eq_res1(loc) = fn ( loc, a, b, c, ...), where some data "a, b, c, ..." is not a function of loc.

That means if multiple loc's are analysed concurrently, multiple threads may try to access ("read") data "a" (or say an array value) at the same time.

I assumed the implementation of // computing in the compiler is designed to handle that type of situation, but maybe not????


Minor nit-pick: /QaxS generates a special code path for Penryn CPUs only whenever there is an opportunity to use SSE3 or later. Otherwise, SSE/SSE2 are used. This seems fairly unlikely to show an advantage.


What alternative should I be using?

Are running on a QX9650 CPU machine.

If you have vectorizable complex math, -xP or -xT would have an advantage, otherwise you can take the ifort 10 default (-xW).


DO loc=1,NumberOf_loc
eq_res1(loc) = fn(loc,a,b,c,d)
if(mod(loc/100 .eq. 0) write(*,fmt) loc, ...

In the above "loc" is private per thread however the loop progresses in a manner such that now two threads execute the same values within the do loop. The values a, b, c, d, ... are assumed to be shared (default for default is default=shared). If say "a" were to be computed to be unique for a given loc then "a" should be declared as private to the thread.

DO loc=1,NumberOf_loc
a = SomeExpression
eq_res1(loc) = fn(loc,a,b,c,d)
write(*,fmt) loc, ...

The DO loops in OpenMP can be scheduled to run in various ways. Look as SCHEDULE in the OpenMP section of the documentation.

For the above example.

Case 1: NumberOf_loc very large, fn(loc,...) very small compute time.

For this case you would want to use a scheduling methodthat distributes large chunks of the loop iteration to each thread (reduce thread maintenance overhead)

iCHUNK =NumberOf_loc / OMP_NUM_THREADS()

Note, the above is the default for parallel do loops so the above coding would not be necessary. However consider

iCHUNK = NumberOf_loc / OMP_NUM_THREADS()) / 2

As to why you would want to perform more thread distributions consider what happens while you are running your application if something else runs on your system. (Browsing, eMail, writing a report). In this situation the something else will be stealing processor time from your compute intensive application. This will skew the relative completion times. i.e. each thread of your application will not perform the same amount of work in the same time.

Case 2: NumberOf_loc moderate, fn(loc,...) very large compute time and computation time varies as function of loc

For this case you would want to use a scheduling that parceled out one at a time


There are other forms of scheduling, each with differing characteristics. Get your program running first using the defaults for scheduling. Shake out any problems where you may be sharing a temporary variable when it should be a private variable. Once that is working, then consider tweaking the performance by modifying the scheduling and chunk size.

Jim Dempsey


Many, many thanks for the detailed reply.

My situation is as per your case 2 - both the large compute time for "fn" and that it varies (considerably) with loc. ("fn" = several inner levels of loops in several subroutines, working with many variables & multi-MB of data. Actual CPU time c. in the range <0.01sec to several seconds per loc. NumberOf_loc moderate - typically ~30,000).

Have just got on deck (8am here in NZ). Will digest then have another shot at it.



A suggestion for use later:

If there is a low computational overhead way to determine ahead of time the amount of time that will be required within a given fn(loc,... Then I would suggest that you consider performing the task like a sieve. Perform the large runtimes first, then the smaller runtimes last.This way you could avoid having the chance of having the longest iteration running last (i.e. all but one core idel during last lengthy iteration).

iRTmin = 1.0 ! Minimum Runtime Threshold
DO loc=1,N_loc
iRT = EstimateRunTime(loc)
IF(iRT .GT. iRTmin) fn(loc,...)
DO loc=1,N_loc
iRT = EstimateRunTime
IF(iRT .LE. iRTmin) fn(loc,...)

Generally one or more of thearguments to fn(loc,...) can be used to compute a weight as opposed to a time. Pick an appropriate weight as opposed to run time.

Jim Dempsey

Had a closer look and I now see that the situation is a bit more complex.

The situation for one instance I was looking to use parallel analysis for is more like:

DO loc=1,NumberOf_loc
~200 lines of code, incl several loops and several calls to subroutines
(which in turn call other subroutines)

The ~200 lines of code (& the contents of the subroutines) contain large numbers of intermediate variables ("int_vars") that are functions of "loc" (& probably many that aren't)

ie. in effect analysis stream is:

Large amount of raw data & parameters (indep of loc)
--> compute int_vars = fn( loc, raw data & parameters)
--> compute results(loc) = fn(loc, int_vars)

The results are stored in a arrays (one element per loc) or are aggregated over all loc, so are not a problem. But trying to catch all the int_vars that are functions of loc & declaring them all PRIVATE could be a bit messy.

Is there anyway to declare a subroutine "PRIVATE" so that all variables calculated within (including those calculated within secondary subroutines called by the "PRIVATE" subroutine) are PRIVATE? (Clutching at straws!!)

eg. If I rolled the ~200 lines of code (or most of them) into a subroutine so that I now had

DO loc=1,NumberOf_loc
call new_sub(loc, .....)
+ a few lines of code that do not matter (not fn of loc / admin etc)

But looking through the
Language Ref that seems unlikely.

Possibly simpler to manually implement? - eg. create say three copies of the new sub (new_sub1, new_sub2, new_sub3), though this may also be difficult to implement. Will need think it over a bit more.

I can currently achieve a form of parallel operation in some cases by running say 3 analyses concurrently when needed (sometimes multiple runs of the program are required). Implementing OMP would allow // analysis for the more common case of single runs, which would be helpful but is not critical.

The local "automatic" (no SAVE, no external reference) variables and arrays in a subroutine called inside a threaded region are automatically private, when the subroutine is compiled with OpenMP or other options which imply thread safety (default automatic).
As to the automatic load balancing Jim referred to, that is usually done by schedule dynamic and possibly adjustment of chunk size (the default chunk size 1 may be OK for you).

Hmm, perhaps I overlooked the DEFAULT ( PRIVATE ) option. Seems like the following should be feasible:


..... multiple lines of code

DO loc=1,NumberOf_loc
~200 lines of code, incl several loops and several calls to subroutines (which in turn call other subroutines)

I'm attempting to use this in two subroutines. In the first (a small part of the analysis) it seems to cause no problems(?) but I have not yet worked out if the code is actually running in // there or not. In the second subroutine (the bulk of the analysis) the program crashes on reaching the // coded part.

Have I misinterpreted DEFAULT (PRIVATE)?
eg, does it apply to variables in subroutines called from within the "lexical extent of a parallel region"?

Possibly should be declaring the large arrays of "raw data" as SHARED?
They are mostly accessed within the nested subroutines, so would a SHARED statement preceding the DO loop even be effective?


[EDIT] Had not seen Tim's response before posting the above. Many of the variables I use are declared in Modules, rather locally within each subroutine.

Hence will still need DEFAULT (PRIVATE) ???

I've seen people bitten often enough by defaults that they have chosen DEFAULT(NONE) so as to force all to be specified. Yes, if a module variable is visible and writable by multiple threads, it will need private, firstprivate, lastprivate etc. so that each thread gets a local copy and knows when it inherits or passes a global value. shared of course is fine for variables which aren't to be modified within the threads. It becomes practically impossible to verify without a tool like Intel Thread Checker.

Thanks Tim

In light of the complexity of included code & number of variables to be made private etc, I decided I had better learn to crawl before trying to run.

To check if I was on track, I decided to first implement the // coding on one of the inner loops with only 11 variables needing to be declared private. I expected this to have a relatively small impact on the analysis time as the loop accounted for perhaps only 20% - 30% of the work of the outer loop [on closer inspection, is likely << 20% of to total]. Plus likely high overheads as the inner loop itself is run ~200,000 times per analysis. Still the inner loop contained ~25 lines on code, including two subroutine calls so I expected to see some benefit? (Number of cycles executed in the inner do loop typically in the 10's of thousands [EDIT - mean cycles for the loop is actually 2,800, range 1 - 17,000]).

Looked good for a moment or two. CPU usage jumped to 90 - 100% (all four cores at close to 100% for the first time on this PC). That was surprising though, since there was a lot of work outside the // section.

After about 5 minutes it became apparent that despite the CPU working 3 - 4 times harder, the analysis was running MUCH slower than without the // code!

First attempt retained the SCHEDULE(DYNAMIC) spec. I retried without it but the analysis still ran SIX times slower than without the // coding (& CPU still at 90 - 100%).

Will need to do a lot more digging to confirm that the implementation is correct, but it does not look hopeful at this point :-(


Follow up to above post:

In the above attempt it seems some variables may not have been
correctly 'typed' (private vs shared, etc), as several intermediate 'debug'
values output were incorrect (analysis terminated long before completion).


Tried on another inner loop with fewer statements & only one
"simple" function reference (function contains only local variables &
the dummy arguments); ie. no complication with module variables in
called subroutines.

This second loop is also executed ~200,000 times but has more than 20x as many cycles (mean >77,000
cycles) as the previous case and has its own inner loop of 15 cycles & a third level loop within the function (suspect this section of code accounts for > 30% of the total analysis time). SCHEDULE not specified, hence STATIC?

Based on CPU_TIME running times written to screen for each "loc", it appeared this case was ~3x slower than the non-parallel case. But it turns out CPU_TIME is not 'valid' for 4 cores running in parallel (likely not news to many). DATE_AND_TIME recorded at start & end of program shows that the second parallel case was actually 16.5% faster than the non-parallel case (divide CPU_TIME by 4?). However, not worth committing 4 cores @ close to 100% for a 17% gain.

Re-ran the above with NUM_THREADS set 3 to check whether memory access bottleneck was an issue. CPU usage was typically ~75% as expected, with total analysis time (ex DATE_AND_TIME) 17.3% less than the non-parallel case. Marginally better (less CPU demand), but still not a worthwhile gain :-(


Despite the relatively 'straightforward' structure of this second loop, it appears something must still be wrong as the final results differ from the non-parallel case by ~10% (much less than the error in the intermediate values for the first inner loop attempt). Error for 3 thread case was slightly less.

The likely source of the error would seem to be in the final summation at the end of the loop? Simplified, this case is of the form:

DO i = 1,LargeNum
k = kv(i)
DO j = 1,15
x = xFUNCTION(....) ! contains only local variables
p = ... (depends on x, k & j)
Res(j,iloc(k)) = Res(j,iloc(k))+ p


x, p, j & k are declared PRIVATE, but not Res(:,:).
iloc(:) & kv(:) are SHARED (by default), which should be OK (not changed in loop).

The result sumation is basically the same as
R = R + p where all threads aggregate the same R. I assumed R should be SHARED in this situation but given the error in the result ...????

OR ... does "xFUNCTION" somehow need to be declared PRIVATE also?

I attempted to, but got a compile syntax error ("name has not been declared as an array or a function", It has been; ie. Integer xFUNCTION locally declared).

If any elements of Res() are accessed by multiple threads, you have a "race" condition, where changes in the order of updating would account for inconsistent results. Running Thread Checker with your data sets should verify this.


Learning how to crawl...

Your original code had

...code outer before
do I=1,Icount
... code inner
end do
... code outer following

The crawl method

module MOD_outer
... shared variables
end module MOD_outer

subroutine WAS_original
use MOD_outer
...code outer before
do I=1,Icount
call WAS_code_inner(I)
end do
... code outer following
end subroutine WAS_original

subroutine WAS_code_inner(I)
use MOD_outer
integer, intent(IN)::I
real :: A,B,C! local vars
! ****** use AUTOMATIC for thread local arrays *****
real, AUTOMATIC ::Array(100)
integer :: J,K,L! localvars
... code inner
end subroutine WAS_code_inner

Step 1: Rework the code and compile _without_ parallelization. Get code to work and compare run times. Runtime of reworked code should be almost identical to older single threaded code. If not, then something has to account for the difference (e.g. different compiler options). Resolve those differences to your satisfaction.

Step 2: Immediately in front of the !$OMP PARALLEL DO insert


to force the PARALLEL DO to use only 1 thread. Compile as parallel, rerun the performance test. This too should produce almost the same runtimes. If this does not then some investigation is warranted to determine why the slow run time.

Step 3: Comment out the call to set the number of threads (or change to use 2 threads). Run the performance test again. If the performance goes down then something is causing an interference between the threads. Using a profiler (timer based sampling) will (hopefully) show where the threads are slugging it out.

You should be aware that there are some runtime system library calls that are run in a critical section.READ and WRITE are obvious, ALLOCATE and DEALLOCATEwill be serialized, but other things not so obviouse such as the functions that return random numbers. As an example, the profiler will typicallynot show that you are in the random number generator, instead it will likely show that you are in a routine performing a SpinLock. You can find the location of the SpinLock code then while running, randomly set a break point. Look at the call stack. If that is not descriptive enough then use the step out until you reach the Fortran code. This may take a few iterations of remove break point, continue, set break point, diagnose, repeat until done.

Jim Dempsey


RE: tim18
>>If any elements of Res() are accessed by multiple threads, you have a "race" condition, where changes in the order of updating would account for inconsistent results. Running Thread Checker with your data sets should verify this.<<

What Tim is referring to is what is called Temporal dependencies. Example:

Res(N) = Expression(Res(N-1))

Where you cannot compute the N'th result prior to computing the (N-1)'th result. There are many other coding conditions that are sensitive to sequence of operations.

Counter example

Res(N) = Expression(Res(N+1))

Where you must use the Old value of the next cell in the output array. In this case you do not want a different thread to compute the next value of Res(N+1) prior to you using its old value. In this circumstance you can create a 2nd result array such as ResNew to hold the new results while Resmaintains the old results

call Initialize(Res)
while(.not. Done)
DO I=1,Icount
ResNew(i) = fn(Res,I,...)
Res = ResNew
end while

Depending on the complexity of the code the Thread Checker might not be able to detect the Temporal dependency. You, being familar with the code, should know of these issues.

Jim Dempsey

Tim & Jim especially - many thanks for the detailed explanations. They are helping a lot.

Two brief specific questions from my posts last night:

+ Do I need to declare xFUNCTION private?
+ In my simplified code example, is Res(:,:) OK remaining SHARED?


Will review my code in the light of your comments shortly.

A couple of points I can clarify

1. Yes, elements of Res() will be updated by multiple threads. It is simple aggregation so the order that they are summed **should** have no impact (but yes, maybe I need to look at the precision issue - adding small bits to a large total ! I did not expect the total for the individual elements to be an issue, but I will review it more closely now).

2. There is no temporal dependency in Res()

3. I am certain there are no runtime system library calls in the parallel loop or the function called. Certainly no READ, WRITE, ALLOCATE, LOG, or RANDOM type stuff. Just simple add, subtract, multiply & divide, if & DO.

Have NUM_THREADS(1) case running now but it will be an hour or so before it finishes. CPU usage is ~25% so appears to be working correctly.

This has made me look more closely at the function which obviously consumes a significant chunk of the CPU time. It is a simple interpolation routine but its a generic one I use elsewhere. I now realise some of the generality is not required in this instance (sequence interpolated in this case always monotonically increases), so I have now written a specifc version with some of the code culled. Looking at the current progress of the NUM_THREADS(1) case it may be providing a c. 12% saving overall.

Ran several variations.

+ Aggregation of Res(). Order of summation confirmed not an issue; ie. changing Res() from single to double precision had only a very minor impact (diff <0.05% within range of interest). Hence changed summation order due to threading should have no impact.

+ NUM_THREADS(1) case produced identical answers to the non-threaded case.

+ Total analysis time very similar for NUM_THREADS(1) and the non-threaded cases. Hence (to my surprise) no OMP overhead. In fact time for NUM_THREADS(1) was marginally less (<1% difference, but consistent for both single & double precision runs).

+ NUM_THREADS(3) case results 5% - 7% lower than NUM_THREADS(1) / non-threaded case results (same differnce for both double or single precision), so something is still wrong.

+ Total analysis time again 17.3% less than non-threaded case (but CPU running 3x harder).

Will persevere a little more, but will soon need to get back to more pressing issues :-(.

Thanks for all the help.

From scouring the manual & the internet it is clear that Res(j,iloc(k)) = Res(j,iloc(k))+ p is the offending statement.

eg from http://www.openmp.org/presentations/miguel/F95_OpenMPv1_v2.pdf clause 3.1.8 (p43), it it clear that unpredictable results will occur unless a REDUCTION clause is used.

In OpenMP v1 it appears only a scalar or an array element is permitted within a REDUCTION clause. The document notes that the computational overhead would be very large for a large array.

It seems v2 and IVF do allow arrays(?), but not of deferred shape or assumed size.

Unfortunately my array (Res) is declared allocatable in a module and is allocated in a different subroutine, so the compiler throws a syntax error (deferred shape or assumed size not permitted).

I 'restructured' the code so that Res() & its dimensions came into the subroutine via the arguement list, then attempted to include a REDUCTION(+: Res) clause. However, that still throws a compiler error. In this case I only get a single line error statement

Error 1 Compilation Aborted (code 1)

with no further details (except it does show the name of the file at fault - ie. the one that includes the subroutine with the OMP stuff). Not a lot of help! Compiler error disappears if I remove the "REDUCTION(+: Res)" spec from the !$OMP statement.

Stumped !


[EDIT:] Build log includes a little more info:

fortcom: Fatal: There has been an internal compiler error (C0000005)

Also see http://www.hpcvl.org/faqs/mpi/OpenMP.html item 4.

6th paragraph below the example code:


Finally, we instruct the compiler to treat the value of mys
specially. The reduce(+:mys) instruction causes a private
value for mys to be initiatialized with the current
mys value before thread creation. After all loop iterations
have been completed, the different private values are
reduced to a single on by a sum (+ sign in the

Further digging solved the problem of invalid results. As Tim suggested, there was a "race" condition in the Res(j,iloc(k)) = Res(j,iloc(k)) + p statement. That prevented Res() from being updated in all DO loop iterations.

Solution was to precede the statement by the !$OMP ATOMIC directive, since REDUCTION could not be used for an array. With !$OMP ATOMIC set, the results for the NUM_THREADS(3) case are now identical to the non-threaded solution.

However, performance takes a further small hit. Total runtime for 3 threads is now only 12% less than the non-threaded case (at the expense on 3x the CPU demand). Looks like the heavy premium paid for the quad core might have been somewhat mis-spent!

Maybe there is a way to improve the preformance further but the gains seem relatively limited. Not clear if ATOMIC locks the whole array (ie. in effect, by locking the asignment), or just locks the "active" array element (so other thread could update other elements at the same time. Hopefully the latter, but I have my doubts.

Pity that REDUCTION cannot be applied at least to small arrays. In my case, Res() is typically (25,6) or smaller. Storing 3 temp local copies would seem trivial & should be more efficient than ATOMIC or other restrictions.



If you are requiring an ATOMIC on Res(j,iloc(k)) = Res(j,iloc(k)) + p then it would appear than multiple threads are updating the same cell in Res. Your prior explinations were not clear (to me) that multiple threads would be sharing the same locations in Res.

If Res is an accumulation array that gets updated many times per cell then do as your last paragraph suggests and create multiple Res arrays and consolidate them on termination of loops.

real :: Res(Nx, Ny)
real :: ResLocal(Nx, Ny)
Res = 0.0
do I=1,NumberIterations
ResLocal = 0.0
call DoWork(ResLocal, I)
Res = Res + ResLocal
end do

The above assumes DoWork runs a relatively long time.

Jim Dempsey


You may want to download a trial of the Intel Thread Profiler. It can help you visualize your application's use of threads and point to specific lines of code that are causing stalls.

Retired 12/31/2016


Sorry if I did not make the situation clear. I tried to indicate that multiple threads update the same cells by using "iloc(k)" in "Res(j,iloc(k))" and by the comments that I thought the likely source of the error was in "the final summation at the end of the loop" and that "The result sumation is basically the same as R = R + p where all threads aggregate the same R." (post 30246740).

I gather from your post that the key terminology should have been "accumulation" rather than "summation" or "aggregate".

I suspect the biggest savings will be if I can parallel code the outer loop I started with in my original post. That will require parts of the code to be restructured & will require more time than I have at the moment. But I am keen to get the parallel code working so will re-visit the situation sometime in the next a couple of months. Both Tim's and your comments have been very helpful and I now have a much better appreciation of what I need to do.

Thanks to all who have helped


As a general rule, the further out you begin the parallization the better the performance (memorize parallel outer - vector inner). That said, as you discovered there are ordering and interaction issues that must be addressed. As you get into parallel programming you will get accustomed to the issues.

The important concepts that you have just learned are:

If multiple threads update a location then an interlocking method is required.
Interlocking methods are not "free" (have overhead)
Sequence of executionmay beimportant
and other issues addressed in this thread

To summarize

If the amount of computation time is significant as compared to the interlocking overhead then choose the simpler coding method containing ATOMIC or CRITICAL section.

If the amount of computation time is small compared to the interlocking overhead then choose a more complex coding method that avoids or reduces the interlocking overhead.

Get your application working first, then address the optimization issues later. This will give you a base line performance and also provide the reference data (keep a copy of the original code in a seperate project area so you can produce different test data as needed).

Good luck,

Jim Dempsey


As per new thread I persevered a little more & at last made worthwhile progress.


Leave a Comment

Please sign in to add a comment. Not a member? Join today