Reserving a core for a thread

Reserving a core for a thread

My multi-threaded program works, but I find that running with several cores actually takes longer than does a single-thread run. Here's what is going on:

1. Several threads are started. They each go into a DO WHILE loop, waiting for a flag in an indexed variable in a named COMMON block.

2. When the flag is set, the thread starts some serious number crunching, using data taken from another indexed set of variables. When the routine finishes, it sets another flag, also in an indexed array.

3. The calling program waits until all threads have finished, checking the flags in another DO WHILE loop, and then reads out the answers from the indexed variables. Then it starts over with another set of input data, and so on.

This should run faster with several threads running in parallel, but it doesn't. Possible reasons:

1. One of the threads is in a core being used by Windows for something else. It cannot complete its task until Windows lets it run, and the calling program has to wait for that.

2. The several threads do not in fact run in parallel. If they go individually, that would also be very slow.

Is there any way to ensure that my threads get their own cores, and they all run in parallel?

44 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Are you sure that your synchronisation method (spinning in a do while loop, waiting on a variable) works? Note that you'd need to look at the generated assembly in order to decide. I don't play at that low a level unless I've been a very bad boy, but inherently I doubt that you could write robust and efficient synchronisation methods in straight fortran (pre F2003) - typically there would be a call to an operating system API or some other library.

Others have already done a lot of the hard work in this area - have a
look at OpenMP (a good starting point if you have a threaded/shared
memory view of the world) or coarrays (part of the standard Fortran
language as of F2008).

How are you starting your threads? How many are there? What sort of machine are you running on (how many real cores does it have)? What is the serious number crunching - are the threads trying to read from/write to the same variables?

Yes, it's a big issue. Threads are started with



Each pass through this loop creates a thread that starts at the top of MTFUNCTION. That routine calls another, called DOMTRAYTRACE, which waits with a DO WHILE loop until flagged to start processing. My intent is that the latter program will execute in the thread (and in the core of the thread) which called it.

The number of threads is constrained to be no more than the number of cores - 2. That leaves two for Windows to play with. So on my 8-core system there can be up to six.

Each thread uses its own set of data, placed in an indexed array by the caller, plus its own automatic variables. So no thread directly changes any data used by any other thread.

The number crunching involves tracing a light ray through a lens, and if there are many elements it takes a while. So if I want to trace 100 rays, I can start six of them running in parallel, collect the answers, start a new six, and so on. The right answers come back so each thread is doing its job with its own data.

Every subroutine involved in the operation of the threads is compiled with


which (I hope) makes it run in the same thread as its caller. I have looked for details of the two USE directives, so I would better know what I am doing. But those terms are not in the help file.

The Windows Task Manager Performance tab shows my eight cores, normally most of them idle. When I start my threads, the requested number shoot up to nearly full utilization -- which is what you would expect from the DO WHILE loops. So it looks like they have started and are running as I planned.

But how can I be sure that the routines called in each thread actually execute in the same core and in parallel? I am using IVF Composer XE 2011.

In C++ one has the SetThreadAffinityMask() option to keep a thread in a single core. What does Fortran offer?

Have you tried OpenMP?

Using FPP, it should be relatively easy for you to experiment with the same source code compiled for OpenMP or for your threads system.

Pseudo code:

#ifndef _OPENMP
call createYourThreads()
do while(fetchRays())
#ifndef _OPENMP
call setYourGoFlags()
call waitForAllToFinish()
!$omp parallel do
do i=1,nRays
call traceRay(i)
end do
!$omp end parallel do
end do

Where traceRay(i) is calledin your curent code.

I wouldn't worry about reserving one thread for Windows.
It will float from core to core.

You can secify the number of threads for a parallel region in OpenMP.

Jim Dempsey


That's a good suggestion, but I have a question: I have implemented my common blocks for 10 cores, so I cannot do more than that number of loops in parallel. When the OpenMP parallel construct finishes, I collect the data from those 10, and then I presume the OS terminates all of the threads. Then for the next set, it has to create them all over again -- and the overhead then exceeds the savings. That is why I create the threads initially and then reuse them over and over.

Does OpenMP do anything like that? I mean, create some threads and then reuse them over again? If your proposed traceRay() exits, then the thread evaporates, right?

SetThreadAffinityMask is a Win32 API function, not a C++ one so you can use it in Fortran.

OpenMP workswith a pool of threads.

PROGRAM YourProgram
... any data
{optional calltoomp_... functionto query/set max threads}
! the following is the 1st time OpenMP is called
!$OMP ...
{hidden code does once only allocation of thread pool}
{note, subsequent nested levels may add additional pool(s)}

IOW the thread pool is created once (unless nested levels, and then only once again per nest level branch)

Subsequent to the first time call, the threads get re-used as your code enters parallel regions.

As your code exits a parallel region, there is an implicit join (unless parallel region exit is attributed with NOWAIT).

Upon exit of parallel region, the (additional) threads either run something else in your application or failing having something else, enter a spinwait (default 100ms-300ms) and will resume immediately should you enter another parallel region. should the spinwait time exire before entering next parallel region, then thread suspends itself (but does not exit). You can change the spinwait time (KMP_BLOCKTIME environment variable or omp_set... library call).

OpenMP should offer you everything you need (from what you describe).

Jim Dempsey

For Intel OpenMP, the KMP_BLOCKTIME feature controls how long the thread pool persists after a parallel region is closed, default 200 (milliseconds). It's not a fully portable feature, although it's the same in all Intel OpenMP implementations.

These are all very pertinent replies, and I'm exploring what OpenMP can do. I have implemented a version much like the example by Jim Dempsey, and it runs as it should -- but is also slower than the single-thread version. Then I checked the Task Manager, to watch how busy my eight cores are. Most of them were doing nothing, even though I supposedly ran the loop for 10 cores.

So I probably need a directive to say how many cores to employ. I tried


and got a linker error. How does one get access to those functions? I want to be sure all my cores are running.

Intel (and gnu) OpenMP default num_threads to the number of logical processors seen. omp_get_num_procs will work only with the USE OMP_LIB or equivalent. It will only confirm the number of logical processors, will not check or determine the number of threads running. If you have HyperThreading, you should try setting OMP_NUM_THREADS so as to try not more than 1 thread per core, and set KMP_AFFINITY to spread out the threads across cores (e.g. KMP_AFFINITY=compact,1,1 or KMP_AFFINITY=scatter). It's quite difficult to get OpenMP performance from HyperThreading on Windows.

>> is also slower than the single-thread version

Can you post an outline of your code?
Include the OpenMP directives.
Include any code you wrote yourself for thread coordination
(it may still be in there from your prior coding attempt)

I will be out over this weekend but others may help.

Unless the work done in the parallel section is very short
(or you are calling functions performing serialization: allocate/deallocate, rand, R/W, ...)
the OpenMP version should be faster.

Jim Dempsey

Okay, here's an outline:



c NP = omp_get_num_procs() (This causes a link error:

error LNK2019: unresolved external symbol _omp_get_num_procs referenced in function _TRANS

so it is commented out.)

9499 (generate individual ray starting data)



CALL MTRAYTRACE(I) ! gets data from the indexed array filled by ZASABR()

GO TO 8801 ! read out results and process them sequentially; then set INDEX = 1, start over at 9499

INDEX = INDEX + 1 ! can start yet more threads immediately
GO TO 9499 ! SET UP NEXT RAY; comes back in above

There are two problems: first, why can't I link the omp_... routine? Is there another library I have to declare to the linker? Second, why are many of my cores idle?

I've checked the Threads window in the debugger, and before I call any of the OMP routines, there is a Main Thread and two Worker Threads. After I get to the first !$OMP PARALLEL DO, the same three threads show up. I would expect as many threads as I have cores. Clearly, the OpenMP feature is not working.

In case it's useful, here are the command lines for Fortran and the linker:

/nologo /debug:full /debug:parallel /Oy- /I"Debug/" /recursive /reentrancy:none /extend_source:132 /warn:none /Qauto /align:rec4byte /align:commons /assume:byterecl /Qzero /fpconstant /iface:cvf /module:"Debug/" /object:"Debug/" /Fd"Debug\vc100.pdb" /traceback /check:all /libs:dll /threads /winapp /c

/OUT:".\Debug\SYNOPSYS200v14.exe" /VERSION:"14.0" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files (x86)\Intel\Composer XE 2011 SP1\\compiler\lib\ia32" /LIBPATH:"C:\Projects\U105136\OpenGL\freeglut 2.4.0 (compiled)" /LIBPATH:"C:\SYNOPSYSV14\Libraries(x64)\Static\MT" "mpr.lib" "SentinelKeys.lib" "wsock32.lib" "freeglut.lib" "Debug\SYNOPSYS200_lib_.lib" /NODEFAULTLIB:"LIBCMT.lib" /NODEFAULTLIB:"msvcrtd.lib" /NODEFAULTLIB:"msvcrt.lib" /MANIFEST /ManifestFile:".\Debug\SYNOPSYS200v14.exe.intermediate.manifest" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pdb" /SUBSYSTEM:WINDOWS /STACK:"3000000" /PGD:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pgd" /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE

Progress! I found a page that says I have to add VCOMPD.lib to the linker input. Now I can use the omp_... routines -- but I still don't get more than the usual three threads, even after the !$OMP PARALLEL DO.

So something's still wrong.

Yes, I'm a newbe, and I didn't know about linking the library VCOMPD.lib, and also setting the compiler options to recognize the omp_ calls. So I got most things working: it makes lots of threads and they all run when I get to the !$OMP PARALLEL DO statement. So far, so good. And the code comes back with the right answers.

But here are my timing specs:

serial mode: 0.371 seconds
10 cores, 10 passes in the DO loop each sequence, which is run about 985 times: 0.546 seconds.

It's hard to believe there is so much overhead in the OpenMP routines. There is some added work in my code, of course; there are 36 assignment statements associated with each core going into the calculation, and several hundred coming out. But if I run a simpler problem, where the calculations are faster but the overhead is the same, I get

serial: 0.156
10 cores: 0.215

So the overhead has to be no more than 0.059 seconds, even if the parallel execution was exactly the same speed as the serial. None of this makes sense.

Is there lots of overhead just triggering each pass through a given thread? That might do it.

If you didn't set /Qopenmp (there's a prominent option in Visual Studio GUI as well), your OpenMP directives should be reported with warnings. That would explain your failure to link libiomp5, thus your omp calls won't be resolved.

As Tim said, you need to set the option /Qopenmp. It is under Fortran
> Language > Process OpenMP Directives in the properties menu.
Without this option, the openmp directives will be ignored.

I've made some real progress. I now have the debug version running my eight cores with eight threads, and my test case runs 1.56x faster with multithreads enabled than it does in serial mode. A key point was to not enable recursive routines and not generate reentrant code. (This by dumb trial-and-error, since I have not seen that information given anywhere.) This is great news!

But the release version still runs 1.6x slower in multithread mode, and I don't know why. Here are the command lines:



/nologo /debug:full /debug:parallel /Oy- /I"Debug/" /reentrancy:none /extend_source:132 /Qopenmp /Qopenmp-report1 /warn:none /Qauto /align:rec4byte /align:commons

/assume:byterecl /Qzero /fpconstant /iface:cvf /module:"Debug/" /object:"Debug/" /Fd"Debug\vc100.pdb" /traceback /check:all /libs:dll /threads /winapp /c


/ZI /nologo /W1 /WX- /Od /Ot /Oy- /D "WIN32" /D "_DEBUG" /D "_WINDOWS" /D "_VC80_UPGRADE=0x0600" /Gm /EHsc /MTd /GS /Gy- /fp:precise /Zc:wchar_t /Zc:forScope /Fp".\Debug

\SYNOPSYS200.pch" /Fa".\Debug" /Fo".\Debug" /Fd".\Debug" /FR".\Debug" /Gd /analyze- /errorReport:queue


/OUT:".\Debug\SYNOPSYS200v14.exe" /VERSION:"14.0" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files (x86)\Intel\Composer XE 2011 SP1\\compiler\lib\ia32" /LIBPATH:"C:\Projects

\U105136\OpenGL\freeglut 2.4.0 (compiled)" /LIBPATH:"C:\SYNOPSYSV14\Libraries(x64)\Static\MT" "mpr.lib" "SentinelKeys.lib" "wsock32.lib" "freeglut.lib" "Debug

\SYNOPSYS200_lib_.lib" "VCOMPD.LIB" /NODEFAULTLIB:"LIBCMT.lib" /NODEFAULTLIB:"msvcrtd.lib" /NODEFAULTLIB:"msvcrt.lib" /MANIFEST /ManifestFile:".\Debug

\SYNOPSYS200v14.exe.intermediate.manifest" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pdb"




/nologo /Oy- /Qipo /I"Release/" /reentrancy:none /extend_source:132 /Qopenmp /Qauto /align:rec4byte /align:commons /assume:byterecl /Qzero /fpconstant /iface:cvf

/module:"Release/" /object:"Release/" /Fd"Release\vc100.pdb" /check:none /libs:dll /threads /winapp /c


/Zi /nologo /W2 /WX- /O2 /Ot /Oy- /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_VC80_UPGRADE=0x0600" /GF /Gm- /EHsc /MT /GS /Gy- /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp

/Fp".\Release\SYNOPSYS200.pch" /Fa".\Release" /Fo".\Release" /Fd".\Release" /FR".\Release" /Gd /analyze- /errorReport:queue


/OUT:".\Release\SYNOPSYS200v14.exe" /VERSION:"14.0" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files (x86)\Intel\Composer XE 2011 SP1\\compiler\lib\ia32" /LIBPATH:"C:

\SYNOPSYSV14\OpenGL\freeglut 2.4.0 (compiled)" /LIBPATH:"C:\SYNOPSYSV14\Libraries(x64)\Static\MT" "ODBC32.LIB" "ODBCCP32.LIB" "mpr.lib" "SentinelKeys.lib" "wsock32.lib"

"freeglut.lib" "VCOMP.lib" "Release\SYNOPSYS200_lib_.lib" /NODEFAULTLIB:"LIBCMTd.lib" /NODEFAULTLIB:"msvcrtd.lib" /NODEFAULTLIB:"msvcrt.lib" /MANIFEST /ManifestFile:".\Release

\SYNOPSYS200v14.exe.intermediate.manifest" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /PDB:"C:\SYNOPSYSV14\Release\SYNOPSYS200v14.pdb" /SUBSYSTEM:WINDOWS


Do you need the option /fp:precise in C++? This will disable a number of optimizations.

Also, is there a reason you are using /iface:cvf in Fortran? Are you calling libraries compiled with CVF?

I need the /iface directive. The program was converted from CVF, and although there are no libraries compiled there, I have calling conventions built in all over the place that require that option. The /fp directive results in floating-point answers that are nearly the same as the CVF version, while the other options are often quite different.

I'm not sure the issue is optimization anyway. The program runs faster than the CVF version, even in serial mode. The issue is, why I cannot get the OpenMP services to work in release mode as well as in debug mode. The latter seems to work fine, and I get faster results running parallel than serial. But the release version runs more slowly in parallel mode than in serial, which suggests that it is actually running the parallel DO loop in serial, with extra overhead slowing it down even more.

So that's the problem. Can you think of any way to fix it?

Quoting dondilworth
I've made some real progress. I now have the debug version running my eight cores with eight threads, and my test case runs 1.56x faster with multithreads enabled than it does in serial mode. A key point was to not enable recursive routines and not generate reentrant code....

AFAIK, use of those options means that your multithreaded program is now rather broken!

There's an overhead associated with multi-threading. Typically the amount of really independent work that can be done in parallel needs to be over a certain threshold before multithreading becomes worthwhile. Whether your notionally independent work is really independent work, and whether theres enough code to justify the overhead can't be assessed by people not familiar with your code. If you could post examples that would help us understand. Ideally those examples would be a cut-down, self contained and compilable program.

Why do you have /warn:none on your debug build? Outside of certain special use situations the warnings given by the compiler with /warn:all are pretty relevant. Ignore them at your peril. I typically use /warn:all on both debug and release builds.

How are you timing your runs?

Previously you wrote:

Every subroutine involved in the operation of the threads is compiled with


(I hope) makes it run in the same thread as its caller.

Those USE statements don't change the execution of your program in a single thread/multithreaded sense. at all They simply make variable, type and procedure declarations available to your program in the scope that has the USE statement. Code that you subsequently write that uses those declarations then determines which thread runs what code.

The word "broken" scares me.

I know about the overhead issue and importance of putting time-consuming code in the parallel part. In my case that code takes nearly all of the time, depending on user input, so there is potentially a lot to gain. I time my runs by calling a routine that reads the current system clock time. Then, when the test it done, it calls it again and subtracts the two. So I know pretty well how much time is required if I use serial mode or if I employ the multithread option. My whole project amounts to over 600 mb of code. I can of course zip it and make it available, but first I want to be sure that I have not made a dumb mistake.

The subroutine that is called in the DO PARALLEL section uses only labeled common, indexed to the thread, and local variables. There are other common blocks, but the parallel routines do not write into them. Every routine that it calls, and so on, works the same way. So there are no race conditions and no routine should have to wait for any other. The order in which the threads are done makes no difference whatever. This is about as clean a piece of code (in that section) as you could want.

I'm bewildered by your comments about the USE statements. I never heard of OpenMP (or tried multicore programming) before about two weeks ago. So I'm bewildered by lots of things at this point. But I'm making serious progress: Just getting my debug version running correctly and multithreading correctly means I am very close. I tried taking those USE statements out, and the debug version ran about 10 times slower. So back in they went.

I shall put back the warn directives, however. You are correct about that. But I have a huge program and the warnings cover a great many pages. None of them are serious, mostly about jumping into and out of loops, so I thought to get rid of them. This is legacy spaghetti code like you never heard of, originating back when computers ran on vacuum tubes. A complete rewrite according to modern standards is out of the question.

Tell me, if I use the OpenMP parallel features, do I need to enable the recursive and reentrant options? I did that with the debug version, and execution was serial (judging by the timing), even though threads were created (according to the Threads pane in the debugger). When I disabled those options I got the speed improvement I was looking for. I have seen no documentation that addresses this issue. A newbe needs that kind of documentation.

Here are the data:

Debug, serial mode, test case: 1.26 seconds. Multicore with a do parallel of only one trip, 3.42 seconds. With 8 trips (for eight cores), 0.88 seconds. Nice.

Release mode: test case, serial mode, 0.28 seconds. Multithread, one trip, 0.48 seconds. Eight trips, 0.45 seconds. Not nice.

Why would the program work debug and not release? Here's my loop:



GO TO 8801



The above schedule(static,1) means each thread takes one iteration at a time.
Without schedule(static,1), each thread takes adjacient chunks of the iteration space. As to what the chunk count is, this depends on factors external to the !$OMP DO... and may include environment variables as well as what (if)you specify the default scheduling mode (static, dynamic, ...)

When the iteration count is large, and the computation time per itereation is relatively small, then you would want to have each thread take larger counts of iterations per loop trip.

Jim Dempsey


I tried the schedule(static,1), and it made no difference. The debug version is still faster in multithread mode, and the release version is still slower. Have you any other suggestions?

Let me supply some more details. In my test case, I am tracing almost 30,000 rays through a lens of 19 elements. The parallel loop farms the job out to a raytrace program, eight rays at a time. When they are all finished, it loads the next set and starts the DO loop all over again with another eight. So there are thousands of trips through that section of code.

I don't know if that logic sits well with the mechanism that creates threads. I gather from an earlier post that once the threads are started they stay defined and just wait for another trip through the parallel loop. If that is the case, then there should not be a whole lot of overhead with all of these trips.

Correct me if I'm wrong, please. Is that a valuable clue?

Quoting dondilworth
Let me supply some more details. In my test case, I am tracing almost 30,000 rays through a lens of 19 elements. The parallel loop farms the job out to a raytrace program, eight rays at a time. When they are all finished, it loads the next set and starts the DO loop all over again with another eight. So there are thousands of trips through that section of code.
This reads as if each thread is starting a separate program (a different exe?) - is that the case?

Not at all. The parallel DO calls a Fortran subroutine, and I want eight of those calls to execute in parallel. When they all finish, I collect the answers, prepare input for another eight, and come down to the DO statement again. Then I want the next eight to run in parallel again, and so on. That subroutine gets called almost 30,000 times. It's always the same routine, in the same exe.

How much code is executed by:


Note, I mean by the subroutine itself, not the loop including the call.

Note, from my understanding you have

repeat ~30,000/nThreads
read nThreads # rays
parallel DO nThreads # ways
CALL MTRAYTRACE(I) once per thread
end parallel DO
end repeat

If MTRAYTRACE is a relatively small work load, then the bulk of the time is waiting for the read-in.


real(8) :: t0, t1, readTime, mtrRayTraceTime(128)

readTime = 0.0
mtrRayTraceTime = 0.0
repeat ~30,000/nThreads
t0 = omp_get_wtime()
read nThreads # rays
readTime = readTime+ omp_get_wtime() - t0
! private(t1)
parallel DO nThreads # ways
t1 = omp_get_wtime()
CALL MTRAYTRACE(I) once per thread
mtrRayTraceTime(omp_get_thread_num()) = &
&mtrRayTraceTime(omp_get_thread_num()) &
& + omp_get_wtime() - t1
end parallel DO
end repeat

write(*,*) "readTime ", readTime
do i = 0,omp_get_num_threads()-1
write(*,*) i, mtrRayTraceTime(i)
end do

Jim Dempsey

Hang on - 30000 iterations in 0.28 seconds? So each unit of work takes 9 microseconds? That's tiny.

The multithreading overhead would be swamping that.

You need to batch the job better. Try having each thread process something like 1000 rays (the more the better) before you collate results and restart the loop.

These are all cogent comments, but consider this: If I run a simpler problem, with the same number of rays but a smaller lens, then the time for overhead is exactly the same but the time in the raytrace is much shorter. Then the execution is indeed much faster. That tells me that, for a complicated problem, the raytrace is indeed taking most of the time, as it should for parallel processing to buy anything.

The fact that the debug version runs faster in multithread mode than in serial also tells me something. That version works. Why would the release version not? That's the whole issue, and so far none of the responders have addressed it.

What we are suggesting you do is (outline)

Reduce the number of { !$omp parallel }
Reduce the number of { !$omp do }
Increase the ratio of {doWork : $omp }
{!$omp end do}
{$omp end parallel}

Presumably all your input data, regardless of lens size, is present. You stated you read this from a file. Therefor, assume you were reading in 1 ray per thread and had 8 threads (for the sake of argument). We are suggesting you:

100x{read in 1 ray per threadfor 8 threads}
!$omp parallel do schedule(static,1)
do i=1,totalNumberOfRaysRead
end do
!$omp end parallel do


100x{read in 1 ray per threadfor 8 threads}
!$omp parallel
nThreads = omp_get_num_threads()
iThread = omp_get_thread_num()
!$ompdo schedule(static,1)
do i=1,100
j = i *nThreads+ iThread
end do
!$omp end do
!$omp end parallel


double buffer your input (overlap reading with processing)

>>The fact that the debug version runs faster in multithread mode than in serial also tells me something.

Debug mode increases the processing time of your doWork(), therefor increases the ratio of doWork : $omp

Jim Dempsey

This is very interesting. You suggest that instead of doing a loop of eight passes, one ray each, all in parallel, I change the bookkeeping so I do a loop of 800 passes, and execute that 1/100 as many times as I do now. But if that is the case, then since I have only eight cores, then each thread would have to trace 100 rays instead of just one. So each thread would be running its 100 in serial, not parallel.

Yes, this could potentially work, but it implies that the overhead of starting a parallel DO is huge and you want to do as much as you can while you are in there. That's the interesting part.

I'll give it a try, and let you know what I find.

Here's what I got when increased the DO loop count from 8 to 100, in release version:

Serial mode only: 0.246 seconds
Parallel mode: .57 seconds

So the release version still takes longer if I run in multithread mode. In fact, those numbers are very similar to what happens with a loop count of 8. I don't see much effect from increasing it.

Even worse, the debug version now takes longer in parallel mode too:

Serial: 2.6 seconds
Parallel: 2.9 seconds

Recall that, with a loop count of 8, the numbers were 1.26 and 0.88. I don't know why the serial mode also takes longer; maybe because of larger array sizes and more memory faults. I see lots of disk activity when it runs.

If I run a very simple problem in release mode, where the loop count is the same but the time in the raytrace is very small, I get this:

Serial: 0.031 seconds
Parallel: 0.135

So the total overhead time is no more than 0.1 seconds. That's less than the time lost in the real parallel case. So the difference cannot be from overhead. Is Windows scheduling things in a way that one thread goes slowly? That would do it, since the program has to wait for them all to finish before moving on. The Resource monitor says that I have 103 threads going, while System has 157.

I would suggest using a simple loop to run 30000 times.
what does your routine do if a ray doesn't trace?
Are you splitting rays ( for ghost image analysis etc.)?
what types of curved surfaces are you trying to trace?
Are you sure you have the most efficient algorithm for finding intersections for each type of surface?
What about multiple intersections etc. for surfaces of degree >= 2?
What type of transformation matrices are you using to transform from one surface's local frame to the next surface?
Are you using 3-vectors or 4 vectors (to do rotations and shifts)?

I could go on...but will stop here!

Of course all of those are considerations when one writes a code like this. But they were all addressed many years ago, and the code now is already much faster than the competition's. So those issues really do not bear on the present situation, which I have repeated many times above: Why does the release mode work more slowly in multithread mode than in serial mode? All my tests indicate that the bottleneck is not in the details of the lens or the raytrace algorithm, since the problem persists with all degrees of complexity. It appears to be some kind of scheduling problem in Windows -- which perhaps holds up one or more threads while the others are long finished.

If my guess is correct, and that is the problem, how does one fix it?

Here's a hypothetical explanation - details depend on implementation that I'm not familiar with in the slightest, but at least it should give you some ideas.

At the end of the parallel do construct your program will sit around and wait for all the threads to catch up. If seven of your threads got there more or less immediately, but the eighth was pre-empted (something else in the system got that core for the timeslice) then the clock face time will be governed by the time that it takes for that eighth thread to be rescheduled and finally complete.

Crudely (it depends on how the synchronisation in your program works, what else is happening in the system, what other threads are doing, etc) scheduling will be on the basis of timeslices. A timeslice on Windows is of the order (it varies) of 10 milliseconds. If your unit of work is only 9 microseconds, then the potential overhead in synchonising the threads at the end of the loop could be rather large - you have eight threads, but for 99 percent of their time they sit around waiting for their colleague to finally get its work done. In the serial case, the one core just keeps plugging on, and with eight cores on the system its not likely that it will get preempted, ever.

(Even if the pre-empted timeslice aspect is a bit of a ruse - the message is the same - you want to minimise the amount of synchronisation that needs to happen - hence the suggestion to create much larger batches that can be processed independently.)

Note that it took almost three pages of forum posts to flush out some of the necessary detail. If you had a cutdown variant of your program that demonstrated the problem things would be progressing much faster. I'm still not clear on the relevant structure of your program.

Other notes:

- How many "real" cores do you have? Is it eight? Or is it four, with hyperthreading?

- For some problems the limiting aspect is memory access. Multithreading doesn't help much here - in fact it can make the problem worse.

- The total runtime of your program is still quite short. Could you construct an example that had a runtime of say 30 seconds or more? When a process starts in Windows there is a whole heap of things that go one - having a longer run time should eliminate that overhead. That will also help you understand whether the disk access that you are seeing is due to your program's calculations (if it is then multithreading is far less likely to help because disk access is so slow...).

Yes, it's a messy problem. This program has been under development since the early 60s, and is probably older than many of the people posting here. (First attempts ran on vacuum tubes!) The source package is around 650 mb, and it's not practical to repackage it in a digestible size. I wish I could. I can zip it up and send the whole geshaeft to anyone brave enough to get into it, if that will do the trick. I understand that finding a problem requires a managable test bench.

I have eight real cores. I am not familiar enough with the other things the OS is doing to venture a guess as to the interactions. I was hoping that the writers of OpenMP would have sorted those things out so that their product would work reliably on many systems. If the details of the OS are critical to the success of my project, then just dealing with what's happening on my PC -- even if successful, will not help my customers, who have other setups. This has to work all the time.

I will see about getting an example that runs for a much longer time. That is a wise suggestion.

I see that one of the schedule options is STATIC, CHUNK. But I have not seen any definition of what a chunk is. A single machine instuction, timeslice, trip through a subroutine, loop cycle, or what?

I do not think that all of the disk accessing I see is due to my program. Looking at the Resource Monitor, I see lots of disk activity all the time, whether I am running my code or just sitting there. Looks like Windows is doing its own homework or something. Anyway, I have 16 gigs of real memory, and I assume that the virtual-memory OS will keep all of my code and data in memory or cache, so there should be no disk overhead on my account.

Here's a clue: If I run my test case and run 10 trips at a time through the DO loop, in debug mode it takes 3.3 seconds. The Task Manager shows that four of my cores are doing nothing, while the other four peak up. If I run eight trips (hoping for one core per trip) all eight of the cores peak up. But the run still takes 3.3 seconds. Those figures are somewhat chaotic; lots of runs produce lots of numbers, some requiring more, but rarely less, time. Does this make sense?

I've tried all of the schedule options: static, dynamic, guided, runtime, and auto. They all score about the same, except the last two, which are slightly slower.

But I'm running out of things to try. My numbers show no benefit from OpenMP, in spite of weeks of effort and patient advice from many experts on this (multithreaded!) forum. Is it worthwhile to continue?

>>But I have not seen any definition of what a chunk is

!$omp parallel do
do i=1,32000
{first thread i iterates 1:4000}
{second thread i iterates 4001:8000}

Each thread's iteration range is determined at beginning of loop.

In the above, the schedule defaults to static and "chunk" is number of iterations/number of threads

The above default works fine when

a) the amount of work per iteration is equal
b) the availability of threads is equal (other apps may be running and taking "your" time).

When work load or thread availability is at issue then you can program using dynamic scheduling.
Dynamic scheduling first partitions the iteration space into smaller chunks than n/nThreads. The first pass is on first come firstserve availablity ofthreads, then the chunk size is halved, picking of chunks first-come-first serve, etc... until no iteration space remains. (you can specify a starting chunk size)

schedule(static, nnn)

Says that each thread consumes the lesser of nnn or remaining iteration space.

You choose the technique that best serves your purpose.

>>> If I run eight trips (hoping for one core per trip) all eight of the cores peak up. But the run still takes 3.3 seconds

Then your serial code is taking the bulk of the time...
or your parallel code contains calls that internally serialize (for most of the processing time)

Do you have a profiler (Intel's VTune)?
If you do not, then you can get a 30-day trial version from Intel
Or... try AMD's Code Analyst which can perfrom Timer-Based analysis on Intel processors.

Use a timer based analysis (as oposed to event or instruction based).

Jim Dempsey


I want to thank you for the helpful reply. I found the comment "...parallel code contains calls that internally serialize" interesting. What kind of calls will internally seralize? I call my subroutine, it calls others, and so on. All of them change only local variables, and a given trip through the DO loop, with its string of calls, is supposed to be serial. I never planned on any further division of labor, once that trip starts. But I want lots of those trips in parallel.

I tried the suggestion to run a case that takes a great deal more time. That was telling. Here's what I got:

Serial mode: 12.2 seconds
Multi, DO index 1 to 1: 15 seconds
1 to 2: 9.6
1 to 3: 7.3
1 to 4: 7.7
1 to 5: 7.1
1 to 6: 6.7
1 to 7: 6.8
1 to 8: 17.8
1 to 9: 6.9
1 to 10: 6.6

This is in debug mode. Now it seems that the OpenMP feature is giving me a 50% speed increase! Wow.

But why on Earth would making the number of trips through the DO equal to the number of cores take longer than any other case? Eight cores, eight trips, takes 17.8 seconds. Ugh. Does that make sense?

I'll test the release version next. Stay tuned.

I made an even slower example and ran it in release mode. Here is what happened:

Loop count = number of threads = N

Serial: 2.63 seconds
N = 1: 2.02
N = 2: 1.9
N = 4: 2.1
N = 8: 2.3

Loop = 10, threads = 2: 1.9
Loop = 10, threads = 8: 2.3
Loop = 10, threads = 10: 2.15

So the best case was using only two threads, not eight as I would have guessed. If I watch the Task Manager, I can see that, for this case, five of my eight cores were busy. Weird.

>>What kind of calls will internally seralize?

allocate/deallocate (even implicitly via temporary array)
rand (or any one at a time random number generator)
file I/O
screen I/O

>>1 to 8: 17.8 ~2x

This does seem odd

I am assuming you are using schedule(static,1) on system with 8 hardware threads

Note also that 9 and 10 do not take additional time???

Is the time you list

a) the time per thread per iteration
b) the time of the parallel do loop
c) the time of the outer loop (~30000) containing the parallel do

Jim Dempsey

>>Loop = 10, threads = 10: 2.15

So the best case was using only two threads, not eight as I would have guessed. If I watch the Task Manager, I can see that, for this case, five of my eight cores were busy. Weird.

Do not configure OpenMP for more threads than you have hardware threads.
An 8 core processor will have 8 hardware threads without HT, or 16 hardware threads with HT
(more if you are on MIC).

module mod_foo
real:: array(30000)
end module foo

use mod_foo
array = 0.5
iMaxThreads = omp_get_max_threads()
{if the above does not work due to missing function in library}
!$omp parallel
iMaxThreads = omp_get_num_threads()
!$omp end parallel

do nThreads = 1, iMaxThreads
t0 = omp_get_wtime()
!$omp parallel num_threads(nThreads)
!$omp do
do i=1,size(array)
call doWork(i)
end do
!$omp end do
!$omp end parallel
t1 = omp_get_wtime()
write(*,*) nThreads, t1-t0
end do

! dummy do work subroutine
! *** that does not get optimized out bu the compiler
subroutine doWork(i)
use mod_foo
integer ::i
integer :: j
do j=1,size(array)
if(sqrt(array(j)) .eq. real(i)) array(i) = array(i) + 1.0
end do
end subroutine doWork

Note, the above is memory intensive as opposed to compute intensive
So performance will drop after 3-4 threads
You can add compute statements to the loop but be careful in that the compiler will not remove the code if it appears that the results are not used.

See how the above performs.

Jim Dempsey

This is more or less what I experience: too many threads slows things down. Anyway, it's time to move on. I can get a slight improvement in speed with some problems, but a decrease with others. I conclude that OpenMP is a nice idea -- but one that does not work very well for the kind of problems I have. Had to try it, and I'm glad I posted in this forum. I appreciate all of the intelligent suggestions made by the posters. But there is no simple solution that works well and reliably. I'll use the feature in the cases when it works, and not worry about it otherwise.

Thank you, everyone.

>> I conclude that OpenMP is a nice idea -- but one that does not work very well for the kind of problems I have. Had to try it, and I'm glad I posted in this forum.

I am glad you gave it a try.

If compute time is a serious concern, then I would suggest you have a more experienced programmer look at the application as a whole. I expect that they will see something that you have been unable to convey to us on this forum.

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today