Parallel processing much slower?

Parallel processing much slower?

Ritratto di dajum

I have a code that I set up as follows

       SUBROUTINE OPI

!$OMP PARALLEL SECTIONS NUM_THREADS(2)

       CALL OPER

!$OMP SECTION

       CALL SUB

!$OMP END PARALLEL SECTIONS

       RETURN

This is the basic structure.  I use a module with volatile variables to communicate between the two threads.  SUB has a DO WHILE loop that goes until OPER tells it to quit.  To test it I don't have SUB doing anything other than looping.  So none of the flags change except the one to tell it to quit. All of the real computations are done in OPER.  This takes about 100 seconds to run.  If I run this without the parallel sections, it takes 81 seconds.  Where do I look for all this overhead.  Once I actually have SUB doing some real work I expect it to happen in parallel to OPER, but the overhead is wiping out any improvements I can expect.

Thanks!

Dave

38 post / 0 new
Ultimo contenuto
Per informazioni complete sulle ottimizzazioni del compilatore, consultare l'Avviso sull'ottimizzazione
Ritratto di jimdempseyatthecove

Can you show the code?

www.quickthreadprogramming.com
Ritratto di dajum

What part would be relavent? OPER is hundreds of lines calling thousands of routines.  Sub is a DO WHILE loop with two if tests that always evaluate to false right now until the last line of OPER sets the DO WHILE condition false and SUB ends.  

Ritratto di dajum

Jim,

The code is almost the same pattern as you suggested in this thread: http://software.intel.com/en-us/forums/topic/299766

But my goal is to bufffer data in the OPER that is written out in SUB.  But my testing now does write either to the buffer or do any output.  Hence the two if tests evaluate to false in my code.  The flags in use are in a module that all have the VOLATILE attribute.  But why this threading makes it run 25% longer is puzzling me.

Dave

Ritratto di dajum

Interesting.  I put a CALL SLEEP(1) inside the DO WHILE.  Execution time 84 seconds.  Why does it matter so much if the tread goes to sleep?

Ritratto di IanH

From your description, the amount of work that SUB has to do depends on how long it takes OPER to finish ("SUB has a DO WHILE loop that goes until OPER tells it to quit").  In the serial case, doesn't that mean SUB does nothing - OPER has already finished?  If so, that means the two cases are far from equivalent. 

A spinning DO WHILE loop will tie up a core (in the absence of the compiler working out that the loop does nothing and eliminating it).  If you put a sleep inside the loop then the core becomes available for work by other threads in the system - in the case where you have less cores than active system threads (how many cores do you have?) that could affect the number of timeslices given to the thread running OPER.

Perhaps I've misunderstood, but if not consider giving SUB a fixed unit of work.

Making a shared variable volatile is not on its own enough to avoid data race conditions and/or guarantee a consistent view of the variable between threads.  Based on your description I expect you would need explicit synchronisation and flush operations.  How you have those arranged can also make a significant difference to execution time.

Ritratto di dajum

Yes in the serial case SUB does nothing.  But I didn't expect the two treads to compete for execution time.  Isn't that the point of have separate threads doing separate work on a multicore machine? Does it matter if the work is just spinning in a loop or actually doing useful processing and output? I have an i7 Q720 processor (4 cores 8 threads).  So I expected the two threads to work in different cores. Do I have to do something special to make that happen?  

In a real case I expect the SUB thread to actually get data to process such that the SUB thread will have some variable fraction of the work. In a serial case it is as much as 20-40% of the total time. I expected to be able to reduce the total exection time by putting most of that effort in a second thread. Sitting and spinning seemed to be the best solution.  Is there some mechanism to wake up the second thread  when the first thread has data ready that would be a better idea?

I have built in flags to handle the synchronisation, that seems to work fine, but process is slower than doing the work in a serial code. So this was a test to try to determine why. But it just seems to raise more questions for me.   

Any pointers to a reference(s) that explains the details of the overhead would be appreciated.  

Ritratto di IanH

The sleep response could be explained by your threads sharing the same physical core.  I think default thread affinity depends on what the operating system had for breakfast.  Out of my domain, others will know better. 

In the meantime open a command prompt, then:

set KMP_AFFINITY=verbose,scatter

then run your program in that command prompt and see what happens.  This should force threads to different physical cores and give you some diagnostics as well.

This may be completely unrelated to your problems,but if by "built in flags" you mean ordinary Fortran variables in conjunction with ordinary Fortran statements, then it is probable that your synchronisation is not formally well defined.

Ritratto di dajum

I changed the affinity, but it didn't really make any difference.  Run-time was 102 without any SLEEP. 

My flags are all variables in a module. But they are configured that only one thread writes any variable, and the other thread only reads it.  I think that makes if well defined, if that is your meaning.  Otherwise could you clarify what you mean by the "synchronisation is not formally well defined"?

OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2,3}

Ritratto di jimdempseyatthecove

>>But my goal is to bufffer data in the OPER that is written out in SUB.

Do you intend to run your code with OpenMP Nested enabled?
(i.e. does OPER contain !$OMP PARALLEL...?)

Is OPI called from within a parallel region?
Is OPI called only once or many times?
Is the results data produced by OPER large or small?
What will the ratio be of time spent by SUB verses time spent by OPER?

The concept in the mentioned link is for a technique to overlap writes (or reads) with work.
And this recommended only when the writes (or reads) are a significant portion of the work.

The construction of your code in this forum thread is non-overlapped (no advantage to parallelization).
Not knowing about your code it is difficult to recommend a technique.

Presumably OPER has a loop. If so, can the partial results be written (SUB) as they are accumulated (iow on/after each iteration)?
If OPER has a loop, then you have at least two differen strategies:
a) single buffer (one results buffer, one intermediary copy for writing)
b) double buffer (two results buffer, no intermediary copy for writing)

Method a) is often easier to implement but has the overhead of copying data in memory.

Additional information will be required before we can make recommendations.

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di dajum

Yes OPER also has parallel regions. OPI is called only once. The ratio between SUB and OPER isn't known in advance as it can vary widely.  At times SUB will be greater, and at times OPER will be greater.  But the targeted cases that I'm really looking to reduce overall duty cycle will have SUB using about 25% of the processing time for a serial run. The amount of data can exceed 1 GB, and most of the work in SUB is writing data.

I'm not sure why you think it has no advantage to being done in parallel. It is intended to do the same overlap of writing data with work. I have used your method a) as my code isn't structured to use b).  

Ritratto di IanH

The OpenMP spec says "...if at least one thread reads from a memory unit and at least one thread writes without synchronization to that same memory unit ... then a data race occurs.  If a data race occurs then the result of the program is unspecified."  Beyond that, there's the need to ensure that your threads view of any shared variables is consistent.

I still think that the simple explanation is that your spinning DO WHILE loop is not equivalent to your serial case.  Again, out of my domain in terms of what happens at a hardware level, but I could imagine that the continuous access of a shared flag variable in the loop introduces reasonable overhead.

Discussion of this sort of topic is difficult without specific code examples (e.g. depending on the number of threads you start inside OPER you can be back in the situation of having less physical cores than threads wanting to run).  Based on what I think you are trying to do I've attached an example that uses omp locks. I think this is formally correct, but note that I really only dabble in OpenMP - I can usually get myself into enough trouble when running things serially that I only contemplate running in parallel when I wish for almost certain disaster.  For this specific example (noting that there are ten batches, sub and oper take about the same time per batch and the output from the batches are somewhat distant from each other in memory) the parallel case leaves the serial case choking in its dust.  Vary things and you can elininate or even slightly reverse that relative performance.

OMP tasks might also be suitable for this - I've used them successfully in an application which also has producers and consumers of data, but with more complicated dependencies between actors.

Allegati: 

AllegatoDimensione
Scarica 2013-01-23-batchhandling.f902.84 KB
Ritratto di dajum

Ian,

Thanks for the code. To execute the parallel version am I supposed to edit all the !$ to be !$OMP, or is there some other way to do this?

BTW in my test case there are no other parallel constructs.  So I expected what happens in SUB to not really matter.  Why it does is what I don't understand.  I sort of expect the two treads to operate independently.  The elapsed time only depending on how long OPER takes to run.  But it appears the SUB thread makes the OPER thread not execute continuously. If it did it should run in the same time as the serial case.  But it must be not executing for some reason, I'd like to understand that reason. 

Dave

Ritratto di IanH

!$ is the OpenMP free form source conditional compilation sentinel - see 2.2.2 of the OpenMP 3.1 spec.  Those lines are comments if OpenMP is not enabled, they are normal statments if OpenMP is enabled.  You shouldn't need to edit the source - it should (hopefully) compile and behave similarly regardless of OpenMP.

Ritratto di dajum

Ian,

I don't think I'm getting the same results as you.  Running each case a number of times I saw for the serial cases elapsed times of .076-.093 seconds.

For the parallel cases .056 - .155 secconds.  The parallel case has a much wider variablility. And most of time was much slower than the serial case, and once was faster.  Which just makes no sense to me. In one trial, it even did all the oper cases before the sub cases, which ran in .09 seconds.  Is this what you saw?

Ritratto di IanH

No, but my hardware is nowhere near as capable as yours.  Make the number of iterations (count) bigger, perhaps by a factor of 100, then secondly perhaps set the affinity to scattered and see what happens.

Ritratto di Tim Prince

If only one thread is doing work, as appears to be implied, OpenMP could be expected to slow it down.  Reducing the value of KMP_BLOCKTIME might make a difference, as you are discussing times of that order of magniitude.

It might help of you would state what you are trying to learn from this thread; but you've already declined to answer questions leading in that direction.

Ritratto di dajum

My code behaves like the run-times I get for many of the runs of Ian's code ( .155 seconds parallel versus .09 for serial).  Doing work in parallel makes it take longer than doing it serialially.  I don't understand that.  What I'm trying to learn is why that happens.  What happens when I start two threads, OPER that does a bunch of work, and SUB that just loops and should just stop when OPER stops, there are no waits or stops or any flags changing between the two.  Why does that take 25% longer than just calling oper?  If the threads are in different cores, why isn't the difference just the time it takes to get the two treads running?  What makes the second thread slow the first one down so much?  Yet making it sleep, will let the other thread run faster.  What is interupting the OPER thread? If the SUB thread didn't slow down the OPER thread I think it should just take a small fraction of time longer than OPER takes running alone.  

I've tried to answer every question posed. I'm sorry if I missed something, but I don't see what it is other than showing the code.  I though the snipet I posted was the relavent part.  

Ritratto di Tim Prince

In the example you posted, doesn't OPER get executed by both threads (watch out for races), followed by SUB being executed by one thread, followed by a wait for KMP_BLOCKTIME timeout?

Ritratto di dajum

I don't think so. I read the SECTION documentation as "Specifies one or more blocks of code that must be divided among threads in the team. Each section is executed once by a thread in the team."  So I think OPER is in one thread and SUB is in another, each getting executed once concurrently.  Since I don't do anything in either thread that makes the other wait at this point, I don't understand why OPER takes 25% longer just because another thread is running. And when I add a SLEEP(1) per internal loop in the SUB thread, OPER only takes 5% longer.  If they are in different processors, and not causing any waits between the threads, I don't understand what is happening that causes this difference. The same thing goes for Ian's code, can you explain what makes it take .155 seconds when it runs in parallel mode.  I understand the cases that it gets .056, but not .155.  I don't see any explanation for that behavior.  Its' like the code decides it just needs to wait on something else.  What is the something else? Is it just Windows letting other processes push it out? That seems strange since the variability of the serial case is much tighter. 

Ritratto di IanH

Your understanding (that there is an implicit section directive after the parallel sections directive) is the same as mine.

When I tried my example code (with larger iteration counts) on a four physical core machine with hyperthreads (8 logical cores) Windows 7 machine I got similar variable/unexpected results to you.  On my ancient two physical core no hyperthreading Vista machine I see consistent speedup with the OMP case.

The call to random_number might be (i.e. I don't know) invoking additional synchronisation in my example that I didn't count on.  More tomorrow.

Ritratto di jimdempseyatthecove

IanH,

I took the liberty to modify your program. I haven't tried all permutations of compile options.

Change summary:

Changed name of "sub" to "IOsub" to reflect purpose of subroutine.

Made Master thread call IOsub, (all) other thread(s) call OPER.

Outer level runs with all threads (you can set this to 2 threads if you want to enable nested.

Locks are set for region of work by each OPER worker thread that will perform work in OPER

Moved RANDOM_NUMBER out of inner loop (both in IOSUB and OPER)

Added nThreads = omp_get_num_threads(); iThread = omp_get_thread_num()

The DO ib loop in OPER now iterates

DO ib=iThread, batches, nThreads-1

When more than 2 threads in use then each non-master thread takes interlieved steps over batches.

Note, modifications assume all computation results maintained by application, IOsub outputs each batch as it is completed. The real application may not wish to hold all results (batches), in that event, double buffering or n-buffering could be considered.

Jim Dempsey

Allegati: 

AllegatoDimensione
Scarica batchhandling.f903.57 KB
www.quickthreadprogramming.com
Ritratto di jimdempseyatthecove

I forgot to mention.

Should you want to use nested parallelism, then set the number of threads on the outer level to 2. For test purposes then modify OPER DO i=1,count loop to have !$OMP PARALLEL DO SHARED(ib,count,r,rnd)

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di dajum

I took the program and made it work more like mine.  Still some differences, but I don't think they are critical in understanding why this works the way it does.  This version will run in 30 seconds for me when it is serial.  But in parallel it takes 48 seconds.  I adjusted the counts to make sure the serial timing interesting.  I basically fills the array in OPER and writes it out in SUB. Can anyone explain why the parallel implementation is so slow?

Jim I'll take a look at your code next. Thanks!

Allegati: 

AllegatoDimensione
Scarica bh2.f902.48 KB
Ritratto di jimdempseyatthecove

On my system (Core i7 4-core w/HT), with IOsubs at count/4 and no thread limitation on outer leve (iow 8 threads).

 IOsubs at count/4
OpenMP stubs (sequential code) 1.29s
OpenMP parallel code .44s

IOsub at count/100
OpenMP stubs (sequential code) 1.085s
OpenMP parallel code .285s

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di Steve Lionel (Intel)

Have you tried running this under Intel VTune Amplifier XE to look for thread and lock contention, etc.?

Steve
Ritratto di dajum

I actually demo'd it a few months ago to look at the same code.  It wasn't useful then to figure out what was happening, and I spent a few days trying to get it to help me figure it out.  I asked for support a couple times and got suggestions.  But in the end I switched to GlowCode and figured out the bottleneck in a few minutes. I'm not a fan of the mechanics of how VTune works,  so I don't own a copy of it.

Ritratto di dajum

Jim,

I tried your code.  It just crashes if I set OMP_NUM_THREADS=1.  Does it work that way for you?

Compiled with /Qopenmp it runs in .099 - .192 seconds

Without /Qopenmp .048-.064

So I see it as much worse running in parallel.  

Ritratto di jimdempseyatthecove

dajum,

Programming error. You do not want me to do all your work for you...
See fix attached.

Jim Dempsey

Allegati: 

AllegatoDimensione
Scarica batchhandling.f903.73 KB
www.quickthreadprogramming.com
Ritratto di dajum

Jim,

I wasn't really worried about the crash.  I'm still at the same point I was at the very beginning.  I don't understand why in parallel it runs slower.  I tested this one too.  Basically I think the serial version is faster.  This one was a little closer.  So why doesn't the same code you run do the same thing for me?  You seem to think it is faster in parallel, It definitely is not for me.  Yet my machine appears to be much faster than yours.  Any idea why that might be so.  Did you compile with arguments other than /Qopenmp?  I used /Qopenmp /O3.  Every test case I run has the same basic characteristic, parallel is slower. Something must be causing that other than the code.

Anyone?

Ritratto di IanH

bh2 has got the "issue" with data races that I was talking about before - you have writes to `index` and `done` potentially going on in parallel to reads of `index` and `data` (the if statement).  For your program to be formally well defined you need to synchronise those.  Similarly there's a formal requirement to ensure that the "view" of the shared variables is consistent between threads. 

I qualify with "formally" because the compiler's generated code and your hardware may "happen" to achieve both of those aspects practically (I don't know - I don't play at that level - whenever the debugger throws up disassembly I have to go and have a good lie down).  Even if those requirements are being met practically, that will come at some cost in terms of execution speed.

My attempt at fixing those aspects (and then perhaps breaking other things) attached.  Measured over four runs, on my two core, no HT system the omp case is faster by about 15% in a release build.  On a four core HT system similar.  Disabling HT on the four core, the omp run is 40% faster than serial (serial runtime is about the same HT to no HT). (I'm seeing occasional significant variability in both serial and omp cases on the four core machine which may be due to background system activity, disk buffering, etc - perhaps its also possible its due to a programming error.)

Allegati: 

AllegatoDimensione
Scarica bh3.f902.5 KB
Ritratto di jimdempseyatthecove

dajum,

I will be out of my office today, will look further and at bh3.f90.

In my tests I ran: In parallel and with OpenMP stubs. I did not run compiled without OpenMP.

My system has one socket. I do not know if that is affecting the issue, it shouldn't.

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di dajum

Ian,

Isn't this a non-critical race situation?  Only one thread modifies the data.  There is no possibility that it can ever end up being at a non-deterministic value as I understand it.  So there can't be an impact to the final results from this.  It may result in the thread reading the data having to make another loop, but I don't see that as an issue in the big picture.

But obviously your code and Jim's code are formally correct, yet I see the same behavior of the parallel versions running slower than the serial case.  What exactly causes that to occur? This is the big picture in my mind.  Why a parallel version should ever take longer than the serial version? I don't understand this point. 

Ritratto di dajum

Ian,

I ran your bh3.f90 code.  For me it takes twice as long to run the parallel version as it does the serial version. On a co-workers machine it took 2.5 times longer to run the parallel version.  However, both these machines are laptops (windows 7).  I then ran it on a desktop machine.  Also a quadcore but only a Xenon E5405 (running VISTA-64).  That machine had the most consistent run-times of all, with almost no spread in the elapsed times (.1 seconds), but was consistently 6% faster for the parallel version. So there is something happening with parallel code on a laptop it would appear.  All are DELL computers. Know anything about that?

Ritratto di jimdempseyatthecove

Put a break point in (but prior to loop) both OPER and SUB (IOsub). Run in parallel build, verify that you are a) running in two threads, b) both threads are in seperate cores/HW threads. b) should be observable by starting the performance monitor and watching the CPU utilization during run.

On the notebook and single E5405 all theads will share the LL Cache (L3), on the multi-socket maching you may be running in different sockets, in that case a memory intensive could run longer (however, this should be the case for the 1st iteration of your inner loop (DO i)

*** this assumes the RANDOM_NUMBER has been moved outside the DO i loop. If you have not moved this outside the loop then both (all) theads are passing through a critical section on each iteration of their respective DO i loop (as opposed to once per batch).

Jim Dempsey

www.quickthreadprogramming.com
Ritratto di jimdempseyatthecove

Revised program.

Edited results

KMP_AFFINITY = verbose
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7}
Number of threads            1
Elapsed time: 18.56 s
Number of threads            2
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 19.93 s
Number of threads            3
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 10.55 s
Number of threads            4
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 8.439 s
Number of threads            5
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 6.533 s
Number of threads            6
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 4.765 s
Number of threads            7
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 4.720 s
Number of threads            8
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7}
Elapsed time: 4.720 s
Done
KMP_AFFINITY=verbose,compact
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1}
Number of threads            1
lapsed time: 18.60 s
Number of threads            2
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1}
Elapsed time: 20.33 s
Number of threads            3
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {2,3}
Elapsed time: 10.14 s
Number of threads            4
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {2,3}
Elapsed time: 12.07 s
Number of threads            5
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {4,5}
Elapsed time: 7.059 s
Number of threads            6
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {4,5}
Elapsed time: 5.014 s
Number of threads            7
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {6,7}
Elapsed time: 5.022 s
Number of threads            8
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {6,7}
Elapsed time: 4.995 s
Done
KMP_AFFINITY=verbose,scatter
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1}
Number of threads            1
Elapsed time: 18.66 s
Number of threads            2
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2,3}
Elapsed time: 19.79 s
Number of threads            3
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4,5}
Elapsed time: 9.924 s
Number of threads            4
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {6,7}
Elapsed time: 8.047 s
Number of threads            5
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1}
Elapsed time: 6.172 s
Number of threads            6
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {2,3}
Elapsed time: 5.061 s
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {4,5}
Number of threads            7
Elapsed time: 4.853 s
Number of threads            8
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {6,7}
Elapsed time: 4.704 s
Done
KMP_AFFINITY=verbose,compact,1,0
OMP: Info #204: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7}
OMP: Info #156: KMP_AFFINITY: 8 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 4 cores/pkg x 2 threads/core (4 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 0 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 1 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 1 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 2 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 3 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 3 thread 1
OMP: Info #144: KMP_AFFINITY: Threads may migrate across 1 innermost levels of machine
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1}
Number of threads            1
Elapsed time: 18.62 s
Number of threads            2
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {2,3}
Elapsed time: 19.93 s
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {4,5}
Number of threads            3
Elapsed time: 9.921 s
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {6,7}
Number of threads            4
Elapsed time: 8.193 s
Number of threads            5
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1}
Elapsed time: 6.046 s
Number of threads            6
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {2,3}
Elapsed time: 5.065 s
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {4,5}
Number of threads            7
Elapsed time: 4.944 s
Number of threads            8
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {6,7}
Elapsed time: 4.654 s
Done

Jim Dempsey

Allegati: 

AllegatoDimensione
Scarica batchhandling.f904.2 KB
www.quickthreadprogramming.com
Ritratto di dajum

Jim,

I see similar results to you with the latest code.  I had previously verified that the code is indeed running in multiple processors.

Ian,

I implemented the ATOMIC statements which are fine for 12.1 and 13., but when using version 11.1 There are no clauses allowed for ATOMIC in 11.1 and it won't take just an assignment statement.  It appears ATOMIC is only meant for WRITE statements in 11.1.  Is that the case?  Is there some way that this is meant to be implemented when using 11.1?

Ritratto di jimdempseyatthecove

dajum,

The 2 thread issue (slower than 1) may be due to the code used to emulate the load for the IOsub. I suggest you modify this to perform a formatted internal write (to character variable), then call SLEEPQQ to emulate write latency. This may be more representative of your overhead for IOsub.

Jim Dempsey

www.quickthreadprogramming.com

Accedere per lasciare un commento.