Use of only 25% of CPU with Auto-Parallelization

Use of only 25% of CPU with Auto-Parallelization

Hi,

I'm using Intel Visual Fortran Compiler Pro 11.1 to compile my code on an Intel core i5 architecture.

Because I would like to parallelize the exectution of the programm i use the "-c /Qparallel" option at the compilation step, and the "/Qpar-report" option outputs that almost all the loops have been parrallelized.

But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the proccessors seem to work simultaneously. I've tried to set the priority of the process at "/high" when i execute the programm, with no effects, and the affinity is set by default on all the 4 processors.

I don't know what is going wrong, thanks in advance for any help.

JB

publicaciones de 43 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

DId you examine with /Qpar-report to see whether the important parts of your program are parallelized, or get diagnostics on why not?

If your objective is simply to max out your multiple thread meter, you might add /Qpar-threshold0  This asserts you want to maximize parallelism at the expense of performance.

Thank you to answer,

I actually tried to use a treshold0 option to ensure that all the loops are parallelized, but it doesn't change the CPU usage, enven if all the loops are parallelized according to the /Qpar-report.

It is like every thing was calculated on a single core, inspite of no processor is fully used, the calculus seems spread out over the 4 processors, but with a maxi use of 25% of the total CPU capability...

Many thanks for your help !

What percentage of your program is spent in the loops? There could be memory bottle necks or other issues preventing your program from fully utilizing each core.

Annalee

The program is a sequence of imbricated loops (5 steps of 2-level loops at least). I guess this schem fit well for auto-parallelism isn't it?

 Do you think that using Open MP may deeply increase the efficiency of the parallelization? What is weird is that the CPU allocation of my process is always staked at 25% precisely!

Identify a process intensive loop that has been reported as being parallelized. Run in Debug mode, place break in loop, run to break point. Open the Debug Window for Threads, how many threads are listed?

Jim Dempsey

www.quickthreadprogramming.com

Applying OpenMP may give you more insight; among other things you can check the number of threads assigned within a parallel region, and see whether your loops can be successfully parallelized without hidden transformations used by -Qparallel.

I suspect you must set /O explicitly along with /Qparallel for it to operate in debug build.

Thank you for your answer, I'm going to check that.

Thank you for your answer, I'm going to check that.

Hi all,

First I would like to thank a lot jim and iliyapolak, the debugger and xperf helped me to find that there was no parallelization in my code. I found in this forum that I had to check data dependency in my loops before using /Qparallel savagely :), and I realized that there's no magic tool for parallelization.

Because my code is pretty much light, I tried to use OpenMP directives in my code, mostly to parallelize independent implicit loops in a subroutine.  The parallelization works fine, but my program is slower than before. Here is the code of this routine:

!    ========================================================
!    Streaming step: the population functions are shifted
!        one site along their corresponding lattice direction
!        (no temporary memory is needed)
!    ========================================================
SUBROUTINE stream(f)
    USE simParam
    implicit none
    double precision, INTENT(INOUT):: f(yDim,xDim,0:8)
    double precision:: periodicHor(yDim), periodicVert(xDim)
!$OMP PARALLEL SHARED(f,xDim,yDim) PRIVATE(periodicHor,periodicVert)
 !$OMP SECTIONS
    !$OMP SECTION
    !    -------------------------------------
    !    right direction
    periodicHor   = f(:,xDim,1)
    f(:,2:xDim,1) = f(:,1:xDim-1,1)
    f(:,1,1)      = periodicHor
    
    !$OMP SECTION
    !    -------------------------------------
    !    up direction
    periodicVert    = f(1,:,2)
    f(1:yDim-1,:,2) = f(2:yDim,:,2)
    f(yDim,:,2)     = periodicVert
    
    !$OMP SECTION
    !    -------------------------------------
    !    left direction
    periodicHor     = f(:,1,3)
    f(:,1:xDim-1,3) = f(:,2:xDim,3)
    f(:,xDim,3)     = periodicHor
    
    !$OMP SECTION
    !    -------------------------------------
    !    down direction
    periodicVert  = f(yDim,:,4)
    f(2:yDim,:,4) = f(1:yDim-1,:,4)
    f(1,:,4)      = periodicVert
    
    !$OMP SECTION
    !    -------------------------------------
    !    up-right direction
    periodicVert         = f(1,:,5)
    periodicHor          = f(:,xDim,5)
    f(1:yDim-1,2:xDim,5) = f(2:yDim,1:xDim-1,5)
    f(yDim,2:xDim,5)     = periodicVert(1:xDim-1)
    f(yDim,1,5)          = periodicVert(xDim)
    f(1:yDim-1,1,5)      = periodicHor(2:yDim)
    
    !$OMP SECTION
    !    -------------------------------------
    !    up-left direction
    periodicVert           = f(1,:,6)
    periodicHor            = f(:,1,6)
    f(1:yDim-1,1:xDim-1,6) = f(2:yDim,2:xDim,6)
    f(yDim,1:xDim-1,6)     = periodicVert(2:xDim)
    f(yDim,xDim,6)         = periodicVert(1)
    f(1:yDim-1,xDim,6)     = periodicHor(2:yDim)
        
    !$OMP SECTION
    !    -------------------------------------
    !    down-left direction
    periodicVert         = f(yDim,:,7)
    periodicHor          = f(:,1,7)
    f(2:yDim,1:xDim-1,7) = f(1:yDim-1,2:xDim,7)
    f(1,1:xDim-1,7)      = periodicVert(2:xDim)
    f(1,xDim,7)          = periodicVert(1)
    f(2:yDim,xDim,7)     = periodicHor(1:yDim-1)
    
    !$OMP SECTION
    !    -------------------------------------
    !    down-right direction
    periodicVert       = f(yDim,:,8)
    periodicHor        = f(:,xDim,8)
    f(2:yDim,2:xDim,8) = f(1:yDim-1,1:xDim-1,8)
    f(1,2:xDim,8)      = periodicVert(1:xDim-1)
    f(1,1,8)           = periodicVert(xDim)
    f(2:yDim,1,8)      = periodicHor(1:yDim-1)
  !$OMP END SECTIONS NOWAIT
!$OMP END PARALLEL
END SUBROUTINE stream

I think this must be caused by a scheduling issue but I don't know what kind of directive is realy efficient in that case. Thank you so much for your help !

JB

>>>It is like every thing was calculated on a single core, inspite of no processor is fully used, the calculus seems spread out over the 4 processors, but with a maxi use of 25% of the total CPU capability...>>>

What load was reported by Xperf.Was Idle thread consuming remaining 75% of cpu time?

Hello everybody,

Sorry I guess I messed up by mistaking the fact that my first post wasn't immediately released and thus posting a new one. That's why there are two conversations on this topic.

@Annalee:
>>>If your code sections are small, the overhead  involved in running in parallel may be higher than the performance gains>>>

I think you must be right, this routine is the just one of the 8 steps within a main loop. But I assumed that this step was the heaviest because there are nested implicit loops and xDim and yDim are almost equal to 1000. By the way is there a specific directive for this kind of array operations? Does the OMP_NESTED=.TRUE. will improve this kind of loop?

@TimP:
I think the tasks are quite well balanced because there is only 1 heavy operation in each section, fore instance: f(2:ny,2:nx,8) = f(1:ny-1,1:nx-1,8). So according to you KMP_AFFINITY may help, but I think I should know better my processor architecture to use this parameter efficiently, isn'it? I tried OMP_SCHEDULE wihtout any impovement.

@iliyapolak:
I'm at work at the moment and I still don't have acces to XPerf depspite I asked for my IT to install it. I tried on my PC and noticed that, as you said all the remain usage of the CPU (75%) is taken by the Idle process, so that my process isn't constraint by any other process.

To better see how parallelization slow my execution, I tried to set OMP_THREAD_LIMIT from 4 to 1 and i noticed that speed decreases linearily while the number of thread increases.

Many thanks, I ask more and more questions not really related to the first topic, may I beging a new conversation?

>>...But when i execute my programm, only 25% of the total CPU ressource is allocated to the reffering process, enven if all the
>>proccessors seem to work simultaneously...
>>
>>... there was no parallelization in my code...

Did you check with Task Manager ( I assume you use Windows ) how many threads are used? Another question is: Are there any I/O operations with the file system during processing?

Hi Sergey,

I managed to see that there was only one thread running thanks to the debuger, I don't know how to check it with the task mananger? Anyway, I'm working on OpenMP directives, and the task manager clearly shows me that the 4cores are running.

Second, your question about I/O is interesting. I actually write data on a file each golbal iteration (my code is a main loop including 8steps at the heart of which there are nested loops). Does it influence parallelization? The step in wich my program write data into a file is not included between parallelization directive.

Thank you so much for your help!

JB

>>... I don't know how to check it with the task mananger?..

- Start Task Manager
- Select Processes property page
- Select View in main menu
- Select Select Columns... and check on Thread Count

>>...I actually write data on a file each golbal iteration (my code is a main loop including 8steps at the heart of which there are
>>nested loops). Does it influence parallelization?

In that case I would simply comment that part in codes, build sources and repeat all tests / verifications.

Bravo ! Auto-Parallelisation works fine when I comment the output step!!

So how can I keep this and get auto-parallel working fine too?

Another question, why execution is not faster (and even a little bit slower than mono-processing)?

>>>I'm at work at the moment and I still don't have acces to XPerf depspite I asked for my IT to install it. I tried on my PC and noticed that, as you said all the remain usage of the CPU (75%) is taken by the Idle process, so that my process isn't constraint by any other process.>>>

Can you post the screenshot from your pc(when you executed Xperf)?

I would not recommend to look at percentage description of cpu load.Xperf and process explorer provide better and more clearer information about the load of cpu by your thread(s).This is done by counting cpu cycles instead of measuring timer interval(~10ms).

>>>don't know how to check it with the task mananger? Anyway, I'm working on OpenMP directives, and the task manager clearly shows me that the 4cores are running.>>>

If you want to ensure that running threads belong to your application you can also use process explorer with its detailed view(including per thread callstack) more advanced information can be obtained with the debugger.

Hi JB,

>>Bravo ! Auto-Parallelisation works fine when I comment the output step!!
>>
>>So how can I keep this and get auto-parallel working fine too?
>>
>>Another question, why execution is not faster (and even a little bit slower than mono-processing)?

Thanks for the update and it looks like a light at the end of a tunnel.

Regarding performance problems I wouldn't make any comments because there are too many unknowns for me and a verification with some performance utilities, like Intel VTune or Inspector, could show you why it happens.

Note: Is it possible to do a couple of tests with smaller data sets?

JB D

In looking at your stream(f) function it essentially rotates sections of an array. This is memory bandwidth heavy. I cannot see the outer levels of your program, so I will throw something out for you to consider.

Rotation can be accomplished by using modulus arithmatic on the indicies.

xBase = xBase + 1 ! rotate in +x
yBase = yBase + 1 ! rotate in +y
do yRing = 1, yDim
  do xRing = 1, xDim
    x = MOD(xBase + xRing - 1, xDim) + 1
    y = MOD(yBase + yRing - 1, yDim) + 1
    ! use x and y as indicies as before

Jim Dempsey

www.quickthreadprogramming.com

>>>This is done by counting cpu cycles instead of measuring timer interval(~10ms).>>>

This is follow up.

Sorry if it is not directly related to the topic,but I thought it could shed some light on so measuring cpu load as percentage of time when cpu was executing some thread.Because of mentioned(quoted) timer interval which can be measured with clockres tool and it is around ~10ms some tools will report the usage as a 0% in fact thread can run even for shorter period of time than timer interval.So its cotribution is not counted.Better option is to use monitoring tool which will count cpu cycles.

>>...it is not directly related to the topic,but I thought it could shed some light on so measuring cpu load as percentage of time...

The problem JB D experienced is related to Auto-Parallelization of some processing with I/O operations ( Not measuring CPU load ) and because of this Intel Visual Fortran compiler didn't do Auto-Parallelization assuming that integrity of the processing will not be achieved.

His post is indirectly related to measuring cpu load.Adding a few tips regarding precision of measurement will not do any rules:)

>>>What is weird is that the CPU allocation of my process is always staked at 25% precisely!>>>


>>...His post is indirectly related to measuring cpu load...

I really don't see a reason for all these irrelevant explanations. It is Not clear for me if you do any programming with Fortran.

This thread has been running for quite a while !

If your process is reporting 25% CPU in task manager with a core i5, then there is no parallel threads being effectively utilised; just one stream being fully committed.
There are two possibilities:
a) If the parallelisation is being achieved by the !$OMP SECTION commands, then one of the sections is running for a significantly longer time than the others, or there is a clash and only one section is running at a time, or
b) the !$OMP commands are being ignored and there is only one running stream.

You should run the program with and without OpenMP being selected and see what is different.
Is it possible to estimate the run times of each of the sections? Elapsed time with QueryPerform or RDTSC might give the precision you require to identify what might be happening. Ignore the complaints about these timing routines not being accurate; as bad as they are, they are probably the best you have available.

John

QueryPerformance and RDTSC are good alternative to task manager.At least you will measure performance in cpu cycles(RDTSC).

Hello everybody!
Sorry I wasn't able to follow these numerous posts this week-end, thanks again for your help. 
A lot of issues have been raised as the posts went on and I tried differents leads:
1)Compiler options (for /Qparallel or /Qax or /Qopenmp)
2)Environnement variables adjustment;
3)CPU load control;
4)I/O control;

1): I managed (many thanks to Sergey) to run auto parallelization, which is a bit more efficient than the openMP directives that I've choosen to pu, but remains still slower than compiling without any option!
Using a /Qpar-report2 yields to different cases:
      - existence of parallel dependence (unanswerable!)
      - insufficient computational work (if somebody could shed light on this I'd be glad)
      - LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: insufficient inner loop
Is a loop really parallelized in the 3rd case? Moreover the indicated lines don't reach any reliable line in my code (sometimes it's even commented lines)

2): I tried to set KMP_AFFINITY whit another configuration, results got worse.

3): I've implemented yesterday an rdtsc counter (thanks to this example) which allowed me to compute time between each step and to notice that the stream step was one of the heaviest but not the only one. Fortunately the longest loops are those which /Qparallel would tend to parallelize.

4): I tried to understand why I/O could interact whit auto-parallelization but I didn't find any information on this. Sergey, how did you think it can prohibit parallelization? How can I solve this problem?

 

Hi,

>>...how did you think it can prohibit parallelization?..

My main concern with I/O in your case is the 2nd part, that is O ( Output ), and it looks like compiler won't be able to synchronize all these operations ( executed in many threads ) or doesn't "know" where in some output file some data needs to be saved.

>>...How can I solve this problem?..

I think the 2nd part, that is O ( Output ), needs to be delayed until all the processing is completed and only after that data could be saved in a regular Non-Parallel way. However, I don't know if it is possible to do in your case.

Regarding I/O at least you can use xperf and identify those thread(s) which are performing I/O by callstack examination.You can check if there is interdependencies between I/O thread(s) and thread(s) which is performing calculation.

I suppose that full scale debugging coupled with callstack analysis of every thread(in case when parallelism was achieved) could reveal the root cause of your problem.While running under debugger you will need to do single step and call tracing and to observe the execution of your code .It is not easy task,but it can be helpful in order to understand the failure of autoparallelism.Before usage of debugger if you are interested please ispect your code's import table with dumpbin.

<<<so that my process isn't constraint by any other process>>>

Your process is only memory mapped container and cannot be constained by other process.What can be preempted it is your process's threads.

>>...How can I solve this problem?

Hi JB D,

I know that my proposal could lead to a redesign of your application. So, I would follow a 3-phase approach:

Phase 1 - Load data to be processed into memory ( Non-Parallel operation in an EXE-module / No Auto-Parallelization )

Phase 2 - Do data processing ( Parallel in a DLL-module / Compiled with Auto-Parallelization or usage of OpenMP )

Phase 3 - Store processed data to a file ( Non-Parallel operation in EXE-module / No Auto-Parallelization )

Unfortunately, I have No idea of what your application actually does and it is very hard to make a right decision / recommendation. Let me know if that approach is Not applicable.

Allright, thanks for your answer!

@Sergey: I think this can be suitable for me, I just have to learn how to interface my code with Fortran-DDL! (I'm a bit new in programing and specially in fortran)

@iliyapolak: I'll try what you said about XPerf at work as soon as my IT will resolve to install it on my workstation. My own computer is not powerfull enough to run my code properly and XPerf would not give accurate informations.

I still wonder what means the lines LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: insufficient inner loop when I compile with /Qpar-report2. Is a loop really parallelized when this comments appear?

>>...I still wonder what means the lines LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized:
>>insufficient inner loop when I compile with /Qpar-report2. Is a loop really parallelized when this comments appear?..

Could you post some codes? At least a part related to these two diagnostic messages.

Hi JB D,

before running Xperf please verify that no other program is usng Kernel Logger.

Does your loop have some interdependencies and does it use constant at compile time values.By looking at compiler message is this possible that your inner loop runs for very short time and simply overhead needed to parallelise it is too large?

Cita:

JB D. escribió:

I still wonder what means the lines LOOP WAS AUTO-PARALLELIZED just followed by: loop was not parallelized: insufficient inner loop when I compile with /Qpar-report2. Is a loop really parallelized when this comments appear?

Without seeing the actual report, there are two possibilities:

  1. The compiler generated multiple code paths for this region. One was parallelized and the other was not.
  2. There is an inner and outer loop, and only the outer loop was parallelized. This includes loops created by the compiler for arrays.

Hi Annalee,

what insufficient inner loop can mean?

@Sergey: here is a subroutine which generate this kind of warnings and should qualify for auto-parallelization (I guess):

SUBROUTINE computeMacros(f,rho,u,uSqr) 
    USE simParam, ONLY: xDIm, yDim 
    use omp_lib 
    implicit none 
 
    double precision, INTENT(IN):: f(yDim,xDim,0:8) 
    double precision, INTENT(INOUT):: u(yDim,xDim,0:1), rho(yDim, xDim), uSqr(yDim, xDim) 
    integer:: x,y 
    do x = 1, xDim 
        do y = 1, yDim 
            rho(y,x)  = f(y,x,0) + f(y,x,1) + f(y,x,2) + f(y,x,3) + f(y,x,4) + f(y,x,5) + f(y,x,6) + f(y,x,7) + f(y,x,8) 
            u(y,x,0)  = (f(y,x,1) - f(y,x,3) + f(y,x,5) - f(y,x,6) - f(y,x,7) + f(y,x,8)) / rho(y,x) 
            u(y,x,1)  = (f(y,x,2) - f(y,x,4) + f(y,x,5) + f(y,x,6) - f(y,x,7) - f(y,x,8)) / rho(y,x) 
            uSqr(y,x) = u(y,x,0) * u(y,x,0) + u(y,x,1) * u(y,x,1) 
        end do 
    end do     
END SUBROUTINE computeMacros

@Annalee: Is this problem related to the algorithm or to the compiler? I can't give you more precision on this report because I'm away until Monday, but I might rememeber that for this routine for example, the first line of the related report was LOOP WAS AUTO-PARALLELIZED and the inner loop warning appeared 3 or 4 times.

>>The compiler generated multiple code paths for this region. One was parallelized and the other was not>>
Does the number just before the warning (for instance: (649)) refer to these paths? because I can't link it with any line number of my script.

>>what insufficient inner loop can mean?
I wonder too!

Your program was succesfully parallelized. For this case, what it means is the compiler parallelized around the outer loop: "do x = 1, xDim" and left the inner loop is left as. This is exactly what should happen.

The number before the warning, just tells us what warning is being displayed.

 "insufficient inner loop" means there was not enough work within the inner loop for it to be effecient to parallize it.  

 

 

Thank you Annalee

Yes, thousand thanks ! You definitely shed light on this point! I Will try to follow the advice of Sergey about the Dynamic Linked Library. I hope this will improuve my computation because parallelization still slow my execution, which is not really the expected behaviour...

You indicate your version of ifort Ver 11.1. I have found a significant improvement in /Qparallel changeing to Composer XE 2011, which has been superceeded by Ver 2012 and possibly Ver 2013.
I would recommend you upgrade from Ver 11.1, as I found problems with that version.

John

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya