Use of only 25% of CPU with auto-parallelization

Use of only 25% of CPU with auto-parallelization


I'm using the Intel Visual Fortran Compiler Pro 11.1 to compile my code on an Intel core i5 architecture.

Because I would like to parallelize the execution of my programm I use the "-c /Qparrallel" option at the compilation and the "/Qpar-report" option which outputs that almost all loops have been auto-parallelized.

But when I execute my programm, only 25% of the total CPU ressource is allocated to my process, even if the 4 processors seem to work simultaneously. I've tried to set a "/high" priority at the execution whithout any effect, the affinity is set by default on the 4 processors.

I've no idea on what causes this issue, thanks in advance for any help.


publicaciones de 14 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

Which tool do you use to check for cpu load?

You are running user mode code on preemptive OS and you can not literally take full control of cpu processing resources.OS scheduler will decide which threads will be in ready state and scheduled to run next.There is also more priviledged code than yours which must run.For example scheduler's code.

You can also set the priority of your thread to real-time, but it is not recommended because you can starve system threads.

For more clear picture of cpu load I can advise you to use Process Explorer which displays load data by counting the number of cpu cycles charged to specific thread.

Hi iliyapolak !

Thank you for your answer,

I'm using the basic windows task manager, I know that it is deeply insufficient but I am not administrator on this computer and I can't really install new tools. I understand that my code may not prevail on system threads but it's wreid to limit a process at exactly 25% of the total CPU capability, so that I'm not able to use "virtually" more than 1 processor over 4.

I tried yesterday (today is french public holiday) the idea of jimdempseyatthecove but I had a debugger issue that i'm going to fix quickly:


jimdempseyatthecove escribió:

Identify a process intensive loop that has been reported as being parallelized. Run in Debug mode, place break in loop, run to break point. Open the Debug Window for Threads, how many threads are listed?

Many thanks to every body for this precious help !


>>>I'm using the basic windows task manager, I know that it is deeply insufficient but I am not administrator on this computer>>>

So the best option is to use what Jim advised.


you can also run Xperf toolkit which will give you a nice graphical breakdown of thread activity.

Hi all,

First I would like to thank a lot jim and iliyapolak, the debugger and xperf helped me to find that there was no parallelization in my code. I

found in this forum that I had to check data dependency in my loops before using /Qparallel savagely :), and I realized that there's no magic

tool for parallelization.

Because my code is pretty much light, I tried to use OpenMP directives in my code, mostly to parallelize independent implicit loops in a

subroutine. The parallelization works fine, but my program is slower than before. Here is the code of this routine:

!    ========================================================
!    Streaming step: the population functions are shifted
!        one site along their corresponding lattice direction
!        (no temporary memory is needed)
!    ========================================================
SUBROUTINE stream(f)
    USE simParam
    implicit none
    double precision, INTENT(INOUT):: f(yDim,xDim,0:8)
    double precision:: periodicHor(yDim), periodicVert(xDim)
!$OMP PARALLEL SHARED(f,xDim,yDim) PRIVATE(periodicHor,periodicVert)
    !    -------------------------------------
    !    right direction
    periodicHor   = f(:,xDim,1)
    f(:,2:xDim,1) = f(:,1:xDim-1,1)
    f(:,1,1)      = periodicHor
    !    -------------------------------------
    !    up direction
    periodicVert    = f(1,:,2)
    f(1:yDim-1,:,2) = f(2:yDim,:,2)
    f(yDim,:,2)     = periodicVert
    !    -------------------------------------
    !    left direction
    periodicHor     = f(:,1,3)
    f(:,1:xDim-1,3) = f(:,2:xDim,3)
    f(:,xDim,3)     = periodicHor
    !    -------------------------------------
    !    down direction
    periodicVert  = f(yDim,:,4)
    f(2:yDim,:,4) = f(1:yDim-1,:,4)
    f(1,:,4)      = periodicVert
    !    -------------------------------------
    !    up-right direction
    periodicVert         = f(1,:,5)
    periodicHor          = f(:,xDim,5)
    f(1:yDim-1,2:xDim,5) = f(2:yDim,1:xDim-1,5)
    f(yDim,2:xDim,5)     = periodicVert(1:xDim-1)
    f(yDim,1,5)          = periodicVert(xDim)
    f(1:yDim-1,1,5)      = periodicHor(2:yDim)
    !    -------------------------------------
    !    up-left direction
    periodicVert           = f(1,:,6)
    periodicHor            = f(:,1,6)
    f(1:yDim-1,1:xDim-1,6) = f(2:yDim,2:xDim,6)
    f(yDim,1:xDim-1,6)     = periodicVert(2:xDim)
    f(yDim,xDim,6)         = periodicVert(1)
    f(1:yDim-1,xDim,6)     = periodicHor(2:yDim)
    !    -------------------------------------
    !    down-left direction
    periodicVert         = f(yDim,:,7)
    periodicHor          = f(:,1,7)
    f(2:yDim,1:xDim-1,7) = f(1:yDim-1,2:xDim,7)
    f(1,1:xDim-1,7)      = periodicVert(2:xDim)
    f(1,xDim,7)          = periodicVert(1)
    f(2:yDim,xDim,7)     = periodicHor(1:yDim-1)
    !    -------------------------------------
    !    down-right direction
    periodicVert       = f(yDim,:,8)
    periodicHor        = f(:,xDim,8)
    f(2:yDim,2:xDim,8) = f(1:yDim-1,1:xDim-1,8)
    f(1,2:xDim,8)      = periodicVert(1:xDim-1)
    f(1,1,8)           = periodicVert(xDim)
    f(2:yDim,1,8)      = periodicHor(1:yDim-1)

I think this must be caused by a scheduling issue but I don't know what kind of directive is realy efficient in that case.

Thank you so much for your help !


If your code sections are small, the overhead  involved in running in parallel may be higher than the performance gains from running the sections in parallel. I should suggest breaking you code into larger chunks for better performance.

I tried to post the following suggestion:

If the work were well balanced among the sections, a KMP_AFFINITY environment variable setting might help.

If not, and, as you say, you need to change scheduling, adding the schedule(runtime) clause on omp parallel allows you to try variations by setting environment variables, such as OMP_SCHEDULE=dynamic,2.  This would allow individual threads to pick up new work  (of the specified number of chunks) .

>>>If your code sections are small, the overhead  involved in running in parallel may be higher than the performance gains>>>

That is true.thread creation and in some cases synchronization overhead could be larger than execution time of small loops.In my case this was main reason to not use java multithreading for my library of special functions.

Hi JB D,

can you post the screenshot from Xperf graphical tool.

Hello everybody,

Sorry I guess I messed up by mistaking the fact that my first post wasn't immediately released and thus posting a new one. That's why there are two conversations on this topic.

>>>If your code sections are small, the overhead  involved in running in parallel may be higher than the performance gains>>>

I think you must be right, this routine is the just one of the 8 steps within a main loop. But I assumed that this step was the heaviest because there are nested implicit loops and xDim and yDim are almost equal to 1000. By the way is there a specific directive for this kind of array operations? Does the OMP_NESTED=.TRUE. will improve this kind of loop?

I think the tasks are quite well balanced because there is only 1 heavy operation in each section, fore instance: f(2:ny,2:nx,8) = f(1:ny-1,1:nx-1,8). So according to you KMP_AFFINITY may help, but I think I should know better my processor architecture to use this parameter efficiently, isn'it? I tried OMP_SCHEDULE wihtout any impovement.

I'm at work at the moment and I still don't have acces to XPerf depspite I asked for my IT to install it. I tried on my PC and noticed that, as you said all the remain usage of the CPU (75%) is taken by the Idle process, so that my process isn't constraint by any other process.

To better see how parallelization slow my execution, I tried to set OMP_THREAD_LIMIT from 4 to 1 and i noticed that speed decreases linearily while the number of thread increases.

Sergey Kostrov warned me about I/O operation and I actually write data into a file every main iteration. Does this influence parallelization? The step in which I wrote data to the file is not included between parallel directives.

Many thanks, I ask more and more questions not really related to the first topic, may I beging a new conversation?

Have you tried using the option /Qopenmp-report:2? This will give you diagnostics on how effectively your code was parallelized.

In order to use KMP_AFFINITY, if you know that you have just one CPU with just one last level cache, and no HyperThreading, you don't even need to know the numbers:

set KMP_AFFINITY=compact

Deje un comentario

Por favor inicie sesión para agregar un comentario. ¿No es socio? Únase ya