CPU time vs. time

CPU time vs. time

I have a time simulation application running a fairly large model which requires a 64-bit executable to run.  Without the 64-bit executable, I get virtual memory errors.  Running the model in question, I am noting that, at this point, the model has been running 6 days and yet the CPU time is only 42,000 seconds.  With smaller models and runs, CPU time and real time are pretty much synchronized.  What sort of thrashing is going on which is causing this discrepancy?

80 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

What do you mean by real-time?

I suppose that CPU time is the total  activity of your process's thread(s).In that time interval(6 days) threads will not be running whole the time.They will be wapped out and swapped in so the total time accumulated can not be exactly equal to 6 days.For more precise measurement can you use Xperf?

>>...I have a time simulation application running a fairly large model which requires a 64-bit executable to run. Without the 64-bit
>>executable, I get virtual memory errors...

32-bit applications have limitations on how much memory could be allocated. For example, on a Windows platform the limit is 2GB. Also, for tests with 32-bit applications check Virtual Memory ( VM ) settings on your platform ( it is Not clear what platform you're using ) and increase values for Initial and Max sizes of VM.

Let's say you're using CPU_TIME function. On a system with many cores a Fortran function CPU_TIME calculates time for all CPUs used during processing. For example, if processing is done in 30 seconds on a system with 8 logical CPUs and all CPUs were used ( processing was done in parallel on 8 threads ) then CPU_TIME will return 240 seconds ( 8 x 30 seconds ). If the same processing is done on 1 CPU ( 1 thread ) then CPU_TIME could return a value greater than 240 seconds.

>>...For more precise measurement can you use Xperf?..

How is it going to help?

Was @NotThatMatters talking about some fortran timing function CPU_TIME?

The 64-bit application is using 25% of the CPU all the time.  This is from task manager.  However, as noted in the first post, it is connected for 7 days now and has run CPU_TIME = 49000 seconds.  In a "normal" run, meaning far smaller memory footprint, the disparity between clock time and CPU time is negligible.  A run which takes 49000 seconds can be left at the close of the day and on return the next morning, it will be complete.

I think that at OS-hardware level such a function like CPU_TIME relies on real clock timer exactly as Xperf.Here I mean Windows platform.On Linux it will also query real time clock but the underlying OS dependent implementation could be different.

Are you interested in total accumulated time in seconds?

We always post "elapsed time" when running our app.  Typically elapsed time and UTC are comparable.  Let me just point out that I gave you some bad information.  The simulation is running on a Windows 8 machine and I was unfamiliar with the "new" task manager.  According to what I am now reading, I am seeing memory being eaten at nearly 100% by the app, but CPU is being used in small increments (0.4%, for example), and the "Disk" is being used 50%.  I am not certain what this all means.

Regarding memory consumption is it total process working set ? In one of my earlier posts I advised to do not rely on task manager percentage of time measurement.Because it is based on real time clock interval and sometimes your thread will run shorter than the timer interval.Better option is to mesure cpu cycles charged to your thread.It can be done by process explorer.Small increments are probably related to deficiancy of timer interval because of lower time resolution.The same is related to disk I/O percentage measurement.Better option is to look at number of I/O operation per second.Small increments are related to activity of your applications where percentage of time when cpu spent on running your app is growing as staircase function.

Is CPU_TIME exported by fortran.dll?

The app is a compiled exe.  CPU_TIME is the generic intrinsic subroutine called from within the code to report processor time.

>>...Running the model in question, I am noting that, at this point, the model has been running 6 days and yet the CPU time is
>>only 42,000 seconds. With smaller models and runs, CPU time and real time are pretty much synchronized. What sort of
>>thrashing is going on which is causing this discrepancy?

1. You're mixing CPU time with I/O time (!).

2. If your application is very busy with reading from and writing to a model file, especially when model is too big and Virtual Memory ( VM ) is used, then CPU doesn't do anything for your application and it simply waits for a completion of these I/O operations.

3. Actually, processing of your model was done in 11.67 hours ( 42000 / 60 / 60 = ~11.67 ) and all the rest time was spent on I/O operations ( or, VM operations ).

4. Take a look at Performance property page of Windows Task Manager and you will see lots of "pillars". Please make a screenshot and post it for review.

5. Let me know f you need an example ( a screenshot of processing of some my algorithm with heavy load of I/Os, that is VM operations ) of these "pillars" ( I could easily provide it ).

6. What system do you use? How much Physical Memory is installed? What are settings for Virtual Memory? What are sizes for your models?

>>...With smaller models and runs, CPU time and real time are pretty much synchronized...

7. Because it uses Physical Memory and it doesn't use too much Virtual Memory. Again, use Performance property page of Windows Task Manager to verify.

Do you have an expectation of how much memory your simulation requires? Is it exceeding the physical memory installed, after allowing for the memory requirement of the operating system and other processes?
There is a suspicion that you are loosing time due to I/O delays, hardware conflicts or virtual memory pageing.
In Task Manager, include "Page Faults" and "PF Delta", as these indicate paging taking place with your process/simulation.
Also "Working Set (Memory)" and "Peak Working Set (Memory)", in comparison to installed memory is another indicator.
The easy solution for this delay cause is to have sufficient physical memory installed and also check that your Vitrual memory configuration is adequate. ( on Win 7 see Control Panel > System > Advanced System Settings >  Performance Settings > Advanced > Virtual Memory )
Memory is cheap and unless your new memory demand is over the top ( from your 32-bit past) this might solve your problem.

Alternatively there could be hardware clashes with other processes for I/O, such as virus checkers etc, although these should not be significant. How many simulation processes are you running ?

Task manager might not be the best measure of performance, but it should be able to identify the most likely cause.

John

Hi NotThatItMatters,

what do you mean by saying real-time because it has ambiguos meaning?

As I pointed it out in my first post your code cannot fully utilize your cpu resources unless system kernel mode threads are spawn and run at elevated IRQL level = 0x2 or your thread(s) is operating at real-timepriviledge level but it is not recommended because of induced starvation which can happen to some system thread.Your application must share cpu with hundreds of threads many of them running more priviledged.Moreover on multicore system code of CPU_TIME can continue running on one core when for example part of your code can be swapped out so you will not be really measuring performance of your code.For accurate performance measurement task manager is not recommended.You can profile your application with Xperf toolkit which can provide you with the nice graphical  breakdown of thread activity both operating in user mode and kernel mode.

>>...As I pointed it out in my first post your code cannot fully utilize your cpu resources unless system kernel mode threads are
>>spawn and run at elevated IRQL level = 0x2 or your thread(s) is operating at real-time priviledge level

The problem is Not related to real-time processing and the expression has to be changed to total-time of processing.

Memory  load and store operations(Port2 and Port3) can be performed in parallel with heavy floating point calculations.The main problem will be waiting on I/O completion(asynchronous waiting) and when you inspect your thread's call stack you will see that Kernel context switching functions are on top of stack.

As John pointed out please check for number of page fault operations occuring per second because resolving page fault is very costly and can slow down your system(significant overhead created by interrupting cpu calling default kernel mode page handler(located in IDT) and calling disk.sys driver).Moreover some AV can interfere with I/O operations starting from hooking ReadFile and WriteFile to installing filter drivers and checking the buffers beign sent and received.

I meant thread's priviledge level,not an application running in real-time(in domain of video simulations).

>>...Memory load and store operations( Port2 and Port3 ) can be performed in parallel...
>>...Kernel context switching functions...
>>...cpu calling default kernel mode page handler...
>>...located in IDT...
>>...disk.sys driver...
>>...installing filter drivers...

Please be advised that this is a forum for Fortran developers and this Not a forum Low Level driver software developers. Please wait for a response from NotThatItMatters and don't clutter the forum with low level hardware details.

And how my answer differs from for example John's answer he also used technical jargon?

NotThatItMatters,

Sergey's post on Tue, 05/07/2013 - 17:32 is right on target. Use John Campbell's recomendation to see if excessive page faults are taking place. There is one additional thing to investigate relating to page faults. On Windows you can set the working page file at nnnn MB and have it variable (permit it to grow as necessary). Sounds good and benign. However, when an app requires the page file to grow, I have observed personally that this expand the page file process slows down the app to the point where you may think it has stopped (this appears to be what you are observing). The partial fix to this is to set your page file to a proper working size for your application.

I say "partial fix" because this will fix only the (potential) page file expansion issue. This will not fix excessive paging by a virtual application which is larger than physical RAM. The fix for "excessive paging by a virtual application which is larger than physical RAM" is to pay attention to data locality. IOW get as much work out of an area of data before moving onto different areas of data. i.e partition the work. Also keep in mind that if you come from C/C++ programming you program with the right most index of a multi-dimensioned array varying the quickest (as inner loop control variable). With FORTRAN, it is the other way around (use left most index as inner loop control variable).

Recommendations:

a) Set page file lower limit to at least working size for your application
b) Assure loop order with respect to Array(InnerMostLoopIndex, MiddleLoopIndex, OuterLoopIndex)
c) If necessary partition (sometimes called tile) work into smaller pieces which is done usually by adding an outer loop (or multiple outer loops). Example:

! process cells
do iRow=1,nRows
  do iCol=1,nCOls
...
Becomes
! specify tile size
iRowChunkSize = mmm ! you determine this value
iColChunkSize = nnn ! you determine this value
! now process tiles
do iRowChunk = 1,nRows,iRowChunkSize
  do iColChunk = 1,nCols,iColChunkSize
    do iRow=iRowChunk,MIN(iRowChunk+iRowChunkSize-1, nRows
      do iCol=iColChunk,MIN(iColChunk+iColChunkSize-1, nCOls)
        ! process cells in tile
        ...

Jim Dempsey

www.quickthreadprogramming.com

>>...c) If necessary partition (sometimes called tile) work into smaller pieces which is done usually by adding an outer loop
>>(or multiple outer loops)

I think Jim just described a Loop Blocking Optimization technique and it is highly efficient.

>>a) Set page file lower limit to at least working size for your application...

I'd like to provide two screen shots which demostrate how terrible the situation is when there is an excessive paging during processing:

Here is a short description:

- An older computer system has 1GB of Physical Memory ( PM ) and VM settings are: Initial Size = 256MB and Maximum Size = 2048MB

- Application starts and allocates ~1.98GB of memory for a data set. So, ~0.90GB of memory from PM and ~1.08GB of memory from VM

- A Red Circle shows an area with "VM Pillars" and during that time operating system is very busy with allocation of pages in VM file and performance of the application varies from ~5% to ~75% and an average number is ~30% ( it is very close to your 25%! )

- As soon as VM pages are allocated performance of processing increases ( you see a "Plateau" ) and CPU Usage is 100% ( everything is back to normal )

- In total 5 iterations of the same processing is done and during "VM Pillars" phase it takes ~300 seconds to complete all calculations and all the rest 4 iterations are done in ~75 seconds each ( almost 3x faster! / "VM Plateau" phase )

- "VM Pillars" phase could be classified as VM-Bound processing ( significantly reduced performance )

- "VM Plateau" phase could be classified as CPU-Bound processing ( performance is Not affected by VM operations )

Attachments: 

AttachmentSize
Download vm-pillars-1.jpg83.56 KB

Here is a complete overview of processing:

Three Circles show different phases of processing and let use know if you have any questions.

Attachments: 

AttachmentSize
Download vm-pillars-2.jpg82.34 KB

This expected behaviour of the system "plagued" by heavy activity of page fault handler and memory manager allocation routines.I would like to see what is the percentage of time spent in kernel mode.I would like to add that Task manager is not the most suitable tool for performance measuring perfom is recommended on Win XP.I suppose that those spikes at the beginning can come from disk.sys DPC routines.

Sergey can you add perfmon disk I/O statistics?

Thank you for all the insight.  In order to get the process running expeditiously, I will evidently need to figure out how to effectively partition the memory.  In general I have been using left to right indexing in multi-dimensional arrays.  There are several exceptions, specifically in the "solver" routine where the largest arrays are being handled.  Attached are three screen shots from the Task Manager showing execution.

Before going into the hows and whys, let me ask a novice question about stack size.  Suppose I have a routine as follows:

SUBROUTINE FOO(II, JJ, KK, A)
INTEGER, INTENT(IN) :: II, JJ, KK
REAL (KIND = 8), DIMENSION(II, JJ, KK), INTENT(INOUT) :: A
END SUBROUTINE FOO

What is the comparison in stack usage between this and the following?

SUBROUTINE FOO(II, JJ, KK, A)
INTEGER, INTENT(IN) :: II, JJ, KK
REAL (KIND = 8), DIMENSION(II, JJ, *), INTENT(INOUT) :: A
END SUBROUTINE FOO

Attachments: 

>>...Sergey can you add perfmon disk I/O statistics?

No at the moment. You could easily simulate the same "terrible" situation with "VM Pillars" with a couple of Fortran or C/C++ code lines. So, please do your own research, coding, testing and analysis.

Task manager screenshots confirm what has been alread said.I can add that reserved hardware MMIO is not high mainly because of low end on die gpu.Moreover page file is almost full which means extensive disk I/O and moving large portion of paged system memory to page file.Also superfetching will be affected because of very little amount of free memory not allocated to any process or device.

Okay, now that it is resolved that page swapping is the likely cause of the slowdown, how can I fix this?

1) Will limiting stack size have an appreciable effect?  (See the previous post!)

2) Will array indexing have an appreciable effect?

This is a large executable with many routines.  I need to have some idea on how to solve this problem before I branch and test.

NotThatItatters,

Two answers to some of your points,

1) There is not much difference between REAL (KIND = 8), DIMENSION(II, JJ, KK), INTENT(INOUT) :: A  and REAL (KIND = 8), DIMENSION(II, JJ, *), INTENT(INOUT) :: A .  The important thing is you should be trying to vary the II index in the inner loop and not the KK index. This will keep the memory being addressed local, as varying KK can span many mb or even gb, which drives the paging demand and kills cache benefits. This can be a significant issue for performance, even when there is no paging, as when you jump around memory, you loose the benefit of the processor cache. If this is a problem, changing the order of the array subscripts can help, but you MUST make sure you do this change for every use of the array.

2) If paging is the problem, you still have not indicated what is the memory size of the program. You need to have some idea of this size, to a) make sure your virtual paging size is big enough, and b) consider how much physical memory to install. The best solution to paging is to install more memory chips to reduce the amount of paging required. Another useful solution when paging occurs is to have a SSD disk for pagefile.sys, as this reduces disk io delays.

To answer your last post,
Not sure of your reference to stack size, but I don't think this is relevant. ALLOCATE is best for big arrays which do not use the stack.
Size of the executable is also not an issue, as the code size of many routines typically take much much less memory that the declared arrays. (initialising large arrays with data statements etc can have an influence on .exe size, but should not affect run time. I initialise arrays during run time, rather than at startup)
Array indexing/subscript order is a big, possibly huge factor.

You need to understand how much memory you are using in your simulation and understand the efficiency of memory addressing. Hope this helps.

John

John provided you with tips for questions 1 and 2 and I'd like to repeat the same: It would be nice to install more Physical Memory, for example, 4GB or 8GB.

Note: That was told already several times as far as I remember.

Attached is a simple example of changing subscript order and loop order to change the run time. When mixing arrays of different sizes and different index orders, the result is not always clear. If you select li=lj and lk = 5*big, the variable performance changes a bit.
the attached example induces paging on my PC, to cause the time variations.

Edit: I updated the test to better show the effect of paging and hopefully memory cacheing for the different test sizes. You can experiment with values of li, lj and lk to get different effects, but from my tests, the best solution is not the same as memory footprint increases. My paging test was witha SSD disk, so it would probably be slower on a HDD. I have not tested /Qparallel !!

John 

Attachments: 

AttachmentSize
Download subscript.zip2.92 KB
Download subscript-ver2.f904.55 KB

{ Correction } There are two statements:

>>...Another useful solution when paging occurs is to have a SSD disk for pagefile.sys, as this reduces disk io delays...

and

>>...My paging test was witha SSD disk, so it would probably be slower on a HDD...

So, you've simply confirmed that idea to use SSD is good.

As Sergey suggested you can add more memory.Regarding SSD it has fater access time and reduced latency and it is advised to put pagefile.sys on SSD drive.But for prolonged time you will have some performance degradation if pagefile is used heavly.

Sergey and ilyapolak,

Yes, you have identified that using an SSD and more memory both contribute to reducing the paging delays. Array dimensions and subscript indexing also contribute significantly to the number of page faults resulting in the calculation.

I have also tried to identify that by localising the memory addressing in the computation that the cache usage can be improved significantly, further inproving the run time performance. The contrary view is probably more relevant, where referencing over a large memory footprint can reduce the effectiveness of cache usage. This can have an effect on performance, before extending into virtual memory pageing. 

The other areas for run time improvement could be vectorisation and parallelisation, although these become more difficult to implement in a large simulation application.

There has been little mention of profiling. Where is all the run time occuring ? Typically in simulation modelling that I perform, there are many itterations and the run time bottleneck can be localised to very few routines. This could be investigated.

Scaling up a model from 32-bit to 64-bit can change where the bottlenecks occur. It could well be that the bottleneck in a larger model change and so a new area of modelling needs to be improved.

NotThatItMatters, you have many options to investigate. Hopefully, our suggestions cover the cause of your problems.

John

NotThatItMatters,

In your subroutine FOO, array A is passed by reference. A(II,JJ,KK) should be processed:

DO K=1,KK
DO J=1,JJ
DO I=1,II

Also, your system has 4GB RAM. Committed shows 10.1/12.6 GB (~4x physical RAM). Nothing inherently wrong with this as long as you reduce page faults. Assume for the sake of argument you had two such arrays A(II,JJ,KK) and B(II,JJ,KK) each 4GB (8GB of your 10GB). An example would be for II=JJ=KK=1600. Virtual Memory on Intel64 can be set to use either 4KB page size or 4MB page size. Lower memory (physical RAM) systems tend to use 4KB page size. The paging system may page one page at a time or multiple pages at a time. Regardless of this, the two arrays (A, B) require 4GB more than fits in RAM. A simple A=B could require between 4GB and 8GB of disk I/O. At 100MB/s this one statement could take 40 to 80 seconds, at 1000MB/s 4 to 8 seconds. I do not know your disk system.

Many operations on such arrays may be such that a smaller array X interacts with the larger array A. In these cases you want to structure to run through X in your inner loops while advancing through A in the outer loops. I this manner you will reduce paging to at most one pass through A (4GB). With A in inner loops and X in outer, you may require on an order of SIZE(X) passes through A (10's, 100's, ... more paging).

If you are performing rather standard functions, consider using something like Intel's MKL. This library is well written for cache locality, and which has the side effect of being page file efficient.

On a different forum there was something else touched upon. The user was having an application crash with out of memory when the allocations were well within the limitations of the system. This would occur late in the run of a program. The behavior looked like a memory leak. Running diagnostics showed the program had no memory leak. At issue was the system page file was being consumed. As it turned out it was not by the application, but rather by a "feature" of Windows where it would buffer writes into RAM (virtual), and at some point this went into page file. Although for your situation you are not experiencing out of memory, this too can greatly agrivate paging latencies. Fortunately there is a way you can turn off this feature (I do not recall the specifics on this).

Jim Dempsey

www.quickthreadprogramming.com

It is a good note.

>>... At issue was the system page file was being consumed...

Windows VM Manager extends a size of VM paging file until it reaches Maximum Size defined in System applet of Control Panel. I usually set that value at least 4x of an amount of Physical Memory ( PM ). For example, on a system with 32GB of PM that value is 128GB for VM. Then, allocation of 96GB is Not a problem in that case, however a processing is very slow and it is similar to the initial problem of the thread.

Instead on relying on slow paged memory albeit on SSD drive better option is rto invest in additional RAM.Overhead of thousands I/O operations is performed in kernel mode and extensive context switching can be reduced with more physical memory added to the system

Thank you all for your help.  I now have some possibilities.  As a first step, I will attempt to re-index the big arrays from

A(IJK, 3, 3)
to
A(3, 3, IJK)
This way the 3x3 array multiplication, such as matrix inversion and the like, will be handled in contiguous memory.  That in itself may have a marked effect on performance.

As far as suggestions go for increasing RAM and the like, you must realize that what I am simulating is a test, and the software itself goes out to clients with many operating systems and RAM sets.  I would like to have this model run in a stable environment without much extension so that I can be assured that when our clients run similar models on machines without amazing cache or memory, they can run them without too much trouble.

NotThatItMatters,

The change to A(3,3,ijk) should have a significant effect, both for moderate (cache) and large size memory (paging) usage. I'd be interested to find out if this is effective.
You might also find Resource Monitor (a button in task mamager) provides more info as the program runs.
My Subscript-ver2.f90 gave a simple example of timing both CPU and elapsed time. Generating logs of run time { open (unit=99, file='runtime.log', position='append') } could provide a useful log to compare changes. Make sure you include notes in runtime.log of what each run tested, as it's easy to forget what you changed.

Let us know how you go.

John

Sorry for a short follow up.

Attached is a zip file ( MemTestApp.zip ) with sources of a very simple test application and it allows to allocate different amounts of memory. Even if it is implemented in C/C++ please don't be afraid to use it and let me know if you have any questions. I used that application a lot during an initial phase of porting 32-bit codes to a 64-bit platform because I wanted to see how memory management is working for extreme cases ( for example, allocation of memory blocks larger than 64GB ).

Attachments: 

AttachmentSize
Download memtestapp.zip6.85 KB

The changing from A(IJK, 3, 3) to A(3, 3, IJK) has been implemented (it took a large bit of rewrite).  I am testing it on a large model with x64 CPU.  It is doing "better" than the original, but not great.  It is improving things by a factor of 2.5 to 1.  That still means that for every second of CPU time there are 10+ seconds of connect time.  Reading through these posts, I am wondering if increasing page size or cache might improve things.  I have no insight into this, but I suppose I might expect marginal improvement.

>>>CPU time there are 10+ seconds of connect time>>>

What do you mean by connect time ?

Clock time, Greenwich mean time, universal time code, ...

NotThatitMatters,

You have not indicated the memory demand of your large model. A few statsitics that would help are.
(for your program, from Windows Task Manager)
Memory - Working Set
Memory - Peak Working Set
Page Faults
I/O Read Bytes
I/O Write Bytes
(For your PC)
physical memory installed
paging file size

You should review these as the program is running and memory demand peaks, if you are using ALLOCATE.

It would help to understand the problem. Ther has to be an explaination for this.

John

Ps: I found your task manager posts above. I presume this is Windows 8, as I am not familiar with the layout. I could not see how much virtual memory your program uses, but it did appear that you were getting 83% of the physical memory. I presume the virtual is a lot more. I also note that disk transfers are 3mb/sec, which over a week is a lot of transfers ?
It would be good to identify the relative sizes of:
Paging space available
physical memory available (4gb)
Memory demand of the program (important)
Memory allocated to the program (3.3gb, 83% of 4gb, if I am reading the tmprocesses.png correctly )

Given the program is using 83%, that is not much left for the operating system and other things, such as disk cache for I/O.
If you could replace the memory cards with larger capacity cards, say 3 x 4gb, this would give 12gb memory and probably change the performance considerably.
Then again, if your program wants say 20gb memory and this is a critical application, then you need to purchase a PC that can install, say 24 to 32gb of memory.  ( you need to get the memory demand of your application and make sure you are not too starved of physical memory, which the 83% allocation to your application implies )

 

I think the term "connect time" dates back to when we used to use dial-up modems....

Steve - Intel Developer Support

Actually, my use of "connect time" dates back to VAX 11-780 Fortran 77 which used to give you CPU and connect time.  I guess that dates me.

Given that I was on the VAX FORTRAN 77 project, and was its lead for a number of years, it dates me too. But that was the era of dialup and the concept of metered connect time was very common.

Steve - Intel Developer Support

@NotThatitMatters

Can you run perfmon.I presume that Win8 retained this profilling tool Please post the screenshots with the variable mentioned by John.

.

//www.computerperformance.co.uk/win8/windows8-performance-monitor.htm

@NotThatitMatters

Can you run Windows 8 profiling tool perfmon?I presume that this tool was retained in Win 8.

Quote:

Steve Lionel (Intel) wrote:

Given that I was on the VAX FORTRAN 77 project, and was its lead for a number of years, it dates me too. But that was the era of dialup and the concept of metered connect time was very common.

Steve you probably got to know Dave Cutler:)

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today