CPU time vs. time

80 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Yes, I certainly did.

Retired 12/31/2016

Great:)  I was not even born at that time.

Okay, so I am once again running the app, but this time with Performance Monitor running concurrently.  Aside from the obvious inference from its name, what is this supposed to be telling me?

NotThatItMatters,

Have you included my recomendation to modify the paging file size ?
You need to know the virtual memory demand of your simulation application and increase the paging file size to cope with this and other applications that are runing and may require the paging space.

John

>>>Performance Monitor running concurrently.>>>

Gather statistics about what John suggested in his response.

Modifying the paging file size has been done, with little or no effect on run time.

What is the rate of pagefault/sec and disk I/O/sec operations.I am just curious.

Pagefaults/sec 25000, disk IO/sec 125 MB.  This is on a different machine with 6GB RAM and a solid-state disk drive, but running a model on a remote drive.  Again, I am attempting to get some benchmarks.

>>Pagefaults/sec 25000

This means either migrate to 64-bit and sufficient RAM, or, spend the time to restructure the application to reduce the pagefaults/sec by 3 orders of magnitude. Some of the techniques to do this were posted near the front of this thread (see references to tile or partitioning). I am surprised that VTune doesn't have a collect page faults (at least not listed in the VTune index nor found with Search). This should be easy enough for VTune to point out which statements are causing excessive page faults. This information would be helpful in deciding what to reorganize.

Jim Dempsey

I wonder if a time based profiler (as opposed to event based profiler) would exhibit a spike of hits on instructions that cause a page fault (note, the thread causing the page fault would suspend for paging, but the monitoring thread will continue to collect samples).

Jim Dempsey

NotThatItMatters,

Your last post surprises me. We still do not have an estimate of what is the memory demand of your "large" application.
Presumably it is more than 4gb, given the dramatic paging number you have posted (25,000 Pagefaults/sec).
I would recommend that you identify all the main arrays you are using and list them in a spreadsheet and calculate their memory usage in bytes. I gave an example of this recently for another post and it does help to identify where the problem could be.  You could even calculate memory demand it in your program, using the SIZEOF intrinsic.
I assume there is lots of ALLOCATE arrays, as the static (module or common) arrays would be limited to 2gb.

This is an important measure of your program size and it does matter.

John

Jim explained it perfectly so here is very little  to add. I would like to see exact breakdown of thread activity and time spent in various subroutines you can measure it with Xperf.I am mildly curious to see how much time is spent servicing those page faults.I bet it could be <=5-10%

Jim,

Xperf tool can monitor pagefault activity and I suppose there is also an option to get the exact code location(presumably inside function) which causes page faults.

This is a short follow up.

Here is a screenshot that demonstrates CPU Usage and Page File Usage when there are lots of I/O operations during some processing:

As you ca see CPU is Not busy.

Attachments: 

AttachmentSize
Downloadimage/jpeg cpuandpagefileusage.jpg90.33 KB

Task manager has low "granularity" of cpu load measurement.You can see that there is low percentage of cpu time spent servicing those I/O, but from user mode code "point of view" such a operations are very costly.

>>...Task manager has low "granularity" of cpu load measurement...

Windows Task Manager has four selections for Update Speed:

Paused, Low, Normal and High

Last three always show right CPU Usage and when I/O operations are completed CPU Usage increases to 99% on a one CPU system. I could post a scrrenshot which demonstrates it.

Thank you all for all your insight.  I am busy taking the code apart and purging it of unnecessary arrays.  This is going to take some time, being as there are more than 10 linear solver routines compiled in the "ancient" code but really only 3 active ones and actually just one appropriate set of memory needed for all solvers.  I am trying to free up: 1) stack and 2) allocatable arrays.  I am noting that simply increasing memory to 8 Gb now makes the original model run with CPU time = clock time, although a simple look at memory requirements would indicate it is doing some page swapping to utilize VM.  This is showing me the need for appropriate allocation and maintenance of arrays.

I mean timer resolution which is around ~10ms.Clockres tool will report actual timer resolution.

>>... I am noting that simply increasing memory to 8 Gb now makes the original model run with CPU time = clock time...

Do you mean Physical or Virtual memory?

Physical memory.  Virtual memory has never been a problem.  The machines in question have a great deal of disk space and a large allotment of virtual memory (32 Gb if memory serves).

NotThatItMatters,

It is good to find that you have obtained improved performance, at least increased %CPU. This is probably due to the increase in physical memory.
The other change was changing the order of array index usage to improve sequential memory access, as you identified with A(KK,3,3) to A(3,3,KK). If processing with DO k; DO j; DO i, this should improve the availability of data in the cache, as Jim has identified in a number of posts. This might not change the %CPU (once page faults are corrected) but will significantly improve (cache) performance and run times.
The combination of both of these can be very effective.

There are two further levels of improvement; vectorisation and OpenMP. Vectorisation is easy as it is just a compiler option, again assisted by sequential memory usage, while parallelism will probably result in more posts on this forum.

Hopefully you have achieved a significant improvement,

John

>>>Windows Task Manager has four selections for Update Speed:>>>

I meant binding of measurement to specific threads/interrupts.This functionality is not supported on task manager.

>>>>Windows Task Manager has four selections for Update Speed...
>>
>>I meant binding of measurement to specific threads/interrupts.This functionality is not supported on task manager....

Task Manager was not designed to provide all that information. I simply don't understand how it could change matters in the case. It was already informed that additional Physical memory helped and, I think, we reached "Bottom of the Ocean" in that discussion.

>>> I think, we reached "Bottom of the Ocean" in that discussion.>>>

Agree with you.

>> I think, we reached "Bottom of the Ocean" in that discussion.

Only if all platforms the app runs on can install at least 8GB of RAM. If this is not the case then the programmer must take into consideration the paging aspects of a virtual memory system. Many larger than physical RAM problems can be quite nicely solved using Virtual Memory and Paging. It is unfortunate that too many of the younger programmers do not grasp this concept. This apparently was an old program to begin with. My guess is this old program ran quite nicely at one time on a system with a few MB and using sequential files. I am not suggesting that you revert back to sequential files, rather that the programmer think of the process in sequential terms (and how to get the most out of each sequence within the physical memory constraints of the system).

Jim Dempsey

FWIW - Many of these "legacy FORTRAN" programs were written to run on DEC KA10 (256 K words), DEC KI10/KL10 (8MB), DEC VAX-11/780 (max RAM 8MB), DEC VAX 7000/10000 (~3.5GB RAM). In the earlyer days, large problems may have been 100x the size of physical RAM, and input/output was from/to magnetic tape. In those days, the programmer paid attention to the sequence of operations in the program to minimize the seek and number of reads/writes. The art in programming on these machines (read "low memory"/"less memory" than required) is lost for the current generation of programmers. A really good programmer will recognize situations when "more" is "less". A good example of this might be a matrix multiplication or FFT were:

Method A: fastest when run with sufficient RAM
Method B: slowest when run with sufficient RAM

Then when a problem larger than physical RAM is presented to the methods, Method B becomes far superior. The general problem (IMHO) CS courses generally do not teach students when to choose method B (more is less), nor do they discuss method B (other than as an example of what not to use).

Jim Dempsey

That is true.Sadly todays young generation of programmers is working at very high level of abstraction with scarce knowledge of the low level details.

I am currently examining the written code for ways of minimizing allocation.  Let me explain the problem: a 2x2 block matrix needs to be solved.  The main block (item 1, 1) is quite large but very banded.  Hence its approximate solution is quite simple using standard techniques.  The 1, 2 and 2, 1 entries are the main memory hogs.  On input, these matrices are quite (emphasis here) sparse.  However, the banded matrix solution of 1, 1 yields a dense working matrix for 1, 2.  All of these matrices (1, 2  2, 1  2, 2) along with their working counterparts are currently allocated as full matrices in the "main" and passed through several subroutines to fill and solve them.  I am trying to modularize the initialization and use of these arrays.  This is not a trivial matter, although several of you probably have insight in solving similar problems.

As you go through the code, look at the subroutines to see if they can work on sections, or possibly, individual rows and columns, in a chunk by chunk bases. Current (high-paging) code

sub1(A,B)
do iRow=1,nRows
do iCol=1,nCols
doWork
end do
end do
end sub1

sub2(A,B)
do iRow=1,nRows
do iCol=1,nCols
doWork
end do
end do
end sub1
...
call sub1(A,B)
call sub2(A,B)
...

You might find it more benificial to use

sub1(A,B,iRow)
do iCol=1,nCols
doWork
end do
end sub1

sub2(A,B,iRow)
do iCol=1,nCols
doWork
end do
end sub1
...
do iRow = 1,nRows
call sub1(A,B,iRow)
call sub2(A,B,iRow)
...
end do

The above is tile by row but you could tile by square-ish sections of the matrix too. The idea here is to make use of the data as much as possible should paging be required.

Jim Dempsey

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today