Figuring out Statistically Sampling Call Graph

Figuring out Statistically Sampling Call Graph

Portrait de ovexleroptier.com

Im evaluating the PTU. I am trying to use the SSCG to figure out my application performance.

The application Im trying to profile is a WMB extension. The WMB is an IBM system which calls an extension which I have built. So the application main is not mine. The main actually loadsmy shared library. My shared library is using additional shared libraries which I have developed. In addition my code is making calls to the IBM modules already running in the process.
Running some initial tests without PTUhas shown me that the overhead implied by my libraries are isabout 50%. So I would expect to see that 33% of sampling go through my code.

Well I have hard time to understand what is going on.

I looked at the caller/callee window and I have expected to see that the sampling(TOTAL) of a function (in the following example ot::ManagedObjectRelease() ) would be equal to the sum of the sampling(TOTAL) of all the callees plus the function self sampling.

I would also expect to see that the sampling(TOTAL) of a function would be the sum of the callers sampling(TOTAL)

However this is not always true as depicted in the 1.jpg attached

Farther more I'm trying to do a top down analysis using caller/callee by looking at the posible threads activities. If I know that my thread is the only one using the OpenToThreadFunc() can I conclude using the image below that mythread consumes about 10% of the total process time?


Just to clarify, I have my own thread running in the process and in addition the process call my functions for its othere threads.

Is it possible to perform filtering of several libraries together (in addition to a single library and All)?

I have attached the results of the SSCG sampling in exp17.rar. My code is in the libraries libcf-xxxxx.so and libopentop-wmd.so, can you please guide me who can I verify that about 33% of it is a contribution of my code, assuming that paths going through my functions are the ones which contributed by my code?

What would be the most expensive calls?

Is there a good manual that would guide me through the analyzing process in such case ?

Thanks

Oron.

4 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.

Hello Oron,
thanks for your interest in PTU.

Your expectations about TOTAL are correct.
User_guide.pdf (in the PTU root dir) also has an information on
Caller/Callee View (in case you have not looked there)

However as you noticed there might be discrepancies in the samples accounting
they caused by the thing that in some cases the collection engine was not able
to unwind the stack and identify a correct caller.

Unfortunately it is not possible to do filtering per several
libraries at once in PTU. PTU is a research product and some
functionality is
missing.

Statistical call graph is helpful in regard of identifying the
hot call paths.

In terms of time accounting and hotspots Event Based
Sampling (EBS) is powerful method and it is also offered by PTU. There you can see how much active time was spent by each thread .
While for EBS you will need to install Sampling kernel
driver to the system.

The result that you attached in not converted one. Please, attach the one converted so I can look at
it.

Here are materials that could be helpful on performance analysis
with EBS:

For Intel Core 2 Duo

For Intel Core i7

You may want to check Intel Parallel
Amplifier product (while it is only for Windows for now)
There the collection technology, UI features (e.g. filtering) are significantly improved.
http://software.intel.com/en-us/intel-parallel-amplifier/

thanks,
Julia


Portrait de ovexleroptier.com

exp17_cnv.rar- contains the converted experiment.

I already guessed that the discrepancy of the caller / callee has to do with failing to correctly interpret the stack.

My understanding is that

Function Sample(TOTAL) = Function Sample(SELF) + the sum of all callee Sample(TOTAL)
Function Sample(TOTAL) = the sum of all callers Sample(TOTAL)

I can understand why:
Function Sample(TOTAL) < Function Sample(SELF) + the sum of all callee Sample(TOTAL)

This probably would happened when some children of the callee has a broken interpretationof the call stack that do not lead all the way to the root (thread_main_function()), in such case there would be contribution to the callees but not the the edge on the graph that connects the Function to the calle.

However I have hard time to understand how this can be with the callers.

Anyhow the discrepancy is huge in the above case.

The sum of callers is a bout 4500 while the function Sample(TOTAL) is about 1500 only so in this case how reliable are the results and the assumptions I can make looking at them?

thanks,

Oron.

Such weird discrepancies are due to function
recursive calls.
E.g. I looked into your data using text
viewer

<..>\\bin\\vtssview.exe . -q -n
10000 -w function,65 >1.res // to get the call graph data

And saw such discrepancy as you
referred for ImbDataFlowTerminal::propagate(ImbMessageAssembly
const&)

Then I looked to the raw stack data:
<..>\\bin\\vtssview.exe . -t -n
10000 -w function,65 >2.res

And saw that the function ImbDataFlowTerminal::propagate(ImbMessageAssembly
const&) in many cases has itself on the stack. Here is just one of such stacks:

Unfortunately GUI viewer has problems aggregating/represeting
cases with recursion.
I suggest you check with the text viewer as
above if you see similar discrepancies.

Also to improve stack unwinding
compile your code (where possible) without fomit-frame-pointer

Julia

Connectez-vous pour laisser un commentaire.