No call stack information for some threads

No call stack information for some threads

I have two systems with vtune installed and i am trying to collect hardware events and then generate a report grouped by thread. I use the following two commands:

amplxe-cl -collect general-exploration -knob enable-stack-collection=true -data-limit=0 -d='unlimited' -target-duration-type=long -r vresult -app-working-dir . --search-dir sym:p=. -- ./myapp myarg

amplxe-cl -report hw-events -group-by thread -r vresult >result.tx

The two systems are

System A - Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz

System B -  Intel(R) Xeon(R) CPU E5530  @ 2.40GHz

In system A i get all thread information, for example if 8 threads were created i get all thread information however on the other i do not get the information for all the threads. The report generated has lesser number of threads than there should be.

When i try doing the same thing through the GUI in the system B i see that some threads have no call stack information and thus the Hw events for these threads are NIL. 

I have dbg library packages installed as well. Appreciate any help. Thanks

26 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

That is true for your "System B -  Intel(R) Xeon(R) CPU E5530  @ 2.40GHz". This is a Nehalem-EP processor.

I can reproduce this on my side.

# amplxe-cl -collect general-exploration -knob enable-stack-collection=true -app-working-dir /home/peter/problem_report -- /home/peter/problem_report/primes.ia32
amplxe: Error: Cannot enable advanced capabilities for Hardware Event-based Sampling: problem with the driver (vtss/vtsspp). Check that the driver is running and the driver group is in the current user group list. See "Building and Managing the Sampling Driver" help topic for further details.

# amplxe-cl -collect general-exploration -knob enable-stack-collection=false -app-working-dir /home/peter/problem_report -- /home/peter/problem_report/primes.ia32  ; it can work properly

Event-based sampling with stack collection can work only on SandBridge processors or later. You may try other supported processor. 

 

Thank you for your fast reply. I checked that the driver is running and i am the member of the group driver is running in. I see that on setting enable-stack-collection to false i see thread data for all threads however as expected i see very few HW counters. So i will have to try on another system to get these counters i suppose. Thanks again!!

> In system A i get all thread information, for example if 8 threads were created i get all thread information however on the other i do not get the information for all the threads.  The report generated has lesser number of threads than there should be.

I don't understand that you said you can get all thread information...why the report generated has lesser number of threads than there should be?

Can you please describe it more detail and post VTune result? 

I have attached the output of

amplxe-cl -collect custom-hw-0  -knob enable-stack-collection=true -data-limit=0 -d='unlimited' -target-duration-type=long -r vresult -app-working-dir . --search-dir sym:p=. -- ./myapp myarg

//here my app runs with 8 threads

amplxe-cl -report hw-events -group-by thread -r vresult >vtuneresults.txt

I have attached vtuneresults.txt and there is information on only 7 threads and not 8.

I tried unloading the drivers (./rmmod-sep3 -s) and reloaded them granting permissions to all users (./insmod-sep3 -pu -p 666) as shown in the Build and manage sampling driver thread and the number of times the thread information is not found is greatly reduced.

Attachments: 

AttachmentSize
Download vtuneresults.txt23.76 KB

It would be helpful if you could zip and attach the result directory - the directory you point as -r vresult.

Thanks & Regards, Dmitry

Please find the result attached. 

Attachments: 

AttachmentSize
Download vresult_0.tar2.92 MB

Thanks for your result data.

The reason was simple, you can find all eight threads in timeline panel, but one of eight threads consumed less CPU time (responsible for context switching?), so functions of  that thread were not appeared in hotspots report.

All the threads here executed the same code so when you say consumed less CPU time, Do you mean the time it consumed was not accounted by Vtune?  and i do not see such issue on other system.

The reason could be - your app ran shortly 0.3s?. Last thread waited for task assignment but program ended. Please try to add more workload.  

Hello,

VTune reports that one of the working threads did not have samples since it was out by preemption and was inactive during the whole run.

To explore more could you please do the following collection:

 amplxe-cl -collect general-exploration -analyze-system -data-limit=0 -d='unlimited' -target-duration-type=long -r vresult -app-working-dir . --search-dir sym:p=. -- ./myapp myarg

and provide the result dir.

Also - it seems that you are limiting the number of working threads by phisical cores only. Do you use any affinity for threads to pin them to phisical cores exclusively?

Thanks & Regards, Dmitry

 

Please find the compressed vresult directory attached. Also i measure the ticks each thread ran using clock() in C and i see that all these threads executed for almost the same number of ticks yet i do not get the thread data for one thread.  In this case the thread with id 11034 is missing and i see that this thread was created first using pthreads and finished last.

Attachments: 

AttachmentSize
Download vresult_0.tar1.83 MB

Also I am limiting the number of threads to be equal to number of cores however i am not pining them to a certain core. Also please find the source code for the app that i am running, it is a nqueens problem solver however in this case instead of merging the different data got from different threads , each thread is made to generate all the data so that each thread executes for the same time theoretically.

Attachments: 

AttachmentSize
Download threadexamine.cpp4.04 KB

This can be reproduced on SandyBridge as well. I run this in a loop and count the number of thread info that has been produced, if it is less, i stop.

Thanks for your example code.

I recompiled your code, and cannot reproduce this issue (I attached binary, you may try again, is it due to gcc version?)

gcc-4.4.6-3.el6.x86_64

I used VTune Amplifier XE 2013 Update 17. See attached my VTune result.

There were 8 working threads + 1 main thread. Can other people reproduce this problem? 

Attachments: 

AttachmentSize
Download r007ge.zip676.37 KB

binary file

Attachments: 

AttachmentSize
Download threadexamine.zip19.53 KB

Hi Peter,

I used the binary  you provided and ran it on the machine : Intel(R) Xeon(R) CPU E3-1230 V2 @ 3.30GHz, which was less prone to this error.

I still see the error. Please find the result attached. Since the number of cores is 4 in this machine, i try to spawn 4 threads however here i get info on only 3 threads and the main thread info is very rare. I hardly see it.

Did you run it in a loop? because i see this issue sometimes not getting reproduced on running less number of times.

Attachments: 

AttachmentSize
Download vresult_0.tar2.24 MB

It seemed that you changed binary name with command "./NDAID 4" in result...there are 8 logical cores, why did you run it in 4 threads?

I also tried in IvyBridge processor which has 8 logical CPU cores, with 3.5GHz frequency, the result was expected, the difference is:

My OS: 3.11.0-19-generic

Your OS: 3.13.0-24-generic

Is it possible due to OS task scheduling reason? Can you try on some old operating systems?

Looked at the result attached and also experimented a bit on my IVT box compiling the source provided.

First - the run seems to be quite short and you use target-duration-type=long knob that sets pretty coarse grain sampling interval.

So I would recommend to set it to "veryshort".

When I did this I saw all 9 threads - 8 working threads and 1 main thread that though consumes pretty small portion of CPU mostly waiting on thread_join.

Thansk & Regards, Dmitry

Hi Dmitry,

I tried very short on  Nehalem-EP Processor with target-duration-type set to veryshort and i could reproduce the issue. I also found that once i restart the system it can sometimes take more than 1000 runs to reproduce the issue. Attached is the result of the run when i hit the issue.

Peter,

I am waiting for a system, i will soon run on older kernel once i get it.

Regards

Girish

 

Attachments: 

AttachmentSize
Download vresult_0.tar3.15 MB

Hi Peter,

I tried it on kernel 3.11.0-15-generic and i still see the problem.

 

Girish

Finally I found the root-cause after reviewing your code carefully and used your cmd to repeat the problem, pay attention on "atoi()" call,  .

amplxe-cl -collect general-exploration -knob enable-stack-collection=true -app-working-dir . --search-dir sym:p=. -- ./NDAID "8", should be changed to:

amplxe-cl" -collect general-exploration -knob enable-stack-collection=true -app-working-dir . --search-dir sym:p=. -- ./NDAID 8

The problem have gone on my side.   

Quote:

Peter Wang (Intel) wrote:

Finally I found the root-cause after reviewing your code carefully and used your cmd to repeat the problem, pay attention on "atoi()" call,  .

amplxe-cl -collect general-exploration -knob enable-stack-collection=true -app-working-dir . --search-dir sym:p=. -- ./NDAID "8", should be changed to:

amplxe-cl" -collect general-exploration -knob enable-stack-collection=true -app-working-dir . --search-dir sym:p=. -- ./NDAID 8

The problem have gone on my side.   

I am not satisfied with result since last thread has tiny workload. If I work on Redhat Enterprise server, all workload on 8 threads are balanced. 

Hi Peter,

I tried  changing in fact i had not used "8" which will be erroneous, i have attached a tar file which contains a small python script to reproduce the Bug. The command is as follows

           ./CheckBug.py  application  number_of_threads  number_of_times_to_run

I had hardcoded these values which also reproduced the bug. Can you run this script for at least for at least number_of_times_to_run =500 on your setup. I have also included the nqueens binary and source. One more thing to try would be to check on OS other than debian based ones as you mentioned.

Attachments: 

AttachmentSize
Download BugTest.tar50 KB

You used custom-analysis, I will modify to general-exploration then try later...I leave for 6 hours.

OK. I changed your custom-collection to general-exploration in your script, didn't change others. But I only ran one time, on Ubuntu and RHEL

# ./CheckBug.py threadStats 8 1

Actually result from RHEL was excellent - I mean you can see 8 threads both in hotsopts report & timeline report, and workloads in 8 threads are balanced. Yes, I still can see the problem in the report from Ubuntu, only 7 threads in hotspots report, and 8 threads in time report.

As I explained to you before, it may be caused by task-scheduling of operation system, Ubuntu OS created 8 threads (see timeline report), but some threads started up early...continually pick up other tasks which may wait at other threads. I don't know them in detail...but you can see 8 threads were created in timeline report of Ubuntu, but last thread has no task to run so 7 treads showed in hotspots report.

I think that this is not VTune's bug, because it is OS's behavior.

I attached two results.   

 

Attachments: 

AttachmentSize
Download vresult-ubuntu.zip535.88 KB
Download vresult-rhel.zip791.99 KB

Login to leave a comment.