Intel® VTune™ Amplifier XE

store forwarding impact

I ran Vtune to collect the Store forward performance impact from the application,the source view showed the following source and assembly code that had the high value of the performance impact,

Address Line Source MOB Loads

0x39219 100 if ( ((long)*s & 3L) == 0) { 1183
0x39219 100 CollectGarb+158: mov esi, DWORD PTP[ecx] 152
0x39219 100mov edx, esi736
0x39219 100and edx, 0x3h 122
0x39219 100 jnz CollectGarb+3a7173

Half cache size

I read some where in you site that when the distance between siblings in a structure is less than half the cache size, the chances of a cache hit are more. Can you please let me know the link?? I saw it in one of the flash slides..any reference to a paper is also appreciated


i can't use rmtsvr. i think it may be the result of different version of compiler

we use vtune analyze to our target(pxa255+linux2.4.19).In the past ,we use arm-linux-gcc 2.95.3 to build our kernel and modules.The rmtsvr and vtlxsc you given to me is also build with 2.95.3 and thy worked well. But recently ,we have changed our arm-linux-gcc to 3.3.2, i rebuild the vtune_drv.o . When i run rmtsvr on my target,is shows :
Copyright (C) 2001-2003 Intel Corporation. All rights reserved.
VTune Performance Analyzer Update for Intel XScale Techonlogy, PXA Linux*
Server is Starting

measuring remote misses

I am running vtune for linux on a system that has two hyperthreaded pentium xeons. This means that there are four virtual processors. I wanted to run a parallel app. and measure the coherence traffic. In specific I wanted to measure the following:

a) The percentage of L2 read misses that are satisfied by reading data from another processor's cache, rather than reading it from main memory. Basically the number of remote misses and the number of main memory misses.

Any help is appreciated.

Is the clock ticks of sub function always counted in as part of its parent function?


Supposefunction A called its child functions A1, A2 and A3 respectively, and VTune reports the clock ticks for A, A1, A2 and A3. Is the clock ticks of those sub functions always counted in as part of its parent function?



/opt/intel/vtune/shared/bin/PrintCpuInfo32 error

Hi , I have installed vtune for linux 3.0 at the Intel Itanium 2 processor(64 bit) SMP,The OS is Red Hat Enterprise Linux 3.0 ,kernel is 2.4.21-9.EL,I installed the Vtune and Vtune driver (VDK) without any error,but when I run this command : vtl activity -d 30 -c sampling -c callgraph -master sampling -app /opt/intel/vtune/doc/samples/vtundemo/vtundemo -moi /opt/intel/vtune/doc/samples/vtundemo/vtundemo

VTune hung up on 2.6.13-1 kernel when sampling L2 cache miss event.


I'm using Vtune Performance Analyzer 7.2 for Windows with Vtune RDC 3.0 for Linux. The Linuxdsitribution I used is Cent OS 4.1 which used 2.6.9-11 kernel by default.

I downloaded 2.6.13-1 kernel version andfound VTunehas some problems with this kernel version.Whenever I added L2 cache miss event to the sampling activity, both VTune performance analyzer and rdc hang up after the sampling is started. If just sampling clock ticks and retired instructions, no problem.

However, this problem doesnt exist ifthedefault kernel 2.6.9-11is used.

Any idea?

JITing & profiling only a selected methods in .NET assembly

Couple of questions...

  1. Does the timing metrics in a Call Graph result include the JIT times? [guessing that the instrumented module is pre-jitted based on the module names listed & therefore does not include them. A question came up as to whether we need tohave a dummy test run to get the JIT timings & then run the actual tests]
  2. Is there a way to enable profiling for a selected set of .NET assemblies but not all?


Subscribe to Intel® VTune™ Amplifier XE