Intel VTune Amplifier and OS developement

Intel VTune Amplifier and OS developement

Can Intel VTune Amplifier be used to optimise a operating system kernel? I'm adding SSE,SSE2 and SSE3 support in my os kernel. It would be nice if it worked with it so i can optimise SIMD  operations.

13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi 5600!

In general, yes!  VTune Amplifier XE will take samples within the OS.  If you have symbols in a supported format, it should be able to give you performance metrics for your functions, etc.  You will need to run an app or something to cause OS code to be executed, obvously.  But, for example, you can use the VTune Amplifier XE to optimize the Linux* kernel (and many people do ;).

Without more details, it is difficult to say more. :\

MOV EDI, [EBP + 0Ch]
MOV ESI, [EBP + 08h]


LEA EDI,  [EDI+16] ;Add 16 to Destination Address
LEA ECX, [ECX-16] ;Sub 16 from ECX
JNZ .xLoop

I'm trying to create a memory clear function using the above code. But it seems i might be doing something wrong because rep stosb is much faster when i use it to clear memory 

Hi 5600:

That, my friend, is a totally different question!  Let's see if anyone has any suggestions.

Have you profiled this code using the VTune Amplifier XE?  Did you take a look at the bandwidth (assuming it is a processor that has bandwidth analysis support)?

I have a Intel Core 2 Duo E6600 2.4GHz. Tried it on Intel VTune Amplifier XE 2011 but it doesn't like Windows 8 so i get errors from VTune.

If you have the latest release, it supports Windows* 8.  Please see Release Notes and documentation for details.  If you are running in Metro mode, you will need to switch to desktop mode to run VTune Amplifier XE.

It seems like nontemporal streaming stores.How do you perform time measurement of that code?Rep stosb writes could be cached because of predictable behavior of the loop.

In general you should get more store memory transfer bandwidth by using movntdq when the source  data is cached and accessed consecutively.Did you try to compare the results of rep stosb by looking at front-end and back-end stalls in Vtune?

I've got the lastest VTune but it supports only .exe files. My OS kernel is  *.bin.  for time measurement of the code i use the RTC time and check how many seconds it takes for rep stosb and SIMD memory operations to complete 10000 memory clears.i usually get 1 second for rep stosb and 2 secs for SIMD using the above code to clear 64 bytes of memory. the test was done only with a 16 byte  aligned memory address.

I temprorily solved my problem by writing an console app with SIMD instructions in VC++ to test in Vtune. Is Movdqa faster than Movntdq?  Because is seems Movdqa is 9x faster than movntdq under VTune. 

Movntdq clocked at 0.065 seconds for moving 4KB of data

Movdqa clocked at 0.007 seconds for moving 4KB of data


By consulting Anger Fog instructions latency tables it seem that movntdq has a large latency of ~400 cycles compared to one cycle of stosb instruction.Throughput is the same for both instructions.

Yes movdqa is faster it consumers three clocks and on Has well throughput is two instructions per cycle(Anger Fog tables).But at cost of cache pollution.

@5600, regarding the .bin issue, two comments:

1. There is a JIT API that would allow you to inform the VTune Amplifier XE where your code is loaded and what functions and statements exist in your code.  You can find all the information in the product help files (see "JIT Profiling APIs").

2. Then, whatever loads your .bin and begins execution of the code is what you would configure VTune Amplifier XE to launch and profile?

Just FYI.

Leave a Comment

Please sign in to add a comment. Not a member? Join today