Software Tuning, Performance Optimization & Platform Monitoring

How many pipeline stages detect branch mispredictions?

Are there three stages in the pipeline which can detect branch mispredictions?

  1. Branch Target Buffer (raising BPU CLEAR signal)
  2. Branch Address Calculator (raising BA CLEAR signal)
  3. Branch Execution Unit (flushing entire pipeline)

I am not sure about whether the BTB can actually detect branch mispredictions. Could somebody please confirm this? What is the difference between BPU CLEAR and BA CLEAR? Do they get raised from different parts of the pipeline?

microcode processor update: what processors are supported and what version is inside update?

Hi! I've asked this question at community forums and I was suggested to ask here. 


Intel provides Processor microcode update. Is it possible to find out what CPUs are supported and what microcode version is inside update? For example, what CPUs this update: Intel® Download Center supports and what microcode version is expected to appear after upgrade?


Performance seems not stable after using AES-NI for data encryption/decryption

Hi, everyone.

I've got a need to encrypt data written from a virtual machine on XenServer. I added a pure software AES CBC encryption method to the Xen virtual disk read/write operation, and test the write throughput by runing the following command in the VM:

dd if=/dev/zero of=/mnt/test_file bs=512 count=1048576

and the tested throughput is about: 55 MB/s.

I modified the encryption method to use the Intel AES-NI for encryption/decrytion, and run the former test several times, and the result is as follows:

Test 1: 85.1 MB/s

How hardware prefetcher change load and store buffer behavior in processor pipeline

Hi, Community! I am experimenting with XEON E5620 dual socket server. I perf with event RESOURCE_STALLS.LOAD and RESOURCE_STALLS.STORE in SDM page 2699 of chapter 19.7.

I first turned off hardware prefetch following instructions on url:

the instruction I used is : wrmsr -a 0x1a4 0xf then I used perf command as: perf stat -e ra202,ra208 ./fft-m26

Counting native events


I try to count some performance events of a part of an application written in C.

So far, I have used PAPI to count events. It works fine for preset events. However, when I profile native events, all of them turn out to be translated into the same Event Code : 0x40000022 (an output of papi_avail is below). It makes no sense, but no error occurs when I profile them. What could be wrong ? How could I debug this ?

How vtune compute bandwith?

Hi, I am analyzing a simulated cannealling program from parsec. The program often access elem data randomly, so it have poor performance. I add a prefetching instruction for elem, and I am glad to see the time of parallel region with multiple threads has been reduced from 31 second to 15 second. Indeed it is a good result. I just prefetch the data in advance one iteration, and I wish get more performance improvement. But after adjusting the prefetching parameter, I cannot get much better result.

Error reading llc_misses event in Xeon D-1540

Hello everyone, 

I am working in a tool that permits to access the different hardware events through performance counters (PMC). This tools works great I have tested in several Intel processors, SandyBridge, Haswell and Haswel-EP. Now I am working with a Broadwell processor that has some new cache monitoring features I need to work with. 

Trying my tool in this processor I found that the events, described in 64-ia-32-architectures-software-developer-manual-325462.pdf Table 19.1, LLC Reference (2EH, Umask 4FH) and LLC Misses (2EH, Umask 41H) report the same number. 

Subscribe to Software Tuning, Performance Optimization & Platform Monitoring