Software Tuning, Performance Optimization & Platform Monitoring

Measuring memory bandwidth through IntelPerformanceCounterMonitorV2.8

Hi all,

Currently, I have tried to measure the memory bandwidth per core. Fortunately, I saw the article ( So, I tested measuring memory bandwidth by using pcm-memory.x in my Linux system(Intel(R) Core(TM) i5-4430 CPU @ 3.00GHz). However, the program produces the error like below.

Beginner PCM build issues..

Hi there,

I followed the howto and articles here as best I can and here is where I am at:

I built my PCM.exe and installed the PCM Service.  used the winring drives from the web in system32 and my PCM folder.  had issues getting Studio to build a RELEASE build.  the PCM Service and intelpcm.dll would only build in DEBUG regardless of the platform and release configuration requested.. PCM.exe built ok in RELEASE.

2) Both the Performance monitor and the PCM service take a very long time to start.  Perfmon seems to be looking for a config file

Branch predictor weird behaviour - how it work?


I try to optimize a software that uses a lot of branches so I try to understand how the branch predictor and the BTB work.
For that, I use the performances counters(I enabled it by a kernel module).
More precisely, I use two events for the counters on an i7 CPU:
- Branch Instruction Retired ( event select : C4H, Umask : 00H)
- Branch Misses Retired (event select : C5H, Umask : 00H)
Also, I count these events only on ring 3( the OS flag is set to 0 for each IA32_PERFEVTSELx MSRs).

I'm doing two experiments.
The first one :

Cache Allocation Technology on Xeon E5 2608L v3

Hi all,

I am having trouble with Cache Allocation Technology (CAT) on Xeon E5 2608L v3. According to various documents CAT is supported on this chip, but when I check the availability using CPUID(07, 0), result suggests that CAT is not available.

I am using the CPUID primitive from PCM 2.8, which successfully detected the availability of CMT. I did the test in Ubuntu 14.04. The motherboard used is Asus X99-A. Turboboost and SMT are disabled during the test.

Any help is greatly appreciated!


Building intel pcm on Mac


following errors when I build intel pcm 2.7 on my Mac with sdk 10.10.

In file included from /opt/pcm_neu/intel-pcm/src/MacMSRDriver/PcmMsr/PcmMsrClient.cpp:15:

In file included from /Applications/

µops and nops and LCPs

This question is with respect to Sandy Bridge, Haswell, .... Intel microarchitectures with a µop cache.

Since the pre-decode unit fetches 16 byte blocks, NOPs are necessary for alignment purposes. It is better for basic blocks to start at a 16 byte address and it is better for instructions to not overlap 16 byte boundaries. But NOPs consume resources (Optimization Manual For example, XCHG EAX is decoded and saved as a µop in the µop cache. It is then eventually scheduled and retired.

IvyBridge CPU part performance degradation when GPU part used for computations

I saw many reports (and did own tests) that show considerable CPU part performance degradation of Ivy Bridge when GPU part of device used for GPGPU (OpenCL computational app executed).

Also, it seems that BayTrail APU doesn't experience so big performance degradation.

What could be the reasons of such behavior? (both very considerable performance hit on Ivy Bridge and much less hit on BayTrail). 

Ideal vectorization speed-up with SSE2 and MIC512 - not AVX?


In the process of optimizing a large Fortran research code I have written a simple program that very closely resembles the performance characteristics of the more complicated case. The code essentially ends up spending all its time evaluating exponential functions and square roots in a vectorizable manner, so it is a compute bound problem that should be extremely well suited for Xeon phi and wide vector units in general.

Iscriversi a Software Tuning, Performance Optimization & Platform Monitoring