Software Tuning, Performance Optimization & Platform Monitoring

The uncore event about GQ


I found that there are four main events of the GQ(uncore), they are: UNC_GQ_CYCLES_FULL.X,  UNC_GQ_CYCLES_NOT_EMPTY.X,  UNC_GQ_OCCUPANCY.X, and  UNC_GQ_ALLOC_.X.  I want to get the values of the events, but I don't know the mean of the ALLOC event and the OCCUPANAY.

So is there anyone know?


Muzi Li

OFFCORE_RSP_1 always gives zero count of uncore events

I'm trying to count uncore events (specifically LLC Hits) on a Nehalem machine but when I use OFFCORE_RSP_1, it always gives a zero count. However, OFFCORE_RSP_0 seems to be working fine.

Here is how I'm trying to read uncore events via OFFCORE_RSP_1:

   PerfEvtSel0 (address: 0x186) -> set this to value 0x04101BB

   OFFCORE_RSP_1 (address: 0x1A7) -> set this to value 0x0700 (to count LLC Hits)

But when I read IA32_PMC0 (address: 0xC1), the count is always 0.

Offcore counters for each core in a socket


I have a doubt regarding off-core response event programming on Nehalem.

I'm working on a Intel(R) Xeon(R) CPU  X5550 system with 2 sockets - each containing 4 cores (Hyperthreading is enabled).

I'm trying to count the number of LLC Hits and Misses encountered by each core in the system separately. In this regard, I saw the section in the Nehalem programming guide about off-core response programming.


PCM reports low QAT card PCIe traffic


I have a problem to monitor QAT card PCIe traffic with PCM. While two cards produce ~5800 MB/s read traffic for memory and ~3100 MB/s write traffic for memory from PCIe. Numbers reported by ./pcm-pcie.x are not even close. ~100 MB/s are reported for reads and for writes.

Could you clarify possible reason?



Environment and tools output

PCM Processor not supported

Hi everyone,

I am a beginner in the Intel PCM environment and I am trying to understand what's going on :)

So I have downloaded and compiled the Intel PCM on my laptop. I have started to code some stuff to see how it goes and also I am trying to use the ready-made utility given in PCM 2.8.

For example I am running this on my machine:

sudo ./pcm-power.x --external_program /home/mrjay/Desktop/Compilers/BerkeleyUPC/bin/upcrun cg.A.O3.4

And I get this output:

How to use PCM on Windows 8?


My system is Win 8.1 Pro. I download the Intel PCM 2.8 and run the "build_all.bat". So all are built.

But when I run the PCM.exe in PCM_Win\Debug, it lacks winring0.dll and winring0.sys. 

I wonder where I can get them?

I read WINDOWS_HOWTO.rtf file and it suggests me to compile the WinMSRDriver dir. But that dir only contains Win7 and WinXP. If my OS is Win 8.1, what should I do?


Thanks! Hope for your reply.

Good ballpark figures for front end and back end bound

Could someone please shed some light on what good ball park figures are for the front and back end bound metrics obtained via VTune, below are the formulae I've managed to obtain for the CPU KPI, if I have interpretted these correctly, Back end bound should be as close to zero as possible as should front end bound, however, what are practical values for these figures - good rough ball parks, also is my interpretation of these correct.


Front end bound ( smaller is better ? ) = IDQ_NOT_DELIVERED.CORE / (4 * Clock ticks)

User programmable DMA controller in the system with Xeon E3-1275v3 and C226 PCH?

We are using Xeon E3-1275v3 and  C226 PCH on our board. I am aware that this system does not support I/OAT or NetDMA.


Is there any user programmable DMA controller?


We want to perform DMA transfers from the main memory (source) to PCIe devices (destination).

Note: PCIe devices do not have DMA controller. PCIe device (destination) can be either PCIe BAR in endpoint or multicast BAR residing inside PCIe switch.

Slower code execution if compiled with Intel Parallel studio

Hi, everyone,

I've just got intel parallel studio 2015 compiler and I'm trying to compile the code, that I previously compiled using Apple LLVM 6.0 compiler under xCode 6.0. I did not do any changes to build options except that I changed the compiler. Surprisingly execution time has increased 2.5 times after that. 

I'm having macbook pro with intel i7 running OSX 10.10. The main part of the program is working with vector data type.

Подписаться на Software Tuning, Performance Optimization & Platform Monitoring