Software Tuning, Performance Optimization & Platform Monitoring

µops and nops and LCPs

This question is with respect to Sandy Bridge, Haswell, .... Intel microarchitectures with a µop cache.

Since the pre-decode unit fetches 16 byte blocks, NOPs are necessary for alignment purposes. It is better for basic blocks to start at a 16 byte address and it is better for instructions to not overlap 16 byte boundaries. But NOPs consume resources (Optimization Manual For example, XCHG EAX is decoded and saved as a µop in the µop cache. It is then eventually scheduled and retired.

IvyBridge CPU part performance degradation when GPU part used for computations

I saw many reports (and did own tests) that show considerable CPU part performance degradation of Ivy Bridge when GPU part of device used for GPGPU (OpenCL computational app executed).

Also, it seems that BayTrail APU doesn't experience so big performance degradation.

What could be the reasons of such behavior? (both very considerable performance hit on Ivy Bridge and much less hit on BayTrail). 

Ideal vectorization speed-up with SSE2 and MIC512 - not AVX?


In the process of optimizing a large Fortran research code I have written a simple program that very closely resembles the performance characteristics of the more complicated case. The code essentially ends up spending all its time evaluating exponential functions and square roots in a vectorizable manner, so it is a compute bound problem that should be extremely well suited for Xeon phi and wide vector units in general.

Cache speed problems.

I have Windows 7 64 bit, a Dell N5110 laptop with Intel i5-2450M. I was writing some program that uses a lot of memory and cache and I discovered my cache speeds are very slow.

Here is a benchmark (with a tool called pmbw benchmark) with the usual results I am getting (first page has the bandwidth):

Calculation of DRAM Power using MSR

Hello All,
I am currently working on the performance counters. I counted different cache events using model specific registers (MSRs). I referred the following papers to use these counts for evaluating power consumption of DRAM. However, after getting event counts I am totally unaware of what to do next. I don't understand the relationship between counts and power. Let me know how to relate counts with DRAM power.
Is it possible to use performance counters to estimate the DRAM power directly?

Referenced papers are as follows

How to find the Individual core L1 and L2 cache hit/miss on the multicore environment

Scenario : 2 Process are executing on 2 different cores respectively of a processor. How can i measure Individual core L1 and L2 Cache hits and miss for each core assuming hyper threading are disabled. Performance Counter monitors are not providing me individual breakdown i believe. So is there any way i can measure the individual core L1 and L2 cache hits and misses.

Non-temporal stores and fences

Hi all,

I'm trying to write a fast function for filling a large buffer with a 128-bit vector value. I'm using movntps and I was wondering if a fence instruction is necessary for correctness and/or performance. Things appear to work fine without it, but I wonder if that's just dumb luck or if the processor detects the lack of it and ensures correctness through some kind of costly interrupt and/or microcode?

If a fence is highly recommended, should I use sfence or mfence? I couldn't find any documents with straight answers.


Can I force the compiler to pay attention to my intrinsics?

I have been working on optimization of a DGEMM kernel in the hopes of being able to understand power-limiting on Xeon E5 v3 processors.

Using the DGEMM from Intel's MKL library, I can typically cause a Xeon E5 v3 to enter power-limited frequency throttling when using more than about 1/2 of the cores on the chip.  The details depend a bit on the base frequency, the number of cores, and the TDP, but it is usually around 1/2 of the cores.

Some question about performace counter and how to read them

Hello all,

I am working on CMP scheduling and need to obtain some inforamtion about the behaviour of multi-threaded applications executed on Intel CPUs. One CPU considered in my study is Intel Core 2 Duo. I have some question and would be thankful if somebody answer me.

1- Could you tell me how many performance counter registers exist in Core 2 Duo?

Assine o Software Tuning, Performance Optimization & Platform Monitoring