Software Tuning, Performance Optimization & Platform Monitoring

Xeon E5 L3 cache is organized how?


Running a benchmark for memory performance that tests different chunk transfer sizes on an E5 that has 20MB of L3 cache and 8 cores, I am seeing five "shelves" if you will:

  • Shelf 1 is the L1 cache.
  • Shelf 2 is the L2 cache.
  • Shelf 3 extends up to 12 MB.
  • Shelf 4 extends to 20 MB.
  • Shelf 5 is main memory.

Why is there a shelf that goes to 12 MB? Has Intel partitioned the L3 somehow? Or does the L3 actually have two sections running at different speeds?

Thanks for any clues.



I have a cPCI board with E6xx processor.  I need to write an appication to talk to the hardware watchdog, and also an application to controll a led connected via gpio.  My operating system is debian squeeze.

I have no idea where to start.  Are there sample programs available, or can someone put me in the right direction?

Kind regards,


Accessing Uncore performance counters


I am trying to access performance counters inside the uncore especially the IMC and QPI ones. I have a Core-i7 3770 using Fedora release 17 with kernel version : 3.6.0-rc1-xxxx.

I have used the following tools to query available uncore counters but no luck so far:

1. PAPI : papi_native_avail doesn’t return anything related to uncore.

2. libpfm4 4.3.0 : showevtinfo does return a few supported events for LLC but nothing related to IMC or QPI.

3. perf 3.5.0 : perf list also does not give any interesting events.

SandyBridge serial vector performance

I was attempting to optimize some code for the Nehalem/Westmere/SandyBridge Xeons, and I was surprised to find that the vector code was slower than the scalar code. So I came up with a small serial test code to compare the performance of scalar versus vector code, and on all of the above Xeons, the vector code generally performed worse, unless math functions are involved. I'm guessing this is the memory wall, since the vector math function (which have many more floating operations per memory reference) loops perform around twice as fast as the scalar versions, as we might expect.

Reading QPI Routing Table

Hi there!

Is there a way to read out the QPI routing table of a processor? I'm looking for something similar to the "cpuid" instruction that can be used to query the hardware directly. Since I'm not working with either Linux, or Windows I cannot use already existing programs...

Is there a way to get this information from the hardware? -  I guess this is done differently for different processors, at the moment I'm especially interested in the XEON 7500 Series...

Thank's in advance for any input!


Formatting Raw Events for Perf...

I am trying to run Perf with raw events and have couple of doubts. Firstly from the perf code that I have downloaded, I see this for one of the events for WestMere. How do I encode it in raw format? Without the inv and cmask I know it is "-r3fb1" but cant figure out what the value is with the inv and mask.

intel_perfmon_event_map[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] =

X86_CONFIG(.event=0xb1, .umask=0x3f, .inv=1, .cmask=1);


PCM: Unsupported processor error


I'm running on Fedora17 on an Ivy Bridge.  When I downloaded PCM, built it and tried to run it, I get the following error.

Any thoughts?


# ./pcm-memory.x

 Intel(r) Performance Counter Monitor: Memory Bandwidth Monitoring Utility

 Copyright (c) 2009-2012 Intel Corporation
 This utility measures memory bandwidth per channel in real-time

Iscriversi a Software Tuning, Performance Optimization & Platform Monitoring