Intel® Developer Zone:
Platform Monitoring

Welcome to Intel Platform Monitoring Community!

Here you will find information covering performance monitoring and software tuning, and platform monitoring topics. Performance monitoring covers a variety of topics including an introduction to monitoring and software tuning methodologies, as well as software optimization techniques and best known methods (BKMs) for novice and more advanced users.

For developers, programming reference manuals are available with the latest information describing the hardware interface of the Performance Monitoring Unit (PMU) of Intel microprocessors including core and un-core monitoring resources, as well the definitive source of information on performance events which may be monitored.

Platform monitoring includes machine monitoring topics such as monitoring CPU core and graphics processors and other system coprocessors as well as metering and quality of service.

No Content Found
Subscribe to Intel Developer Zone Articles
No content found
Subscribe to Intel Developer Zone Blogs
Question about Core Specificity Encoding option for reading cpu performace counters
By Hamid Reza K.1
Hi list, I am going to obtain core cycle during which data bus is busy for a multi-threaded application executed on Core 2 Duo. I found that performace event "Dbus_Busy" meets my purpose. But, as you know, to use the event, you are supposed to sepecify core-specificity encoding. There are two options for Core Specificity Encoding: All cores and This core. I wonder if you could tell me what the meaning of this core option is for a multi-threaded application? Best regards, H. R. Khaleghzadeh
MLC to support CoD for Haswell-EP or -EX
By drMikeT1
I was wondering if there is a version of MLC (Memory Latency Check) utility that understands the CoD (Cluster on Die) snooping mode for the Haswell platform.  On a 2 socket Haswell-PE system withh CoD enabled, numactl -H shows 4 memory domains and accounts correctly the cores associated with each, MLC however treats the same host as just a 2 memory domain system and the tests DO NOT DIFFERENTIATE between the two memory controllers (and domains) that are available on each socket.  Thanks Michael
an issue on performance optimization by Intel compiler
By WEI Z. (Intel)14
Hi,          I am learning to use Intel C++ Compiler XE 15.0 integrated with VS 2013, I wrote a simple example as below to look into its performance . void dataCopy(float *codeWord0Ptr, float *codeWord1Ptr, int numDataCopy, float *outputPtr) {     float *outputPtr1 = &outputPtr[numDataCopy];     __assume_aligned(codeWord0Ptr, 64);     __assume_aligned(codeWord1Ptr, 64);     __assume_aligned(outputPtr, 64);     __assume_aligned(outputPtr1, 64);     #pragma ivdep     #pragma vector aligned     for (idxData = 0; idxData < numDataCopy; idxData++)     {         outputPtr[idxData] = codeWord0Ptr[idxData];         outputPtr1[idxData] = codeWord1Ptr[idxData];     } }        I enabled  release and x64 mode,  and enabled related optimization, AVX etc settings in project properties.        I also enabled optimization report in project properties, I see it reports loop was vectorized.        When I run it on my host PC(core is i5-3320M) and do some profiling on function dat...
Change in Turbo policy for Xeon E5 v3?
By John D. McCalpin7
I ran across a change in the behavior of Xeon E5 v3 processors relative to Xeon E5 v1 processors and am confused about several aspects.... On Xeon E5 v1 (Sandy Bridge EP) and Xeon E5 v3 (Haswell EP) processors, the maximum non-turbo clock multiplier ratio is contained in bits 15:8 of MSR_PLATFORM_INFO (MSR 0xCE). This defines the rate at which the TSC increments, and matches the "nominal" frequency of the processor. The Linux "cpufreq" controls can be used to set specific target ratios in bits MSR 0x199 (IA32_PERF_CTL) For frequencies at or below the "nominal" multiplier in MSR_PLATFORM_INFO, this results in fixed-frequency operation at the specified frequency. For frequencies above the nominal multiplier in MSR_PLATFORM_INFO, the OS programs the highest allowable value into these fields -- i.e., the maximum single-core Turbo ratio from bits 7:0 of MSR_TURBO_RATIO_LIMIT (MSR 0x1AD) The hardware then provides the highest frequency that it is able to provide, subject to the numb...
Information about PCM PCIe counters
By Anuj K.3
Hi everyone. I have been working on measuring the PCIe activity of network cards and I wanted to understand PCM counters better. I'm running the pcm-pcie.x executable on a Haswell server which displays the following counters (full event description here: http://pastebin.com/pnuj1eKu): PCIeRdCur (PCIe read current transfer (full cache line) RFO (Demand Data RFO) CRd (Demand Code Read) DRd (Demand Data Read) ItoM (PCIe write full cache line) PRd (MMIO Read) WiL (MMIO Write) I had these questions: What is the difference between PCIeRdCur and DRd? PCIeRdCur measures the number of partial and full cache line reads. Does it miss any PCIe reads that are captured by DRd, or does PCIeRdCur include DRd? I'm seeing non-zero values for both these counters. The description printed by pcm-pcie.x says that WiL measures traffic for "PCI devices writing to memory - application reads from disk/network/PCIe device", but it also describes it as "MMIO Writes (Full/Partial)". Aren't these two descript...
RAPL analysis/tests on my laptop
By Carlos P.17
Hi, I am testing the RAPL feature on my laptop in a way to try to read some CPU consumption values and I need someone to help me getting some answers. Well, I am running a RAPL sample code simultaneously with a total battery power consumption code and I am getting the following values: ##RAPL: Package energy before: 5743.516907J PowerPlane0 (core) for core 0 energy before: 2655.314636J PowerPlane0 (core) for core 0 policy: 0 PowerPlane1 (on-core GPU if avail) before: 1199.609863J PowerPlane1 (on-core GPU if avail) 0 policy: 16 DRAM energy before: 33766.502869J Sleeping 1 second Package energy after: 5747.005737  (3.488831J consumed) PowerPlane0 (core) for core 0 energy after: 2655.366577  (0.051941J consumed) PowerPlane1 (on-core GPU if avail) after: 1199.689941  (0.080078J consumed) DRAM energy after: 33767.247131  (0.744263J consumed) ##Total battery power consumption: 8 watts average I suppose that Package energy is the CPU total energy without DRAM energy right? (PKG=unc...
Sample code for PCIe Burst Transfer white paper by Intel?
By Sonny G.4
Hi,   I bumped into a white paper by intel: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers... Is there sample code for Linux on Xeon (E5-2600) processor that I can take a look, instead of the general idea outlined in the paper? For example, basically the steps are: 1. Mark memory Region as WC  -   Any sample code for this? 2. Burst transfer -  Sample code to use this __mm256_store_si256() functions Any help is appreciated. Thanks!  
Block Matrix Multiplication With Cilk?
By Patrick P.1
I'm trying to tackle the same problem every HPC student gets: multiply matrices faster with as few memory accesses as possible. I've started with the dumb 6-deep nested for-loop block algorithm, but I feel like you can eliminate the 3 innermost loops (or should be able to) with Cilk notation and take advantage of SSE/AVX. This is what I came up with, but I get a compile error on icpc that the array bases are invalid. I've seen tons of cilk examples with variables for array bounds, so I'm REALLY confused as to why this is invalid. Our Matrix is implemented generally using std::vector<std::vector<Val>> where in our case Val is an int but can change. Matrix Matrix::operator*(const Matrix& src) const throw(std::exception) {     if (data.size() != src.data[0].size()) {         throw std::runtime_error("Incompatible Matrix Dimensions!");     }     unsigned int BS = blockSize, m1x = data[0].size(), m1y = data.size(),         m2x = src.data[0].size();     Matrix toRet ...
Subscribe to Forums
No Content Found

Videos


Software Performance Monitoring

Highlights from the Community Manager

On Jan 5, 2011, Intel launched the 2nd Generation Intel® Core™ processor family (formerly code-named Sandy Bridge) for laptops and PCs. The new processors have a revolutionary new architecture that combines the computing “brain,” or microprocessor, with a graphics engine on the same die for the very first time. New features include Intel® Insider™, Intel® Quick Sync Video, and a new version of the company's award-winning Intel® Wireless Display (WiDi), which now adds 1080p HD and content protection for those wishing to beam premium HD content from their laptop screen to their TV.

Stay connected. Visit often. We will be posting the PMU programming guides and updated tools to give you the latest information on the new Intel microarchitecture innovations