平台监控包括机器监控主题，如监控 CPU 内核与图形处理器、其它系统微处理器及服务衡量与质量。
By Angela Schmid (Intel)Posted 05/06/20130
Audience: Anyone collecting event based performance data on a platform based on the Intel® Xeon® processor E5 family. There is a Performance monitoring unit erratum on the Intel® Xeon® processor E5 family that affects the events used for memory and cache profiling. To collect data on the events...
By Hussam Mousa (Intel)Posted 06/15/201211
Summary of recent Intel processor's cpuid values, model and family numbers linked to the architecture codename and processor codename as well as their brand names and model. Summary covers mainline IA x86 and x64 90nm, 65nm, 45nm, and 32nm processors.
Introduction This is an organic document, meaning, that it will expand as need and request dictate. The purpose is to help establish a baseline understanding of terms used in Platform Monitoring, concepts described, and utilizations or capabilities comprehended. Performance Monitoring Terminolog...
Introduction This discussion covers some of the needs and implications that drive one to optimize and manage one’s platform. In this, we disclose opportunities to influence the ultimate performance of a computer system at the architectural, platform and software levels, and provide a rationale fo...
By John S.0
I was looking for some quick statistics on turbo mode on the new E5-v2 processors. The turbo frequency listed on the ARK site only applies when one core is being turbo'd, right? I would like to know what frequency all cores can safely turbo to at all times for the E5-2667v2 and E5-2643v2. Does anyone have these numbers?
Hi, I'm trying to measure the amount of floating-point operations executed on a processor with Haswell microarchitecture using performance counters under Linux. So far, I managed to measure the amount of (single precision) float multiplications using the UOPS_ISSUED.SINGLE_MUL performance event. However, this event only counts the amount of float multiplications while ignoring float additions, comparisons as well as any operations involving doubles. I also tried accessing the FP_COMP_OPS_EXE.* events (such as FP_COMP_OPS_EXE.SSE_FP_SCALAR_SINGLE) from the SandyBridge and IvyBridge microarchitectures via their raw event codes using the perf tool under Linux. The resulting counter values were always 0. Is there any way to measure the amount of float additions, comparisons, divisions as well as operations on doubles on the Haswell microarchitecture? Thanks in advance
Hi all, I got a new server with an Atom C2750 (Avoton) an I was looking at the performance counters with pcm.x (IntelPerformanceCounterMonitor version 2.6, which should support this chip). Most of the data seems OK, but the read for consumed energy seems wrong, as I see a roughly constant ~7.81 Joules consumption when running the tool every one second. The read is proportional to the sampling frequency, so if I run $ ./pcm.x 2 I always get ~15.6 Joules, regardless of the actual CPU load. Is there anyone with the same hardware that can properly read the energy counter? Thanks!
By Marcin K.5
Hello, In Agner Fog's excellent microarchitecture.pdf (section 9.14) I read that: Store forwarding works in the following cases: [...] When a write of 128 or 256 bits is followed by a read of the same size and the same address, aligned by 16. On the other hand, Intel's Architecture Optimization Reference Manual (22.214.171.124 Intel Sandy Bridge, L1 DCache) I read that Stores cannot forward to loads in the following cases: [...] Any load that crosses a 16-byte boundary of a 32-byte store. It would seem that a 32-byte load does cross 16-byte boundary, so it should not be forwarded. However, table 2.16, section 126.96.36.199 does indicate that forwarding takes place for a 32 byte store/load when the load is from the same address. I wrote the following simple code to test this, and it seems that 32 byte stores have a small penalty when forwarded to subsequent 32 byte loads on the Sandy Bridge (and Ivy Bridge) architecture. Here is the code: #include <stdlib.h> #include <malloc.h...
Hi, We are developing and optimizing codec on Intel architecture with assembly optimization by finding most time consuming functions/modules through vtune amplifier. I have more basic questions, please clarify How to find stalls presents in the assembly, if so how to remove this. Only re-ordering is the solution? Is there any possibility to know what are the instructions pipelined?. Confusion is there whether intrinsic optimization or assembly optimization programming gives the better performance. Of course if portability required intrinsic programming is good, but looking for better performance. Are the IPPs are license-free? What are the basic strategies/steps in writing and optimizing the assembly function?. If you have any document related this during IPP implementation, please share. Regards, Srinivasu
Hi, Is there an equivalent of the FP_COMP_OPS_EXE family of events (.X87, .SSE_FP_PACKED_DOUBLE, etc.) for Silvermont chips? I am trying to use Linux "perf" to get at performance statistics of a scientific code, but from the non-architectural events listed in Table 19-18 of the Software Developer's Manual (Vol. 3), I don't see anything. If there isn't, does anyone have advice on counting FlOps and integer ops on Silvermont? Thanks, Chris
By Redblack B.0
Turbostat on Linux is able to display the consumption of GPU power but none of the Intel tools are currently doing it (e.g. PCM, BLA & Power Gadget). Is there a reason for it? Thanks