Authors: Suresh Srinivas, Uttam Pawar, Catalin Manciu, Gabriel Schulhof
This document is a Runtime Optimization Blueprint illustrating how the performance and observability of runtimes can be improved by using Last Branch Record (LBR) on Intel® architecture. The intended audience for this document is runtime implementers and customers/providers deploying runtimes at scale. In the Overview section, we introduce the problem that runtimes have with high Instruction Cache (I$) miss stalls (on average 12% of the CPU cycles are stalled across seven runtime workloads). In the Diagnosis section, we illustrate how to diagnose this problem using performance monitoring unit (PMU) counters on Intel architecture and sample tools. In the Solution section, we describe how to solve this problem. The Case Studies section details how this optimization improves performance and reduces I$ misses as well as Instruction TLB (ITLB) misses (up to 50%) in three applications in three environments. The last section summarizes the blueprint and provides a call to action for runtime developers/implementers.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804