How to Tune Applications Using a Top-Down Characterization of Microarchitectural Issues


Applies to: Platforms based on 2nd Generation Intel® Core™ processor family, 3rd Generation Intel® Core™ processor family, Intel® Xeon® processor E5 family

Authors: Jackson Marusarz, Shannon Cepeda, Ahmad Yasin


1 Introduction

Optimizing applications to take advantage of increasingly complex CPU microarchitectures can be a difficult and intimidating task.  In addition to knowledge about an application, its data layout and algorithms, performance tuners need knowledge about how an application is utilizing available hardware resources.  One way to obtain this knowledge is by using on-chip Performance Monitoring Units (PMUs).  PMUs are dedicated pieces of logic within a CPU core that count specific hardware events as they occur on the system.  Examples of these events may be cache misses or branch mispredictions.  These events can be observed and combined to create useful high-level metrics such as cycles-per-instruction (CPI).

A specific microarchitecture may make available hundreds of events through its PMU.  However it is frequently non-obvious to determine which events are useful in detecting and fixing specific performance issues.  Often it requires an in-depth knowledge of both the microarchitecture design and PMU specifications to obtain useful information from raw event data.  The use of predefined events and metrics, and the top-down characterization method described here, convert the data into actionable information.


2 The Top-Down Characterization Overview

Modern CPUs employ pipelining as well as techniques like hardware threading, out-of-order execution and instruction-level parallelism to utilize resources as effectively as possible.  In spite of this, some types of software patterns and algorithms still result in inefficiencies.  For example, linked data structures are commonly used in software, but cause indirect addressing that can defeat hardware prefetchers.  In many cases, this behavior can create bubbles of idleness in the pipeline while data is retrieved and there are no other instructions to execute.  Linked data structures could be an appropriate solution to a software problem, but may result in inefficiencies.  There are many other examples at the software level that have implications on the underlying CPU pipelines.  Methodologies like the Top-Down Characterization aim to give developers insight into whether they have made wise choices with their algorithms and data structures.

The Top-Down Characterization is a hierarchical organization of event-based metrics that identifies the dominant performance bottlenecks in an application.  Its aim is to show, on average, how well the CPU’s pipeline(s) were being utilized while running an application.  Previous frameworks for interpreting events relied on accounting for CPU clockticks - determining what fraction of a CPU’s clockticks were spent on what type of operations (retrieving data due to L2 cache misses for example).  This framework instead is based on accounting for the pipeline’s resources.  To explain the Top-Down Characterization, we will review a few microarchitectural concepts, at a high level.  Many of the details of the microarchitecture are abstracted in this framework, allowing a developer to use and understand it without needing to be a hardware expert.

The pipeline of a modern high-performance CPU is quite complex.  In the simplified view, shown in Figure 1, the pipeline is divided conceptually into two halves, the Front-end and the Back-end.  The Front-end is responsible for fetching the program code represented in architectural instructions and decoding them into one or more low-level hardware operations called micro-ops (uops).  The uops are then fed to the Back-end in a process called allocation.  Once allocated, the Back-end is responsible for monitoring when a uop’s data operands are available and executing the uop in an available execution unit.  The completion of a uop’s execution is called retirement, and is where results of the uop are committed to the architectural state (CPU registers or written back to memory).  Usually, most uops pass completely through the pipeline and retire, but sometimes speculatively fetched uops may get cancelled before retirement – like in the case of mis-predicted branches.


Figure 1: A modern out-of-order Intel CPU pipeline


The Front-end of the pipeline on recent Intel microarchitectures can allocate four uops per cycle, while the Back-end can retire four uops per cycle.  From these capabilities we derive the abstract concept of a pipeline slot.  A pipeline slot represents the hardware resources needed to process one uop.  The Top-Down Characterization assumes that for each CPU core, on each clock cycle, there are four pipeline slots available.  It then uses specially designed PMU events to measure how well those pipeline slots were utilized.  The status of the pipeline slots is taken at the allocation point (marked with a star in Figure 1), where uops leave the Front-end for the Back-end.  Each pipeline slot available during an application’s runtime will be classified into one of four categories based on the simplified pipeline view described above.

During any cycle, a pipeline slot can either be empty or filled with a uop.  If a slot is empty during one clock cycle, this is attributed to a stall.  The next step needed to classify this pipeline slot is to determine whether the Front-end or the Back-end portion of the pipeline caused the stall. This is done using designated PMU events and formulas given in Section 6.  The goal of the Top-Down Characterization is to identify dominant bottlenecks, hence the attribution of the stall to either the Front- or Back-end is a critical point of consideration.  Generally, if the stall is caused by the Front-end’s inability to fill the slot with a uop, it will be classified as a Front-End Bound slot for this cycle, meaning, the performance was limited by some bottleneck under the Front-End Bound category.  In the case where the Front-end has a uop ready but cannot deliver it because the Back-end is not ready to handle it, the empty pipeline slot will be classified as Back-End Bound.  Back-end stalls are generally caused by the Back-end running out of some resource, for example, load buffers.  However, if both the Front-end and the Back-end are stalled, then the slot will be classified as Back-End Bound.  This is because, in that case, fixing the stall in the Front-end would most likely not help an application’s performance.  The Back-end is the blocking bottleneck, and it would need to be removed first before fixing issues in the Front-end would have any effect. 

If the processor is not stalled then a pipeline slot will be filled with a uop at the allocation point.  In this case, the determining factor for how to classify the slot is whether the uop eventually retires.  If it does retire, the slot is classified as Retiring.  If it doesn’t, either because of incorrect branch predictions by the Front-end or a clearing event like a pipeline flush due to Self-Modifying-Code, the slot will be classified as Bad Speculation.  These four categories make up the top level of the Top-Down Characterization.  To characterize an application, each pipeline slot is classified into exactly one of these four categories.  Figure 2 summarizes the above paragraphs in a flow chart.


Figure 2: Pipeline Slot Classification Flow Chart


The distribution of pipeline slots in these four categories is very useful for developers.  Although metrics based on events have been possible for many years, before this characterization there was no approach for identifying which possible performance issues were the most impactful.  When performance metrics are placed into this framework, as will be shown in the next sections, a developer can see which issues need to be tackled first.  The events needed to classify pipeline slots into the four categories are available beginning with Intel® microarchitecture code name Sandy Bridge – which is used in the 2nd Generation Intel Core processor family and the Intel Xeon processor E5 family.  Subsequent microarchitectures may allow further decomposition of these high-level categories into more detailed performance metrics, some of which are discussed in section 4.


3 Top-Down Characterization using Intel® VTune™ Amplifier XE 2013

The Top-Down Characterization can be applied by any developer or tool that is able to collect the required designated PMU events.  For ease of use, Intel® VTune™ Amplifier XE 2013 has built-in support for the Top-Down Characterization.  VTune Amplifier XE is Intel’s performance analyzer and can be obtained at the link given in Section 6.  VTune Amplifier XE’s General Exploration analysis type, shown in Figure 3, is pre-configured to collect the events defined in the Top-Down Characterization starting with the Intel microarchitecture code name Sandy Bridge.  General Exploration also collects the events required to calculate many other useful performance metrics.  The results of a General Exploration analysis are displayed by default in the General Exploration viewpoint of the GUI.  For details on running this analysis see the VTune Amplifier XE documentation



Figure 3: VTune Amplifier XE Analysis Type Selection


General Exploration results are displayed in hierarchical columns to reinforce the top-down nature of the characterization.  A Summary Tab (not shown) gives the percentage of pipeline slots in each category for the whole application.  A developer can explore results in multiple ways.  The most common way to explore results is to view metrics at the function level as shown in Figure 4.  For each function, the fraction of pipeline slots in each category is shown.  For example, the function called grid_intersect(), selected below, had 37.3% of its pipeline slots in the Retiring category, 8.3% in Bad Speculation, 47.1% in Back-End Bound, and 7.3% in Front-End Bound.  Each category can be expanded to view metrics underneath that category.  Automatic highlighting is also used to draw a developer’s attention to potential problem areas, in this case, to the high percentage of Back-End Bound pipeline slots for grid_intersect().


Figure 4: VTune Amplifier XE General Exploration Viewpoint



4 Microarchitectural Tuning Methodology

When doing any performance tuning, it is important to focus on the top hotspots of the application. Hotspots are the functions taking the most CPU time.  Focus on these spots will ensure that optimizations impact the overall application performance. VTune Amplifier XE has two specific analysis types for this, called Hotspots and Lightweight Hotspots.  Within the General Exploration Viewpoint, hotspots can be identified by determining the functions or modules with the highest CPU_CLK_UNHALTED event counts, which measures the number of CPU clockticks.  To obtain maximum benefit from microarchitectural tuning, developers should ensure that algorithmic optimizations such as adding parallelism have already been applied.  Generally system tuning is performed first, then application-level algorithm tuning, then architectural and microarchitectural tuning.  This process is also referred to as “Top-Down”, as in the Top-Down software tuning methodology.  It, as well as other important aspects of performance tuning like workload selection, are described in the whitepaper De-Mystifying Software Performance Optimization.

  1. Select a hotspot function (one with a large percentage of the application’s total clockticks).

  2. Evaluate the efficiency of that hotspot using the Top-Down Characterization and the guidelines given below.

  3. If inefficient, drill down the category representing the primary bottleneck, and use the next levels of sub-bottlenecks to identify causes.  The rest of section 4 describes how to use the additional metrics in each category.

  4. Optimize the issue(s).  VTune Amplifier XE’s tuning guides contain specific tuning suggestions for many of the underlying performance issues in each category.

  5. Repeat until all significant hotspots have been evaluated.

VTune Amplifier XE will automatically highlight metric values in the GUI if they are outside a predefined threshold and occur in a hotspot.  VTune Amplifier XE classifies a function as a hotspot if greater than 5% of the total clockticks for an application accrued within it.  Determining whether a given fraction of pipeline slots in a particular category constitutes a bottleneck can be workload-dependent, but some general guidelines are given in Table 1 below.  These thresholds are based on analysis of some workloads in labs at Intel.  If the fraction of time spent in a category (other than Retiring) for a hotspot is on the high end or greater than the range indicated, an investigation might be useful.  If this is true for more than one category, the category with the highest fraction of time should be investigated first.  Note that it is expected that hotspots will have some fraction of time spent in each category, and that values within the normal range below may not indicate a problem.

The important thing to realize about the Top-Down Characterization is that a developer need not spend time optimizing issues in a category that is not identified as a bottleneck – doing so will likely not lead to a significant performance improvement.  The rest of this section describes the categories in greater detail and the types of performance issues found in each.


Table 1: Top-Down Category Guidelines by Workload Type


4.1 Tuning for the Back-End Bound Category

The majority of un-tuned applications will be Back-End Bound.  Resolving Back-end issues is often about resolving sources of latency, which cause retirement to take longer than necessary.  On the Intel microarchitecture code name Sandy Bridge, VTune Amplifier XE has Back-End Bound metrics to find the sources of high latency.  For example the LLC Miss (Last-Level Cache Miss) metric identifies regions of code that need to access DRAM for data, and the Split Loads and Split Stores metrics point out memory access patterns that can harm performance.  For more details on Intel microarchitecture code name Sandy Bridge metrics, see the Tuning Guide.  Starting with Intel® microarchitecture code name Ivy Bridge (which is used in the 3rd Generation Intel Core processor family), events are now available to breakdown the Back-End Bound classification into Memory Bound and Core Bound sub-metrics, as shown in Figure 5. A metric beneath the top 4 categories may use a domain other than the pipeline slots domain.  Each metric will use the most appropriate domain based on underlying PMU events.  For more details see the documentation for each metric or category.

The Memory and Core Bound sub-metrics are determined using events corresponding to the utilization of the execution units – as opposed to the allocation stage used in the top-level classifications.  Therefore, the sum of these metrics will not necessarily match the Back-End Bound ratio determined at the top-level (though they correlate well). 

Stalls in the Memory Bound category have causes related to the memory subsystem.  For example, cache misses and memory accesses can cause Memory Bound stalls.  Core Bound stalls are caused by a less-than-optimal use of the available execution units in the CPU during each cycle.  For example, several multi-cycle divide instructions in a row competing for the divide units could cause Core Bound stalls.  For this breakdown, slots are only classified as Core Bound if they are stalled AND there are no uncompleted memory accesses.  For example, if there are pending loads, the cycle is classified as Memory Bound because the execution units are being starved while the loads have not returned data yet.  PMU events were designed into the hardware to specifically allow this type of breakdown, which helps identify the true bottleneck in an application. The majority of Back-End Bound issues will fall into the Memory Bound category.

Most of the metrics under the Memory Bound category identify which level of the memory hierarchy from the L1 cache through the memory is the bottleneck.  Again, the events used for this determination were carefully designed.  Once the Back-end is stalled the metrics try to attribute the stalls of pending loads to a particular level of cache or to in-flight stores.  If a hotspot is bound at a given level, it means that most of its data is being retrieved from that cache- or memory-level.  Optimizations should focus on moving data closer to the core.  Store Bound is also called out as a sub-category, which can indicate dependences – such as when loads in the pipeline depend on prior stores.  Under each of these categories, there are metrics which can identify specific application behaviors that result in Memory Bound execution.  For example, Loads Blocked by Store Forwarding and 4k Aliasing are metrics that flag behaviors which can cause an application to be L1 Bound.

Core Bound stalls are typically less common within Back-End Bound. These can occur when available computing resources are not sufficiently utilized and/or used without significant memory requirements.  For example, a tight loop doing Floating Point (FP) arithmetic calculations on data that fits within cache.  VTune Amplifier XE provides some metrics to detect behaviors in this category.  For example the Divider metric identifies cycles when divider hardware is heavily used and the Port Utilization metric identifies competition for discrete execution units.


Figure 5: The Back-End Bound Category for Intel® microarchitecture code name Ivy Bridge in VTune Amplifier XE


4.2 Tuning for the Front-End Bound Category

The Front-End Bound category covers several other types of pipeline stalls.  It is less common for the Front-end portion of the pipelines to become the application’s bottleneck; however there are cases where the Front-end can contribute in a significant manner to machine stalls.  For example, JITed code and interpreted code can cause Front-end stalls because the instruction stream is dynamically created without the benefit of compiler code layout in advance.  Improving performance in the Front-End Bound category will generally relate to code layout (co-locating hot code) and compiler techniques.  For example, branchy code or code with a large footprint may highlight the Front-End Bound category.  Techniques like code size optimization and compiler profile-guided optimization (PGO) are likely to reduce stalls in many cases.

The Top-Down Characterization on Intel microarchitecture code name Ivy Bridge and beyond divides Front-End Bound stalls into 2 categories, Front-End Latency and Front-End Bandwidth.  The Front-End Latency metric reports cycles in which no uops were issued by the Front-end in a cycle, while the Back-end was ready to consume them.  Recall that the Front-end cluster can issue up to 4 uops per cycle.  The Front-End Bandwidth metric reports cycles in which less than 4 uops were issued, representing an inefficient use of the Front-end’s capability.  Further metrics are identified below each of the categories.

Branch mispredictions, which are mostly accounted for in the Bad Speculation category, could also lead to inefficiencies in the Front-end as denoted by the Branch Resteers bottleneck metric underneath Front-End Latency starting in the Intel microarchitecture code name Ivy Bridge.


Figure 6: The Front-End Bound Category for Intel® microarchitecture code name Ivy Bridge in VTune Amplifier XE


VTune Amplifier XE lists metrics that may identify causes of Front-End Bound code. If any of these categories shows up significantly in the results, dig deeper into the metrics to determine the causes and how to correct them. For example, the ITLB Overhead (Instruction Translation Lookaside Buffer Overhead) and ICache Miss (Instruction Cache miss) metrics may point out areas suffering from Front-End Bound execution.  For tuning suggestions see the VTune Amplifier XE tuning guides.

4.3 Tuning for the Bad Speculation Category

The third top-level category, Bad Speculation, denotes when the pipeline is busy fetching and executing non-useful operations.  Bad Speculation pipeline slots are slots wasted by issued uops that never retired or stalled while the machine recovers from an incorrect speculation.  Bad Speculation is caused by branch mispredictions and machine clears and less commonly by cases like Self-Modifying-Code.  Bad Speculation can be reduced through compiler techniques such as Profile-Guided Optimization (PGO), avoiding indirect branches, and eliminating error conditions that cause machine clears.  Correcting Bad Speculation issues may also help decrease the number of Front-End Bound stalls.  For specific tuning techniques refer to the VTune Amplifier XE tuning guide appropriate for your microarchitecture.

Figure 7: The Bad Speculation Category for Intel® microarchitecture code name Ivy Bridge in VTune Amplifier XE


4.4 Tuning for the Retiring Category

The last category at the top level is Retiring.  It denotes when the pipeline is busy with typically useful operations.  Ideally an application would have as many slots classified in this category as possible.  However, even regions of code with a large portion of their pipeline slots retiring may have room for improvement.  One performance issue that will fall under the retiring category is heavy use of the micro-sequencer, which assists the Front-end by generating a long stream of uops to address a particular condition.  In this case, although there are many retiring uops, some of them could have been avoided.  For example, FP Assists that apply in the event of Denormals can often be reduced through compiler options (like DAZ or FTZ).  Code generation choices can also help mitigate these issues – for more details see the VTune Amplifier XE tuning guides.  In the Intel microarchitecture code name Sandy Bridge, Assists are identified as a metric under the Retiring category.  In the Intel microarchitecture code name Ivy Bridge and beyond, the pipeline slots in the ideal category of retirement are broken into a sub-category called General Retirement, and Microcode Sequenced uops are identified separately.

 Figure 8: The Retiring Category for Intel® microarchitecture code name Ivy Bridge in VTune Amplifier XE

If not already done, algorithmic tuning techniques like parallelization and vectorization can help improve the performance of code regions that fall into the retiring category.  For information on adding parallelism or vectorization to an application see some of the resources listed in section 6.1.

5 Example

This section will briefly give an example of the sequence a developer should follow to use the Top-Down Characterization method in VTune Amplifier XE for performance tuning.  A matrix multiply application is used to demonstrate the method, both because it is simple to understand and because it is available as sample code with VTune Amplifier XE.  Users can find this code in the VTune Amplifier XE install directory under the samples folder.  Figure 9 shows the results of a General Exploration analysis on an un-tuned version of the application.



 Figure 9: VTune Amplifier XE General Exploration Results: Un-tuned Matrix Multiply Application


Step 1 of the method given in section 4 is to select a hotspot.  In this case the application has only one, the multiply1() function.  This result shows that 97% of the slots used by multiply1() were Back-End Bound.  This metric is highlighted red in the VTune Amplifier XE GUI which suggests further investigation.  VTune Amplifier XE uses pre-defined thresholds for each category and will highlight when the category percentage is above the threshold.  However, developer knowledge and insights on the profiled application, and guidelines like the ones given in section 4, must be used to determine whether a highlighted value warrants further investigation.


Step 2 of the method is to evaluate the efficiency of the hotspot, and in this case, 97% of slots Back-End Bound is a clear indication that the code is not efficient.  The next step is to expand this category to find more detailed information.  VTune Amplifier XE indicated that the execution was Memory Bound and not Core Bound.  Figure 10 shows the expanded Back-End Bound category and Memory Bound sub-category with details about stalls caused by the memory subsystem.





Figure 10: Expansion of Back-End Bound Category




This figure shows that the majority of the execution stalls were DRAM Bound as highlighted.  The DRAM Bound metric indicates the fraction of clockticks the application was limited by accesses to DRAM.  To overcome this issue, it is useful to determine where these accesses are occurring and try to improve the memory access patterns to properly exploit the caches.  Double clicking on the function in the VTune Amplifier XE GUI will open the source view and show code locations contributing to this metric.


 Figure 11: Source View Showing Last Level Cache Miss Events


Each metric is composed of events, which are attributed to lines of source.  To see which events are used in a metric, hover over the metric name in the General Exploration results viewpoint.  The primary event used in the DRAM Bound metric is MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS.  This is a precise event, meaning that it will be accurately tagged to the line of code that caused it.  For more information on precise events please see the Intel® 64 and IA-32 Architectures Optimization Reference Manual, section B.3.1.  It is trivial to view the line of source causing last-level cache misses in this function.  However, in real-world applications the ability to drill down to lines of source and view the events for each line is a significant advantage for performance tuning.  In this implementation, the matrix accesses are done in a column-wise order, which does not take advantage of data locality.  Interchanging the loops, is a well-known technique that leads to temporal locality of the data in cache, significantly reducing the DRAM accesses.


Step 4 of the general process is to optimize the issue.  The multiply2() function in the sample code implements the loop interchange discussed above.  Figure 12 shows a General Exploration result with the new memory access pattern.  Compared to Figure 9, it can be observed that the Back-End Bound percentage reduced from 97% down to 57%, while the Retiring percentage has improved from 2% to 35%.





Figure 12: Results for First Tuned Implementation of Matrix Multiply




Figure 13 shows the expanded Back-End Bound category.  The DRAM Bound fraction has been reduced dramatically from 75% in multiply1() down to 7% in multiply2().  The application is now impacted by Core Bound stalls, particularly Port Utilization.


 Figure 13: Back-End Bound Results for First Tuned Implementation of Matrix Multiply

At step 5 in the process, the developer has another judgment call to make.  The Retiring slots for this hotspot are now within the guidelines.  However, the hotspot is still around 60% Back-End Bound, which is higher than expected for a High-Performance Computing type of application.

Reviewing the new results, the methodology indicates sub-optimal Port Utilization.  Applying this information and some knowledge of the algorithm a developer can interpret the new situation.  Now that the memory issues have been fixed, the application is processing many double precision operations sequentially while saturating the available FP execution units.  In this case, applying vectorization can reduce strain on the execution units by allowing each multiply or add operation to apply to multiple elements at once.  Figure 14 shows the result after the loop has been vectorized using a compiler pragma, which can be seen in the multiply2() function of the sample code.


Figure 14: Results for Second Tuned Implementation of Matrix Multiply


The overall clockticks have dropped from ~44 billion in Figure 12 down to 31 billion after the loop has been vectorized.  This represents a significant improvement of ~30% in overall runtime.  The number of instructions retired also decreased from 51 billion to 25 billion as a result of using vectorized instructions.  This explains why the Retiring category has decreased and the DRAM Bound category is highlighted again. This is not a negative result; it simply shows that in the new vectorized version, the memory subsystem has become the bottleneck again because the core is now being used efficiently.  At this point, a developer can evaluate whether further optimization is warranted.  You might not eliminate DRAM Bound stalls in a workload that is essentially streaming from memory, as is the case with a matrix multiply.  Having improved the performance around 16x from the original version, this is a good place to stop the process.  Performance tuning is iterative and it is up to the developer to determine how much effort to invest.


6 Conclusion 

The Top-Down Characterization and its availability in VTune Amplifier XE represent a new direction for performance tuning using PMUs. Developer time invested in becoming familiar with this characterization will be worth the effort, since support for it is designed into recent PMUs and, where possible, the hierarchy is further expanded on future Intel microarchitectures.  For example, the characterization was significantly expanded between Intel microarchitecture code name Sandy Bridge and Intel microarchitecture code name Ivy Bridge. Future expansions are in the plans for the hierarchy as well as the visualization portions.

The goal of the Top-Down Characterization’s is to identify the dominant bottlenecks in an application’s performance.  The goal of VTune Amplifier XE’s General Exploration analysis and visualization features is to give developers actionable information for improving their applications.  Together, these capabilities can significantly boost not only application performance, but also the productivity of developers optimizing software.



6.1 Resources


Intel® VTune™ Amplifier XE 2013 Evaluation Center

VTune Amplifier XE 2013 Product Page

Whitepaper: De-Mystifying Software Performance Optimization

VTune Amplifier XE Tuning Guides

Intel® 64 and IA-32 Architectures Optimization Reference Manual

Intel Vectorization Toolkit

Intel Guide for Developing Multithreaded Applications


6.2 List of Metrics Used in this Guide

This list describes the events and formulas used to calculate several of the metrics described in this guide.  For these metrics, the term SLOTS is used to describe the total number of slots available per execution.  It is defined as: 4* CPU_CLK_UNHALTED.THREAD. 4 is the machine pipeline width.

Front-End Bound




Bad Speculation


Back-End Bound

1 - (Front-End Bound + Retiring + Bad Speculation)

BackEndBoundatEXEStalls (an auxiliary ratio to breakdown backend stalls)


Memory Bound


Core Bound

(BackEndBoundatEXEStalls /CPU_CLK_UNHALTED.THREAD) – Memory Bound

Front-End Latency


Front-End Bandwidth

Front-End Bound – Front-End Latency





 Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

 The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

 Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

 Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to:

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Copyright© 2013 Intel Corporation. All rights reserved.

Optimization Notice

Performance Notice

For more complete information about performance and benchmark results, visit

For more complete information about compiler optimizations, see our Optimization Notice.