Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors, Part 2: Understanding and Using Hardware Events

Abstract

This document assists the user in optimizing applications on the Intel® Xeon Phi™ Coprocessor. It is intended for use with the Intel® VTune™ Amplifier XE performance profiler. It gives an architectural overview and details about which events and metrics to use to analyze performance, along with tuning suggestions.

1 Introduction

The Intel® Xeon Phi™ coprocessor is the first product based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture). Developers of applications for the Intel Xeon Phi coprocessor can tune the performance of their software using the Intel® VTune™ Amplifier XE performance profiler.

Software performance optimization can occur at many levels: application tuning, system tuning, operating system tuning, etc. Generally, a top-down approach is the most efficient: tuning of the system first, and then optimizing the algorithms of the application, then tuning at the micro-architecture level. System tuning, including tuning of the operating system, is normally done to remove hardware bottlenecks. Algorithmic tuning involves things like adding parallelism, tuning I/O or choosing more efficient data structures or library routines. Algorithmic tuning generally relies on knowledge of the application’s hotspots and familiarity with the source code, and aims to improve performance for the application in general.

Part 1 of this paper discusses some of the algorithmic optimizations that should be considered for applications running on the Intel Xeon Phi coprocessor. Part 2 focuses on micro-architectural optimizations and how to identify where they are needed. Micro-architectural tuning relies on knowledge of how the application is executing on the hardware – how the pipeline, caches, etc. are being utilized. Tuning at this level can be specific to the architecture and underlying hardware being used. For developers to complete micro-architectural tuning, they need access to the real-time performance information gathered by the computing hardware itself while the application runs.

This information is stored in a processing core’s Performance Monitoring Unit (PMU), which can be programmed to count occurrences of particular events. VTune Amplifier XE 2013 gives developers the ability to both collect and view sampled data from an Intel Xeon Phi coprocessor. This guide will give a framework for analyzing the event data collected from an application run on the coprocessor.

2 Intel® Xeon Phi™ Coprocessor Overview

The Intel Xeon Phi coprocessor is ideally suited for highly parallel applications that feature a high ratio of computation to data access. It is composed of up to 61 CPU cores connected on-die via a bi-directional ring bus. Each core is capable of switching between up to 4 hardware threads in a round-robin manner, resulting in a total of up to 244 hardware threads available. Each core consists of an in-order, dual-issue x86 pipeline, a local L1 and L2 cache, and a separate vector processing unit (VPU). Details of the cache hierarchy are below.

Type Size Info
L1 Instruction 32KB 8-way, 64B line size
L1 Data 32KB 8-way, 64B line size
L2 Instruction+Data 512KB 8-way, 64B line size

Table 1: Cache Information

The Intel Xeon Phi coprocessor includes hardware prefetching, which is present in the form of 5 separate mechanisms, each targeting a specific data access pattern. This extensive hardware prefetching, plus compiler-generated software prefetches, remove some of the burden on software developers to use prefetch instructions. Therefore optimization of software prefetching is not discussed in this guide.

Another feature of the Intel Xeon Phi coprocessor is a second-level data Translation Lookaside Buffer (DTLB). TLBs are caches that hold virtual-to-physical memory address translations, so that the translation doesn’t have to be performed on each memory access. Each core’s L2 DTLB contains 64 entries, which can be used to cache 2M page translations or Page Directory Entries (PDEs) for 4K pages. The L2 DTLB can significantly reduce the latency incurred for an L1 DTLB miss.

Type Entries Page Size Maps
L1 Instruction 32 4KB 128KB
L1 Data 64 4KB 256KB
8 2MB 16MB
L2 Data 64 4KB or 2MB 128M

Table 2: TLB Information

Intel Xeon Phi coprocessors also contain 8 dual-channel GDDR5 memory controllers. Each channel can deliver data at a rate of 5.5 GT/s. The theoretical aggregate memory bandwidth available is 352 GB/s.

3 General Exploration Method

The current process for using Intel® VTune™ Amplifier XE to collect and view data from an Intel Xeon Phi coprocessor is detailed in several documents listed in the Resources section. This guide will focus on the process for analyzing data that is already collected and displayed in the VTune Amplifier XE interface. Data may need to be collected over multiple runs, and metrics will need to be calculated outside of VTune Amplifier XE. Support within VTune Amplifier XE for the Intel® Xeon Phi™ product family will continue to improve.

Although looking at the individual counts of various events can be useful, in this document most events will be used within the context of metrics. Section 4 lists some general measures of efficiency that can help in evaluating when to start and stop optimization for a particular piece of code. Section 5 details a set of metrics that are valuable for application analysis. Along with each metric and its description is a formula for calculating the metric from available events, a threshold for determining when the value for a metric may indicate a performance problem and some tuning suggestions.

The general method to follow for performance analysis with Intel® VTune™ Amplifier XE is:

  1. Select a hotspot (a function with a large percentage of the application’s total CPU cycles).
  2. Evaluate the efficiency of that hotspot using the metrics in Section 4.
  3. If inefficient, check each applicable metric in Section 5. If a metric’s value is below the suggested threshold, or unacceptable by other standards, use the additional information in this guide to find and fix the problem.
  4. Repeat until all significant hotspots have been evaluated.

When following this method, it is important to carefully select a representative workload. Many of the metrics involve collecting several events, and this may require running the workload multiple times to collect data. An ideal workload should have some steady state phase(s), where behavior is constant for a duration longer than the data collection interval. The workload should also give consistent, repeatable results, and be the only application consuming a significant portion of CPU time during data collection. If the workload is being run multiple times to collect data, ensure that there are no warm-cache effects or other factors that affect performance. Finally, before beginning analysis, a sanity-check with basic CPU clock cycle and Instruction events is encouraged – ensure the event counts are constant run-to-run and fall within expectations. The metrics and descriptions for events in this document are based on B-stepping coprocessors and should apply to B-step or more recent.

4 Efficiency Metrics

There are several metrics which can be used to measure general efficiency on the Intel Xeon Phi coprocessor. Developers should look at these metrics first, to get an idea of how well their application is utilizing the resources available. These metrics (except where noted) can also be used to assess the impact of various optimizations as part of an iterative tuning process.

The formulas given for each metric are meant to be calculated at the function level (using the sum of samples from all hardware threads running). The VTune Amplifier XE interface performs this summation automatically if using the “Custom Analysis” Hardware Event-based Sampling analysis type, and the “PMU events” tab with the “Function/Call stack” grouping. The summed values from this interface (per function) can be used to calculate the metrics in this guide.

4.1 CPI

Events Used

Event Meaning
CPU_CLK_UNHALTED The number of cycles executed by the core
INSTRUCTIONS_EXECUTED The number of instructions executed by the thread

Formula(s)

Metric Formula
Average CPI per Thread CPU_CLK_UNHALTED/INSTRUCTIONS_EXECUTED
Average CPI per Core (CPI per Thread) / Number of hardware threads used

Threshold(s)

Metric Investigate if:
Average CPI per Thread >4.0, or generally increasing
Average CPI per Core >1.0, or generally increasing

Description and Usage

Cycles per instruction, or CPI, is a metric that has been a part of the VTune Amplifier XE interface for many years. It tells the average number of CPU cycles required to retire an instruction, and therefore is an indicator of how much latency in the system affected the running application. Since CPI is a ratio, it will be affected by either changes in the number of CPU cycles that an application takes (the numerator) or changes in the number of instructions executed (the denominator). For that reason, CPI is best used for comparison when only one part of the ratio is changing. For instance, changes might be made to a data structure in one part of the code that lower CPI in a (different) hotspot. “New” and “former” CPI could be compared for that hotspot as long as the code within it hasn’t changed. The goal is to lower CPI, both in hotspots and for the application as a whole.

In order to make full use of the metric, it is important to understand how to interpret CPI when using multiple hardware threads. On the Intel® MIC Architecture, CPI can be analyzed in two ways: “per-core” or “per-thread”. Each way of analyzing CPI is useful. The per-thread analysis, which calculates CPI per hardware thread, is the most straight-forward. It is calculated from two events: CPU_CLK_UNHALTED (also known as clock ticks or cycles) and INSTRUCTIONS_EXECUTED. CPU_CLK_UNHALTED counts ticks of the CPU core’s clock. Since the clock is implemented in hardware on the core, all hardware threads on a core would see the same clock. This event is counted at the core level – for a particular sample, all the threads running on the same core will have the same value.

The other event used is INSTRUCTIONS_EXECUTED, and this event is counted at the thread level. On a sample, each thread executing on a core could have a different value for this event, depending on how many instructions from each thread have really been retired. Calculating CPI per thread is easy: it is just the result of dividing CPU_CLK_UNHALTED by INSTRUCTIONS_EXECUTED. For any given sample, this calculation will use the core’s value for clock ticks and an individual hardware thread’s value for instructions executed. This calculation is typically done at the function level, using the sum of all samples for each function, and so will calculate an average CPI per hardware thread, averaged across all hardware threads running for the function.

CPI per core is slightly more complex. Again, all hardware threads running on a core share a common value for clock ticks, and they each have individual values for instructions executed. To calculate an “Aggregate” CPI, or Average CPI per core, you divide the core’s CPU_CLK_UNHALTED value by the sum of all the threads’ INSTRUCTIONS_EXECUTED values. For example, assume an application that is using two hardware threads per core on the Intel Xeon Phi coprocessor. One hot function in the application takes 1,200 clock ticks to complete. During those 1,200 cycles, each thread executed 600 instructions. The CPI per thread for this function would be (1200 / 600) or 2.0. The CPI per core for this function would be (1200 / (600 + 600)) or 1.0. Now assume the application was run again using three hardware threads and the same workload. Now each thread retired 400 instructions, for a total of 1200, in the same amount of time. The CPI per thread for the function would be different – (1200 / 400) or 3.0 for each thread. The CPI per core would stay the same: (1200 / (400 + 400 + 400) or 1.0. Again, the calculation of average CPI per core is typically done at a function level using samples from all cores and threads running. Therefore an easy way to derive average CPI per core is to divide the average CPI per thread value by the number of hardware threads per core used.

The hypothetical example above illustrates how CPI per thread and CPI per core can react in different ways as the result of an application change. In the case above, after adding an additional hardware thread, CPI per thread degraded and CPI per core remained the same. In reality, the third hardware thread allowed the application to complete the same amount of work in the same time. Looking at the data gives a deeper understanding of the situation. CPI per core remained the same, indicating that the core itself was executing instructions at the same rate as before. CPI per thread degraded from 2.0 to 3.0 for each thread, revealing that each hardware thread was less efficient than before. Both of these analyses are true – the core performance remained the same running 2 hardware threads more efficiently or 3 hardware threads less efficiently. But if a developer was only looking at CPI per thread, it would appear that performance got worse. In typical usage scenarios on the Intel Xeon Phi coprocessor, it would be possible to make changes that affect CPI per thread and CPI per core differently, and it is important to measure and understand them both.

The Intel Xeon Phi coprocessor supports running up to four hardware threads on each physical core. However, the front-end of the pipeline can only issue up to two instructions per cycle. (This is the opposite of the traditional Intel® Xeon® processor pipelines, which currently support two hardware threads and feature front ends that can issue four instructions per cycle.) The availability of four hardware threads on the Intel Xeon Phi coprocessor can be useful for absorbing some of the latency of a workload’s data access. Since the Intel Xeon Phi coprocessor pipeline operates on instructions in-order (meaning instructions must wait for previous ones to have all operands before they can execute), the support for additional hardware threading may be particularly important for some types of applications. While one hardware thread is waiting on data, others can be executing.

Another important thing to know about the front-end of the Intel Xeon Phi coprocessor pipeline is that it does not issue instructions from the same hardware context (hardware thread) for two clock cycles in a row, even if that hardware context is the only one executing. So, in order to achieve the maximum issue rate, at least two hardware contexts must be running. With multiple contexts running, the front-end will switch between them in a round-robin fashion. Given these requirements and the ability to issue 2 instructions per clock, the minimum theoretical CPIs of any application running on Intel Xeon Phi coprocessor can be calculated, and are listed below in table 3.

Number of Hardware Threads Minimum (Best) Theoretical CPI per Minimum (Best) Theoretical CPI per
/ Core Core Thread
1 1.0 1.0
2 .5 1.0
3 .5 1.5
4 .5 2.0

Table 3: Minimum Theoretical CPIs

Some applications have enough latency inherent in their data access that all four hardware threads can be utilized, with each adding performance. In this case, the addition of each thread would decrease the per core CPI on the same workload. It can be tricky to look at CPI when increasing or decreasing the amount of work processed, because again these changes affect the instructions executed. A general rule would be that if the amount of work completed were increasing, then in the case where each hardware thread was beneficial, then CPI per core would be increasing at a rate less than the increase in work processed. CPI per core is useful in analyzing the benefit of each additional hardware thread. Even when CPI per core is decreasing (good), CPI per thread might be increasing, and this is useful to know as well, because many of the code optimizations a developer may apply will be addressing CPI at the thread level.

Table 4 shows the CPI per Core and per Thread for a real workload run in the Intel lab as the number of hardware threads/core is scaled from 1to 4. For this application, the performance of the application was increasing with the addition of each thread, although the addition of the 4th thread did not add as much performance as did the 2nd or 3rd. The data shows that the CPI per thread is increasing as threads are added – meaning each is becoming less efficient – but, the CPI per core is decreasing overall, as expected since each thread adds performance. For this workload, the number of instructions executed was roughly constant across all the hardware thread configurations, so the CPI directly affected execution time. When CPI per core decreased, that translated to a reduction in total execution time for the application.

Metric 1 hardware thread / core 2 hardware threads / core 3 hardware threads / core 4 hardware threads / core
CPI per Thread 5.24 8.80 11.18 13.74
CPI per Core 5.24 4.40 3.73 3.43

Table 4: CPI Example

It is important to note that the Thresholds for the CPI per Core and CPI per Thread metric are very conservative. Many applications may have higher CPI values and still be running optimally. In general, applications that are operating within the cores (i.e. doing computations on cacheable working sets) should be able to obtain CPIs at or lower than the given thresholds. Applications that need to operate at least partly across cores or utilizing memory may have higher CPIs than the thresholds given.

Tuning Suggestions

Any changes to an application will affect CPI, since it is likely that either the number of instructions executed or the time taken to complete them will change. The goal in general should be to reduce CPI per core (and therefore execution time), especially when compared to previous versions of the application. Most of the performance suggestions in each of the issues in section 5 can be used to try to reduce CPI. Keep in mind that some beneficial optimizations, such as ones undertaken to increase Vectorization Intensity (section 5.5) may actually increase CPI because the amount of work done with a single instruction increases, and thus the number of instructions executed overall can decrease. CPI is most useful as a general comparison and efficiency metric rather than as a sole determinant of performance.

4.2 Compute to Data Access Ratio

Events Used

Event Meaning
VPU_ELEMENTS_ACTIVE The number of VPU operations executed by the thread
DATA_READ_OR_WRITE The number of loads and stores seen by a thread’s L1 data cache
DATA_READ_MISS_OR_WRITE_MISS The number of demand loads or stores that miss a thread’s L1 cache
L1_DATA_HIT_INFLIGHT_PF1 The number of demand loads or stores that are for a cacheline already being prefetched from L2 into L1

Formula(s)

Metric Formula
L1 Compute to Data Access Ratio VPU_ELEMENTS_ACTIVE / DATA_READ_OR_WRITE
L2 Compute to Data Access Ratio VPU_ELEMENTS_ACTIVE / DATA_READ_MISS_OR_WRITE_MISS

Threshold(s)

Metric Investigate if:
L1 Compute to Data Access Ratio < Vectorization Intensity (See section 5.3)
L2 Compute to Data Access Ratio < 100x L1 Compute to Data Access Ratio

Description and Usage

These metrics are a way to measure the computational density of an application, or how many computations it is performing on average for each piece of data loaded. The first, L1 Compute to Data Access Ratio, should be used to judge suitability of an application for running on the Intel® MIC Architecture. Applications that will perform well on the Intel® MIC Architecture should be vectorized, and ideally be able to perform multiple operations on the same pieces of data (or same cachelines). The L1 ratio calculates an average of the number of vectorized operations that occur for each L1 cache access. All vectorized operations, including data operations, are included in the numerator by definition of the VPU_ELEMENTS_ACTIVE event. VPU_ELEMENTS_ACTIVE was used instead of VPU_INSTRUCTIONS_EXECUTED because it gives a more accurate picture of how many operations occurred – for example, an instruction applied to a register packed with 16 floats will count as 16 operations. All demand loads and stores are included in the denominator, and no prefetches.

The threshold for the L1 metric is a guideline. Most codes that run well on the Intel® MIC Architecture should be able to achieve a ratio of computation to L1 access that is greater than or equal to their Vectorization Intensity (see section 5.3). This is similar to a 1:1 ratio – one data access for one computation – except that by vectorizing each computation should be operating on multiple elements at once. An application that cannot achieve a ratio above this threshold may not be computationally dense enough to fully utilize the Intel® MIC Architecture.

Computational density at the L1 level is critical. At the L2 level it is an indicator of whether code is operating efficiently. Again, the threshold given is a guideline. For best performance, data should be accessed from L1. This doesn’t mean that data can’t be streamed from memory – the high bandwidth on Intel Xeon Phi coprocessors is advantageous for this. But, ideally, data should be streamed from memory into the caches using prefetches, and then should be available in L1 when the demand load occurs. This is even more important for the Intel Xeon Phi coprocessor than for traditional processors. Long data latency mitigates the performance benefits of vectorization, which is one of the cornerstones of Intel® MIC Architecture performance. The L2 Compute to Data Access Ratio shows the average number of vectorized operations that occur for each L2 access. Applications that are able to block data for the L1 cache, or reduce data access in general, will have higher numbers for this ratio. As a baseline, the threshold of 100x the L1 ratio has been used, meaning there should be roughly 1 L2 data access for every 100 L1 data accesses. Like the L1 metric, it includes all vectorized operations (including data movement) in the numerator.

The denominator for the L1 metric includes all demand loads and stores – all L1 data cache accesses. The denominator for the L2 metric is slightly more complicated – it uses all the demanded data accesses that missed L1 – only these will be requested from L2. It will be strongly related to the L1 Hit Rate discussed in Section 5.2.

Tuning Suggestions

For the L1 computational density metric, if the value is less than the Vectorization Intensity, general tunings to reduce data access should be applied. This is best accomplished by aiming to reduce the number of instructions on the critical path in general. Remove conditionals, initialization, or anything not needed in inner loops. Streamline data structures. Align data and ensure the compiler is assuming alignment in generating loads and stores. Ensure the compiler is generating good vectorized code – for example, not register spilling. Eliminate task or thread management overhead as much as possible.

For the L2 computational density metric, try to improve data locality for the L1 cache using techniques described in section 5.3. Re-structuring code using techniques or pragmas from Intel® Cilk™ Plus can also enable the compiler to generate more efficient vectorized code, and can help improve both the L1 and L2 metrics.

5 Potential performance issues

This section highlights several possible performance issues that can be detected using events. For each issue, the events needed are listed along with their descriptions. Each issue is identified using metrics and thresholds. Like the metrics given in Section 4, the formulas given for the metrics below are meant to be calculated at the function level (using the sum of samples from all hardware threads running). The VTune Amplifier XE interface performs this summation automatically if using the “Custom Analysis” Hardware Event-based Sampling analysis type, and the “PMU events” tab with the “Function/Call stack” grouping. The summed values from this interface (per function) can be used to calculate the metrics in this guide.

The value computed for each metric should then be compared to the threshold value. The thresholds given in this document are generally chosen conservatively. This means that an application is more likely to trigger the threshold criteria without having a problem than to have one of the given issues without triggering the threshold. The thresholds only indicate that a developer may want to investigate further. All of the metrics in section 5 are also designed to be used after the execution environment is fixed. Changes to the number of hardware threads or cores used may affect the predictability of the metrics.

5.1 General cache usage

Events Used

Event Meaning
CPU_CLK_UNHALTED The number of cycles in which the core was executing
DATA_READ_MISS_OR_WRITE_MISS The number of demand loads or stores that missed the L1 data cache
L1_DATA_HIT_INFLIGHT_PF1 The number of demand loads or stores that are for a cacheline already being prefetched from L2 into L1.
DATA_READ_OR_WRITE The number of loads and stores seen by a thread’s L1 data cache.
EXEC_STAGE_CYCLES The number of cycles when the thread was executing computational operations
L2_DATA_READ_MISS_CACHE_FILL/ L2_DATA_WRITE_MISS_CACHE_FILL Counts L2 read or read for ownership misses that were serviced by another core’s L2 cache (on the same card). Includes L2 prefetches that missed the local L2 cache and so is not useful for determining demand cache fills.

L2_DATA_READ_MISS_MEM_FILL/ L2_DATA_WRITE_MISS_MEM_FILL

Counts L2 read or read for ownership misses that were serviced by memory (on the same card). Includes L2 prefetches that missed the local L2 cache, and so is not useful for determining demand memory fills.

Formula(s)

Metric Formula
L1 Misses DATA_READ_MISS_OR_WRITE_MISS + L1_DATA_HIT_INFLIGHT_PF1
L1 Hit Rate (DATA_READ_OR_WRITE – L1 Misses) / DATA_READ_OR_WRITE
Estimated Latency Impact (CPU_CLK_UNHALTED – EXEC_STAGE_CYCLES – DATA_READ_OR_WRITE) / DATA_READ_OR_WRITE_MISS

Threshold(s)

Metric Investigate if:
L1 Hit Rate < 95%
Estimated Latency Impact > 145

Description and Usage

For applications running on the Intel Xeon Phi coprocessor, good data locality is critical for achieving their performance potential. In order to realize the benefit from vectorizing applications, the data must be accessible to be packed into VPU registers at as low a latency as possible – otherwise, the time to pack the registers dominates the time to do the computation. Although being able to switch execution among four hardware threads does hide some data access latency, it can still have a significant impact on performance. Therefore, improving data locality is one of the most worthwhile optimization efforts for the Intel Xeon Phi coprocessor. Both L1 and L2 locality are important. Program changes that result in data being accessed from local L2 cache as opposed to a remote cache or memory save at least 250 cycles of access time. Under load, the savings are even greater. Accessing data from L1 as opposed to L2 saves about 20 cycles.

Traditionally, Hit Rate metrics indicate how well each level of cache is being used. It is normally calculated by dividing the number of hits by the total number of accesses for that level of cache. Hit rates also typically only apply to “demand” accesses – meaning, true loads from the application as opposed to software or hardware prefetches. It is possible to determine the demand hit rate for the Data (or L1) cache, but the formula requires some explanation. Data cache accesses can be either a standard hit, a miss, or a hit to an in-flight prefetch, which is counted separately. Hits to an in-flight prefetch occur when the data was not found in the cache, and was a match for a cacheline already being retrieved for the same cache level by a prefetch. These types of hits have a longer latency than a standard hit, but less than a miss. To be conservative with the hit rate, in this document they are treated like misses and thus subtracted in the numerator.

The L2 and FILL events on the Intel Xeon Phi Coprocessor are counting both demand loads and stores as well as multiple types of prefetches. Not all of the prefetches are accurately counted by other events – so, the formulas can’t be adjusted to calculate real demand L2 hits or misses. This document does not recommend any metrics that depend on the L2 or FILL events, except for memory bandwidth (where including prefetches is OK). The Estimated Latency Impact metric is given in an attempt to work around the lack of L2 metrics. This metric is a rough approximation of the amount of clock cycles devoted to each L1 cache miss. The numerator is computed by using the total CPU cycles and subtracting one for each L1 cache hit (because each L1 access should take 1 cycle), and one for each cycle that the EXEC_STAGE_CYCLES event is active. EXEC_STAGE_CYCLES should be active for many computations and is used to partially filter out computation cycles. What’s left are considered to be cycles devoted to data access beyond the L1 cache. The denominator is L1 cache misses – giving an estimate of the number of CPU cycles spent on each L1 cache miss. It should be stressed that this is only an approximation, and is not fully accurate for many reasons, including pipeline effects, un-accounted for cycles, and overlapping memory accesses.

The Estimated Latency Impact metric can give an indication of whether the majority of L1 data misses are hitting in L2. Given that the L2 data access latency is 21 cycles, Estimated Latency Impacts that approach that number are having a high degree of L2 hits. The threshold is set at 145 as it is the average of the unloaded L2 and memory access times. The other important thing to note about the Estimated Latency Impact is that, like all ratios, it is affected by either changes in the numerator or denominator. In most cases, an optimization that positively affects data access should result in a decrease in this metric’s value. However, some changes that are positive, such as a decrease in L1 misses, may result in a value for this metric that is unchanged – as it would reduce both the numerator and the denominator. This type of change would affect the L1 Hit Rate Metric instead.

Although not used in any of the metrics, the L2_DATA_READ_MISS_CACHE_FILL, L2_DATA_WRITE_MISS_CACHE_FILL, L2_DATA_READ_MISS_MEM_FILL, and L2_DATA_WRITE_MISS_MEM_FILL events can also be helpful for tuning data locality. As mentioned in the descriptions for these events, they cannot be used to compute any L2-related metrics because they include some prefetching. The quantities for these events should not be considered accurate – but, the general ratio of CACHE_FILLs to MEM_FILLs may indicate that too much data was being accessed from other core’s caches. Since remote cache accesses have as high a latency as memory accesses, they should be avoided if possible.

Tuning Suggestions

Many traditional techniques for increasing data locality apply to the Intel Xeon Phi coprocessor: cache blocking, software prefetching, data alignment, and using streaming stores can all help keep more data in cache. For issues with data residing in neighboring caches, using cache-aware data decomposition or private variables can help. Set associativity issues are another type of data locality issue that can be difficult to detect. If hit rates are low in spite of trying some of the above techniques to reduce them, conflict misses occurring from too many cachelines mapping to the same set may be the culprit. Set associativity issues (conflict misses) can occur on the Intel Xeon Phi coprocessor when an application is accessing data in L1 with a 4KB stride or data in L2 with a 64KB stride. The specific type of miss caused by set associativity issues (conflict misses) cannot be separated from general misses detected by events. If set associativity issues are suspected, try padding data structures (while maintaining alignment) or changing the access stride.

5.2 TLBmisses

Events Used

Event Meaning
DATA_PAGE_WALK The number of L1 TLB misses
LONG_DATA_PAGE_WALK The number of L2 TLB misses
DATA_READ_OR_WRITE The number of read or write operations

Formula(s)

Metric Formula
L1 TLB miss ratio DATA_PAGE_WALK / DATA_READ_OR_WRITE
L2 TLB miss ratio LONG_DATA_PAGE_WALK / DATA_READ_OR_WRITE
L1 TLB misses per L2 TLB miss DATA_PAGE_WALK / LONG_DATA_PAGE_WALK

Threshold(s)

Metric Investigate if:
L1 TLB Miss Ratio > 1%
L2 TLB Miss Ratio > .1%
L1 TLB misses per L2 TLB miss Near 1

Description and Usage

The Intel Xeon Phi coprocessor has a two-level TLB and two page sizes (4K and 2MB). By default under current versions of the OS programs use 4K pages. In this case the L2 TLB acts as a page table cache and reduces the L1 TLB miss penalty (for an L2 TLB hit) to around 25 clock cycles. For large (2 MB) pages, the L2 TLB acts as a standard TLB, and the L1 miss penalty (for an L2 TLB hit) is only around 8 cycles.

The L2 TLB miss penalty is at least 100 clocks; furthermore, it is impossible to hide this latency with prefetches, so it is important to try to avoid L2 TLB misses. L1 TLB misses that hit in the L2 TLB are of less concern.

Since there are 64 cache lines in a 4 KB page, the L1 TLB miss ratio for sequential access to all the cachelines in a page is 1/64. Thus any significant L1 TLB miss ratio indicates lack of spatial locality; the program is not using all the data in the page. It may also indicate thrashing; if multiple pages are accessed in the same loop, the TLB associativity or capacity may not be sufficient to hold all the TLB entries. Similar comments apply to large pages and to the L2 TLB.

If the L1 to L2 TLB miss ratio is high, then there are many more L1 TLB misses then there are L2 TLB misses. This means that the L2 TLB has the capacity to hold the program’s working set, and the program may benefit from large pages.

Tuning Suggestions

For loops with multiple streams, it may be beneficial to split them into multiple loops to reduce TLB pressure (this may also help cache locality). When the addresses accessed in a loop differ by multiples of large powers of two, the effective size of the TLBs will be reduced because of associativity conflicts. Consider padding between arrays by one 4 KB page.

If the L1 to L2 ratio is high then consider using large pages.

In general, any program transformation that improves spatial locality will benefit both cache utilization and TLB utilization. The TLB is just another kind of cache.

5.3 VPU usage

Events Used

Event Meaning
VPU_INSTRUCTIONS_EXECUTED The number of VPU instructions executed by the thread
VPU_ELEMENTS_ACTIVE The number of vector elements active for a VPU instruction, or, the number of vector operations (since each instruction performs multiple vector operations). 

Formula(s)

Metric Formula
Vectorization Intensity VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED

Threshold(s)

Metric Investigate if:
Vectorization Intensity <8 (DP), <16(SP)

Description and Usage

We would like to be able to measure efficiency in terms of floating-point operations per second, as that can easily be compared to the peak floating-point performance of the machine. However, the Intel Xeon Phi coprocessor does not have events to count floating-point operations. An alternative is to measure the number of vector instructions executed.

Vector instructions include instructions that perform floating-point operations, instructions that load vector registers from memory and store them to memory, instructions to manipulate vector mask registers, and other special purpose instructions such as vector shuffle.

Vector operations that operate on full vectors use the hardware’s “all-ones” mask register %k0. Thus when a vector operation on two full vectors is performed, the VPU_ELEMENTS_ACTIVE event is incremented by 16 (for single precision) or 8 (for double precision). Scalar FP operations are generally implemented by the compiler using the vector registers, but with a mask indicating that they apply to only 1 vector element.

So a reasonable rule of thumb to see how well a loop is vectorized is to add up the values of VPU_ELEMENTS_ACTIVE and VPU_INSTRUCTIONS_EXECUTED for every assembly instruction in the loop and take the ratio. If this number approaches 8 or 16 then there’s a good chance that the loop is well vectorized. Vectorization intensity cannot exceed 8 for double-precision code or 16 for single-precision code. If the number is much smaller, then the loop was not well vectorized.

This method should be used in conjunction with the compiler’s vectorization report.

Care should be taken when attempting to apply this method to larger pieces of code. Various vagaries in code generation and the fact that mask manipulation instructions count as vector instructions can skew the ratio and lead to incorrect conclusions.

Tuning Suggestions

Low vectorization intensity may indicate that the compiler failed to vectorize a particular loop, or that the vectorization was inefficient. Examination of the vectorization report may provide insight into the problems. Problems are typically one or more of:

  1. Unknown data dependences. #pragma simd and #pragma ivdep can be used to tell the compiler to ignore unknown dependences or to tell it that dependences are of a certain type, such as a reduction.
  2. Non unit-stride accesses. These can be due to indexing in multi-dimensional arrays, or due to accessing fields in arrays of structures. Loop interchange and data structure transformations can eliminate some of these.
  3. True indirection (indexing an array with a subscript that is also an array element). These are typically algorithmic in nature and may require major data structure reorganization to eliminate.

5.4 Memory bandwidth

Events Used

Event Meaning
L2_DATA_READ_MISS_MEM_FILL The number of read operations that resulted in a memory read (includes prefetches).
L2_DATA_WRITE_MISS_MEM_FILL The number of write operations that resulted in a memory read. Writes are implemented using a memory Read for Ownership (RFO) transaction to maintain coherency. Includes prefetches.
L2_VICTIM_REQ_WITH_DATA The number of evictions that resulted in a memory write operation
HWP_L2MISS The number of hardware prefetches that missed L2
SNP_HITM_L2 The number of incoming snoops that hit modified data in L2 (thus resulting in an L2 eviction)
CPU_CLK_UNHALTED The number of cycles

Formula(s)

Metric Formula
Read bandwidth (bytes/clock) (L2_DATA_READ_MISS_MEM_FILL + L2_DATA_WRITE_MISS_MEM_FILL + HWP_L2MISS) * 64 / CPU_CLK_UNHALTED
Write bandwidth (bytes/clock) (L2_VICTIM_REQ_WITH_DATA + SNP_HITM_L2) * 64 / CPU_CLK_UNHALTED
Bandwith (GB/Sec) (Read bandwidth + Write bandwidth) * freq (in GHZ)

Threshold(s)

Metric Investigate if:
Bandwidth <80GB/Sec

Description and Usage

This formula computes bandwidth by summing up the data transfers from all the different types of events that cause memory reads or writes. It does not take into account streaming stores. For an application using streaming stores, bandwidth will be underestimated.

When the core executes an instruction that reads memory, it must fill both the L2 and the L1 cache with the data. If the data is in neither cache, the core will read the data from either another core’s cache or from memory. The latter case results in an L2_DATA_READ_MISS_MEM_FILL event. When the core executes an instruction that writes memory, it must first execute a Read for Ownership (RFO) to bring the data into the cache hierarchy. If that data is fulfilled from memory the write operation results in an L2_DATA_WRITE_MISS_MEM_FILL event. As noted in section 5.1, the FILL events include some types of prefetches. Although this makes them inappropriate for use in calculating Hit Rates, which assume demand data only, they can still be used in bandwidth calculations, as a prefetch does use real bandwidth.

When an L2 entry is required to hold a datum and there are no available lines, the core must evict a line; if that line has been modified then it must be written to memory. This results in an L2_VICTIM_REQ_WITH_DATA event. If data has been modified in one core’s cache and another core needs that data, the first core receives a snoop Hit Modified (HITM) event which causes it to evict that data. This results in an SNP_HITM_L2 event. Normally the snoop would result in a cache-to-cache transfer to the second core (see section 5.1) but if the core is using the clevict instructions then they appear as incoming snoops even though they were generated by the same core. It is usually safe to ignore this event but there are some cases in which the compiler or runtime will use clevict instructions – usually in conjunction with streaming stores. If there is a lot of modified data shared between two cores, including this event can result in overestimation of memory bandwidth (by including cache-to-cache transfers).

This method of calculating bandwidth uses core events. An alternate method exists which collects samples from uncore events found on the memory controllers. The VTune Amplifier XE “Bandwidth” profile uses the uncore sampling method. Both methods should result in approximately the same values for memory bandwidth in most cases.

Tuning Suggestions

The user must know how much memory bandwidth their application should be using. If data sets fit entirely in a core’s L2 cache, then the memory bandwidth numbers will be small. If the application is expected to use a lot of memory bandwidth (for example by streaming through long vectors) then this method provides a way to estimate how much of the theoretical bandwidth is achieved.

In practice achieved bandwidth of >140GB/sec is near the maximum that an application is likely to see. If the achieved bandwidth is substantially less than this it is probably due to poor spatial locality in the caches, possibly because of set associativity conflicts, or because of insufficient prefetching. In the extreme case (random access to memory), many TLB misses will be observed as well.

6 Conclusion

As mentioned in Part 1, to effectively utilize the performance potential of the Intel Xeon Phi coprocessor, applications need to be well parallelized, well vectorized, and to exploit data locality. The metrics in this guide help to identify the microarchitectural effects of problems in the above three areas. Many resources exist to help developers optimize software for the Intel Xeon Phi coprocessor. Section 6.1 lists some useful links, and section 6.2 gives a table of all the events needed to calculate the metrics in this guide.

6.1 Resources

Optimization and Performance Tuning for Intel® Xeon Phi™ coprocessors, Part 1: Optimization Essentials http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization
Intel® VTune™ Amplifier XE 2013 Evaluation Center http://software.intel.com/en-us/intel-vtune-amplifier-xe-2013-evaluation-options/
Intel® VTune™ Amplifier XE 2013 Product Page http://software.intel.com/en-us/intel-vtune-amplifier-xe/
Intel® Xeon Phi™ coprocessor developer portal http://software.intel.com/mic-developer
Compiler Methodology (including performance optimization) for Intel Many Integrated Core architecture http://software.intel.com/en-us/articles/programming-and-compiling-for-intel-many-integrated-core-architecture

6.2 List of Events Used in this Guide

These events can be collected with VTune Amplifier XE 2013 by creating a “Custom Analysis” and then selecting “New Knights Corner Hardware Event-based Sampling Analysis”. The events can be added individually.

CPU_CLK_UNHALTED http://software.intel.com/en-us/intel-vtune-amplifier-xe-2013-evaluation-options/
Intel® VTune™ Amplifier XE 2013 Product Page The number of cycles executed by the core
INSTRUCTIONS_EXECUTED The number of instructions executed by the thread
VPU_ELEMENTS_ACTIVE The number of VPU operations executed by the thread
DATA_READ_OR_WRITE The number of loads and stores seen by a thread’s L1 data cache
DATA_READ_MISS_OR_WRITE_MISS The number of demand loads or stores that miss a thread’s L1 cache
L1_DATA_HIT_INFLIGHT_PF1 The number of demand loads or stores that are for a cacheline already being prefetched from L2 into L1
DATA_READ_OR_WRITE The number of loads and stores seen by a thread’s L1 data cache.
EXEC_STAGE_CYCLES The number of cycles when the thread was executing computational operations
L2_DATA_READ/WRITE_MISS_CACHE_FILL Counts L2 read or read for ownership misses that were serviced by another core’s L2 cache (on the same card). Includes L2 prefetches that missed the local L2 cache and so is not useful for determining demand cache fills.
L2_DATA_READ/WRITE_MISS_MEM_FILL Counts L2 read or read for ownership misses that were serviced by memory (on the same card). Includes L2 prefetches that missed the local L2 cache, and so is not useful for determining demand memory fills.
DATA_PAGE_WALK The number of L1 TLB misses
LONG_DATA_PAGE_WALK The number of L2 TLB misses
VPU_INSTRUCTIONS_EXECUTED The number of VPU instructions executed by the thread
L2_VICTIM_REQ_WITH_DATA The number of evictions that resulted in a memory write operation
HWP_L2MISS The number of hardware prefetches that missed L2
SNP_HITM_L2 The number of incoming snoops that hit modified data in L2 (thus resulting in an L2 eviction)

Table 5: Intel Xeon Phi coprocessor events used in this guide

Notices

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

Intel, the Intel logo, VTune, Cilk and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others

Copyright© 2012 Intel Corporation. All rights reserved.

Optimization Notice

http://software.intel.com/en-us/articles/optimization-notice/

Performance Notice

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks

Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.