We are all familiar with the flow of cars along roads. We often see a bottleneck at one point on a route that has caused everything before it to slow down, while everything after it moves at full speed on an underutilized road.
The bottleneck itself is the last place where the cars are moving slower than they could be. If there are queues, it is the last place that has a full queue. Areas before the bottleneck may also have slow traffic and full queues, but that is because the traffic is blocked from entering the next section.
Similarly, data and instructions move from caches or dual in-line memory modules (DIMMs), along buses, into the registers that feed the arithmetic logic units (ALUs), and out again. Like cars trying to leave a parking lot after a big game, they jostle each other, compete for the same space, and slow each other down.
Your application is computing if the operating system or other tools show your threads are running instead of waiting on I/O or other resources, but there must be some bottleneck somewhere that is limiting how fast the application can run. Such an application is called compute bound. If the bottleneck can be relieved and the next bottleneck is sufficiently less restrictive, your application runs faster.
The major bottlenecks for a compute-bound application occur when:
- Moving instructions into the cores
- Translating input or output data virtual addresses into physical addresses
- Moving data into the cores
- Performing arithmetic operations in the cores
- Moving data out of the cores
The operations that ultimately become the bottlenecks depend on the hardware, the application, and the workload. For example, an application where the cores cannot do the operations fast enough on a quad-core processor (bottlenecked in the ALU part of the core) may be limited by data movement on a many-core processor (bottlenecked moving data into the cores) until the data becomes too sparse (bottlenecked translating addresses).
Assembling information from various sources to discover exactly what is in your hardware can be tedious. CPU-Z freeware can make this process easier.
Most applications do not have problems moving instructions into cores, but a rare piece of code not dominated by tight loops or small recursive algorithms may.
The front end of a core is the instruction fetching and decoding section. Front-end bound execution results in high miss rates in the portion of the L1 cache dedicated to instructions. It is hard to work around this problem – often it requires reworking your code.
The General Exploration Analysis Type in the Intel® VTune™ Amplifier produces a result with Front-End Bound metric information.
Note: For long-running, multi-threaded applications, you must attach and detach the VTune Amplifier from your application at the right times to avoid massive collections. Consider adding code to your application to facilitate this.
A more common situation is code containing conditional branches that do not go the same way for many consecutive executions. Such erratic branches cause the core to mispredict whether or not to take the branch, causing the core to either not do or to start then undo speculative execution of the instructions following the branch. The General Exploration Analysis Type offers the following information in this case.
A translation bottleneck is a rare problem caused by spreading data over many pages and doing very little processing on each data item after it has been accessed. The translation lookaside buffer (TLB) discussion in Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors, Part 2: Understanding and Using Hardware Events describes how to see the problem using the VTune Amplifier, and provides suggestions for alleviating the problem. Simply put, you must reorder your data and/or accesses to reduce the number of pages needed. The hardware and OS support for large pages and huge pages is specifically aimed at this problem.
Data paths in modern computers are complex, and there are many variations between platforms. While modeling can estimate cache hit rates, it is usually easier to simply measure the caches and memory subsystem. Your objective should be to get almost all accesses to hit in the L1 or L2 caches.
The Memory Access Analysis Type in VTune Amplifier 2016 focuses on the loads and stores in and beyond the last level cache (LLC).
Instead, use a custom analysis type specifying the appropriate L1 and L2 counters. The analysis result shows whether the core is stalled while waiting for data to arrive or leave.
If the vast majority of the data is not coming from the L1 cache, and the calculations your code is doing are not complicated, it is highly likely your code is stalled on memory traffic. For example, consider an Intel® Xeon Phi™ processor with 60 cores, where there is no L3 cache and the Multi-Channel DRAM (MCDRAM) is shared by all the cores. The 400+ GBs/second the MCDRAM can deliver do not meet the needs of the 6e12 floating point operations per second the vector processing units (VPUs) can execute – the units will stall while awaiting data.
For more information, see Why Efficient Use of the Memory Subsystem Is Critical to Performance.
The front end of each core can request arithmetic operations – multiplies, adds, etc. – faster than any of the adders and multipliers in the back end can perform the operations. This is not a problem if the operations do not use the same arithmetic unit – one instruction can use the integer adder while another does a floating-point multiply. However, if two operations need the same unit, one or the other must wait. This appears like so in the VTune Amplifier:
If this is your problem, you must use vector instructions, more cores, faster or more processors, or more cluster nodes to increase the number of arithmetic units available to your application.
Writing data to a cache line starts by reading the cache line into the L1 cache and claiming ownership of it so that all other caches invalidate their now-obsolete values. This is expensive.
Further writes to the same cache line do not need to reacquire ownership, but the cache line may be evicted by another device acquiring access to write it. False sharing, a well-known problem with memory caches, results if these accesses are to two or more different variables that share a cache line.
Complicating this picture are the write-combining buffers that merge multiple writes to cache lines or partial cache lines. See chapter 8 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual for more information.
The bottleneck appears differently in the VTune Amplifier General Exploration Analysis Type result depending on whether the bottleneck is writing into L2, L3, or beyond.
Bottlenecks writing into L2 look like this:
Bottlenecks writing into L3 look like this:
Bottlenecks writing beyond L3 looks like this:
The previous article, How Memory is Accessed, discusses how knowing the number of devices, the buses, and data sharing opportunities can help you understand:
- The origin of access latency and bandwidth variations
- Why planning the routes data takes through the network can make a huge difference in execution times
This article explains how hard it is to predict bottlenecks – although it is easy to estimate limits. Initially estimate how much hardware you need based on the number of operations you will do: arithmetic, storage, I/O, etc. Beyond that, measure real code processing real data on real machines to find the surprising bottlenecks before you put a lot of effort into fixing something that may not be the problem at all. Extrapolating from experiments with smaller core counts or smaller data sets is risky.
This is the end of this series of articles, but expect changes in the future as tools improve to help you get great performance out of your modern memory subsystem.
A second series of articles, starting with Performance Improvement Opportunities with NUMA Hardware, provides an introduction to all aspects of modern NUMA hardware and how it can be used within various application domains.
About the Author
Bevin Brett is a Principal Engineer at Intel Corporation, working on tools to help programmers and system users improve application performance. He is currently alternating between developing the necessary techniques for measuring and improving the performance code gets out of the memory subsystem, preparing the next round of tools to support these techniques, and applying these techniques to code that will change people’s lives for the better.
- New Memory Access Analysis in VTune™ Amplifier XE 2016 Supports of Tracing Memory Allocation/De-allocation Used in Your Code
- Understanding How General Exploration Works in Intel® VTune™ Amplifier XE
- Interpreting General Exploration Data
- General Exploration Analysis
- OpenMP* Thread Affinity Control
- Efficient Parallelization
- Memory Deep Dive: NUMA and Data Locality