Intel(r) Transactional Synchronization Extensions (Intel(r) TSX) profiling with Linux perf

Intel TSX exposes a speculative execution mode to the programmer to improve locking performance.. Tuning speculation requires heavily on a PMU profiler. This document describes TSX profiling using the Linux  perf) (or “perf events”) profiler, that comes integrated with newer Linux systems. More details on TSX are available at ACM queue, wikipedia, at LWN for the Linux glibc implementation and in the specification (chapter 8)

Preliminaries

The techniques described in this guide need an updated perf version with TSX and Haswell support. Perf is integrated into the Linux kernel, so this requires updating to a new kernel.The 3.13 kernel includes all the features described here. Please get it from http://www.kernel.org/pub/linux/kernel/v3.x.At the time of this writing a RC pre-release is available. The earlier 3.11 and 3.12 kernels contain a subset. For best results please use 3.13 or later.

After downloading the kernel tree it should be installed and booted. In addition the perf binary needs to be build (in tools/perf) and installed. Some perf features may require additional library packages to be installed. The build procedure suggests the package names.

Some Linux distribution may also provide updated kernel and perf packages. For example Fedora 20 with the latest updates has full support.

Basic cycle sampling to understand the program

Then the program has to be enabled initially for TSX. When it is running with TSX it can be profiled using perf.

 

Normal cycle sampling aborts transactions, so may impact the TSX performance. Some initial cycle sampling (perf record ; perf report) is still a good idea to get a basic overview of the expensive parts of the workload, but it should be understood that it affects the TSX performance.

 

perf record -e cycles:pp -g program
perf report

 

Use the interactive browser or generate a report in a file (perf report > file)

 

When sampling TSX it is important to use the :pp qualifier for cycles to enable PEBS, otherwise the sampling instruction pointer will always be in the abort handler (or near the lock instruction for HLE) when a sample hits a transaction.

Measuring basic transactional success with perf stat -T

The first step after the program is running with TSX to use perf stat -T to measure the basic transactional success.

 

perf stat -T program

 

or if the program is long running in a steady state run in parallel from another terminal

 

perf stat -T -a sleep 1

 

Using -a may require being root or setting /proc/sys/kernel/perf_paranoid to -1 first. With -a the complete machine will be measured. Alternatively it's also possible to attach to specific pids with -p.

The -T option reports the number of transactional cycles. When the number is low the program may not spend much time in locks or the locks are not enabled for TSX lock elision.

-T also reports the aborted cycles, that is cycles spent in doomed transactions that did not commit. The goal of TSX tuning is normally to make that number as small as possible, that is to make the commit rate of transactions as large as possible.

These numbers should be only trusted for relatively long running processes. At startup there are typically various transient abort causes (for example faulting in the working set) that will disappear later. If the startup phase of the application is very expensive it is preferable to use -a or -p in parallel to only measure when the program is past the start-up phase.

Newer version of perf also have a -I option to enable interval sampling. For programs with very different phases this can be useful to use with -T to get separate measurements for different phases.

In addition -T reports the number of transactions; separated for HLE (el) and RTM (tx) and their average length. In general it is preferable if transactions are not too short.

The overhead of perf stat -T counting is normally low, it should not affect the run time of the program significantly. perf stat -T counts both kernel and user transactions. When a RTM enabled kernel is used, but only the user program should be measured it is possible to specify the events used by -T manually using -e to perf stat, with an additional :u qualifier to only count them for ring 3. The computations for the various ratios will be still done.

Profiling for missing locks

When the number of transactional cycles reported by -T is low, not all locks may be elided. To find all locks in a program and make sure they are elided, it's useful to count the MEM_UOPS_RETIRED.LOCK_LOADS event in comparison with RTM_RETIRED.START or HLE_RETIRED.START.

perf stat -e '{r21d0,tx-start,el-start,cycles}' program

When the number of lock loads is significantly higher than the number of started transactions it may be possible that not all locks are elided. In this case sample for the common locks and make sure the common ones are all elided

perf record -g -e r21d0:p …
perf report


Profiling abort causes with perf record sampling

When the number of aborted cycles reported by stat -T is high the location of aborts should be profiled using sampling. Abort sampling does not affect the transaction commit rate, because the transactions have already aborted when sampled (it however still adds some overhead)

HLE and RTM use different sampling events. perf stat -T reports whether HLE or RTM are used (el-starts or tx-starts). When el-starts is high el-aborts is the abort even, when tx-starts is high tx-aborts is the abort event. It is also possible to sample for both at the same time, but it is recommended to not specify a event that is not needed. This example below samples for RTM aborts.

perf record -e cpu/tx-abort/pp program # measure program
perf record -e cpu/tx-abort/pp -a sleep 1 # measure whole system for 1 second
perf report # display samples

 

PEBS monitoring needs to be enabled explicitely with pp (otherwise only the abort handler is sampled). This is required for correct abort locations. When the event is specified in another way precise needs to be 2 (either with :pp or with precise=2). The PEBS record contains additional information that needs to be explicitly enabled. The important information for aborts is the weight (--weight), the transaction flags (--transaction) and often the call chain (-g).

The program should have debugging information and a symbol table available. This can be either done by compiling with -g and have the object files available, or for a distribution program install the debuginfo package. This allows to display the symbols and also browse sample results for individual lines. When the source code is available on the system perf is also able to report samples in a source listing.

By default the assembler code is displayed in AT&T syntax. Intel assembler syntax can be enabled for “perf record” with -M intel. The overview displays the samples by symbols. When srcline is added to the sort argument below it can be also reported by source line.

The additional information needs to be explicitly specified for perf report using the --sort command, so that it is displayed

perf record -g --transaction --weight -e cpu/tx-abort/pp program
perf report --sort symbol,transaction,weight

The transaction weight is the global cost of cycles the transaction spent before aborting.Aborts with a high weight are more expensive. Note that the current perf version does not sort on weight, that is the top entries are not necessarily the most expensive. There is a global_weight which is the global sum of the weight and local_weight which is the average weight of the sample. The default weight is global.

perf currently splits samples by weight, which may lead to a lot of entries. After weight has been examined it is sometimes useful to remove it from the sort keys to collapse the output.

The transaction flags describe the type of abort. The tuning strategy varies depending on the type.

Name

Description

EL

Abort in a HLE transaction.

TX

Abort in a RTM transaction.

SYNC

Synchronous abort. The abort was caused by the specified instruction, for example a system call or other unfriendly instruction (see 14.3.8.1 in the specification)

ASYNC

Asynchronous abort. The abort was caused by another thread and the instruction pointed to just happened to be running at that time.

RETRY

The transaction is retry-able.

CON

Conflict: The abort was caused by a write conflict, typically caused by another thread. The location can be random in the transaction, but is often near a conflict causing pattern.

CAP-WRITE

Capacity: the abort was caused by the local transaction exceeding the maximum write buffering capacity of the transaction.

CAP-READ

Capacity: the abort was caused by the local transaction exceeding the maximum read buffer capacity of a transaction. Rarely this can be also caused by other abort types.

   

:NUMBER

Abort code. The program explicitly aborted with the XABORT instruction with code NUMBER. A common code is 0xff for lock busy in the lock library, that is the lock being not elided for a long time.

Call chains are often needed to understand the context in the program. The record call chain option (-g) requires compiling the program and the used libraries with -fno-omit-frame-pointer or using -g dwarf in perf when dwarf2 unwind information is available. Using dwarf2 is slower than using frame pointers. It also requires compiling the perf tool with the unwind library.

The call chains of an abort is only recorded after the transaction has aborted, which is typically in the lock library The sample hit is the actual abort point. So the callchain is discontinuous, it starts with the abort point and continues with the call graph of the lock library lock function. This is important to keep in mind when looking at the call graphs, as it differs from samples not hitting transactions.

For asynchronous (and to a lesser degree capacity) aborts it is often more useful to look at the whole critical section, than the specific abort point. The abort is triggered by another CPU and appears randomly in the transaction, so the instruction reported in the sample may have nothing to do with the abort cause. The memory access of the whole transaction needs to be examined to minimize read-write or write-write memory sharing. This can be done by sampling for the return IP of the abort, either by using tx-aborts-count or el-aborts-count with callgraph. For non inlined locks typically the first caller outside the locking library defines the critical section. The non PEBS *-count events do not support weight and transaction flag. If those are needed with return ip the non eventingrip PEBS version of these abort events can be used: r4c8:p (HLE abort) or r4c9:p (RTM abort) They currently only exist in raw form, but this may change in future perf versions.

Last Branch Records (LBRs) to look inside the transaction

To see the control flow inside a transaction that lead to an abort LBRs can be used. When enabled the CPU stores the last 16 branches in the LBR registers. Perf record can sample them for aborts with the -b option. The default display uses basic block histograms and collapses all paths and is often not very useful to analyze individual abort samples. A workaround is to use perf report -D and extract them manually from the samples and translate the addresses with addr2line. Future perf versions will hopefully improve this cumbersome procedure.

Update: an experimental patchkit for perf report to enable with with --branch--call-graph is available.

Additional TSX events and qualifiers

Perf has some additional builtin TSX events. The counting events are separate for HLE and RTM All the available builtin events can be listed with “perf list”. Except for the abort events these events are not precise. When used with perf report and they hit a transaction they will report an instruction after the abort, which is often not useful.

RTM event

HLE event

Description

cpu/tx-abort/pp

cpu/el-abort/pp

Precise abort event for sampling. Use this to profile for abort locations,

tx-abort

el-abort

Abort event for counting, not using PEBS). This event should be used with perf stat instead of the precise version. When sampled it will sample the point of the abort handler or original lock for HLE, and not the abort point.

tx-capacity

el-capacity Transactions that exceeded the buffering capacity. Not precise, use for counting.

tx-commit

el-commit

Successful transaction commits

tx-start

el-start

Transaction starts. Only use for counting.

tx-conflict

el-conflict

Abort due to a memory conflict. Only use for counting

cycles-t

Transaction cycles. Only use for counting.

cycles-ct

Transactional cycles minus cycles of aborted transactions (committed cycles). Only use for counting.

cpu/instructions,in_tx=1/

Transactional instructions. Only use for counting.

cpu/instructions,in_tx=1,in_tx_cp=1/

Transactional instructions minus aborted transactions (committed instructions). Only use for counting.

 

For analyzing capacity and conflict aborts it is usually preferable to sample aborts with --transaction and examine the transaction abort types in the sample browser. Most of these events are more useful for counting (perf stat) which does not cause additional aborts.

Specifying raw perf events

Perf has only a limited builtin event list. Additional events can be specified in a raw hex form:

rUUEE

where EE is the hex event modifier and UU is the unit mask. See the SDM for a full list of valid events on Haswell. Additional qualifiers can be added to the mask, for example 0x100000000 to count the event only inside a transaction and 0x200000000 to set a checkpointed event (see section 18.10.5 in the SDM for more details). The raw mask is directly mapped to the control register of the performance counter.

Some alternative frontends for perf provide a full symbolic event list (for example ocperf in pmu-tools), which avoids this complicated procedure.  ocperf can also generate raw event strings for later use.

Perf supports multiplexing events to run more than 4 (or 8 for non HyperThreading) counters in parallel. When events are used in equations that depend on each other it is important to run all the events in a equation at the same time. This can be done by specifying event groups with {}. It is valid to run multiple groups at a time and then will be multiplexed.

perf stat -e '{cycles,cycles-t,cycles-ct}' program
perf stat -e '{r154,r254,r21d0},{r1c9,r1c8,4c8,r4c9}' -p PID

Newer perf version also have an alternative more verbose syntax:

perf stat -e 'cpu/event=0x12,umask=0x34,intx=1,intx_cp=1/' -a sleep 1

When the events are post processed by a program it is useful to enable CSV mode for perf stat (-x,)

Additional raw TSX events diagnosing various abort conditions

Most of these events are speculative and may over-count. They are not precise and will only report an IP after abort. They can be used to count specific abort reasons. In most cases it is a better alternative to try analyze aborts from tx-abort/el-abort PEBS sampling, as that can be done by source line. The HLE events may be useful to confirm specific HLE commit problems when debugging a new HLE enabled lock library.

SDM Name

Perf raw event

Description

tx_exec.misc4

r85d

HLE executed inside RTM and other rare abort causes

tx_exec.misc3

r45d

Transaction nesting limit overflow and other rare abort causes

tx_mem.abort_hle_elision_buffer_full

r4054

Too deep nesting for HLE

tx_mem.abort_hle_elision_buffer_unsupported_alignment

r2054

Read from lock value inside HLE region using unsupported alignment.

tx_mem.abort_hle_elision_buffer_mismatch

r1054

XRELEASE address or value did not match XACQUIRE

tx_mem.abort_hle_elision_buffer_not_empty

r854

HLE XRELEASE without matching XACQUIRE

tx_mem.abort_hle_store_to_elided_lock

r454

Store to lock inside HLE region


 

Additional useful perf events. Using the software trace points may require running as root or making /proc/sys/kernel/debug/tracing world readable first. The kernel also needs to support system call tracing. For more available events see “perf list”

Name

Perf event

Description

syscalls:sys_enter_futex

syscalls:sys_enter_futex

Count futex syscalls, which give a rough estimate of how often a sleeping futex lock (e.g. a pthread mutex) blocked. This also counts lock wake-ups on unlock when another thread is waiting. For adaptive pthread mutexes it will under-count contention.

mem_uops_retired.lock_loads

r21d0

Count atomic operations. Useful to find locks that are not elided. Can be also used as a PEBS event (:p) for sampling.

Should not be used with multiplexing.



Updated 09/13 for latest perf code

Updated 09/25 to fix some mistakes and clarify some descriptions.

Updated 11/07 to report merge status

Updated 12/15 to remove mention of git tree, just point to released kernels

Updated 01/11 to fix a broken reference, point to rawhide and add pointer to LBR callgraph patch.

Updated 04/29 to fix another broken reference, add a pointer to ACM queue article and update FC20 reference.

For more complete information about compiler optimizations, see our Optimization Notice.

Comments


Dmitry,

Dmitry,

Yes 3.2 is too old for dwarf support. It was added with the 3.8 kernel. Before that you have to use frame pointers.

cycles:pp is useful even without transactions as it has less "skid". It takes some time for the sample interrupt to fire, so the sample can be off by a significant amount. :p and :pp use a mechanism to let the CPU record the IP in advance before triggering the sample interrupt. This still has some skid, but much less.

However before Haswell the implementation of :pp is very costly, so I would rather recommend using :p only, which has one instruction more skid. On Haswell they both cost the same.

precise=2 can be only used with the new verbose perf event syntax cpu/event=XX,umask=YY,precise=2/ and only with my patchkit currently. On older perf you can always user rYYXX:p and :pp which is equivalent.

-Andi


If I sample with "-e cycles"

If I sample with "-e cycles" or with "-e cycles:pp", am I intended to see any difference? I am asking because I do not see any. I understand that it is useful for transactions (to not cause aborts), but is it useful otherwise?
Also my perf does not seem to understand "-precise=2" flag.
Thanks!

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net


Hi Andy,

Hi Andy,

You've mentioned "-g dwarf" option to unwind stacks even w/o frame pointers. I have perf 3.2.5 and it does not recognize this option. Do I need a newer perf?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net


Hi Andy,

Hi Andy,

You've mentioned "-g dwarf" option to unwind stacks even w/o frame pointers. I have perf 3.2.5 and it does not recognize this option. Do I need a newer perf?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net