Monitoring Intel® Transactional Synchronization Extensions with Intel® PCM

After applying a new technology (a new processor, a hardware accelerator, a new instruction, etc) besides measuring the immediate performance delta one requires a method to verify that this technology has been applied correctly and efficiently. Intel® Transactional Synchronization Extensions (Intel® TSX - instructions for speculative execution of critical sections protected by locks) are not an exception here.

In fact the 4th generation Intel® Core™ processors (with Intel TSX) introduced special hardware monitoring capabilities to measure the success of Intel TSX execution and to provide information about speculation failures. These capabilities are documented in Intel® 64 and IA-32 Architectures Optimization Reference Manual and already supported by TSX Linux perf profiler and Intel® Performance Counter Monitor (Intel® PCM). Intel PCM is a simple open-source monitoring API and a collection of sample tools based on it (running on Windows, FreeBSD, MacOS X and arbitrary/old Linux kernels).

In this blog I will show a few examples how Intel PCM TSX tool can be used.

Building Intel PCM-TSX Tool

On Windows, Intel PCM-TSX tool (pcm-tsx.exe) can be build using the MS Visual Studio project in the PCM-TSX_Win directory of the PCM package (please also see WINDOWS_HOWTO.rtf for instructions on obtaining a required Windows kernel driver). On all other supported operating systems running the 'make' command in the main PCM directory builds pcm-tsx.x executable.

Measuring Basic Transactional Success

A first step after enabling Intel TSX in the application is to measure basic transactional success. The measurement is similar to the Linux "perf stat -T":

./pcm-tsx.x ./program
 Intel(r) Performance Counter Monitor: Intel(r) Transactional Synchronization Extensions Monitoring Utility
 Copyright (c) 2013 Intel Corporation
 Executing "./program" command:
Time elapsed: 42 ms
Core | IPC  | Instructions | Cycles  | Transactional Cycles | Aborted Cycles  | #RTM  | #HLE  | Cycles/Transaction
   0   0.58         47 M       81 M        33 M (40.81%)        127 K ( 0.16%)  7239        0      4583
   1   1.13       3278 K     2905 K         0   ( 0.00%)          0   ( 0.00%)     0        0       N/A
   2   0.84       3831 K     4566 K      2659 K (58.24%)       1460   ( 0.03%)   576        0      4617
   3   0.74         33 M       45 M        32 M (70.23%)         85 K ( 0.19%)  7233        0      4446
-------------------------------------------------------------------------------------------------------------------
   *   0.66         88 M      134 M        68 M (50.56%)        214 K ( 0.16%)   15 K       0      4519

The outputs reports a few metrics for each of the cores and also aggregated metrics for the whole system. The metrics are IPC (instructions per cycle), number of instructions and cycles, the number of cycles that were executed in a transaction, the number of transactional cycles that were aborted, the number of started RTM speculations, the number of started HLE speculations and the average number of transactional cycles per transaction (average transaction length). When the % transactional cycles is low the program may not spend much time in critical sections or the locks are not enabled for TSX lock elision. Also the goal of TSX tuning is normally to make % aborted cycles as small as possible, that is to make the commit rate of transactions as large as possible.

One can also run pcm-tsx.x in background to your application:

pcm-tsx.x <update_delay_in_seconds>


Counting Abort Reasons with Intel TSX Events

Processors with TSX have hardware monitoring events that can be used to build transaction abort reason distribution. The list of TSX events can be obtained by running pcm-tsx without any parameters:

Usage: pcm-tsx.exe (delay | "external_program") [-C] [-e event1 ] [-e event2 ] [-e event3 ] [-e event4 ]
  <delay>            - delay in seconds between updates. Either delay or "external program" parameters must be supplied
  "external_program" - start external program and print the performance metrics for the execution at the end
  -C             - output in csv format (optional)
  -e eventX      - monitor custom TSX event (up to 4) - optional. List of supported events: 
RTM_RETIRED.START Number of times an RTM execution started.
RTM_RETIRED.COMMIT Number of times an RTM execution successfully committed
RTM_RETIRED.ABORTED Number of times an RTM execution aborted due to any reasons (multiple categories may count as one)
RTM_RETIRED.ABORTED_MISC1 Number of times an RTM execution aborted due to various memory events
RTM_RETIRED.ABORTED_MISC2 Number of times an RTM execution aborted due to uncommon conditions
RTM_RETIRED.ABORTED_MISC3 Number of times an RTM execution aborted due to HLE-unfriendly instructions
RTM_RETIRED.ABORTED_MISC4 Number of times an RTM execution aborted due to incompatible memory type
RTM_RETIRED.ABORTED_MISC5 Number of times an RTM execution aborted due to none of the previous 4 categories (e.g. interrupt)
HLE_RETIRED.START Number of times an HLE execution started.
HLE_RETIRED.COMMIT Number of times an HLE execution successfully committed
HLE_RETIRED.ABORTED Number of times an HLE execution aborted due to any reasons (multiple categories may count as one)
HLE_RETIRED.ABORTED_MISC1 Number of times an HLE execution aborted due to various memory events
HLE_RETIRED.ABORTED_MISC2 Number of times an HLE execution aborted due to uncommon conditions
HLE_RETIRED.ABORTED_MISC3 Number of times an HLE execution aborted due to HLE-unfriendly instructions
HLE_RETIRED.ABORTED_MISC4 Number of times an HLE execution aborted due to incompatible memory type
HLE_RETIRED.ABORTED_MISC5 Number of times an HLE execution aborted due to none of the previous 4 categories (e.g. interrupt)
TX_MEM.ABORT_CONFLICT Number of times a transactional abort was signaled due to a data conflict on a transactionally accessed address
TX_MEM.ABORT_CAPACITY_WRITE Number of times a transactional abort was signaled due to limited resources for transactional stores
TX_MEM.ABORT_HLE_STORE_TO_ELIDED_LOCK Number of times a HLE transactional region aborted due to a non XRELEASE prefixed instruction writing to an elided lock in the elision buffer
TX_MEM.ABORT_HLE_ELISION_BUFFER_NOT_EMPTY Number of times an HLE transactional execution aborted due to NoAllocatedElisionBuffer being nonzero.
TX_MEM.ABORT_HLE_ELISION_BUFFER_MISMATCH Number of times an HLE transactional execution aborted due to XRELEASE lock not satisfying the address and value requirements in the elision buffer.
TX_MEM.ABORT_HLE_ELISION_BUFFER_UNSUPPORTED_ALIGNMENT Number of times an HLE transactional execution aborted due to an unsupported read alignment from the elision buffer.
TX_MEM.ABORT_HLE_ELISION_BUFFER_FULL Number of times HLE lock could not be elided due to ElisionBufferAvailable being zero.
TX_EXEC.MISC1 Counts the number of times a class of instructions that may cause a transactional abort was executed. Since this is the count of execution, it may not always cause a transactional abort.
TX_EXEC.MISC2 Counts the number of times a class of instructions that may cause a transactional abort was executed inside a transactional region
TX_EXEC.MISC3 Counts the number of times an instruction execution caused the nest count supported to be exceeded
TX_EXEC.MISC4 Counts the number of times an HLE XACQUIRE instruction was executed inside an RTM transactional region

The following example shows how to estimate the number of aborts due to conflicts, the total number of transactional buffer overflows, TSX-unfriendly instructions and others for RTM execution.



pcm-tsx.x ./program -e RTM_RETIRED.ABORTED -e RTM_RETIRED.ABORTED_MISC1 -e TX_MEM.ABORT_CONFLICT -e RTM_RETIRED.ABORTED_MISC3
 Intel(r) Performance Counter Monitor: Intel(r) Transactional Synchronization Extensions Monitoring Utility
Executing "./program" command:
Time elapsed: 9549 ms
Event0: RTM_RETIRED.ABORTED Number of times an RTM execution aborted due to any reasons (multiple categories may count as one) (raw 0x4c9)
Event1: RTM_RETIRED.ABORTED_MISC1 Number of times an RTM execution aborted due to various memory events (raw 0x8c9)
Event2: TX_MEM.ABORT_CONFLICT Number of times a transactional abort was signalled due to a data conflict on a transactionally accessed address (raw 0x154)
Event3: RTM_RETIRED.ABORTED_MISC3 Number of times an RTM execution aborted due to HLE-unfriendly instructions (raw 0x20c9)
Core | Event0  | Event1  | Event2  | Event3
   0   8707 K    8701 K    8810 K       0
   1      0         0         0         0
   2      1         0         0         0
   3   8247 K    8242 K    9231 K       0
--------------------------------------------------
   *     16 M      16 M      18 M       0

The number of conflicts can be directly read as Event 2 (TX_MEM.ABORT_CONFLICT) and the number of aborts due to RTM unfriendly instructions as Event 3 (RTM_RETIRED.ABORTED_MISC3). To obtain the total number of transactional buffer overflows one computes ~= RTM_RETIRED.MISC1 - TX_MEM.ABORT_CONFLICT. Note that multiple abort signals may count as one in a different category (e.g. TX_MEM.ABORT_CONFLICT > RTM_RETIRED.ABORTED_MISC1 is possible). The other aborts are ~= RTM_RETIRED.ABORTED - RTM_RETIRED.MISC1 - RTM_RETIRED.MISC3. To look further into the reasons of the other aborts one chooses different TSX events from the list above. 

TSX Tuning

The next step in the TSX tuning process is to find the source of aborts in the application code (the code that "kills" transactions): here one needs to use a sampling profiler with TSX PEBS support (best for synchronous aborts: TSX-unfriendly instructions, faults, etc) and/or the TSX emulator (for some synchronous aborts and asynchronous aborts which are conflicts and transactional buffer overflows). Once found look into Intel TSX enabling and optimization recommendations (Chapter 12) for methods to avoid the aborts.

Best regards,

Roman