Processor Tracing

Intel® Processor Trace (Intel® PT) is an exciting new feature coming in future processors that can be enormously helpful in debugging because it will expose an accurate and detailed trace of activity with triggering and filtering capabilities to help with isolating the tracing that matters. We released specifications recently, and now a library is available to enable tool development as well as a talk this week on the work to make these capabilities available in Linux. Tool and operating system developers have specifications and our library to enable development.

We have released a library along wth sample tools to enable use of Intel® Processor Trace (Intel® PT) available as the "Processor Trace Decoder Library" available as a free download.  I can tell you a little about this project and I will also explain Intel PT to motivate the decoding capabilities of the Processor Trace Decoder Library.

The project itself will be able to support any operating system which itself is enabled for using Intel PT.  Intel PT is presented as a performace event, therefore support in an operating system is easy to detect by seeing if that event is available to configure/use. Changes for Linux have been worked on; the status of some Linux work was presented this week (presentation available at linuxfoundation-PT for LinuxCon.pdf). In time, we expect other operating systems including Windows and OS X will include support for Intel PT too and our Processor Trace Decoder Library is ready for that. The decoder library currently has been verified to build on Linux, Windows and OS X so it is ready!

The project for Processor Trace Decoder Library contains a library for decoding Intel PT together with sample implementations of simple tools built on top of the library that show how to use the library in your own tool.  The following are included in the download:

  • libipt: A packet encoder/decoder library plus a document describing the usage of the decoder library.
  • Optional Contents and Samples:
    • ptdump: Example implementation of a packet dumper.
    • ptxed: Example implementation of a trace disassembler.
    • pttc: A trace test generator.
    • script: A collection of scripts.

Processor Trace

Intel recently released details about Intel Processor Trace in the latest Intel® Architecture Instruction Set Extensions Programming Reference as Chapter 11. Intel Processor Trace is a low-overhead execution tracing feature that will be supported by some processors in the future.  It works by capturing information about software execution on each hardware thread using dedicated hardware facilities so that after execution completes software can do processing of the captured trace data and reconstruct the exact program flow. Intel PT is not free with respect to execution overhead, but the overhead is low enough that it should work well in production builds for most applications.

The captured information is collected in data packets. The first implementation of Intel PT offers control flow tracing, which includes in these packets timing and program flow information (e.g. branch targets, branch taken/not taken indications) and program-induced mode related information (e.g., Intel TSX state transitions, CR3 changes). These packets may be buffered internally before being sent to the memory subsystem.

Why is this useful?

Intel PT provides the context around all kinds of events. Performance profilers can use PT to discover the root causes of 'response-time' issues -  performance issues which affect the quality of execution, if not the overall runtime.  For example, using PT, video application developers can explore, in very fine detail, the execution of problematic individual frames, something not generally possible with more traditional sampling-based collection.

Furthermore, the complete tracing provided by Intel PT enables a much deeper view into execution than has previously been commonly available; for example, loop behavior, from entry and exit down to specific backedges and loop tripcounts, is easy to extract and report.

Debuggers can use it to reconstruct the code flow that led to the current location.  Whether this is a crash site, a breakpoint, a watchpoint, or simply the instruction following a function call we just stepped over.  They may even allow navigating in the recorded execution history via reverse stepping commands.

Another important use case is debugging stack corruptions.  When the call stack has been corrupted, normal frame unwinding usually fails or may not produce reliable results.  Intel PT can be used to reconstruct the stack back trace based on actual CALL and RET instructions.

Operating systems could include Intel PT into core files.  This would allow debuggers to not only inspect the program state at the time of the crash, but also to reconstruct the control flow that led to the crash. It is also possible to extend this to the whole system to debug kernel panics and other system hangs. Intel PT can trace globally so that when an operating system crash occurs the trace can be saved as part of a operating system crash dump mechanism and then used later to reconstruct the failure.

Intel PT can also help to narrow down data races in multi-threaded operating system and user program code. It can log the execution of all threads with a rough time indication.  While it is not precise enough to detect data races automatically, it can give enough information to aid in the analysis.


Trace Buffer Management

The trace data can be collected into operating system provided circular buffers. To simplify memory management and to make it easier for the operating system to find a suitably large piece of memory, the buffer need not be contiguous.

The logical buffer consists of a collection of memory pages and a control structure that describes the page layout.  The operating system may configure Intel PT to generate an interrupt when any of the sections is near full.

This enables a variety of different use cases:

  •   a single circular buffer
  •   a single buffer with copy-out
  •   a single buffer with copy-out section by section

While Intel PT generates too much data to store the execution trace over a long period of time to disk, shorter snippets can be saved.

Intel is enabling Linux to provide support for Intel PT through the perf_event interface (a presentation reagrding this work is available at linuxfoundation-PT for LinuxCon.pdf).

Execution flow reconstruction

Intel PT uses a compact format to store the execution trace.  It omits everything that can be deduced directly from the code or from previous trace.

You can compare this with a brief list of instructions for navigating a maze.  As long as the way is obvious, you simply follow the twists and turns of the maze.  When you come to a junction you need to know whether to turn left or right.  In order to navigate the maze, all you really need is a short list of left or right directions. Similar to that, Intel PT uses a single bit to indicate whether a conditional branch has been taken or not taken.  Unconditional jumps and linear code are not represented in the trace, at all.

The PT trace consists of a sequence of packets (which come in different types).  To represent a selection of conditional branches, for example, Intel PT uses the TNT packet that comes in two different sizes: 8 bits and 64 bits. For reconstructing execution flow, there are a few more things to consider such as indirect branches, function returns, or interrupts. To model these, Intel PT adds more packets like TIP for indirect branches and function returns, and FUP for asynchronous event locations.  An interrupt will then be represented as a FUP followed by a TIP, giving the source and destination of the asynchronous branch, respectively. Intel PT also gives information about transactional synchronization. Whenever a transaction is started, committed, or aborted, Intel PT will generate two packets: a MODE.TSX packet giving the new transactional state, and a FUP packet giving the code location at which the new state is effective.  For a transaction abort, an additional TIP packet will be generated giving the location of the corresponding abort handler.

Please refer to the specification (Chapter 11 of the Intel® Architecture Instruction Set Extensions Programming Reference) for a full list of supported packets.

In order to reconstruct the execution flow, a decoder therefore needs to decode the instructions in the traced executable or library as well as the PT trace packets.  To handle dynamic libraries, the decoder also needs to consider sideband information provided by the operating system.

Intel provides an Open Source reference implementation for decoding PT packets and for reconstructing the execution flow. The Processor Trace Decoder Library (a collection of tools and libraries to enable use of Intel® Processor Trace) is available as a free download. Intel is currently working to help enabling GDB, the GNU* debugger. Additional intregration with other tools are being considered as well.

Summary

Intel provides a low-overhead tracing feature that allows recording the execution flow and reconstructing it at a later time.  This feature has applications for functional as well as for performance debugging.

For more complete information about compiler optimizations, see our Optimization Notice.