Debug Intel® Transactional Synchronization Extensions

One of the most popular ad-hoc functional debugging techniques is to use the printf or fprintf functions to display the state of variables. However, if these functions are used inside an Intel® Transactional Synchronization Extensions (Intel® TSX) transaction they can cause transaction aborts. The reason is that flushing the print output buffer involves an operating system call and an I/O operation: operations that cannot be roll backed by Intel® TSX. That means that the (f)printf output from transaction may be lost due to the machine state roll-back as a result of the transaction abort caused by the attempt to flush the I/O buffer inside the transaction. If the flush happens after a committed transaction then the printf output won’t be lost.1

In general, any transaction abort handler needs to use a fall-back synchronization mechanism that does not involve Intel TSX.  It should, therefore, be possible to see the problem that is being debugged there where printf works as expected. However, what can you do if, for some reason, the problem is not reproducible in the fall-back execution? So far I haven’t had this problem, but if you do please consider the trick shown below.


The idea is to use Intel® Processor Trace (Intel® PT) technology. This records an instruction trace that includes execution inside transactions, but it doesn’t record any data values. To overcome this limitation I map a data value into a code execution path that I can then see in the captured trace.


The basic building block of the method is function that “prints” a char (1 byte) into the processor instruction trace. It takes the character (in the %rdi register) and jumps to that offset in a special code block defined by character’s value. The code block contains 256 1-byte “ret” instructions (return from the function).

tsx_print_char:
    movzbq %dil,%rdi # zero-extend the lowest 8-bits of rdi (input character)
                     # to 64-bits in rdi
    leaq tsx_print_char_0(%rip), %r8 # get the address of ret table below into r8
    addq %rdi, %r8  # add rdi to r8
    jmp *%r8        # jump to r8
    tsx_print_char_0:
    ret # 256x ret
    ret
    ret
    ret
    ret
    ret
    ret
    ...
    ret
    ret
    ret
    ret


The decoder of the processor trace dump can then read the records indicating control transfer from the jmp *%r8 instruction offset to a position in the ‘ret’ table (tsx_print_char_0 label) and interpret them as characters. Using Linux perf this can be done as follows ($7 and $11 are fields in the perf Intel PT dump that contain source and destination of an execution control transfer, 3 is the length of the “jmp *%r8” instruction encoding):

perf script | egrep 'tsx_print_char.*tsx_print_char_0' | awk '{ printf("%c", (strtonum("0x"$11)-strtonum("0x"$7)-3))} '

Using this primitive one can built a function that is compatible with the normal printf:

#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>

extern "C" void tsx_print_char(char c);

#define TSX_PRINTF_BUF_LEN (1024)

int tsx_printf(const char* format, ...)
{
    va_list list;
    char str[TSX_PRINTF_BUF_LEN];
    va_start(list, format);
    int ret = vsnprintf(str, TSX_PRINTF_BUF_LEN, format, list);
    va_end(list);
    ret = (ret < TSX_PRINTF_BUF_LEN)? ret : TSX_PRINTF_BUF_LEN;
    for(int i = 0; i < ret; ++i)
    {
        tsx_print_char(str[i]);
    }
    return ret;
}

Here is a performance micro-test for the tsx_printf function:

int main(int argn, char * arg[])
{
   cpu_set_t cpuset;
   CPU_ZERO(&cpuset);
   CPU_SET(2, &cpuset); // thread 2
   pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
   tsx_printf("Some warm up to resolve symbols %llu\n", RDTSCP());
   printf("Some warm up to resolve symbols %llu\n", RDTSCP());
   unsigned long long int totaltx = (argn >=2)?atoi(arg[1]):100;
   int  print_type = (argn >=3)?atoi(arg[2]):0;
   unsigned long long int begin, end, i = 0;
   int tmp;
   tbb::speculative_spin_rw_mutex mutex; // TBB mutex using TSX
   begin = RDTSCP();

   for(; i < totaltx; ++i)
   {
       tbb::speculative_spin_rw_mutex::scoped_lock _(mutex);
       if(_xtest()) // print inside transaction
       {
           switch(print_type)
           {
               case 1:
                   printf("Hello from an aborted transaction %d!\n", i);
                   break;
               case 2:
                   tsx_printf("Hello from an aborted transaction %d!\n", i);
                   break;
           }
       }
       _mm_stream_si32(&tmp, 0); // causes abort due to incompatible
                                 // memory type access (non-temporal store)
   }
   fflush(stdout);
   end = RDTSCP();

   printf("total     iterations: %d\n", totaltx);
   printf("cycles per iteration: %llu\n", (end-begin)/totaltx);

   return 0;
}

Compiling and running the test:

g++ -c tsx_print_char.s -o tsx_print_char.o
g++ -Itbb/build/linux_intel64_gcc_cc4.3_libc2.11.3_kernel3.13.5_release -Ltbb/linux_intel64_gcc_cc4.3_libc2.11.3_kernel3.13.5_release -ltbb –O3 test.cpp tsx_print_char.o -o test.x -pthread
perf record -e intel_pt//u ./test.x 10 2
perf script | egrep 'tsx_print_char.*tsx_print_char_0' | awk '{ printf("%c", (strtonum("0x"$11)-strtonum("0x"$7)-3))} }'

The output:

Some warm up to resolve symbols 14970613345637616
Hello from an aborted transaction 0!
Hello from an aborted transaction 1!
Hello from an aborted transaction 2!
Hello from an aborted transaction 3!
Hello from an aborted transaction 4!
Hello from an aborted transaction 5!
Hello from an aborted transaction 6!
Hello from an aborted transaction 7!
Hello from an aborted transaction 8!
Hello from an aborted transaction 9!

The table below shows cycles per iteration when the test runs 10000 iterations on  an Intel® Core™ i7-6700 CPU @ 3.40GHz (“Skylake”) with Linux kernel 4.4 and “performance” power governor (“cpupower frequency-set -g performance”). The output of the normal printf (stdout) is redirected to a file to use buffering and improve speed (flushed just once before the end of the test).

printf variantcycles per iteration

cycles per iteration

with Intel PT collection running

no printf495583
buffered printf697848
tsx_printf10151356

As one can see the overhead of tsx_printf is similar to that of the normal buffered printf.

The source code is available under conditions of the Intel Sample Source Code license.

 

Web Resources about Intel® Transactional Synchronization Extensions


1 If the functional issue one wants to debug were transaction aborts in the first place then all printf output before the debugged abort point will be lost. However, the performance monitoring unit is able to provide a good guidance on the type of instruction abort (TSX-unfriendly instruction, access to TSX-incompatible memory type) and its location in the code.

.

AdjuntoTamaño
Archivo tsx_printf.tgz1.36 KB
standard
Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.