Applying Intel® Threading Building Blocks observers for thread affinity on Intel® Xeon Phi™ coprocessors.

In spite of the fact that the Intel® Threading Building Blocks (Intel® TBB) library [1] [2] provides high-level task based parallelism intended to hide software thread management, sometimes thread related problems arise. One of these problems is thread affinity [3]. Since thread affinity may help to optimize cache performance [3] [4] and consequently overall performance, this topic cannot be avoided. In contrast to OpenMP* [5] [6] the Intel® TBB library does not have native features and tools for managing thread affinity. To satisfy such a need, the Intel® TBB task_scheduler_observer feature [7] [8] can help. In this article I demonstrate how an observer can be used to bind threads on an Intel® Xeon Phi™ coprocessor [9].

Since the library does not provide any way to bind threads, system specific APIs should be used. The Intel® Xeon Phi™ coprocessor runs the Linux* operation system (Linux* OS) and according to the Linux* man pages, sched_setaffinity [10] can be used for these purposes. We consider several particularities of thread affinity on Linux* OS before we continue with the observer approach. On the one hand the sched_setaffinity API seems to be a rather simple interface that can be used without any problems: it is enough simply to specify the necessary mask for the current thread and the job is done.

int sched_setaffinity(pid_t pid, unsigned int len, unsigned long *mask);
int sched_getaffinity(pid_t pid, size_t cpusetsize, cpu_set_t *mask);

But on the other hand, there is no API to determine how much memory should be allocated for the mask. One way to find out is to allocate space and try to get the affinity mask of the current process [10]. If that fails with “EINVAL” it means that the mask is too small and should be extended. The code snippet below implements this idea:

cpu_set_t *mask;
int ncpus;
for ( ncpus = sizeof(cpu_set_t)/CHAR_BIT; ncpus < 16*1024 /* some reasonable limit */; ncpus <<= 1 ) {
	mask = CPU_ALLOC( ncpus );
	if ( !mask ) break;
	const size_t size = CPU_ALLOC_SIZE( ncpus );
	CPU_ZERO_S( size, mask );
	const int err = sched_getaffinity( 0, size, mask );
	if ( !err ) break;

	CPU_FREE( mask );
	mask = NULL;
	if ( errno != EINVAL )  break;
}
if ( !mask )
	std::cout << "Warning: Failed to obtain process affinity mask. Thread affinitization is disabled." << std::endl;

One more caveat is the Intel® Xeon Phi™ coprocessor has a specific topology: logical processor 0 is actually hardware thread zero on the last core and logical threads 1-4 are the hardware threads on core zero. It is advised to avoid logical thread zero (best to avoid the last core completely), since it is where the OS boots and runs most daemons. See the BKM at [16]. The Intel® TBB library provides the task_scheduler_observer class with a few virtual methods that allow clients to observe when a thread starts or stops participating in task scheduling. In our case we just need the notification when a thread starts its work: it is a good moment to force the operating system affinity of the current thread. Let’s try to implement a simplest example which calculates π [11] using an Intel® TBB observer to enable thread affinity. Firstly, we should include the required header:

#include <tbb/task_scheduler_observer.h>

When a thread starts its work, the on_scheduler_entry is called. The basic idea is to count how many threads have already started and the value of the counter provides a unique number for each thread. on_scheduler_entry is an asynchronous method, so tbb::atomic should be used for the counter to prevent data races. Since Intel TBB 4.2 Update 1 you can use the tbb::task_arena::current_slot() method to obtain unique number of the thread position in the current arena. It should be noted that it is good practice to bind threads only in the range of the process mask (which the OS specifies for each process). We have already showed code to get this process mask in the first code snippet. Thus mask contains the process mask and ncpus reflects its length. The following code snippet obtains the unique thread number:

tbb::atomic<int> thread_index;
/*override*/ void on_scheduler_entry( bool ) {
    if ( !mask ) return;

    const size_t size = CPU_ALLOC_SIZE( ncpus );
    const int num_cpus = CPU_COUNT_S( size, mask );
    int thr_idx =
#if USE_TASK_ARENA_CURRENT_SLOT
        tbb::task_arena::current_slot();
#else
        thread_index++;
#endif
#if __MIC__
    thr_idx += 1; // To avoid logical thread zero for the master thread on Intel(R) Xeon Phi(tm)
#endif
    thr_idx %= num_cpus; // To limit unique number in [0; num_cpus-1] range

We suppose that the requested number of threads from Intel® TBB can be more than the hardware resources available in the process mask. But our unique number is in [0; num_cpus-1] range and can be unambiguously match with the process mask. In order to bind threads with contiguous numbers to different cores (not to be confused with different hardware threads which can be on the same core) a pinning step variable is introduced.

// Place threads with specified step
int cpu_idx = 0;
for ( int i = 0, offset = 0; i= num_cpus ) cpu_idx = ++offset; }

As result cpu_idx will contain the number of a bit that should be set in the affinity mask. Since we want to take into account the process mask we find the set bit with this number:

// Find index of 'cpu_idx'-th bit equal to 1
int mapped_idx = -1;
while ( cpu_idx >= 0 ) {
    if ( CPU_ISSET_S( ++mapped_idx, size, mask ) )
        --cpu_idx;
}

mapped_idx contains the position of the required bit. Set it in target mask and tell the OS to bind current thread with this mask:

cpu_set_t *target_mask = CPU_ALLOC( ncpus );
CPU_ZERO_S( size, target_mask );
CPU_SET_S( mapped_idx, size, target_mask );
const int err = sched_setaffinity( 0, size, target_mask );

if ( err ) {
    std::cout << "Failed to set thread affinity!n";
    exit( EXIT_FAILURE );
}
#if LOG_PINNING
else {
    std::stringstream ss;
    ss << "Set thread affinity: Thread " << thr_idx << ": CPU " << mapped_idx << std::endl;
    std::cerr << ss.str();
}
#endif
CPU_FREE( target_mask );

If we gather together all parts of the code we will get the pinning_observer implementation:

class pinning_observer: public tbb::task_scheduler_observer {
    cpu_set_t *mask;
    int ncpus;

    const int pinning_step;
    tbb::atomic<int> thread_index;
public:
    pinning_observer( int pinning_step=1 ) : pinning_step(pinning_step), thread_index() {
        for ( ncpus = sizeof(cpu_set_t)/CHAR_BIT; ncpus < 16*1024 /* some reasonable limit */; ncpus <<= 1 ) {
            mask = CPU_ALLOC( ncpus );
            if ( !mask ) break;
            const size_t size = CPU_ALLOC_SIZE( ncpus );
            CPU_ZERO_S( size, mask );
            const int err = sched_getaffinity( 0, size, mask );
            if ( !err ) break;

            CPU_FREE( mask );
            mask = NULL;
            if ( errno != EINVAL )  break;
        }
        if ( !mask )
            std::cout << "Warning: Failed to obtain process affinity mask. Thread affinitization is disabled." << std::endl;
    }

/*override*/ void on_scheduler_entry( bool ) {
    if ( !mask ) return;

    const size_t size = CPU_ALLOC_SIZE( ncpus );
    const int num_cpus = CPU_COUNT_S( size, mask );
    int thr_idx =
#if USE_TASK_ARENA_CURRENT_SLOT
        tbb::task_arena::current_slot();
#else
        thread_index++;
#endif
#if __MIC__
    thr_idx += 1; // To avoid logical thread zero for the master thread on Intel(R) Xeon Phi(tm)
#endif
    thr_idx %= num_cpus; // To limit unique number in [0; num_cpus-1] range

        // Place threads with specified step
        int cpu_idx = 0;
        for ( int i = 0, offset = 0; i<thr_idx; ++i ) {
            cpu_idx += pinning_step;
            if ( cpu_idx >= num_cpus )
                cpu_idx = ++offset;
        }

        // Find index of 'cpu_idx'-th bit equal to 1
        int mapped_idx = -1;
        while ( cpu_idx >= 0 ) {
            if ( CPU_ISSET_S( ++mapped_idx, size, mask ) )
                --cpu_idx;
        }

        cpu_set_t *target_mask = CPU_ALLOC( ncpus );
        CPU_ZERO_S( size, target_mask );
        CPU_SET_S( mapped_idx, size, target_mask );
        const int err = sched_setaffinity( 0, size, target_mask );

        if ( err ) {
            std::cout << "Failed to set thread affinity!n";
            exit( EXIT_FAILURE );
        }
#if LOG_PINNING
        else {
            std::stringstream ss;
            ss << "Set thread affinity: Thread " << thr_idx << ": CPU " << mapped_idx << std::endl;
            std::cerr << ss.str();
        }
#endif
        CPU_FREE( target_mask );
    }

    ~pinning_observer() {
        if ( mask )
            CPU_FREE( mask );
    }
};

The following code snippet demonstrates a one of the simplest implementation of the calculation of π with the help of parallel_reduce from Intel® TBB:

template <typename R, typename S>
R tbb_pi( S num_steps )
{
    const R step = R(1) / num_steps;
    return step * tbb::parallel_reduce( tbb::blocked_range<S>( 0, num_steps ), R(0),
        [step] ( const tbb::blocked_range<S> r, R local_sum ) -> R {
            for ( S i = r.begin(); i < r.end(); ++i ) {
                R x = (i + R(0.5)) * step;
                local_sum += R(4) / (R(1) + x*x);
            }
            return local_sum;
        },
        std::plus<R>()
    );
}

Of course it is not the best algorithm (in terms of performance or accuracy). Moreover there is no need to calculate π at all since more than 10 trillion (1013) digits of π have already been calculated [11]. But it is a good study example and can be useful as an overhead and scalability measurement benchmark.

It seems to be an interesting idea to compare the performance of the example with pinning enabled and disabled. Intel® Xeon Phi™ coprocessor has several hundred hardware threads and creating hundreds of pthreads takes noticeable time; therefore I measured only calculation time without the time needed to create threads. To be sure that all requested threads are created, the task_scheduler_observer can also be called for help. I implemented a concurrency tracker observer which counts how much threads have already started their work:

class concurrency_tracker: public tbb::task_scheduler_observer {
    tbb::atomic<int> num_threads;
public:
    concurrency_tracker() : num_threads() { observe(true); }
    /*override*/ void on_scheduler_entry( bool ) { ++num_threads; }
    /*override*/ void on_scheduler_exit( bool ) { --num_threads; }

    int get_concurrency() { return num_threads; }
};

And a special “warming” loop which ensures that the requested number of threads are created was added to the main function:

int main(int argc, char* argv[])
{
    const size_t N = 10L * 1000 * 1000 * 1000;
    const int threads = argc > 1 ? atoi( argv[1] ) : tbb::task_scheduler_init::default_num_threads();
    const bool use_pinning = argc > 2 ? atoi( argv[2] ) : false;

    tbb::task_scheduler_init init( threads );

    pinning_observer pinner( 4 /* the number of hyper threads on each core */ );
    pinner.observe( use_pinning );

    // Warmer
    concurrency_tracker tracker;
    while ( tracker.get_concurrency() < threads ) tbb_pi<double>( N );

    tbb::tick_count t0 = tbb::tick_count::now();
    const double pi = tbb_pi<double>( N );
    const double time = (tbb::tick_count::now()-t0).seconds();

    const double eps = 1e-10;
    const double PI = 3.1415926536;

    const double err = fabs(pi/PI-1.0);
    if ( err > eps ) {
        std::cout << "Error: " << err << std::endl;
        return -1;
    }

    std::cout << "Pi = " << pi << " Threads: " << threads << " Time: " << time << " sec. (use_pinning = " << use_pinning << ")" << std::endl;

    // Always disable observation before observers destruction
    tracker.observe( false );
    pinner.observe( false );

    return 0;
}

To compile the example source the compiler:

source /opt/intel/composer_xe_2013_sp1/bin/compilervars.sh intel64

And build the example to be executed natively on Intel® Xeon Phi™ coprocessor:

icc -o pi.exe -mmic -std=c++11 -tbb -pthread -lrt pi.cpp

Before running the example, the binary and dependent libraries should be copied on Intel® Xeon Phi™ coprocessor:

scp pi.exe mic0:/tmp
scp /opt/intel/composer_xe_2013_sp1/tbb/lib/mic/libtbb* mic0:/tmp

And to run the example:

ssh mic0 LD_LIBRARY_PATH=/tmp /tmp/pi.exe [num_threads] [use_pinning]

I gathered calculation times on Intel® Xeon Phi™ Coprocessor 7120X (16GB, 1.238 GHz, 61 core) for all input values of num_threads in the range [1;244] and use_pinning in the range [0;1]. The results are presented on the speedup chart and the efficiency chart (which is supposed to be more useful [12]):

The Pi example charts on Intel(R) Xeon Phi (tm) coprocessor

It turned out so that the results with pinning enabled and disabled are the same. That should be expected since this implementation of the π calculation example is mainly a computational problem which is almost independent of cache performance (which is supposed to be optimized with affinity [4]).

I carried out another experiment. I ran the example 100 times on the maximum number of threads with pinning enabled and disabled and calculated several statistical values based on the measured times: the expected value [13] (mean [14]) and the standard deviation [15]:

Where µ00 are the expected value and the standard deviation with pinning disabled and µ11 are the expected value and the standard deviation with pinning enabled. As you can see the expected values are almost the same which matches with the charts. Since the deviations are very small we cannot estimate them visually from the charts but from the statistics we can see that the case with pinning enabled the deviation is notably less – about 17-18%.

The expected value and the standard deviation formulas

In conclusion I would like to add notes of caution that despite the fact that the thread affinity helps to improve application performance it is not always so: e.g. the π example demonstrated its independence on thread affinity. Moreover if the machine layout is unclear or it is unknown where OS daemons are run it is entirely possible to use pinning to force your threads to run in precisely the worst place. Or if your application uses different parallel runtimes simultaneously thread affinity will make the system composability worse in most cases.

References:

  1. Intel® Threading Building Blocks (Intel® TBB)
  2. Intel® Threading Building Blocks (Intel® TBB) (open source)
  3. Wikipedia: Processor affinity
  4. Linux* Journal: Processor affinity
  5. OpenMP*
  6. OpenMP* Thread Affinity Control
  7. Intel TBB documentation: task_scheduler_observer
  8. Under the hood: Building hooks to explore TBB task scheduler
  9. Intel® Xeon Phi™ Product Family
  10. Fedora Manpages: SCHED_SETAFFINITY(2)
  11. Wikipedia: Pi
  12. Wikipedia: Speedup
  13. Wikipedia: Expected value
  14. Wikipedia: Mean
  15. Wikipedia: Standard deviation
  16. FAQs: Compiler

For more complete information about compiler optimizations, see our Optimization Notice.

附件尺寸
下载 pi.cpp6.19 KB
如需更全面地了解编译器优化,请参阅优化注意事项