Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors - Part 1: Optimization Essentials

Abstract

The Intel® Xeon Phi™ coprocessors are a new addition to the Intel family of processors and platforms, first in the family of Intel® Many Integrated Core (Intel® MIC) Architecture. The first Intel® Xeon Phi™ coprocessor has numerous cores with wide vector or SIMD registers. Software running on these new processors should enable one to utilize numerous cores as well as take advantage of the wide SIMD (vector) operations. This document introduces a process by which developers can tune their software to run faster on Intel® Xeon Phi™ coprocessors. It provides links to other resources for more in-depth information.

Overview

Intel® Xeon Phi™ coprocessors offer high computational numerical performance. Getting that performance requires properly-tuned software: it must be highly scalable, highly vectorized, and make efficient use of memory. This document leads developers through the basic process needed to tune and optimize software for Intel Xeon Phi coprocessors. Three basic factors most influence performance on Intel® Xeon Phi coprocessor systems: scalability, vectorization, and memory utilization

Intel® Xeon Phi™ coprocessors

The list below highlights key features of the first Intel Xeon Phi product family formerly known as “Knights Corner”

  • 50+ cores which run the Intel instruction set architecture (commonly called the x86Intel Architecture instruction set).
  • 4 threads per physical core
  • 512 bit registers for SIMD operations (vector operations)
  • 512K L2 cache per core
  • High speed bi-directional ring connecting the 50+ cores

These physical cores are simpler than the Intel® Xeon® processors. They have a dual in-order execution pipeline in contrast to the out-of-order execution model on Intel Xeon processors. The 4 hardware threads per physical core help to mask the effects of latencies on the in-order instruction execution Application scalability is important because applications typically may have 200+ active threads on the MIC system. The computational power comes from the wide 512 bit registers. High performance codes on the Intel Xeon Phi coprocessor will want to utilize these wide SIMD instructions to extract desired performance levels. The best performance will only be achieved when the number of cores, threads and SIMD or vector operations are used effectively. The combined L2 cache size of 512K per core provides over 25MB of L2 cache. The bi-directional ring provides a higher throughput capacity than is available on standard Intel Xeon platforms. The main memory for Intel Xeon Phi coprocessor resides on the same physical card as the coprocessor, and is completely separate and not synchronized with the memory on the coprocessor’s host system. The Intel Xeon Phi coprocessor runs Linux* OS on the card. Developers will find the OS and development tools to be standard fare for HPC and enterprise customers. Readers may find more information about Intel Xeon Phi coprocessor architecture at: http://www.slideshare.net/IntelXeon/under-the-armor-of-knights-corner-intel-mic-architecture-at-hotchips-2012

Scalability – using many cores

Amdahl’s Law

This section reviews Amdahl’s Law and Gustafson’s Corollary, along with some basic tuning tips for scalability. For the purposes of this paper consider scalability as the ability to decrease runtime inversely proportional to the number of processors. For example, perfectly scaling code will run in 1/1000th the time on 1000 cores than it takes to run on 1 core. Now this is the perfect case. A reasonable model is represented by Amdahl’s law which can be expressed:

TP = TS( P/n + S) + OVH

Where:

  • TP is the time to execute in parallel
  • TS is the time to execute sequentially
  • P is the percentage of the time the parallelized regions run in when run sequentially
  • n is the number of processors the work is distributed across (assume homogeneous processors)
  • S is the percentage of the time the sequential regions execute when run sequentially
  • OVH is the overhead associated with setting up parallel tasks, synchronization costs or any other operations introduced to handle parallelism

Amdahl’s law is great for setting expectations. Let’s consider a case where P is 90% and S is 10%. Even if an infinite number of cores were applied to this fixed problem size, the maximum speedup will never exceed a 10X because S is 10%. S must decrease below10%, in order to get more than 10X speedup (as well as have more than 10 processors or cores). A developer needs to consider requirements and all resources (processor cores are not the only resource to count). If S is 10%, does this mean a developer should not consider more than 10 cores? No. What if TS is 100 seconds and the requirement is to complete the task in less than 13 seconds? 31 processors can achieve that performance target. Even if it isn’t using all 31 cores as efficiently as some may like, it meets the specified performance requirement and that is satisfactory. Most developers on Intel® Xeon Phi™ coprocessors have a large percentage of work in the parallel region P (and thus a smaller S). Different development projects, requirements and goals, preclude specifying a minimum value for S. Each project team must define the goals and necessary levels of scalability, then make some rough projections on what is reasonable to achieve. Many challenging high performance computing tasks are trying to solve problems that will not reside in memory on a single system. The aggregate resources of a system or cluster are required to solve the problem; this includes processors, cores, memory, interconnect, and disk space as well as time. In these cases TS may not even be a reasonable number to calculate because running the whole problem serially just couldn’t be done. Many Intel Xeon Phi coprocessors will be part of large high-performance clusters.

Gustafson’s Corollary

This leads to Gustafson’s corollary. John Gustafson pointed out that the problems solved in parallel are larger than what is attempted to solve on a sequential system. Or that as n (the number of cores or processors) increases, the problem size increases, and as the problem size increases, the percentage of time in the parallel region (P) increases as well – so scaling increases as well. A shortcoming of Amdahl’s law is that it doesn’t consider changes in P and S as problem size increases. Frequently the goal is to solve a larger problem in parallel in the same amount of time it takes to solve a smaller problem sequentially. Gustafson observed that as problem size increases, the proportion of the work done in parallel increases at a faster rate than the amount of time spent in sequential regions. Thus as problem size increases, scaling improves, as well as relative speedup and efficiency. For a simple case to illustrate this, consider LU-matrix factorization with partial pivoting. As the matrix size n increases, the amount of memory required increases on the order of n2 but the computation required increases on the order of n3 and this computation runs very well in parallel. Thus LU-matrix factorization has become the most common parallel benchmark for supercomputers. So Amdahl’s law should not be ignored, but it should be understood that P may increase as problem size increases, and so the scaling may increase as well.

Scalable Optimization Tips: granularity, balance, barriers, and false sharing

Granularity

If a problem size remains constant as the number of cores increases, the amount of work for each core decreases, becoming smaller and smaller (we refer to this a .“granularity”). If overhead remains constant for each unit of work, the overhead will make up a larger fraction of the runtime for smaller units of work. Thus, as the grain size decreases, the efficiency decreases. One way to check for this on a sequential system is to think about the problem size and amount of work that will be done in parallel. Consider a smaller sequential problem where the work done in the sequential case matches the amount of work that will be done per core for the planned parallel workload. Determine the ratios using Amdahl’s law and consider whether it will scale or not.

Workload balance

On Intel MIC Architecture, processes are likely to have 200+ active threads. If all of the threads are issuing and executing SIMD instructions, things are very efficient. But if a small group of threads have more computation to complete and the remaining threads must idly wait for these few threads to catch up, the overhead and inefficiency increases. In other words the application fails to scale to use the available resources. One imbalanced task assignment can cause many threads or processes to sit idle and thus decrease system performance. Some code contains numerous for- loops with a fixed iteration count where each iteration performs the exact same computation. These examples, ideally parallel or embarrassingly parallel, usually achieve good work load balance, and OpenMP* constructs handle them well. For other cases the work stealing attributes of Intel® Cilk™ Plus parallel extensions or Intel® Threading Building Blocks may be able to achieve a better workload balance. If a developer sees poor workload balance across an application, the developer may want to explore techniques for improving it. If you are not familiar with Intel Cilk Plus technology or Intel Threading Building Blocks, now is a good time to look them up. See: www.threadingbuildingblocks.org or http://software.intel.com/en-us/intel-cilk-plus-archive or more information.

Barriers

Any fork, join, mutex, lock or barrier potentially reduces efficiency. It is best to choose only the locks/barriers or controls necessary to ensure there are no data races present in the code. Don’t remove barriers at the risk of introducing data races, though. Using products such as Intel® Parallel Advisor XE 2013 on a desktop system can help model parallelizing a sequential region and identify shared data before committing to a threading model. Once code is threaded, Intel® Inspector XE can help find common data races and memory issues. Remember that the time to share a barrier among 16 threads is less than it will take synchronizing 200+ threads. As your application scales to run on hundreds of threads it is particularly important to minimize or coalesce the global barriers and locks. Sometimes locks can be eliminated by providing each thread with a local copy of essential data and only synchronizing that at certain locations rather than locking on each reference to the global address. You may want to reconsider the type of lock you are using. For example, if many threads access some data but rarely modify it, you may have found a general purpose mutex just fine when there are only 2 to 16 threads, but it may not work well for 200+ threads. In this case, a reader-writer lock may be appropriate. This will allow multiple threads to access the data if they are only reading it and updates or modifications (writers) are rare. Intel Threading Building Blocks includes reader-writer locks as well as user-space optimized barriers and controls.

System calls are barriers that many developers frequently overlook. The two most common unintended calls that impact scalability are malloc and gettimeofday. Calls to malloc encounter locks inside it that serializes its callers.serializes execution. If a thread allocates a large block of memory and then operates for a long time, this overhead is not as critical. Applications that make many calls to malloc will be well served by using a more efficient memory allocator. Intel Threading Building Blocks includes memory allocation calls that scale extremely well for this purpose. There are other third party memory allocation libraries available which also do better than standard malloc. In the second example, when 200+ threads call gettimeofday at the same time, it may also behave like a sequential region. Only have one thread call gettimeofday. Use local timers for timing within a thread. The call gettimeofday can be configured to return local core counter or global processor counter, this tip assumes it is utilizing a global processor counter rather than a local core tsc counter. Consider other system calls and libraries that may act as barriers and use an alternative or minimize their usage.

False Sharing

One more item to consider is false sharing, . This happens when two different cores read and write adjacent data in the same cache line. For example:

float a[32];
#pragma omp parallel for num_threads(32)
for (int i = 0; i < 100000;  ++i)
	a[omp_get_thread_num()]+= 1.0;

In the above example assume that the array starts at a cache line boundary and that each thread runs on a different core (this can be accomplished using the compiler’s OpenMP® API extensions for affinity; – see the compiler documentation for details). Then every core will access a different element, so there is no true sharing. However, since a cache line consists of 16 four-byte floats, 16 of the cores will be accessing one cache line, and 16 of the cores will be accessing another cache line. This results in terrible performance, because the cache lines are continually moving among the caches of different cores as one core after another tries to write to update their data element. False sharing is usually fixed either by padding to cache line boundaries or by using private variables.

Parallel Programming References:

The following books are good resources to learn more about parallel programming.

Intel Guide for Developing Multithreaded Applications - http://software.intel.com/en-us/articles/intel-guide-for-developing-multithreaded-applications

Structured Parallel Programming: Patterns for Efficient Computation by Michael McCool, James Reinders, Arch Robison, published by Morgan Kaufman 2012

The Art of Concurrency: A Thread Monkey’s Guide to Writing Parallel Applications by Clay Breshears, published by O’Reilly Media, 2009

Intel Threading Building Blocks tutorial (90 pages) www.threadingbuildingblocks.org – select documentation

Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism, by James Reinders, published by O’Reilly Media, 2007

Vectorization – using SIMD registers and operations

Array Notation

As mentioned earlier, one of the keys to the performance value of Intel Xeon Phi coprocessors is the 512-bit registers and associated SIMD operations. This section explains some approaches to tune software to vectorize or effectively utilize the wide SIMD operations.

One of the first things to do to tune existing code is to understand where the time is spent. The sections of code which consume the most compute cycles are the hotspots and are the places to focus tuning effort first. Intel® VTune™ Amplifier XE can collect clock ticks and map these back to the source code to show where most CPU cycles are spent executing. Focus on the hotspots identified by VTune Amplifier XE. Use the steps below to make sure that the optimizations are working well in these regions.

The best method to take advantage of the 512-bit wide SIMD instructions is to write using an array notation style such as available in Intel Cilk™ Plus or Fortran 90. When array notation is used, the compiler will vectorize or utilize the SIMD instruction set. Fortran adopted an array notation syntax long ago. Fortran developers are encouraged to become familiar with this feature of Fortran90. Intel introduced an array syntax with Intel® Cilk™ Plus. In order to promote adoption of methods to write code that compilers can vectorize on different instruction sets, Intel opened the Cilk Plus specifications. Intel Cilk Plus features including array notation and threading are available in branches of the gcc 4.8 compiler.

Under Intel Cilk Plus, to reference an array or segment of an array, the syntax is as follows: [<lower bound> : <length> : <stride>]; lower bound is the first element of the array included in the operation while length is the number of elements in the array involved in the operations, and stride is the distance between each array element utilized (most commonly this is 1). If the entire array is operated on then the values may be omitted. So if A and B are two one dimensional arrays of length n and c is a scalar, the following commands are equivalent:

A[:] = c * B[:] ; // for (i=0;i<n;i++) A[i] = c * B[i] ;
A[0:n:1] = c * B[0:n:1] ; // for (i=0;i<n;i++) A[i] = c * B[i] ;

The array notational syntax is the preferred method for ensuring the compiler will effectively utilize the SIMD operations on Intel MIC Architecture (or any other Intel platform).

Elemental functions

Just as Fortran allows for user-defined elemental functions, the Intel Cilk parallel extensions to C/C++ also support user-defined elemental functions. An elemental functions is a regular function which can be invoked on either scalar or array elements in parallel. When a developer declares an elemental function, the compiler will generate two versions of the function: a normal scalar function of that name as well as a data parallel version that will be invoked when the function is called from a for loop or with vector input. An elemental function is defined by adding the attribute vector in its declaration as shown in the example below. The function MyVecMult would be written as follows:

__attribute__(vector (optional clauses))void MyVecMult(double *a, double *b, double *c)
{ c[0] = a[0] * b[0] ; return ;}

It can be invoked like this:

For (i=0;i<n;i++) MyvecMult(a[i],b[i],c[i]) ;

or

MyvecMult(a[:],b[:],c[:]);

Additional vector clauses may be added to specify vector length and additional hints. The use of user defined elemental functions is another technique developers may use to express operations that can take advantage of SIMD instructions. There are restrictions on what a user-defined elemental function may do (no switch statements, for loops goto, . . . See http://software.intel.com/sites/default/files/article/181418/whitepaperonelementalfunctions.pdf for more details). The Intel compiler documentation also includes more information on array notation with Intel Cilk Plus.

Directives and Pragmas

Most software does not need to be rewritten in array notation to benefit from vectorization. The Intel compiler automatically vectorizes many for loops and constructs. However, it is typically insufficient or inadvisable to rely solely on auto-vectorization. Developers should be prepared to assist the compiler generate efficient vector code. In these cases the addition of pragmas or directives can provide the compiler with sufficient information to vectorize the code. Before making code changes, collect performance data using Intel VTune Amplifier XE – focus your efforts on the hotspots identified by Intel VTune Amplifier XE. In addition, turn on vector reports when compiling (I typically use –vecreport3). Look at the compiler report for the source files with identified hotspots and verify that the hotspots are being vectorized. If not, consider rewriting with array notation or adding pragmas or directives to help the compiler vectorize efficiently.

The first step is to make sure the compiler vectorizes all of the hotspots identified if possible. The simplest pragma to add above a for-loop is #pragma ivdep or cDIR$ IVDEP (note the “c” here represents the Fortran comment character in fixed and tab-source forms; otherwise, free form syntax uses “!”).. This informs the compiler to ignore potential or assumed pointer dependencies (literally, “Ignore Vector DEPendence). Only use this where you know the pointers always dereference independent memory areas. A commonly used pragma/directive is #pragma simd or cDIR$ SIMD. This pragma/directive informs the compiler to ignore all conflicts and try to produce code with SIMD operations if at all possible. Because the developer is instructing the compiler to ignore conflicts it cannot disambiguate, the developer should be certain about potential dependencies in the code before using this directive or pragma

Efficient vectorization is important, too. Giving the compiler more information can help it generate far better vectorized code. So after verifying that the compiler reports a section is vectorized, check that it is doing so efficiently. Check that the code is not using split loads or stores and that the code avoids gather-/scatter operations (unless you are working on code that stores data in a fashion requiring gather-scatter--many sparse matrix codes do this). The vector alignment is very important. When declaring arrays, make sure they are aligned on 64 Byte addresses, for example: __declspec(align(64)) float A[1000]. In Fortran, use the cDIR$ ATTRIBUTE ALIGN directive. For dynamic memory allocation in C you can use _aligned_malloc(). When you pass pointers into functions or routines, you may use #pragma vector aligned immediately prior to the loop to let the compiler know all the pointers in the loop are aligned. If it is true for all loops in a routine, instead of inserting the pragma before each loop you can use the call __assume_aligned() for the pointer and the compiler will know it is aligned for all loops in the routine. In Fortran there is the CDIR$ ASSUME_ALIGNED directive. Please see compiler documentation for use of these commands.

When two nested loops have short loop counts, the compiler can frequently do a better job if the two nested loops are merged into a single for loop. Although some developers do this activity themselves, it is not necessary as the same can be accomplished with directives/pragmas. The pragma #pragma nounroll_and_jam, or cDIR$ NOUNROLL_AND_JAM or the similar directives unroll_and_jam/ UNROLL_AND_JAM both fuse two loops into a single loop--; one also unrolls the loops while the other does not.

Memory and cacheing

Addressing and prefetching

Code will run best when data are accessed in sequential address-order from memory. Frequently developers will change the data structure to allow this linear access pattern. A common transformation is from an array of structures to a structure of arrays ( AoS to SoA).

Access to data is always important, and prefetching minimizes delays waiting acquiring data needed to perform computations. The Intel compiler automatically prefetches data for loops it vectorizes. If your loop has a calculable memory access pattern that may not be clear at compile time, your code may benefit by specifying prefetch in it. This can be done with pragmas/directives or with intrinsic commands. Remember if a developer chooses to insert prefetch instructions in a for loop, the prefetches typically should be for a future iteration of the loop, not the current iteration of the loop. The developer is directed to the Intel compiler documentation for details on implementation.

Blocking or Tiling

Code runs faster when data are reused while they are still in the processor registers or the processor cache. It is frequently possible to block or tile operations so that data are reused before they are evicted from cache. Two examples below using array notation illustrate this.

Example 1 – not blocking for data reuse for large N:

A[0:N]=B[0:N]+C[0:N];
D[0:N]=E[0:N]+A[0:N];

Example 2 – Access to vector A is blocked or tiled for reuse:

#define VLEN 4
for(i=0;i<N;i+=VLEN){
    A[i:VLEN]=  B[i:VLEN]+C[i:VLEN];
    D[i:VLEN]=  E[i:VLEN]+A[i:VLEN];
}

A small set of applications achieve a greater than n factor speedup on Intel Xeon Phi coprocessors. Typically these applications are trivially parallel (also known as embarrassingly parallel—), have very little synchronization, and have a data set that fits into cache on the Intel Xeon Phi coprocessor. Such workloads may not fit into the cache, say, on the Intel microarchitecture code named Sandy Bridge systems, and so suffer more from delays in data retrieval from main memory. These are not common, but developers should enjoy this capability when it works. Tiling or blocking for data reuse is beneficial even if you cannot always remain entirely in cache.

Some applications may benefit from large page sizes. Developers are encouraged to refer to the event based tuning guide for data to collect to determine when an application may benefit from large page sizes.

Bandwidth

Applications which are bandwidth-bound can run faster on Intel Xeon Phi coprocessor systems. In such cases the speedup is not a ratio of the number of additional cores. It is more closely related to the total available aggregate bandwidth on Intel Xeon Phi coprocessors, which exceeds the bandwidth of the current class of Intel Xeon processors (based on the microarchitecture code named Sandy Bridge). Good memory access patterns (unit stride 1) and prefetching help to maximize the achieved memory bandwidth.

Summary

The Intel Xeon Phi coprocessors introduce a wider SIMD registers, associated SIMD vector operations, and 50+ cores. Developers will find it worthwhile to tune their software for this new coprocessor. The tuned code will run well on Xeon host platforms and allow developers to maintain a single code base. The steps listed in this document are an introductory approach to tuning software. Those desiring to go deeper into optimization and understanding performance are encouraged to read and follow Part 2: Optimization and Performance Tuning for Intel® Xeon Phi™ Platforms Part 2: Understanding and using hardware events. This second part which takes an event based approach to tuning and goes into a more detailed description of important tuning steps.

Per informazioni più dettagliate sulle ottimizzazioni basate su compilatore, vedere il nostro Avviso sull'ottimizzazione.