Memory Management for Optimal Performance on Intel® Xeon Phi™ Coprocessor: Alignment and Prefetching

Download PDF

In this article, we explore the memory management subsystem, specifically alignment and prefetching, and the way it relates to application performance optimization. We start with basic architectural features of the Intel® Xeon Phi™ coprocessor’s (the coprocessor) memory hierarchy and its performance characteristics. Next we discuss memory alignment and how aligned memory is achieved either dynamically or statically. We then take a detailed look at different techniques to insert prefetch instructions in application code including compile-based default behavior, user-specified prefetch directives, and prefetch intrinsic functions. Finally, we look at the streaming non-temporal store instruction and discuss how to use it to optimize the memory bandwidth.

1) Memory Architecture

Memory Hierarchy
Memory management on Intel Xeon Phi coprocessors closely resembles that of Intel® Xeon® processors. At the center of the memory architecture is a two-level, hardware-managed, coherent cache hierarchy organized around a ring style interconnect.

Figure 1: Memory Hierarchy for Intel® Xeon Phi™ Coprocessor

The top of the memory hierarchy is the coprocessor cores and the vector registers. The core is based on an in-order Intel® Architecture-based core with a completely redesigned vector processing unit (VPU) that operates on 512-bit wide vector data enough to hold 16 single precision floating point numbers. Each core supports 4 execution contexts or threads and each thread has its own 32 vector registers. Ideally, the majority of the floating point operations in an application should be carried out by the VPU due to its high throughput instead of the legacy x87 unit.

To facilitate low-latency, high-speed data access to the memory, L1 cache is directly integrated into the core. The L1 Data and Instruction caches are the second level in the coprocessor’s memory hierarchy. The L1 cache can hold 32K of data and has a 3-cycle access time. Together with the coprocessor’s load and store VPU instructions, the L1 cache behaves like an extended register file, which significantly improves the performance of many algorithms.

When the coprocessor core references a memory location that is not in the L1 cache, it goes to the next level in its memory hierarchy. First, it looks at the coprocessor core’s local L2 cache, and then in the L2 caches of all the other cores via the ring interconnect. Each core’s local L2 cache has the capacity of 512KB, and with the processor interconnect, the total available can be as high as 31MB. Cache coherence is automatically maintained among the L2 caches of all the cores, effectively creating a virtual cache of 31MB for the application programmer. The coprocessor’s L2 cache is composed of 64-byte cache lines with 8-way associativity, and it has an access time of approximately 11 cycles. Accessing the L2 cache of a remote core takes longer than accessing the core’s own L2 cache. You can align the data access so that memory falls into the same cache line to avoid the cache-line split. This practice can significantly improve application performance.

Cache Control Instructions and Hardware Prefetcher
One important feature of the Intel Xeon Phi coprocessor is that it provides the explicit cache control instructions and instruction modes for cache optimization hints. These include instructions to prefetch data into the L1 or L2 caches and instruction modes to reduce the priority of a cache line. For example, streaming data typically sweeps existing data out of a cache, thus the coprocessor is able to mark each streaming cache line for early eviction after it is accessed. These cache control instructions also allow the L2 cache to be used similarly to a scratchpad memory, while remaining fully coherent.

The coprocessor’s L2 cache has a streaming hardware prefetcher that can selectively prefetch code, read, and read-for-ownership cache lines into the L2 cache. There are 16 streams that can bring in up to a 4-KB page of data. The prefetcher observes an L2 cache misses pattern, upon detection of a cache miss pattern, it starts issuing prefetch requests to memory.

Prefetch instructions and L2 hardware prefetch work together to improve application performance by drastically reducing the cache miss during the program execution. By default, the Intel® compiler always analyzes the memory access patterns and inserts L1 and L2 prefetch instructions in the executable files it generates. For most applications, this usually results in application performance improvement by converting cache misses to cache hits. The L2 hardware prefetcher will not be trained to issue prefetch for those memory accesses. For the memory access compile misses, as indicated by a high level of cache misses, the L2 hardware will be trained to issue prefetch and help reduce the L2 cache misses.

Streaming Store and Cache Line Eviction Instruction
New with the Intel Xeon Phi coprocessor are the streaming non-temporal store instructions. These special vector instructions operate on data whose length is the multiple of the coprocessor’s L2 cache line size and the beginning address is also cache aligned. The streaming non-temporal store instructions simply allocate cache lines in L2 cache and set all the content. There is no data transfer from memory, which is why it saves bandwidth.

Any memory instructions that operate on location and will not be reused in the immediate future are non-temporal memory accesses. There is no benefit in caching non-temporal memory accesses and keeping them in the cache until the least recently used policy catches them. Evicting those cache lines sooner can free up cache lines that can be used for caching more important data. The coprocessor provides special instructions for evicting cache lines from L1 cache and L2 cache (clevict0 and clevict1). The Intel compiler generates these instructions on loops that demonstrate non-temporal behavior. It’s easy to generate cache line eviction instructions for loops on the coprocessor because its vector length is equal to the cache line size, which means that one cache line can be evicted per vector loop iteration.

In the next few sections, we will look at a few memory-related optimization techniques that can help application developers achieve higher performance on Intel Xeon Phi coprocessors.

2) Memory Alignment: Requirements and Usage

One of the most effective ways of working with the coprocessor memory subsystem is to understand and adhere to its memory alignment requirement. All memory-based instructions in the coprocessor should be aligned to avoid instruction faults. Lack of appropriate memory alignment can result in performance degradation, and in serious cases, program failure. For floating point arithmetic vector instructions, the alignment requirement is based on the instruction operands. For example, the requirement is 64-byte alignment for a packed operation. Misaligned scalar operations are allowed, but incur an 8-10 cycle penalty. Some instructions, such as gather/scatter, vloadunpack, vstorepackin, and vstorepacklo, only require element alignment.

Application programmers can use the compiler-provided language extensions or the runtime-library API to declare static or dynamically allocated memory with the appropriate memory alignment.

Language Extensions

In C/C++, __attribute__((align(n))) can be placed in front of any statically allocated variable to request alignment at a n-byte boundary. For example, the following statement specifies that darray consists of 8 elements aligned on a 64-bit boundary.

__attribute__((align(64))) double darray[] = {0.0, 2.0, 4.0, 6.0, 8.0, 10.0, 12.0, 14.0};

The alignment rule also applies to global static variables and local static variables. The storage duration will, of course, be different.

You can request specific alignments for scalar variables as well. However, by default, the compiler always aligns a scalar variable on the natural boundary of its data type. If the value specified in the align(n) statement is less than the compiler default alignment of the data type, it has zero effect. In other words, data is aligned to the maximum value of its own natural alignment or the alignment specified with __attribute__((align(n))).

The Intel Compiler also supports the older syntax __declspec(align(n)), which has the same semantics as __attribute__((align(n))).

Runtime-Library APIs

To dynamically allocate a piece of aligned memory, use posix_memalign, which is supported by GCC as well as the Intel Compiler. The benefit of using it is that you don’t have to change the memory disposal API. You can use free() as you always do. But pay attention to the parameter profile:

  int posix_memalign (void **memptr, size_t align, size_t size);

The Intel Compiler also provides another set of memory allocation APIs. C/C++ programmers can use _mm_malloc and _mm_free to allocate and free aligned blocks of memory. For example, the following statement requests a 64-byte aligned memory block for 8 floating point elements.

  farray = (float *)__mm_malloc(8*sizeof(float), 64);

Memory that is allocated using _mm_malloc must be freed using _mm_free. Calling free on memory allocated with _mm_malloc or calling _mm_free on memory allocated with malloc will result in unpredictable behavior.

Intel® Threading Building Blocks
Intel® Threading Building Blocks (Intel® TBB) provides a set of aligned memory allocation routines that scale with the number of processor cores (the memory pool is managed independently for each thread, instead of via a single process-wide global lock).

void*scalable_aligned_malloc(size_t size, size_t align);
void scalable_aligned_free(void* ptr );
void*scalable_aligned_realloc(void* ptr,size_t size,size_t align);

These memory allocation routines have exactly the same signature as the _mm_* routines. You must use scalable_aligned_free to free memory returned by scalable_aligned_malloc. To use the memory allocation routines in Intel TBB, you must include "tbb/tbb_allocator.h" in the source file and link dynamically against

Aligning Data within Data Structures

Data alignment can also be used to optimize cache line usage. By clustering small objects that are commonly used together into a struct and forcing the struct to be allocated at the beginning of a cache line, you can effectively guarantee that all objects are loaded into the cache as soon as any one object is accessed, resulting in a significant performance advantage.

You cannot adjust the alignment of a parameter or field within a struct or class. You can, however, specify the alignment of a struct (or union or class), in which case every object of that type is affected.

As an example, suppose that a function uses local variables i and j as subscripts into a 2-dimensional array. They might be declared as follows:
int i, j;

These variables are commonly used together, but they may end up in different cache lines, which could degrade performance. Instead, declare them as:
__attribute__((align(16))) struct { int i, j; } ind;

By placing the __attribute__((align(16))) before the keyword struct, you are requesting the appropriate alignment for all objects of that struct type. The compiler now ensures that the objects are allocated in the same cache line. In C++, you can omit the struct variable name. In C, however, it is required, and you must write references to i and j as ind.i and ind.j.

If you use many functions with such subscript pairs, it is more convenient to declare and use a struct type for them, as in the following example:
  typedef struct __attribute__((align(16))) { int i, j; } ind;

3) Prefetching Instructions

As we mentioned earlier, another optimization technique to achieve high performance in the memory subsystem is to use prefetch instructions to ensure data is nearby while being accessed. Prefetching data on the coprocessor can be as simple as letting the compiler automatically generate prefetch instructions, giving hints to the compiler in the form of compiler pragmas and directives, or passing the prefetch distance via compiler invocation switches. When everything else fails, you can manually insert the prefetch instructions in your code.
Let’s first look at the architecture for prefetch instructions.

L1 and L2 Prefetching
The Intel Xeon Phi coprocessor contains different types of software prefetching instructions. The L2 prefetch instructions, such as vprefetch1, bring a 64-Byte line of data from memory to L2. The L1 prefetch instructions, such as vprefetch0, further bring the data to the L1 cache. The cache line allocated for the prefetched data is put into a shared coherence state in the MESI protocol. Other variants of prefetch instructions, such as vprefetche1 and vprefetche1, mark the cache line as exclusive in the tag directory.

InstructionCache LevelNon-temporalBring as exclusive

The syntax for prefetch instructions is:
  vprefetch1 | vprefetch2 ptr hints

The hints field is an immediate control that indicates the kind of data prefetch being performed. It can be one of the following:

  • Exclusive: Loads the data for modification
  • Non-temporal: Specifies data that is not expected to be re-used
  • Miss hint: Indicates that the target cache line is not expected to be in the cache

While the L1 prefetch instruction supports all three hints, the L2 prefetch instruction only supports the exclusive and non-temporal hints.

The L1 and L2 prefetch instructions are considered to be microarchitecture “hints”; the hardware may either defer or drop the operation. If the requested cache line containing the specified address is already in the L1 or L2 data cache, the prefetch is dropped. Similarly, any attempt to prefetch uncacheable objects is ignored.

Compiler-based Prefetching
Prefetch makes a huge difference in loop constructs where the memory access, if there is any, is repetitive and intensive. In loop constructs, temporal locality in L1 and register promotion can hide the latency of scalar memory access. The aggregate variables such as arrays, on the other hand, rely on prefetch instructions to hide this memory latency. To accomplish this task, the compiler typically attempts to prefetch data a few iterations ahead of the target accesses. In other words, the compiler tries to make sure that the array location accessed in the current iteration was prefetched to the higher cache hierarchy a few iterations earlier. The exact number of iterations of loops into the future the compiler has to prefetch is known as the prefetch distance.

The compiler uses two factors to calculate the prefetch distance: the latency of the prefetch instruction and the latency of the loop body. The latency of a prefetch instruction has to take into account all the load in the system as well as the minimum latency of accessing the target level in the memory hierarchy. The latency of a loop is the accumulation of the latency of all the instructions in the loop, including the inner loop with known loop bounds. Any hint from the source code on the trip count for the inner loop will be taken in the prefetch distance calculation.

The optimal prefetch happens when the prefetch instruction completes the cycle before the memory access takes place. The prefetch distance calculation is critical to the optimal prefetch. Undershooting or overshooting this optimal value will cause either exposed memory latency or LRU policy evicted useful cache line.

By default, the compiler always generates prefetch instructions for the coprocessor, and the optimization level is higher than 2. The programmer can find out the prefetch distance for each loop by using –opt-report-phase hlo and –opt-report 3. The only way to override the compiler’s default behavior on prefetching is by using either –opt-prefetch=0 or –no-opt-prefetch.

The programmer can also explicitly tell the compiler what prefetch distance numbers to use by using –opt-prefetch-distance=d1:d0, where d1 is the L2 prefetch distance and d0 is the L1 prefetch distance. For example, -opt-prefetch-distance=64,32 tells the compiler to use 64 as the prefetch distance for memory to L2 prefetches and 32 as a prefetch distance for L2 to L1 prefetches.

The following is sample output from compiling the Black-Scholes code, a popular benchmark in the financial services industry.

Total #of lines prefetched in main for loop at line 185=10
   # of initial-value prefetches in main for loop at line 185=5
   # of dynamic_mapped_array prefetches in main for loop at line 185=10, dist=35
Estimate of max_trip_count of loop at line 200=1024
Estimate of max_trip_count of loop at line 207=1048576000
Total #of lines prefetched in main for loop at line 207=12
   # of dynamic_mapped_array prefetches in main for loop at line 207=12, dist=8
Total #of lines prefetched in main for loop at line 247=8
   # of initial-value prefetches in main for loop at line 247=6
   # of dynamic_mapped_array prefetches in main for loop at line 247=8, dist=8

Prefetch Directives
Prefetch directives give programmers more options for specifying the prefetch hints and prefetch distance to any variables in their code. Prefetch directives have the following syntax:
#pragma prefetch var:hint:distance

The benefit of using compiler-based prefetch directives is that the same code compiles with compilers that do not support the prefetch directives (though such directives will produce warnings).

Here is an example in which the programmer tells the compiler:

  1. Not to prefetch for col and x
  2. To prefetch value for 12 iterations ahead into the L2 cache.
    for (i=i0; i!=i1; i+=is) 
      float sum = b[i];
      int ip = srow[i];
      int c = col[ip];
    #pragma noprefetch col
    #pragma prefetch value:1:12
    #pragma noprefetch x
      for(; ip<srow[i+1]; c=col[++ip])
        sum -= value[ip] * x[c];
      y[i] = sum;

The following example again shows the programmer explicitly specifying the prefetch distance, this time into both the L1 and L2 caches. The vprefetch2 is specified for htab_p with a distance of 16 vectorized iterations ahead, and the vprefetch1 for htab_p with a distance of 6 vectorized iterations ahead. If the pragmas are not present, the compiler chooses both the L1 and the L2 prefetch distances based on its analysis of the loop’s memory access patterns.

void foo(int *htab_p, int m1, int N)
  int i, j;
  for (i=0; i<N; i++) {
#pragma prefetch htab_p:1:16 
#pragma prefetch htab_p:0:6
      for (j=0; j<2*N; j++) {
	  htab_p[i*m1 + j] = -1;

Prefetching Intrinsics
The Intel® C/C++ and Fortran compilers provide prefetch intrinsics as a language extension. The following example shows the use of prefetch intrinsics. Note that the “noprefetch” pragma does not affect any prefetches you explicitly include, such as _mm_prefetch.

#include <stdio.h>
#include <immintrin.h>
#define N 1000
int main(int argc, char **argv)
  int i, j, htab[N][2*N];
  for (i=0; i<N; i++) {
#pragma noprefetch // Turn off compiler prefetches for this loop
    for (j=0; j<2*N; j++) {
      _mm_prefetch((const char *)&htab[i][j+20],_MM_HINT_T1); // vprefetch1
      _mm_prefetch((const char *)&htab[i][j+2],_MM_HINT_T0); // vprefetch0
 htab[i][j] = -1;
 printf("htab element is %dn", htab[3][40]); return 0;

Measuring Prefetching Effectiveness
You can measure the effectiveness of the prefetch instructions with Intel® VTune Amplifier Performance Analyzer (vTune). vTune collects various runtime behavior data from the Performance Monitor Unit (PMU) based on the performance events you select. SEP returns the counter values for these events upon completion of the program.

Listed below are the counters that have been validated to measure the effectiveness of software prefetching.

L1 Prefetches = L1_DATA_PF1/time
L1 Prefetch Miss Ratio = L1_DATA_PF1_MISS/L1_DATA_PF1
L2 Prefetch Miss Ratio = L2_DATA_PF2_MISS/L2_DATA_PF2

4) Streaming Non-temporal Stores and Cache Eviction Instruction

Streaming non-temporal store instructions are specialized memory store instructions designed to optimize the utilization of memory bandwidth in cases where data with no temporal locality is being streamed into memory. Unlike regular stores, such store instructions do not perform a read-for-ownership (RFO) for the target cache line before the actual store. The rationale behind this is that any data read from memory for this purpose will not be used and will get overwritten by the data stream.

The compiler has heuristics to identify loops as non-temporal, fully automatic. When these heuristics identify a loop to be non-temporal, the default behavior of the compiler is to generate NR.NGO stores and a fence (a lock instruction) after the loop to ensure safety. However, these fully automatic heuristics rarely get triggered because the compiler needs to know that the trip count of the target loop is large (statically at compile time). The programmer can assist the compiler in identifying non-temporal loops by manually marking loops as non-temporal using
   #pragma nontemporal pragma
or by using the -opt-streaming-stores always compiler option that declares all loops in the target compilation unit as non-temporal. The compiler will then generate NR.NGO instructions for these non-temporal loops and will not generate fences after them. In this case, the programmer is responsible for making sure that this use of NR.NGO instructions is safe.

The compiler also generates cache eviction instructions on loops that are identified to show non-temporal behavior. Note that generating cache line eviction instructions for loops on the coprocessor is rather easy because its vector length is equal to the cache line size, which means that one cache line can be evicted per vector loop iteration. However, the cost of executing the cache eviction instructions, in some cases, may negate any benefit. The programmer can then use -opt-streaming-cache-evict=x to control whether the cache eviction instruction should be generated or not.

X = 0use no cache eviction level
X = 1use the L1 cache eviction level
X = 2use the L2 cache eviction level
X=3 (default)use the L1 and L2 cache eviction level

In the following example, the computation core of Black-Scholes, takes 3 floating point values from input arrays OptionYears, OptionStrike, and StockPrice and writes two values to the output arrays, CallResult and PutResult. The two output arrays are the targets for streaming non-temporal stores. The programmer uses the pragma directive to highlight this opportunity:

#pragma simd vectorlength(CHUNKSIZE)
#pragma simd
#pragma vector aligned
#pragma vector nontemporal (CallResult, PutResult)
for(int opt = chunkBase; opt < (chunkBase+CHUNKSIZE); opt++)
   float CNDD1;
   float CNDD2;
   float T = OptionYears[opt];
   float X = OptionStrike[opt];
   float S = StockPrice[opt];
   float rsqrtT = 1/sqrtf(T);
   float sqrtT = 1/rsqrtT;
   float d1SQRT1_2 = log2f(S / X) * RVLOG2ESQRT1_2 * rsqrtT +  RVVSQRT1_2 * sqrtT;
   float d2SQRT1_2 = d1SQRT1_2 - VOLATILITYSQRT1_2 * sqrtT;
   CNDD1 = HALF + HALF*erff(d1SQRT1_2);
   CNDD2 = HALF + HALF*erff(d2SQRT1_2);
   float XexpRT = X*exp2f(RLOG2E * T);
   float CallVal  = S * CNDD1 - XexpRT * CNDD2;
   float PutVal  = CallVal  +  XexpRT - S;
   CallResult[opt] = CallVal ;
   PutResult[opt] = PutVal ;

In this particular case, executing the cache eviction instruction takes some time, which can negate most of the benefit. The programmer can use -opt-streaming-cache-evict=0 to avoid these instructions. So there are vmovnrngoaps instructions, but not evict0 or evict1.

   .loc    1  229  is_stmt 1
        vpermf32x4 $238, %zmm2, %zmm3                           #229.20 c49
   .loc    1  231  is_stmt 1
        movq      48(%rsp), %rax                                #231.4 c49
   .loc    1  229  is_stmt 1
        vcvtps2pd %zmm2, %zmm6                                  #229.20 c53
        vcvtps2pd %zmm3, %zmm7                                  #229.20 c57
   .loc    1  230  is_stmt 1
        vmovnrngoaps %zmm6, (%r14,%r15,8)                       #230.4 c57
   .loc    1  229  is_stmt 1
        vfmadd213pd %zmm7, %zmm4, %zmm26                        #229.32 c61
   .loc    1  230  is_stmt 1
        vmovnrngoaps %zmm7, 64(%r14,%r15,8)                     #230.4 c61
   .loc    1  229  is_stmt 1
        vfmadd213pd %zmm6, %zmm16, %zmm27                       #229.32 c65

5) Conclusion

Understanding memory performance and the memory hierarchy is crucial to optimize performance of your application on any Intel® architecture, Multicore or Many Integrated Core. The memory hierarchy for the Intel Xeon Phi coprocessors bears strong resemblance to the one for the Intel Xeon host processors. However, the wider SIMD vector unit places stronger alignment requirements for the data it consumes. The software prefetching creates opportunities for higher cache utilization, resulting in more efficient use of the memory hierarchy. Streaming stores, together with explicit cache eviction instructions, are specific instructions to further optimize the memory subsystem. When applied correctly, they can deliver higher application performance.

About the Author

Shuo Li

Shuo Li works at Software and Service Group at Intel Corporation. His main interest is parallel programming, and application software performance. In his recent role as a software performance engineer covering financial service industry, Shuo works closely with software developers and modelers and help them achieve best possible performance with their software solutions. Shuo holds a Master's degree in Computer Science from university of Oregon and an MBA degree from the Duke University.

Sudha Udanapalli Thiagarajan

Sudha Udanapalli Thiagarajan received a Bachelor’s degree in Computer Science and Engineering from Anna University Chennai, India in 2008 and a Master’s degree in Computer Engineering from Clemson University in May 2010. She joined Intel in 2010 and been working as an enabling Application Engineer, focusing on optimizing applications for ISV’s and developing collateral for Intel® Many Integrated Core Architecture.


Intel, the Intel logo, Xeon, and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.