effect of using array alignment

effect of using array alignment

Hi,

I am trying to figure out the effect of using array alignment in vectorization of MIC code.

Here is a simple piece of offload code from Intel

  #pragma offload target(mic:cardId)
  #pragma omp parallel for private(j,k)
  for (i=0; i<numthreads; i++)
  {
    int offset = i*LOOP_COUNT;
    for (j=0; j<MAXFLOPS_ITERS; j++)
    {
      #pragma vector aligned
      for (k=0; k<LOOP_COUNT; k++)
      {
        fa[k+offset]=a*fa[k+offset]+fb[k+offset];
      }
    }
  }

This program gets ~1900 GFlops, which is very promising. However, if I changed line 8 to be 

#pragma vector unaligned

or

#pragma simd

The performance significantly drops to ~60 GFlops.

From documentations I learnt that"aligned" indicates "compilers to use aligned data movement instructions for all array references when vectorizing". Could you elaborate this explanation a little bit?

Also, i noticed that the change I mentioned above doesn't make too much difference in the program running on host machine. Why is that?

Thanks!

10 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hi JS,

On the MIC architecture, vector load/store operations must be called on 64-byte aligned memory addresses. On the Xeon architecture with AVX/AVX2 instruction sets (Sandy Bridge, Ivy Bridge or Haswell), alignment does not matter. In earlier architectures (Nehalem, Westmere) alignment did matter, but a 32-byte alignment was necessary.

From the programmer's perspective, "#pragma vector aligned" tells the compiler "Compiler, I promise that the first iteration of this loop hits data on an aligned boundary, and you can just use aligned load/store instructions". In the words of James Reinders, this pragma "tells the compiler to chill". If at runtime, the data are not aligned as promies, you will get a segmentation fault. My experience shows that on Xeon Phi, it is always good to align data, and it is always good to tell the compiler "#pragma vector aligned" in performance-critical loops.

Without "#pragma vector aligned", the compiler will try to determine at runtime whether the first loop iteration hits on an aligned boundary. If it does not, the compiled code may peel off up to 15 iterations (in single precision), so that the remainder of the loop is aligned. The check for alignment, and the unaligned accesses take extra time, and can be eliminated by doing alignment and then using "#pragma vector aligned".

Now, "#pragma simd" tells the compiler "vectorize whatever it takes", and, as I understand, it may place an additional burden on the compiler to do runtime checks of the memory alignment situation, loop count, pointer disambiguation, etc. I find this pragma useful for vectorization of outer loops, but inner loops are usually ok without it.

I honestly don't know what "#pragma vector unaligned" does and when it can be useful.

A

You can refer to this article that discusses this topic in depth: http://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization

--Amanda

 

Quote:

Andrey Vladimirov wrote:

On the Xeon architecture with AVX/AVX2 instruction sets (Sandy Bridge, Ivy Bridge or Haswell), alignment does not matter. In earlier architectures (Nehalem, Westmere) alignment did matter, but a 32-byte alignment was necessary.

 

Alignment is always a key part of reaching optimal performance, especially for memory bound kernels -- without alignment the processor has to load several cache lines before all "needed" values can be loaded into a vector register. This stays true for Sandy and Ivy Bridge CPUs.

#pragma vector unaligned

tells the compiler not to peel iterations to align stores to the chosen array, but to allow for the possibility of unaligned data.  This might be advantageous for a case where the array on which the peeling would be keyed is not aligned, while most arrays in the loop are aligned. It might also have an advantage on a short array (too short to approach full performance) or might save on generated code size.  I have seen one case out of more than 100 where this confers some advantage.  In your case, it is obviously inadvisable, since run-time alignment adjustment takes care of at least 67% of the data access. 

#pragma simd

tells the compiler nothing about alignment.  It simply tells the compiler to ignore all considerations which would impede vectorization (in your case, the possibility of overlap between fa[] and fb[]).

I agree with Andrey about 32-byte alignment being more important on the early core-i7 architectures than on the current ones, but 64-byte alignment is the key to achieve full-cache-line moves on MIC even at the beginning of the loop.

Patrick is correct, I should not have translated "relaxed memory alignment requirements" to "does not matter". However, in practice, for most (if not all) applications that I have dealt with, vectorized compute-bound and bandwidth-bound workloads on Sandy Bridge and Ivy Bridge perform about the same whether or not I enforce aligmnent. This is definitely not the case for Xeon Phi, it really likes aligned data.

Quote:

Intel® AVX has relaxed some memory alignment requirements, so now Intel AVX by default allows unaligned access; however, this access may come at a performance slowdown, so the old rule of designing your data to be memory aligned is still good practice (16-byte aligned for 128-bit access and 32-byte aligned for 256-bit access). The main exceptions are the VEX-extended versions of the SSE instructions that explicitly required memory-aligned data: These instructions still require aligned data. Other specific instructions requiring aligned access are listed in Table 2.4 of the Intel® Advanced Vector Extensions Programming Reference (see "For More Information" for a link).

Source: http://software.intel.com/en-us/articles/introduction-to-intel-advanced-...

Thank you all for the informative replies!

Although I understand that explicit alignment will reduce the number of cache lines that needs to be loaded, I am still not sure about why #pragma vector unaligned or #pragma simd would bring the performance down by almost two orders of magnitude (1900 GFlops to 60 GFlops). Actually, if no #pragma presents, the compiler can automatically vectorize this loop to get 1900 GFlops performance under -O3.

Also, I found that #pragma simd can vectorize some math functions such as exp and log, while #pragma vector cannot. Therefore, if I have a loop which contains both array computing (like a*fa[i]+fb[i]) and maths functions like (exp(a) and log(a)), how should I vectorize this?

Thanks again!

What likely is happening is the computation is using scalars not vectors and also not using streaming stores. Your computation is taking approximately 32x longer. This is a multiple of the vector size (relationship between scalar operand and vector operand).

Jim Dempsey

As the example posted here shows all threads over-writing the same data region (potential race condition), it doesn't look useful for evaluating the practical effects of these directives. 

In case it's not evident from the documentation, the point of commonality between #pragma simd and #pragma vector is that both remove the cost threshold so the compiler doesn't attempt to decide how advantageous vectorization will be.  #pragma simd has no options to control handling of alignment, but these pragmas may be used together in cases where both qualify.

In my examples, #pragma vector nontemporal can improve performance of svml math functions, where each thread gets a vector chunk of length 200 or more.  

#pragma simd includes additional effects such as ignoring dependencies (even some proven ones), turning off automatic memset and memcpy substitutions (so you can take control of the use of streaming stores), and offers clauses like firstprivate, lastprivate, reduction, ....  It may break your code when there is an undeclared first or lastprivate or reduction; note that max and min reductions are included in those which may break, even though there is no support for those in the reduction clause.  A questionable use of #pragma simd may work on one platform target and break on another.

#pragma omp simd has the potential advantage of corresponding to an adopted standard (OpenMP 4).  For example, there is some support for it in gcc 4.9.  It's somewhat like the earlier Intel version without the omp but without the effect on memset and memcpy.

I have an example with svml using #pragma omp parallel for simd and #pragma vector nontemporal together.

In summary, #pragma simd is the least safe of those discussed here, but all of them should be used with care and effort to understand their implications in the intended usage.

Quote:

jimdempseyatthecove wrote:

What likely is happening is the computation is using scalars not vectors and also not using streaming stores. Your computation is taking approximately 32x longer. This is a multiple of the vector size (relationship between scalar operand and vector operand).

Jim Dempsey

The slow case may have been forced into using pure gather-scatter memory access even though it may report vectorization.  You'd also need to look into whether one case was able to shortcut a loop which another didn't.

Leave a Comment

Please sign in to add a comment. Not a member? Join today