SSE and AVX behavior with aligned/unaligned instructions

SSE and AVX behavior with aligned/unaligned instructions

We've learned that if the compiler emits an aligned SSE memory move instruction for an unaligned address, it will cause a SEGV. Will the same occur with AVX? Or in the case of AVX is the extent of the resulting behavior amount to undesirable performance?

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I think you answered the question yourself: aligned instructions require alignment!

In more detail:-

The Fine Manual gives details of the properties of each instruction, as does the online intrinsics guide; here you can see the properties of AVX load instructions. You will observe that there are both aligned and unaligned loads, for instance :-

_m256d _mm256_load_pd (double const * mem_addr)

Synopsis

__m256d _mm256_load_pd (double const * mem_addr)
#include "immintrin.h"
Instruction: vmovapd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated. 

and 

__m256d _mm256_loadu_pd (double const * mem_addr)

Synopsis

__m256d _mm256_loadu_pd (double const * mem_addr)
#include "immintrin.h"
Instruction: vmovupd ymm, m256
CPUID Flags: AVX

Description

Load 256-bits (composed of 4 packed double-precision (64-bit) floating-point elements) from memory into dst. mem_addr does not need to be aligned on any particular boundary.

 

Aligned VEX-encoded loads and stores (i.e. vmovdqa) still require aligned memory operands. However, memory operands for other VEX-encoded instructions (e.g. vpaddd) need not be aligned. You will still pay performance penalty for unaligned memory access though. Refer to Intel Software Developer Manual for the description of particular instructions.

 

SSE instructions on AVX cpu accept more cases of misaligned data than on earlier cpus. I don't think it's well documented in case it is part of your question.

@Tim P. AFAIK, legacy SSE instructions (i.e. non-VEX-encoded) haven't changed and still require aligned memory operands where they previously did. Only the VEX-encoded equivalents have relaxed requirements.

SSE arithmetic instructions may accept an unaligned operand on an AVX cpu. It goes without saying that it is inadvisable to try to take advantage of this. It raises the possibility of unexpected failure when changing cpu.
Then again Intel may have changed undocumented behavior on more recent cpus after originally planning to match amd (who might also have changed).

I addressed this recently in a different forum topic, but I can't find the reference right now....

In the beginning, SSE supported unaligned 128-bit loads/stores only via the MOVUPS instruction.  All 128-bit memory references that were input arguments to other instructions were required to be 128-bit aligned to avoid a protection fault.   In the earliest SSE systems, MOVAPS was faster, so it was preferred when the data was known to be aligned.    Later systems eliminated the performance penalty of MOVUPS in the case where the data was aligned, so the compiler switched to generating MOVUPS even in the cases where it knew the data was aligned. 

AVX relaxed the alignment restrictions for input arguments for both 128-bit and 256-bit loads.   BUT, every generation of processor had different performance penalties for executing these memory references without natural alignment.

From memory:

  • Sandy Bridge

    • Loads

      • 2 loads per cycle (up to 128-bit) in the absence of bank conflicts or cache line crossing.

        • I.e.., no penalty for unaligned loads that do not cross a cache line boundary.
      • 128-bit loads that cross a cache line boundary reduce the rate to 1 load every 2 cycles.
      • 256-bit loads take 2 cycles, but two can execute in parallel in the absence of bank conflicts or cache line crossing.
      • 256-bit loads that cross a cache line boundary reduce the rate to 1 load every 4 cycles.
      • Loads that cross a 4KiB page boundary have a larger penalty, but at least part of that penalty can be overlapped with subsequent loads.  The detailed mechanisms are not clear.
    • Stores
      • Big (?) penalty for any sized store that crosses a cache line boundary.
      • Huge (>100 cycle) penalty for any store that crosses a 4KiB page boundary.
    • Because there are only 2 address generation units, it is not possible to perform 2 loads and 1 store per cycle.
      • 2 256-bit loads plus 1 256-bit store every 2 cycles is supported, but it is extremely difficult to avoid bank conflicts in this case.
  • Ivy Bridge
    • I think there were reductions in the penalties for cache-line and page crossing, but I don't recall that I ever measured them in detail.
  • Haswell
    • Loads

      • 2 loads per cycle (up to 256-bit) for any alignment in the absence of cache line crossing.
      • 1 load per cycle for any sized load that crosses a cache line boundary.
      • Loads that cross a 4KiB page boundary have a larger penalty, but at least part of that penalty can be overlapped with subsequent loads.  The detailed mechanisms are not clear.
    • Stores
      • One store per cycle (any size or alignment) as long as it does not cross a cache line boundary.
      • I think that the penalties for cache-line-crossing and 4KiB-page-crossing are much smaller than on SNB, but I don't have the numbers handy.
    • A 3rd address generation unit was added to allow 2 loads plus 1 store per cycle.
  • Skylake Xeon
    • I have not tested this yet, but it certainly supports 2 512-bit aligned loads per cycle, or 1 512-bit aligned load plus any other load that does not cross a cache line boundary.

      • This could be built on the same physical interface that Haswell uses -- dual-read-port, 512-bit port width.
    • Skylake Xeon does not appear to be able to support 2 512-bit loads plus 1 512-bit store per cycle, but the reported performance is slightly higher than 2 512-bit loads per cycle.   I have not checked to see whether this inability to fully overlap also applies to 128-bit and/or 256-bit 2-load-plus-1-store combinations.
"Dr. Bandwidth"

Elimination of compiler use of aligned loads applies only to AVX and newer targets, as older ISA code selection in Intel compilers is generally optimized for the oldest corresponding cpu.
The penalty for 256 bit unaligned access on Sandy bridge was so large that compilers would always split access to 128 bit pairs. Ivy bridge greatly reduced the penalty but not to the extent that compilers needed to eliminate the splitting. Intel compilers when directed to generate both Sandy and ivy bridge paths should produce only the path optimized for Sandy bridge.
I suppose the incentive for some of us to learn about avx512 specifics is reduced because the lack of client cpu support.

Leave a Comment

Please sign in to add a comment. Not a member? Join today