_mm_load_ps generates VMOVUPS

_mm_load_ps generates VMOVUPS

Hi all,

I've tested the following case using Intel XE Compiler 2011.3 and 2013.4

I have a question, let's take a very basic SSE function:

void test1(float * pool)
    __m128 v = _mm_load_ps(pool);
    __m128 a = _mm_load_ps(pool + 8);
    _mm_store_ps(pool + 16, _mm_add_ps(v, a));
    printf("test1: %gn", pool[16]);

if I compile it without specific flags i get expected SSE code, aligned load (explicit for pool, implicit for pool + 20h) and store (pool + 40h):

00E410A3  movaps      xmm0,xmmword ptr [eax] 
00E410A6  addps       xmm0,xmmword ptr [eax+20h] 
00E410AA  movaps      xmmword ptr [eax+40h],xmm0 

if I compile it using AVX i get unaligned load for pool, implicit aligned load for pool + 20h and unaligned store for pool + 40h

[plain]002F10A3  vmovups     ymm0,xmmword ptr [eax]
002F10A7  vaddps      ymm1,ymm0,xmmword ptr [eax+20h]
002F10AC  vmovups     xmmword ptr [eax+40h],xmm1[plain]

Is this expected ? Does this affect performance ?

Kind regards

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

When i say "I compile it using AVX", I mean /QxAVX under Windows (and that means in my project there is AVX elsewhere so not using this flags ends up in either emulating AVX instruction with SSE or mixing legacy / VEX instruction => performance disaster)

Ok, after benchmarking random access load/store, seems VMOVUPS [XMM] = MOVAPS in term of computation time when memory is aligned.

Thanks a lot

You are welcome.

Leave a Comment

Please sign in to add a comment. Not a member? Join today