AVX Unaligned Loads/Stores w/ ComposerXE

AVX Unaligned Loads/Stores w/ ComposerXE

Hi,

I am trying to figure our what compiler flags to use in order to get the correct and most optimal results from AVX intrinsics. I have put together a small example below showing the problems I am having. Basically, using unaligned AVX loads/stores does not produce optimal code and I can't seem to find the right combination of flags to get the compiler to do the right thing.

I have included 3 sample outputs from very simple test code which illustrate the problems.

Any help or insight would be greatly appreciated. Thank you.

Mike

//-----------------------------------------------------------------------------
// Below shows that using unaligned loads/stores produces suboptimal code.
// Why do these simply not translate to "vmovups ymm#,r##"?
//

Using:
icc -S -use-msasm -xavx -fsource-asm main.c

..___tag_value_vavx_test.32: #17.1

### __m256 va0, vb0, vc0;
### __m128 vc128;
###
### va0 = _mm256_loadu_ps( a );

vmovups (%rdi), %xmm0 #21.28

### vb0 = _mm256_loadu_ps( b );

vmovups (%rsi), %xmm1 #22.28
vinsertf128 $1, 16(%rsi), %ymm1, %ymm3 #22.28
vinsertf128 $1, 16(%rdi), %ymm0, %ymm2 #21.28

### vc0 = _mm256_add_ps( va0, vb0 );

vaddps %ymm3, %ymm2, %ymm4 #23.11

### _mm256_storeu_ps( c, vc0 );

vmovups %xmm4, (%rdx) #24.23
vextractf128 $1, %ymm4, 16(%rdx) #24.23

###
### vc128 = _mm256_extractf128_ps( vc0, 0 );
### _mm_store_ss( c, vc128 );

vmovss %xmm4, (%rdx) #27.5

###
### _mm256_zeroupper();

vzeroupper #29.2

###
###
### return;

ret #32.5

//-----------------------------------------------------------------------------
// Below demonstrates that using aligned loads/stores produces unaligned
// load/stores in assembler. This is fine, but is of concern if switching
// compilers or flags as other settings produce aligned instructions.
//

Using:
icc -S -use-msasm -xavx -fsource-asm main.c

..___tag_value_vavx_test.32: #17.1

### __m256 va0, vb0, vc0;
### __m128 vc128;
###
### va0 = _mm256_load_ps( a );

vmovups (%rdi), %ymm0 #21.27

### vb0 = _mm256_load_ps( b );
### vc0 = _mm256_add_ps( va0, vb0 );

vaddps (%rsi), %ymm0, %ymm1 #23.11

### _mm256_store_ps( c, vc0 );

vmovups %ymm1, (%rdx) #24.22

###
### vc128 = _mm256_extractf128_ps( vc0, 0 );

### _mm_store_ss( c, vc128 );

vmovss %xmm1, (%rdx) #27.5

###
### _mm256_zeroupper();

vzeroupper #29.2

###
###
### return;

ret #32.5

//-----------------------------------------------------------------------------
// Below demonstrates that using -xhost withunaligned loads/stores produces the correct
// unaligned load/stores in assembler. But the compiler also generates SSE
// instructions mixed in with the code and does not even zeroupper, causing
// severe penalties.
// (note, this is compiled on a SandyBridge system running RHEL6.1)
//

Using:
icc -S -use-msasm -xhost -fsource-asm main.c

..___tag_value_vavx_test.32: #17.1

### __m256 va0, vb0, vc0;
### __m128 vc128;
###
### va0 = _mm256_loadu_ps( a );

vmovups (%rdi), %ymm0 #21.28

### vb0 = _mm256_loadu_ps( b );
### vc0 = _mm256_add_ps( va0, vb0 );

vaddps (%rsi), %ymm0, %ymm1 #23.11

### _mm256_storeu_ps( c, vc0 );

vmovups %ymm1, (%rdx) #24.23

###
### vc128 = _mm256_extractf128_ps( vc0, 0 );

### _mm_store_ss( c, vc128 );

movss %xmm1, (%rdx) #27.5

###
### _mm256_zeroupper();

vzeroupper #29.2

###
###
### return;

ret #32.5

2 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Brandon Hewitt (Intel)

For case 1:

The compiler is trying to avoid split cache lines causing a performance decrease. Hence why it splits the 256-byte load into two 128-byte loads. If you can get the compiler to recognize that the pointers are 32-byte aligned, it won't do that, but then that's why you're using unaligned load intrinsics in the first place, because you don't know that for sure.

For case 2:

Since you're using aligned load intrinsics, the compiler knows there's no risk of a split cache line and can do the entire 256-byte load at once. It still uses vmovups because there is no significant cost to using vmovups vs. vmovaps on an aligned data, and using vmovups avoids a crash on the chance that the data really is unaligned after all.

For case 3:

I've just confirmed that -xhost maps to -xavx on my Core i7 2nd generation system here. What exact compiler are you using (-V will show this), and can you send me the output you get after adding the -dryrun option?

Brandon Hewitt Technical Consulting Engineer Tools Knowledge Base: "http://software.intel.com/en-us/articles/tools" Software Product Support info: "http://www.intel.com/software/support"

Connectez-vous pour laisser un commentaire.