v13 beta - BUG? AVX alignment for stack variables

v13 beta - BUG? AVX alignment for stack variables

Hello,

Running in to some issues with Composer XE 13 Beta code generation for AVX.

Specifically, in release builds on x86 it's not 32-byte-aligning function-scope AVX data on the stack. This causes an illegal instruction when the program executes.

This issue does not occur:-
* When using the Microsoft C++ compiler for x86 or x64 (though they have their own issues...)
* When using the Intel C++ compiler for x64, provided AVX code generation is switched on in the options.

Compiler version: 2013_beta_0.060
OS: Windows 8 release preview
CPU: Sandy Bridge i5-2500
Architecture: x86

Compiler command line options (excerpt - removed some include paths):-
/GS- /Qftz /W3 /QxAVX /Gy /Zc:wchar_t /Zi /Ox /Ob1 /fp:fast /D "__INTEL_COMPILER=1300" /Zc:forScope /GR /arch:AVX /Gd /Oy /Oi /MT /EHsc /nologo /FAs /Ot

Excerpt from generated code:-
5518AD0D vmovdqu ymmword ptr [esp+550h],ymm5
5518AD16 vmovaps ymm0,ymmword ptr [esp+530h]
5518AD1F vmovaps ymm1,ymmword ptr [esp+550h]

Note that esp is 32-byte aligned at this point, but the offset addresses (530, 550 etc.) being generated are not.

Also note that the compiler appears to be generating a lot of unaligned loads/stores, even though the data are declared as being aligned.

The local variables are declared as follows:-

// MSVC style alignment
#ifdef WIN32
#define vecpre __declspec(align(32))
#define vecpost
#endif

// AVX typedef for vector type
typedef __m256 vec_float;

// Aligned version
typedef vecpre vec_float align_vec_float vecpost ;

void Foo()
{
...
align_vec_float bar;
...
}

Anyone care to shed any light?

Cheers,
Angus.

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hello Angus,

is it really an "illegal instruction" error? If yes, then the code you created (using AVX via option "/QxAVX") is executed on a system that does not have AVX instruction set extension.
However, I don't think so, because "/QxAVX" usually adds a test to the main routine whether the underlying processor is able to execute AVX instructions. You won't be able to execute your application. If it's a library there won't be a test, though.

More likely you're seeing a SEGV caused by a GP fault when an aligned load/store on unaligned memory is executed. We seem to have a report for that already and engineering is currently looking into that, please see:
http://software.intel.com/en-us/forums/showthread.php?t=105284

Another note:
It's OK that unaligned loads/stores are used, even if the data is aligned. The reason is that 2nd & 3rd generation Intel Core processors execute both aligned & unaligned load/stores with same cycle count if the data accessed is actually aligned. That's an optimization of the underlying HW. The compiler favors the unaligned accesses because they're more flexible/portable (e.g. when using 3rd party libraries).

Edit:
To be sure, would it be possible for you to create a small test case/reproducer? I'm kind of afraid that you're seeing this problem only with complex code... like for the other thread above.
Some indicator whether we're facing the same problem would be to compile without optimization. If I'm right you shouldn't see the problem anymore. Can you verify this?

Thank you & best regards,

Georg Zitzlsberger

Hoping to clarify what is being reported here:
I think OP is saying that __m256 data types don't get default 32-byte alignment unless AVX option is on, which can generate alignment fault when an AVX compiled function is called by a non-AVX one.
If the Microsoft compiler requires __declspec(align(32)) in order for a function to call an ICL compiled AVX function, one hopes the same thing will work for ICL. As OP implies, it could be useful if declspec were not required.
It would be useful to post a minimal example demonstrating the problem on premier.intel.com in order to get an assessment from the compiler team whether this situation could be improved upon.
Georg seems to be diverting the subject into the question of how the compiler uses unaligned instructions in many cases even though it expects alignment. Even if the case in question were made to run correctly by such means, there could be a hidden performance penalty if the alignment isn't corrected.

Yes, you're right it's a SIGSEGV, access violation reading 0xFFFFFFFF.

The generated asm code is quite hard to disentangle, but it looks like it has something to do with parameter passing for an inline function. It looks like the temporary stack variables being generated to allow the data to be passed aren't aligned properly, so when the inlined code attempts to access them as if they were aligned, *boom*.

Specifically, this code is generating an error:-

align_vec_float rg0 = mulps(rand[0].white1(), set1ps(uncorrh));

compiles to:-

5518A836 vmovdqu ymmword ptr [esp+550h],ymm5
5518A83F vmovaps ymm0,ymmword ptr [esp+530h] <=== ACCESS VIOLATION
5518A848 vmovaps ymm1,ymmword ptr [esp+550h]
5518A851 vextractf128 xmm2,ymm0,0
5518A857 vextractf128 xmm3,ymm0,1
5518A85D vextractf128 xmm4,ymm1,0
5518A863 vextractf128 xmm5,ymm1,1
5518A869 vzeroupper
5518A86C vpmulld xmm2,xmm2,xmm4

where rand[0].white1() is:-

vforceinline vec_float white1()
{
align_vec_int l_seed = m_seed;
l_seed = my_mulepi32(l_seed, set1epi32(196314163));
.....
}

and my_mulepi32 is (written in asm as a certain other vendor's compiler wasn't generating vzeroupper instructions in all necessary places)

vforceinline vector_reg_i mulepi32(vector_reg_i q1, vector_reg_i q2)
{
vecpre __m128i vecpost loword;
vecpre __m128i vecpost hiword;

__asm
{
vmovaps ymm0, q1
vmovaps ymm1, q2
vextractf128 xmm2, ymm0, 0x0
vextractf128 xmm3, ymm0, 0x1
vextractf128 xmm4, ymm1, 0x0
vextractf128 xmm5, ymm1, 0x1
vzeroupper
pmulld xmm2, xmm4
pmulld xmm3, xmm5
movdqa loword, xmm2
movdqa hiword, xmm3
}
return _mm256_insertf128_si256(_mm256_castsi128_si256(loword),hiword,1);
}

If I change the 'vmovaps' to 'vmovups', the code executes OK - but is presumably incurring the penalty for unaligned loads/stores?

Sandy Bridge has a large penalty for 256-bit unaligned access which crosses a cache line boundary. The compiler normally splits such accesses down to 128-bit instructions when it expects frequent misalignment. Ivy Bridge is supposed to make a big improvement on misaligned moves, such that the single 256-bit instruction could be preferred over the split moves.
As Georg pointed out, there should be no penalty for using movups in the case where the data are aligned.

Thanks! Seems that the best solution with the Intel compiler is to get rid of the inline asm, and use a 128-bit VEX version of "vpslld" to perform the shift operations (I didn't realise to begin with that "vpslld" 128-bit intrinsic is allowed under AVX whereas the 256-bit version is not, new to all this stuff). Code now running nice & smooth.

Leave a Comment

Please sign in to add a comment. Not a member? Join today