misaligned F32vec4

misaligned F32vec4

I'm using the eval 7.0 C++ for IA-32, integrated with MSVC6 SP5. I have found that the operator new doesn't always return a properly aligned object.

The following program *sometimes* exhibits the issue:



using namespace std;

cout << "F32vec4 is " << sizeof(F32vec4) << " bytes." << endl;

F32vec4 *foo = new F32vec4;
cout << "pointer foo is 0x" << foo << endl;

F32vec4 *bar = new F32vec4;
cout << "pointer bar is 0x" << bar << endl;

*foo += *bar;

return 0;

operator new should align an F32vec4 to 16 bytes. But often only 8-byte alignment is done. This causes the program to crash when the 16-byte movaps for the arithmetic operation += is encountered.

The program may work, by chance, if the two allocations turn out to be aligned. I get different alignment results when I run with the debugger, without the debugger, compile for debug single-threaded, compile for debug multi-threaded DLL, etc. It really should work in all these cases. I have seen similar results under the .NET environment.

I guess there is some question over whether new alignment is part of the Intel compiler or the Microsoft runtime. Even if it is determined to be a problem in the runtime, it would be good to get a workaround. Otherwise it is difficult to use SSE and SSE2 intrinsics/libraries in dynamically allocated classes.

3 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Okay, a workaround that works is to override operators new and delete, either globally or in each class that contains F32vec4/__mm128 member variables, to use processor pack functions _aligned_malloc() and _aligned_free(). Using STL containers will require creating an allocator class that uses the same functions (if the global new and delete are not overridden).

It still would be much nicer if new put a 16-byte object on a 16-byte boundary.

The problem you're raising is indeed a nasty one, spurious crashhhhes goodies...

Like you point out, overhiding "operator new" on a per class basis is a sensible solution IMHO (I do just that in my code...). You can define a macro (say "ALLOC16") with the definitions for operators new, new[], delete, delete[] and just include it in your class like in :

class Stuff
__m128 a,b,c;


typically operator new[]/delete[] are invalid if sizeof(your class) isn't a multiple of 16 so it's a good idea to report a warning if they are called at runtime.

btw, handling yourself the allocation has big advantages for highly optimized code. If you target SMP systems or hyperthreaded CPUs for example, a shared heap allocator is a major source of useless synchronization between threads that compete for the allocator. Another issue is that you're not garenteed that each thread will have alloc. addresses at least 128 B apart (one L2 line). So here again a good solution is to devise yourself a pool of allocators, typically one allocator dedicated to each physical or logical CPU, you will maximize L2 hits ratio this way

Leave a Comment

Please sign in to add a comment. Not a member? Join today