data-misaligned with asm-inlined fabs()

data-misaligned with asm-inlined fabs()


EXC_BAD_ACCESS generated by inlined fabs() function.

icc 10.1 (only 32-bit compiler installed)

MacOSX 10.5

XCode 3.0


void foo( double b, ... )


... fabs(b) ...


sometimes generates data-misaligned inline-fabs()

0x19d54157 call 0x19d5415c

0x19d5415c pop %edx

0x19d5415d andps 0xe825b(%edx),%xmm5

0xe825b(%edx) = [0x19d5415c+0xe825b] NOT aligned at 16 bytes boundary, as required by "andps":

If you need any more information, please ask.

EDIT: does not occur with -O0, but with -O1/2/3

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

According to what you have shown, alignment of b is defined prior to the call. The compiler would not assume it is aligned,
unless you insert #pragma vector aligned. If you do that, or if you supply a function of your own which uses parallel SSE, it is your
responsibility to assure that the data will be aligned before
reaching code which is vectorized under that pragma. Otherwise, the compiler would issue andps only subsequent to an adjustment
loop which takes care of unaligned data beginning at b.
If the memory operand is a mask created by the compiler, the compiler would issue linker instructions to make that mask aligned. If
the OS or binutils did not correctly support the instruction for 16-byte alignment, you would see such a fault.
If you defined the mask yourself, declaring it as an m128 data type, such as your reference page indicates, would instruct
the linker to align it.
If you can make an example showing that the compiler does not work correctly, please submit it on your account.


The code sequece listed above is usingProgram Counter relative coding practices (note call to location following call when then pops the PC of the pop into %edx). This means that the program and accompanying data can be moved about in memory as opposed to being bound to addresses at load time. This style of programming is typically done in device handlers or what used to have been done in MS DOS .COM files. Yes I notice you are on MacOSX and not on a DOS machine.

Your compiler option switches were not specified.

The fault may lie with program placement by whatever loads your program. Or it could be with the C++compiler. Or with the programmer not being specific about option switches or declspec etc... in the program.

If the compiler generated the offset correctly (0xe825b), then the pop %edx would have been required to load at 0x???????5.

If you can produce an ASM listing with generated code bytes then examine the (relative)location of where that pop %edx is located. If it is located at an address ending in 5 then the problem lies with whatever is loading your PC relative code.

Jim Dempsey

Hello, Tim, Jim,

thank you for your quick answers.

I am providing no own implementation of "fabs()", so the compiler is generating that itself (typically as inlined AND-masking everything except the sign-bit).

This is the full compiling stage as shown in XCode:

CompileICC "/Users/patrick/Code_Projects/EIU_RiBoDy_2/Output Mac/universalXCode/" /Users/patrick/Code_Projects/EIU_RiBoDy_2/../EIU_Stuff/nifty/nifty_quaternion.cpp

cd /Users/patrick/Code_Projects/EIU_RiBoDy_2

/usr/bin/icc-10.1-base/bin/icc -x c++ -arch i386 -dev-usr-root=/Developer/usr -g -O1 -w1 -fno-omit-frame-pointer -no-parallel -fvisibility=default -traceback -fpascal-strings "-I/Users/patrick/Code_Projects/EIU_RiBoDy_2/Output Mac/universalXCode/" "-F/Users/patrick/Code_Projects/EIU_RiBoDy_2/Output Mac/universalXCode/Release" "-I/Users/patrick/Code_Projects/EIU_RiBoDy_2/Output Mac/universalXCode/Release/include" -I../EIU_Stuff/ode-0.5.02/include -IInterface -D__ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__=1040 -falign-functions=16 -include RiBoDy_Prefix.pch -c /Users/patrick/Code_Projects/EIU_RiBoDy_2/../EIU_Stuff/nifty/nifty_quaternion.cpp -o "/Users/patrick/Code_Projects/EIU_RiBoDy_2/Output Mac/universalXCode/"

No errors/warnings. The next command issued is the linker "icpc".

As you can see, I added the function-align=16 which took care of some misalignments, but sadly not all. The functions are loaded by MacOSX provided dyld, which loads the functions to their align16 boundary properly. I would not suspect dyld to be faulty in this case, also I would not suspect my binutils to be faulty either, as all other binaries that I produce do work properly (using Apple's gnu tools).

What would be the switch to get the intermediate asm listing? Would that listing reveal a possible mis-alignment generated by the compiler?

Would issuing a #pragma vector aligned help for compiler-generated code (like the asm-inlined fabs())?


>> What would be the switch to get the intermediate asm listing? Would that listing reveal a possible mis-alignment generated by the compiler?

What this will tell you is (potentially) what is at blame.

try ICC -? to find out.

"Normal" code on an I386 is not Programmer Counter Relative, or should I say is not written to be insensitive to movement after load or conversely the load is to be insensitive to placement (no relocations performed as part of load process). Instead"Normal: codeis Base of Virtual Address Relative. Your inlined function would therefore NOT contain the CALL (next instruction), POP %edx (to get eip into edx). Instead it would simply contain SSE AND instruction using for the address of the mask an address that is relative to Virtual Address of 0x00000000 (or knowing the load location of the instruction it would contain the fixed offset to the address). It would not need to make the relatively expensive CALL&POP to determine where the eip is located.

Jim Dempsey

Jim, do not worry about the PIC-style, I am building a dylib and probably

this is then the default compiler behaviour. Alignining constants to boundaries

should be independent of PIC/noPIC.

The switch to get the asm output is "-S".

Without optimization (-O0) I get this for a "fabs()":


fabs #444.6


Everything is fine.


With O1,2,3 I get:



call L_L390 # Prob 100% #440.1

L_L390: #

popl %edx #440.1


andps L_2il0floatpacket.38-"L_L390"(%edx), %xmm5 #444.6



.section __DATA,__data

.align 5


.long 0xffffffff,0x7fffffff,0x00000000,0x00000000

So the compiler correctly emits an ".align 5" for the constant (the bit-mask for

the "fabs()" operation), although an ".align 4" would be sufficient.


But in the object file I get this for the DATA,data section:



sectname __data

segname __DATA

addr 0x00001877

size 0x00000200

offset 7335

align 2^5 (32)

reloff 0

nreloc 0

flags 0x00000000

reserved1 0

reserved2 0


The things are neatly aligned within the mis-aligned section

(section address is 1877 and NOT aligned 2^5)!


I finally found the option to get rid of this, and also this problem:


Apparently, the compiler tries to generate the binary directly by default.

Going via some assembler (I guess gas?), will produce correct binaries on MacOSX 10.5.


Good detective work and congratulations on find for work around.

I am not a user of ICC, nor do I use MacOSX (I am a Windoz user MSVC and Intel Visual Fortran).

On Intel Visual Fortran I have a similar outstanding issue relating to alignment. I'vegiven up on getting a resolution for the problem.Apparently the engineer in charge of the issue does not see the problem. I've tried to explain the problem several ways without success.

The problem is, or I should say is not, that the compiler generates the offset within the output segment (data segment)to the alignment specified. The problem is, the Linker is not being directed to align the data segment to other than a 4 byte boundary. The alignment of the data segment should be made to the largest of all alignment requirements. Part of the problem with interpretation is the support engineer is assuming "This is flat model - there ain't no stink'n segments".

The assumption is correct but the realities are that the linker assembles the segments into this one big flat virtual address space by concatenating the object files segment data according to alignment restrictions (byte, word, dword, qword, paragraph,4kPage, ...). When IVF (via the IDE) manages to get through the linker you end up with dword alignment restrictions on the object files data segments. This means if the data segment isn't aligned correctly that all the aligned data within the data segment are not aligned within the virtual address. This results in the SSE instructions requiring alignment operatingas a hit or miss situation.

The problem you are encountering smells much like the problem that exists with IVF.

Jim Dempsey

Hello Jim,

yes, that sounds "familiar"... Note that the compiler is producing a correct asm file. My assumption is, the "direct-binary-creation" is faulty with ICC. Making the effort of creating an .s first and passing that to a "proper" assembler is then a good way out. (that is what the -use-asm is doing???).

I am using ifort at university (linux). Checking..... aah, ifort does understand the "-use-asm" switch as well!

Maybe for windows the switch will cure it ? ( /Quse-asm ??? ) But I dont know which assembler needs to be installed (cygwin???).



The assembler will depend (for me) on which system I do the compilation on. Intel IA32 orAMD Opteron x64.

The version of IFORT for Windows I have (V10.1.013) does not list /Quse-asm

However, if I issue

ifort /Barf /?I get illegal option /Barf followed by the help

ifort /Quse-asm /?I get no mention of illegal option, followed by help

Therefore /Quse-asm is an undocumented option.

Haven't run a test to see if it will work as you suggest. May experiment later when I have the time.

Eeking the last little bit of performance out of the system is the least of my worries at this time.

Jim Dempsey


I tested it the same way like you did ("ifort -asd" throws an error, "ifort -use-asm" does not throw an error). So I dont know whether it has an effect at all under linux either.

I did not get the alignment errors when I did not use optimization (-O0). Maybe that is an option for you. Good luck under windows, and with Intel's engineers!

Under linux icc/ifort is really an unbeatable option in the scientific computation area. Luckily, Intel provides a free academic version. They are really good at producing good+optimized code, but apparently they have problems with the different binary formats (namely mach-o and PE alignment probs).


If you submitted the same data you did in this thread to the premier support your issue would probably have been at least "targeted to be fixed" by now.

I presume that they get a lot of false reports from inexperienced programmers so they are sometimes quick to brush you off but if you attach a sound test case with the instructions for reproducing the problem you will get a fix.


>> I did not get the alignment errors when I did not use optimization (-O0). Maybe that is an option for you.

This may have been more of a case of a quote from "Dirty Harry"

"Do you feel lucky, punk."

On my system it would intermittently fail on Debug vs Release. As more often than not, it would not fail in Debug, but would fail in Release. My Debug code compiled with additional statistics variables in some modules. Thus alignment was more by chance than by design (no matter how well my alignment directives were written).

My "fix" was to have the initialization code check the alignment of one of the supposedly aligned variables and if there were a skew it would announce the skew and stop. Then the "fix" would be to add a padd variable, compile, re-link and cross fingers. Eventually it would pass the skew test. Maybe you can do something like that for now.

I am very happy with IFORT. My simulation code consists of ~750 source files. Late 1980's F77 code revised by me to F95 with OpenMP, 40x faster than original code (forty x). Uses OpenMP on 4 core system (wish I had 8). It cooks along quite well now. Some of my simulation runs take weeks of computation time.

I've been looking at replacing my dual Opteron 270 (2 processors with 2 cores each, 2GHz) with a 2x4 core Intel platform. Something using the Tyan Tempest i5400PW (S5397) motherboard using two Intel Xeon 5400 series processors. Xeon E5410 looks attractive. Plenty of room for RAM (16 slots) and later I could replace the processors if the prices drop. 8 cores at 2.33 GHz and 24MB of L2 cache should work much better than 4 cores at 2GHz and 4MB of L2 cache. Faster FSB will help too. If anyone has experience with this motherboard I would appreciate your comments.

My current motherbord is a Tyan Thunder K8SR S2881. Ok for it's time. Have problems finding drivers for the disk controller. Windows Server 2003 works ok, don't know what will happen with Windows Server 2008 (or XP x64 Pro). I do know Vista couldn't load, actually everything would load up until the final boot (after all the friggn downloads) then it would crap out with disk driver problems. Hopefully upgrading my system will help.

Jim Dempsey


I am currently evaluating the product, there is no "premier support" for me. Maybe support is reading the forums?

I am comparing icc against gcc in terms of speed for an existing commercial product. Sadly, the icc version of the product is about 1.5 times slower (with the best settings I could find) than the gcc (simply -O3) version. It might have to do with the "-use-asm" switch, because icc might get some more optimizations out of the binary, so the comparison might not be fair ATM. I think I wait for the next release which will be hopefully announced as "Leopard compatible", and beg for another trial-period at that time.


EDIT: I have sent a PM to "Tim18", asking to forward this thread to the support team.

EDIT2: Tim18's email address is incorrect. Tim18, here is the email I was trying to send you via this forum:

Hello Tim,

I understand the preview versions of MacOSX10.5 Leopard had quite some differences to the shipped version. The Intel 32bit Ccompiler has some problems on the shipped systems. I want to point your attention to these two threads I have created; they also contain a workaround for the current version:

Technical details are found within the threads.

I am only evaluating the product. There is no speed increase to my GCC build, so I am not interested in buying the compiler ATM. I understand it might not be a fair comparison to GCC at this time because of possible issues.

When icc is working under Leopard properly, and when I have time to spare like now, I will get back to Intel and request another trial period. That might be already Q3 2008. Can you give me a release date, or a ball park estimate?

As an additional request, it would be nice if the compiler understood (at least for mach-o targets), the __attribute__(constructor/destructor) keywords. The function-vectors are simply collected in the appropriate section of the file. Just check out a sample gcc dylib build with the command line utility "otool -l". So this is quite a primitive feature request.


icc 10.1.007 fixes this problem.

this version was released on the same date as I registered as eval (dec 6th), so the download center gave me the old version icc 10.1.006.

So all is well. Take note that even with evaluation version premier support is free so you still can register and submit issues.

Hehe, yes, I discovered that when I clicked my personal download page.

BTW: the fixed compiler (and without the "-use-asm" switch) generates faster code than GCC as expected.

Leave a Comment

Please sign in to add a comment. Not a member? Join today