Question about example on Optimization manual---AVX mask move to avoid branch penalty

Question about example on Optimization manual---AVX mask move to avoid branch penalty

Hi all,

I am trying to run an example introduced by optimization manual(June 2013) on page 11-23, example 11-14. I tried to use a separate .s file to write the function, and a main.c file to do the main func. The code will only run correctly in debug mode. Please see attachment for my code. The cond_loop.c is actually cond_loop.s but the forum won't accept this kind of extension.  

  • icc main-2.c cond_loop.s -g          Everything works fine. 
  • icc main-2.c cond_loop.s              Segmentation Fault with failure to access array members at the end of the code.

After the function void cond_loop(const float *a, float *b, const float *c, const float *d, const float *e, const int length) returns, all the array pointers will be lost so I cannot access the old arrays anymore. This problem will only occur without -g compile option, meaning release code only bug. So I am not able to debug it. I did some research and it showed this is because in debug mode stack frame pointer will always be saved but in release mode this is not the case. I am not sure this is my problem and I don't really know how to solve the problem. I tried to push rbp and rsp but these won't help. Would anyone please help me look at it? Any advice is appreciated. Thank you all!

BTW: in attachment, cond_loop_c.c is the corresponding C version of the assembly and of course, this one works perfectly. And I am using Linux so it is X64 system V ABI. Thanks again.

Best

xiangpisai

AnexoTamanho
Download main-2.c1.5 KB
Download cond-loop.c1.04 KB
Download cond-loop-c.c249 bytes
32 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

Have you verified that your (assembler) call statement in your main has not pushed any arguments onto the stack and is requiring the called routine to fixup the stack? Depending on calling conventions you may need to make an adjustment (do not rely on the fact that you see the correct values in the incomming registers and draw the conclusion that the caller (main) is not also pushing args onto the stack).

Place a printf in front of cond_loop call in main and compile in Debug build, but with all optimizations enabled and no runtime checks. You should be able to place a break point on the printf. Open a disassemly window and step past the return from printf (and any potential stack cleanup. Check to see if the next code is only placing the args into registers or is additionally placing them onto the stack. If onto the stack then check to see if the stack is cleaned up after the call (meaning callee need not clean up) and if it is not cleaned up, the callee is responsible for cleanup. Note, depending on your compiler you used to be able to declare the calling convention as "naked". You might want to read up on this.

Jim Dempsey

www.quickthreadprogramming.com

Why not permit the compiler to generate code using vblendps from C source code?  Are you trying to verify the expectation that vmaskmovps will be slower?

I suppose you might argue that it's ugly to use C code with restrict pointers or #pragma ivdep, but you must make equivalent assumptions when you write the .s code.  You will either need to simplify the source code or use #pragma vector always or simd as well, to avoid the compiler's "protects against exceptions"  If you want it all with no restrict or pragma, there's CEAN.

b[0:length] = c[0:length]* (a[0:length]<0 ? d[0:length] : e[0:length])

Citação:

TimP (Intel) escreveu:

Why not permit the compiler to generate code using vblendps from C source code?  Are you trying to verify the expectation that vmaskmovps will be slower?

I suppose you might argue that it's ugly to use C code with restrict pointers or #pragma ivdep, but you must make equivalent assumptions when you write the .s code.  You will either need to simplify the source code or use #pragma vector always or simd as well, to avoid the compiler's "protects against exceptions"  If you want it all with no restrict or pragma, there's CEAN.

b[0:length] = c[0:length]* (a[0:length]<0 ? d[0:length] : e[0:length])

Hi Tim,

Thanks for your reply. I know vblendps is a good choice here. I am just trying to practice using vmaskmovps because I might need it in my future code so I basically copied the code provided in the Intel Optimization Manual. I really don't have any hate to restrict or #pragma ivdep or #pragma simd. 

As the previous comment suggests, I should check in the assembly code window to see if main func is pushing something secret into stack before calling the cond_loop function in debug mode. The result is yes. I just want to understand what is going on in these steps. These code isn't really about speedup. They are more like...tutorial code. But yes, your suggestions are very valuable. Thank you very much!

Best

xiangpisai

Citação:

jimdempseyatthecove escreveu:

Have you verified that your (assembler) call statement in your main has not pushed any arguments onto the stack and is requiring the called routine to fixup the stack? Depending on calling conventions you may need to make an adjustment (do not rely on the fact that you see the correct values in the incomming registers and draw the conclusion that the caller (main) is not also pushing args onto the stack).

Place a printf in front of cond_loop call in main and compile in Debug build, but with all optimizations enabled and no runtime checks. You should be able to place a break point on the printf. Open a disassemly window and step past the return from printf (and any potential stack cleanup. Check to see if the next code is only placing the args into registers or is additionally placing them onto the stack. If onto the stack then check to see if the stack is cleaned up after the call (meaning callee need not clean up) and if it is not cleaned up, the callee is responsible for cleanup. Note, depending on your compiler you used to be able to declare the calling convention as "naked". You might want to read up on this.

Jim Dempsey

Hi Jim

Thank you so much for your suggestion.

First thing is: the bug will actually be there even in debug mode, as long as you turn on -O3. I have do what you told me to and I found out that the main function is trying to edi and rsi into address pointed by rbp, the base pointer I believe. There are two suspecious code:

  • mov dword ptr [rbp-0x20], edi
  • mov qword ptr [rbp-0x18], rsi

I think they are trying to protect rdi and rsi which are the first two parameters of the function cond_loop but I don't see why, because the address seems never to be loaded again. 

The above two instructions are obtained from a working code. For the non-working code, it doesn't save anything, just make the call directly. So what should I do to make my code working correctly? Is there anything I should do in the main.c or anything I should do in the assembly file? Thanks again for your help

Best

xiangpisai

  • mov dword ptr [rbp-0x20], edi
  • mov qword ptr [rbp-0x18], rsi

RBP is a base pointer and it looks like  edi and rsi are saved in function local variables storage area.You have segmentation fault so the right option is to locate faulting rip (instruction pointer) and try to resolve memory address.Sometimes ret address is overwritten by some junk and not clean by callee.Other possibility is to attempt to read/write from/to unaccessible memory.

Can you run your code under GDB?

Hi Folks, 

Thank you so much for your help! I just found out the bug and fixed it!

The optimized code won't save the pointer address of abcde into some place safe. They just placed them into r12, r13, rbx, r15, r14, I don't know why the compiler assumes that the function it's calling won't mess up with these registers. Now when the program calls my func writtin in assembly, of course I used these registers...After knowing these, I just modified my cond_loop.s file and added 

  • push r12
  • push r13
  • push rbx
  • push r15
  • push r14

at the beginning of the code and added 

  • pop r14
  • pop r15
  • pop rbx
  • pop r13
  • pop r12

before the ret instruction and everything works now. The remaining problem is: 

I don't really think this is a good solution because who knows where will the main function place my pointers next time I run the code? So guys, what do you usually do when you write some functions in assembly code? How do you protect your pointers? Thanks a lot for your kind help and fast response. 

Best

xiangpisai

Citação:

iliyapolak escreveu:

Can you run your code under GDB?

Hi iliyapolak,

Thank you for your help. I did try to debug it with gdb but it makes no difference than idb and idb has a GUI which is better at tracking registers. So basically I have solved the problem, just as mentioned in the previous comment. Yet there are still some unsolved quesions. Would you please further shed some light on that? Thanks in tons!

Best

xiangpisai

>>>I did try to debug it with gdb but it makes no difference than idb and idb has a GUI which is better at tracking>>>

If you are asking about GDB  I must say that I do not know very well Linux debugging.I base my knowledge on similar issues(bugs) which usually occur on Windows platform.In general debugging under windows is easier because of great windbg extensions but at expense of closed source code.

Afaik r12-r15 registers are nonvolatile  and must be preserved by a callee if used , but this is under Windows I am not sure if Linux supports the same "convention".

Citação:

iliyapolak escreveu:

>>>I did try to debug it with gdb but it makes no difference than idb and idb has a GUI which is better at tracking>>>

If you are asking about GDB  I must say that I do not know very well Linux debugging.I base my knowledge on similar issues(bugs) which usually occur on Windows platform.In general debugging under windows is easier because of great windbg extensions but at expense of closed source code.

Afaik r12-r15 registers are nonvolatile  and must be preserved by a callee if used , but this is under Windows I am not sure if Linux supports the same "convention".

Hi iliyapolak,

Hmm actually I am working under linux and debugging with intel debugger---idb.

Oh yes. I guess I am into the trouble of "volatile" registers! Thank you so much for mentioning it! Windows and linux use different calling conventions so I guess I will read System V ABI book now and try to find something useful. I will reply later with my findings. 

Best

xiangpisai

Citação:

iliyapolak escreveu:

>>>I did try to debug it with gdb but it makes no difference than idb and idb has a GUI which is better at tracking>>>

If you are asking about GDB  I must say that I do not know very well Linux debugging.I base my knowledge on similar issues(bugs) which usually occur on Windows platform.In general debugging under windows is easier because of great windbg extensions but at expense of closed source code.

Afaik r12-r15 registers are nonvolatile  and must be preserved by a callee if used , but this is under Windows I am not sure if Linux supports the same "convention".

Here it is:

Quoting from: http://people.freebsd.org/~lstewart/references/amd64.pdf

Most registers are overwritten by a procedure call, but the values in the following registers must be preserved:

%rbx %rsp %rbp %r12 %r13 %r14 %r15

But this is a piece of document for freeBSD and not for linux. I bet these two should share something in common but I am not sure what would happen for Linux. I cannot find anything interesting on Google. Thanks a lot!

Best

xiangpisai

Google here does indicate that http://x86-64.org/documentation/abi.pdf should cover this (with only Windows differing from other x86_64 OS), but here too that reference doesn't load.

Still, you could start by saving the .s file from a similar compilation from C source, checking that it works as a replacement for the .c, then trying your variations. A requirement for binutils update is not unusual on the more "stable" linux distros.

Even some experts have forgotten that RHEL 5.x, for example, has committed never to support AVX, so even with updated binutils you may expect problems with function interfaces.

>>>How do you protect your pointers>>>

I would do what you did  save them on the stack and restore later.

Citação:

TimP (Intel) escreveu:

Google here does indicate that http://x86-64.org/documentation/abi.pdf should cover this (with only Windows differing from other x86_64 OS), but here too that reference doesn't load.

Still, you could start by saving the .s file from a similar compilation from C source, checking that it works as a replacement for the .c, then trying your variations. A requirement for binutils update is not unusual on the more "stable" linux distros.

Even some experts have forgotten that RHEL 5.x, for example, has committed never to support AVX, so even with updated binutils you may expect problems with function interfaces.

Thank you Tim, for going back and help me again.

Though the website won't load, I am able to download that piece of document somewhere else and there is nothing about volatile inside. My future code will specifically involve in AVX and AVX2 acceleration and will only run on  specific machine. So I won't need to worry anything about portability. 

BTW: Thank you for mentioning RHEL. I guess I will make sure my code doesn't run on any of those machines :D

One more question: Does Xeon Phi support AVX and AVX2 instructions and functions like mask, blend and gather/scatter? Thanks!

Best

xiangpisai

On Windows platforms, x64 there are calling convention rules as to what registers need to be preserved across calls, and what do not, as well as the input and output variable assignments. There will be a similar set of rules for Linux and/or other O/S. Find the rules, obey the rules.

; Rules for argument passing on x64 platform:
;
; rax, volatile, return value
; rcx, volatile, first integer argument
; rdx, volatile, second integer argument
; r8, volatile, third integer argument
; r9, volatile, forth integer argument
; r10, volatile
; r11, volatile
; xmm0,volatile, first fp argument
; xmm1,volatile, second fp argument
; xmm2,volatile, third fp argument
; xmm3,volatile, forth fp argument
; xmm4,volatile
; xmm5,volatile
;
; all other registers non-volatile
; (must be preserved/restored if used)
;

Note, the registers you are now pushing are on the "must be preserved" list
*** you may also need to save and restore some of the xmm registers too ***

Do not assume the above list is valid on Linux, dig through the man pages to find out what is required.

Jim Dempsey

www.quickthreadprogramming.com

Citação:

jimdempseyatthecove escreveu:

On Windows platforms, x64 there are calling convention rules as to what registers need to be preserved across calls, and what do not, as well as the input and output variable assignments. There will be a similar set of rules for Linux and/or other O/S. Find the rules, obey the rules.

; Rules for argument passing on x64 platform: ; ; rax, volatile, return value ; rcx, volatile, first integer argument ; rdx, volatile, second integer argument ; r8, volatile, third integer argument ; r9, volatile, forth integer argument ; r10, volatile ; r11, volatile ; xmm0,volatile, first fp argument ; xmm1,volatile, second fp argument ; xmm2,volatile, third fp argument ; xmm3,volatile, forth fp argument ; xmm4,volatile ; xmm5,volatile ; ; all other registers non-volatile ; (must be preserved/restored if used) ;

Note, the registers you are now pushing are on the "must be preserved" list
*** you may also need to save and restore some of the xmm registers too ***

Do not assume the above list is valid on Linux, dig through the man pages to find out what is required.

Jim Dempsey

Thanks Jim,

As discussed in the previous comments, I have already noticed that. After pushing necessary registers into stack and pop them back before function returns everything works fine now. 

Best

xiangpisai

Since the official ABI link (http://www.x86-64.org/documentation/abi.pdf) is inaccessible for the moment, the following link may be of use:

http://www.classes.cs.uchicago.edu/archive/2009/spring/22620-1/docs/hand...

What the document doesn't mention though is that the frame pointer (rbp) is omitted by default at least on Linux, so you should not rely on it when accessing the arguments on stack. Use rsp instead for that. However, you still have to save and restore rbp if your asm routine clobbers it in order to work correctly when frame pointers are enabled by compiler switches. Also, some types of arguments are always passed on stack, but that's not the case for pointers, as in your example.

>>>What the document doesn't mention though is that the frame pointer (rbp) is omitted by default at least on Linux, so you should not rely on it when accessing the arguments on stack.>>>

It is also the case on Windows.

Btw all the omittion of ebp makes debugging harder.

>>...After the function void cond_loop(const float *a, float *b, const float *c, const float *d, const float *e, const int length) returns,
>>all the array pointers will be lost so I cannot access the old arrays anymore...

Something is really wrong and I'll take a look at C-versions of your test case. Thanks for the reproducer.

...
void cond_loop( const float *a, float *b, const float *c, const float *d, const float *e, const int length );
...
void cond_loop_c( float *a, float *b, float *c, float *d, float *e, int length )
{
...
}
...

There is a difference in forward declaration of function and declaration of the function in implementation. That is, const specificator is used for most parameters except for b. Please post a right declaration for the function. Thanks.

Actually it seems that there are two different functions: void cond_loop() and void cond_loop_c().

>>...there are two different functions: void cond_loop() and void cond_loop_c()...

xiangpisai made a note that

...in attachment, cond_loop_c.c is the corresponding C version of the assembly...

Citação:

Sergey Kostrov escreveu:

>>...there are two different functions: void cond_loop() and void cond_loop_c()...

xiangpisai made a note that

...in attachment, cond_loop_c.c is the corresponding C version of the assembly...

Hmm...Might be my bad English. Actually I don't have a good idea what is const in assembly. In my assembly I didn't perform any check for that const. So I cannot tell the difference that in the C version of the code, whether using const or not will make any difference. 

Try to do a verification that GCC compiler generates identical assembler codes for these two versions of the cond_loop function:

[ Version 1 ]

void cond_loop( const float *a, const float *b, const float *c, const float *d, const float *e, const int length );

void cond_loop( const float *a, const float *b, const float *c, const float *d, const float *e, const int length )
{
int i;
for(i=0;i[ Version 2 ]

void cond_loop( float *a, float *b, float *c, float *d, float *e, int length );

void cond_loop( float *a, float *b, float *c, float *d, float *e, int length )
{
int i;
for(i=0;i

Xiangpisai,

Here is a modified test case and I'd like to inform you that I didn't have any issues or problems on a 64-bit Windows 7 Professional platform on a Dell Precision Mobile M4700 with Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ).

I've completed 4 tests and here Intel C++ compiler options:

icl.exe /QxAVX /Od /MDd /D"_DEBUG" Test18.cpp
icl.exe /QxAVX /O1 /MD /D"NODEBUG" Test18.cpp
icl.exe /QxAVX /O2 /MD /D"NODEBUG" Test18.cpp
icl.exe /QxAVX /O3 /MD /D"NODEBUG" Test18.cpp

I hope that my resulst will be useful for you.

Anexos: 

AnexoTamanho
Download test18.cpp2.67 KB

[ Outputs ]

Test 1 - icl.exe /QxAVX /Od /MDd /D"_DEBUG" Test18.cpp

a[14] = 0.849280
c[14] = 0.657335
d[14] = 0.189015
e[14] = 0.729507
f[14] = 0.479530

Test 2 - icl.exe /QxAVX /O1 /MD /D"NODEBUG" Test18.cpp

a[14] = 0.849280
c[14] = 0.657335
d[14] = 0.189015
e[14] = 0.729507
f[14] = 0.479530

Test 3 - icl.exe /QxAVX /O2 /MD /D"NODEBUG" Test18.cpp

a[14] = 0.849280
c[14] = 0.657335
d[14] = 0.189015
e[14] = 0.729507
f[14] = 0.479530

Test 4 - icl.exe /QxAVX /O3 /MD /D"NODEBUG" Test18.cpp

a[14] = 0.849280
c[14] = 0.657335
d[14] = 0.189015
e[14] = 0.729507
f[14] = 0.479530

Note: Since a Seed is the same for all 4 tests results are identical.

Citação:

Sergey Kostrov escreveu:

[ Outputs ]

Test 1 - icl.exe /QxAVX /Od /MDd /D"_DEBUG" Test18.cpp

a[14] = 0.849280
c[14] = 0.657335
d[14] = 0.189015
e[14] = 0.729507
f[14] = 0.479530

Test 2 - icl.exe /QxAVX /O1 /MD /D"NODEBUG" Test18.cpp

a[14] = 0.849280
c[14] = 0.657335
d[14] = 0.189015
e[14] = 0.729507
f[14] = 0.479530

Test 3 - icl.exe /QxAVX /O2 /MD /D"NODEBUG" Test18.cpp

a[14] = 0.849280
c[14] = 0.657335
d[14] = 0.189015
e[14] = 0.729507
f[14] = 0.479530

Test 4 - icl.exe /QxAVX /O3 /MD /D"NODEBUG" Test18.cpp

a[14] = 0.849280
c[14] = 0.657335
d[14] = 0.189015
e[14] = 0.729507
f[14] = 0.479530

Note: Since a Seed is the same for all 4 tests results are identical.

Thanks for helping me test the code. I am also using a Dell Precision M4700. the only thing different is that I am using Linux---Debian Wheezy. I guess that's the main point why I am getting bugs. Anyway, after pushing corresponding registers to stack, things are working fine now.

>>... I am also using a Dell Precision M4700. the only thing different is that I am using Linux---Debian Wheezy. I guess that's
>>the main point why I am getting bugs. Anyway, after pushing corresponding registers to stack, things are working fine now...

Thanks for confirming that the problem is resolved.

Just wanted to point you to a great optimization resource:

http://agner.org/optimize/#manuals

Particulary this manual:
http://agner.org/optimize/optimizing_assembly.pdf

It explains Windows and Linux 32-bit and 64-bit ABI, calling conventions, name mangling, etc.

-- Regards, Igor Levicki If you find my post helpfull, please rate it and/or select it as a best answer where applies. Thank you.

Igor,

Thanks for the link. I've enjoyed Agner Fog's posts and web pages for many years and have considered him a valueable programming resource.

Jim Dempsey

www.quickthreadprogramming.com

Faça login para deixar um comentário.