BTC family intrinsics - code generation needs to be improved

BTC family intrinsics - code generation needs to be improved

Bild des Benutzers gpseek

// BT.cpp : Test bit intrinsics

#include  
#if defined(_M_X64)

#pragma intrinsic(_bittestandcomplement64)

#else

#pragma intrinsic(_bittestandcomplement)

#endif
int global_static_integer;
__inline int btc_func_style(int mask, int i)

{

	_bittestandcomplement((long*) &mask,  i);

	return mask;

}
__inline void btc_mem_style(int* mask, int i)

{

	_bittestandcomplement((long*) mask,  i);

}
__declspec(noinline) void test_mem_style(int i)

{

	 btc_mem_style(&global_static_integer, i);

}
__declspec(noinline) void test_func_style(int i)

{

	 global_static_integer = btc_func_style(global_static_integer, i);

}
int main(int argc, char* argv[])

{

	test_mem_style(argc + 1);

	test_func_style(argc);
	return global_static_integer;

}
Here is the output of these 2 simple test functions:
; mark_description "Intel C++ Compiler for applications running on IA-32, Version 12.1.3.300 Build 20120130";
?test_mem_style@@YAXH@Z	PROC NEAR PRIVATE

; parameter 1(i): eax

        sub       esp, 12                                       ;27.1

        mov       edx, OFFSET FLAT: ?global_static_integer@@3HA ;28.3

        btc       DWORD PTR [edx], eax                          ;28.3

        setb      al                                            ;28.3

        add       esp, 12                                       ;29.1

        ret                                                     ;29.1

?test_mem_style@@YAXH@Z ENDP
?test_func_style@@YAXH@Z	PROC NEAR PRIVATE

; parameter 1(i): eax

        sub       esp, 12                                       ;33.1

        mov       ecx, DWORD PTR [?global_static_integer@@3HA]  ;34.3

        lea       edx, DWORD PTR [esp]                          ;34.3

        mov       DWORD PTR [esp], ecx                          ;34.3

        btc       DWORD PTR [edx], eax                          ;34.3

        setb      al                                            ;34.3

        mov       eax, DWORD PTR [esp]                          ;34.3

        mov       DWORD PTR [?global_static_integer@@3HA], eax  ;34.3

        add       esp, 12                                       ;35.1

$LN51:

        ret                                                     ;35.1

?test_func_style@@YAXH@Z ENDP

As you can easily see now, the geneated code is far from being optimal. The compiler creates a few useless stack memory read and writes, stack pointer adjustmens, and a setb instruction for no obvious reasons. Could somebody transfer this message to the complier development group as an improvement request? Thanks!
The same problem exists for BTR and BTS intrinsics too. And the same issue has also been confirmed on x64.

17 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.
Bild des Benutzers Jennifer J. (Intel)

Whatare the compile options used?
the OS used? and gcc or vc version?

Thanks,
Jennifer

Bild des Benutzers gpseek

Itested O2, O3 and Ox in Visual Studio 2005. I think the compiler will do the same thing with different settups.
I can also test VS 2010. However, I don't think itcan make any difference at all.

Bild des Benutzers gpseek

I just ran the test again with the newest version at this moment to see these get improved.

Unfortuately, there is no improvements on these at all for almost a year. The generated code is the same.

Could Jennifer help me cry about this again:) Thanks!

Complier and options:

Intel(R) C++ Compiler for applications running on IA-32, Version 13.1.0.149 Build 20130118

; mark_description "-c -Qvc10 -Qlocation,link,$(VCInstallDir)\\bin -Zi -nologo -W3 -O2 -Oi -Qipo -Qftz- -D __INTEL_COMPILER=1310";

; mark_description " -D WIN32 -D NDEBUG -D _CONSOLE -D _UNICODE -D UNICODE -EHs -EHc -MD -GS -Gy -fp:precise -Zc:wchar_t -Zc:for";

; mark_description "Scope -Qansi-alias -YuStdAfx.h -FpRelease\\bt.pch -FA -FaRelease\\ -FoRelease\\ -FdRelease\\vc100.pdb -Gd -T";

; mark_description "P";

Bild des Benutzers Marián "VooDooMan" Meravý

OT: I am very sorry to make off-topic post, but I just want to ask how come

#pragma intrinsic(_bittestandcomplement64)

can gpseek use, while I need to use:

#if defined(_MSC_VER) && !defined(__INTEL_COMPILER)
#   pragma intrinsic(abs)
#endif

and for others too (memset, memcpy, etc...), since in other case I get warning from ICC (or maybe an error, IIRC, I don't recall exactly why I disabled these MS-specific pragma's in the past).

-- With best regards, VooDooMan - If you find my post helpful, please rate it and/or select it as a best answer where applies. Thank you.
Bild des Benutzers Jennifer J. (Intel)

I wasn't aware of your posting on the compiler options previously. But I've got it now, and is checking on it.

Jennifer

Bild des Benutzers Jennifer J. (Intel)

Zitat:

gpseek schrieb:

Unfortuately, there is no improvements on these at all for almost a year. The generated code is the same.

From our compiler engineer, the reason for this is because the 2nd paramenter of "_bittestandcomplement()" is not immediate number, otherwise it would be optimized.
The memory forms of these “bit test” instructions do not perform well. The best is to use the following to get better performance:

// Make sure n is an unsigned int, not a signed int
x[n / 32] |= (1 << (n % 32));

Is this work-around working for you?

Jennifer

Bild des Benutzers gpseek

Zitat:

Jennifer J. (Intel) schrieb:

From our compiler engineer, the reason for this is because the 2nd paramenter of "_bittestandcomplement()" is not immediate number, otherwise it would be optimized.
The memory forms of these “bit test” instructions do not perform well. The best is to use the following to get better performance:

// Make sure n is an unsigned int, not a signed int
x[n / 32] |= (1 << (n % 32));

Is this work-around working for you?

Jennifer

Jennifer,

These can hardly be said as workaround.

 1. Using unsigned won't help at all. This is because these intrinsic functions are not defined as unsigned. I believe, in this case, it is MS but not Intel to blame. Intel  just used MS definitions to keep compatiblities:

unsigned char _bittestandcomplement(    long *a,    long b );

unsigned char _bittestandcomplement64(    __int64 *a,    __int64 b );

Refer to http://msdn.microsoft.com/en-us/library/zbdxdb11(v=vs.90).aspx

2. Your engineer's example is out of the question, or is out of the problem scope. What it does is just as everybody does now: to avoid BT family instructions! This is no longer best practice, as I believe, because BT family instructions are really fast since core 2 duo. But the compiler is still not so good at generating these instructions. What I say is to use these fast BTs instead in some cases. Using BTs can lessen port competition, for example, thus create further optimizing oppotunities.   

 

Bild des Benutzers Sergey Kostrov

>>...
>>unsigned char _bittestandcomplement( long *a, long b );
>>unsigned char _bittestandcomplement64( __int64 *a, __int64 b );
>>...

You can't blame Intel for declarations of these intrinsic functions because they are Microsoft specific and take a look at a Copyright note in intrin.h header file. In Intel specific headers for intrinsic functions they are not declared at all.

In order to make your statements about effectiveness of code generation more valuable ( and fair! ) you need to provide examples of code generation with different C++ compilers.

I hope that your comment will be taken into account by Intel software engineers. Thanks.

Bild des Benutzers gpseek

Zitat:

Sergey Kostrov schrieb:

In order to make your statements about effectiveness of code generation more valuable ( and fair! ) you need to provide examples of code generation with different C++ compilers.

I hope that your comment will be taken into account by Intel software engineers. Thanks.

I don't think code generation examples from other compliers are really needed for my case at all.

I'm saying that opitimizing BT family instructions generations help Intel C/C
++ compiler and more importantly Intel Core platform. Is this still not enough?

The last time I checked with MS complier, the BT code are bad/slow too.  However, this should not be an excuse for intel complier at all. If nothing has been changed since last time I checked AMD platform, AMD's BTs are not as fast as Intel's BTs since C2D introduction. However, these fast BTs are mostly wasted simply because of lousy compliers, both MS and Intel's included.  

 

Bild des Benutzers Sergey Kostrov

>>The last time I checked with MS complier, the BT code are bad/slow too. However, this should not be an excuse for intel
>>complier at all. If nothing has been changed since last time I checked AMD platform, AMD's BTs are not as fast as Intel's BTs
>>since C2D introduction. However, these fast BTs are mostly wasted simply because of lousy compliers, both MS and Intel's included.

It is really intersting to know.

Bild des Benutzers iliyapolak

>>>The compiler creates a few useless stack memory read and writes, stack pointer adjustmens, and a setb instruction for no obvious reasons>>>

I can agree that stack  reads/writes are useless,but usage of setb instruction could be related to btc instruction and inserted automatically by compiler when CF == 1.Maybe usage of setb is hardcoded by compiler designers?

Bild des Benutzers gpseek

Zitat:

iliyapolak schrieb:

>>>The compiler creates a few useless stack memory read and writes, stack pointer adjustmens, and a setb instruction for no obvious reasons>>>

I can agree that stack  reads/writes are useless,but usage of setb instruction could be related to btc instruction and inserted automatically by compiler when CF == 1.Maybe usage of setb is hardcoded by compiler designers?

The generated setb instructions are useless here for these calls.

I think I know why they choose to generate a setb instruction there in the first place. Take a look at _bittestandcomplement declariation at the MS site. You will see _bittestandcomplement returns "the bit at the position specified". I think MS over-did the job.  _bittestandcomplement does not really need such a return value at all in most of the cases you can think of.  In my test cases shown in the original post, the return value is not used at all.

However, the complier is not complicated enough to know that if no reference to return value, then dont bother even to try to calculate.

That is why you see the redundant setb there.

 

Bild des Benutzers iliyapolak

>>>The generated setb instructions are useless here for these calls>>>

Yes I agree with you on setb instruction.

Bild des Benutzers gpseek

bump up:)

Any update on this? Thanks

Bild des Benutzers gpseek

Any update on this?! Thanks!

Bild des Benutzers Matthew Oliver

I to would be interested in the compiler being updated to properly optimize the BT family of intrinsics. If both inputs are in register then this intrinsic should be able to generate a single BTR etc. instruction and just leave it at that. The way it is currently handled is making the whole thing far slower than it needs to be.

Also given that these instructions are actually being used by Intel for there Embree RT code then i would have thought that this sort of thing would have been fixed as its also affecting them.

Melden Sie sich an, um einen Kommentar zu hinterlassen.