Bit test intrinsics functions wanted!

Bit test intrinsics functions wanted!

BT, BTS and BTC instructions are fast again in Core 2. Could your compiler guys impleament those bit test intrinsics? I think BT instruction should be impleamented at least.

The intrinsics benifits are quite obvious. Suppose we want to test if bit i is set in an integerbitmap, we usually do this in C/C++:

if (bitmap & (1 << i))


The problems of the above C test are

1. more intructions genereated and

2. register cl is needed, thus increasingregister pressure. And moreregister swap/save instructions often neededbecause rcx/ecx is often used as an function parameter.

Another plus for _bit_test(integer, index) is that it reduces code size.

One additional suggestion to the compiler optimization:

Sometimes(not always) bitmap & (1 << i) should be compiledas a BT instruction.


32 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

As you appear to be referring to a specific compiler version, you should specify which one. If you have specific suggestions, there are appropriate places for you to post the details (your account for Intel C++, gcc bugzilla for gnu). In either case, you would want to test a version which is currently being enhanced (usually, the latest available).


Thank you for reply.

I think Itested the official latest version of IntelC++ Compiler 9.1 for Windows, which I downloaded last week.

What's a account? I don't think I have one.

I just want to suggest your compiler developers to impleament such intrinsics that can improve Core 2 cpu performence.

At the end of your installation, you should have been invited to set up a support account on Also, you should be able to go to to open an account or get updates. This is how you would submit bug reports and feature requests.


Iassume you are a member of the C/C++ compiler team. I'm happy as long as anyone in your team knowsthis request.

I've tested a few things with ICL 9.1.It's a good compilerthat can beat the MS one in most cases.

However, I'm sure you can make it even better.

No, I'm not on the compiler team, but I do a lot of testing and work with customers. If you want your requests to go forward, someone has to submit them. it may as well be you, so that you get progress reports.

Tim, Thanks again!

I don't have such an account to file the request. I'm still an evaluation user. Could you please send this thread to them as a feature request? I think this feature would definitely boost Core 2 processors' SpecInt2000 or similar science benchmark results a little bit. And the impleamentation is not difficult at all if you consider the fact that they already impleamented _bit_scan_forword, _bit_scan_reverse and even a _popcnt!

When you getting the eval, it asks if you'd like the free support. If you select "Yes", you'll have an account with the PremierSupport at "". And you can submit issues or feature requests.

About the "bitmap & (1 << i) should be compiled as a BT instruction",it's a good one for our future compiler

About the _bit_test, will the following intrinsics work better for your case? If yes, I'll submit the feature for you.

int _bit_test(int val, int cnt); // returns either 0 or 1 the bit in val specified by cnt
int _bit_test_and_set(int *val, int cnt); // returns either 0 or 1 the bit in *val specified by cnt. That bit is then set.



Jenifer, Thanks a lot.

int _bit_test(int val, int bit_index) looks much better than the 2nd one that is microsoft syntax. _bit_test returns 0 if the specified bit is not set._bit_test returns non zero if the bit is set (not neccessarily to return 1 because the compilermay actually generate a conditional (CF) jumpwhenit is used in a condition clause).

_bit_test intrisincs should map to the BT instruction as closely as possible. I think you can safely assume that instrinsics users are at least assembly-aware programmers who know what they are doing. The MS version is quite ineffienct, whichgenerates a dummy memory read the last time I checkeda piece of 32 bit code MS 8.0 generated.

Each function has it's strengths and weaknesses. In a multi-threaded single processor system you would use the bit_test_and_set, in the SMP you would use the interlocked version of the intrinsic. Lacking this you would have to use a critical section or spinlock. Much more costly than using a memory temp.


Have these intrinsics been implemented in version 10?

Thanks for checking back. But sorry. It's not in 10.0. :(

I've sent a note to the engineer.

The new intrinsic "_bittest" and some others like "_bittestandset" will be added later this year.

Once it's available, I'll post a news here.

I have to add that the new intrinsic "_bitttest" and others will be added later this year, but these intrinsics may not meet your requirements. The betterversion will be added after. It will take some more time. Again I'll post the news here.

Thanks Jennifer for the update and communications

I likeCore that is much better than Netburst:(

The current compilers are still carrying the tradition of avoiding certain intructions that are solw on P4.

Glad to hear we are going to get new intrinsics. Thanks again!

How about naming them shorter to save some typing?

For example _bt instead of _bittest and _bts instead of _bittestandset which is awkward to type?

Igor Levicki

That's fine and makes perfect sense. However, if you consider they already name _bit_scan_forward for bsf and _bit_scan_reverse for bsr, the longer ones make it consistent.

So you are suggesting that they also rename _mm_move_ps to _move_aligned_packed_single_precision_floating_point for consistency with those longer names?

I would rather introduce short names for _bit_scan_forward and _bit_scan_reverse, and leave the old ones as aliases for compatibility reasons. As you see consistency can be satisfied both ways.

Igor Levicki

Definitely not

I prefer the short names that are the same as asm conterparts with a leading underscore too. The big plus for short ones is that you can remember them easily because you already know the asm instructions. So, you havea really good idea:introduce short names for existing awkward long names like _bit_scan_forward and still keepconsistency. And overtime the long ones become deprecated.

_bsf makes more sense to me.


any update on this?


Sorry, not yet in the product. I'll keep pressing.

We have added more support for new intrinsics of VS 2005. This feature is in the nextversion of 10.0 product. I'll post the release news here when available.


A new compiler 10.1 has been just released yesterday. You should be able to download it from the Registration center now.

This 10.1 compiler supports the following new intrinsics that might be interested to you. The definitions of those intrinsics are in the VS2005 header file:













Why there are no short names and why the capitalization is not consistent?

Igor Levicki

Note that those intrinsics are defined by VS2005. ICL supports them now in 10.1 release.

Jennifer, Thank you a lot to get them implemented those intrinsics.

I dont't think the long name is a big deal here though it is a liitle bit hard to remember the long names. I can use inline functions to wrap them up to the way I like. The good thing is one does not need to add #if stuff todifferenciate Intel orMS build.

However, more optimization work needs to be done with those intrinsics to make them useful.

Here is a simple and quick test:

// BT.cpp : Test bt intrinsics



inline int bt(const long m, int i)


return _bittest(&m, i);


int main(int argc, char* argv[])


return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;


The asm output from Intel C/C++ 10.1 comipler for the x64 build (32 bit output is basically the same thing):

; -- Machine type EFI2
; mark_description "Intel C++ Compiler for applications running on Intel 64, Version 10.1 Build 20070913 %s";
;ident "/manifestdependency:"type='win32' name='Microsoft.VC80.CRT' version='8.0.50727.762' processorArchitecture='amd64' publicKeyToken='1fc8b3b9a1e18e3b'""
; COMDAT main
; -- Begin main
; mark_begin;
IF @Version GE 800
ELSEIF @Version GE 612
IF @Version GE 800
ELSEIF @Version GE 614
; parameter 1: ecx
; parameter 2: rdx
$B1$1: ; Preds $B1$0
;;; {
sub rsp, 40 ;12.1
mov DWORD PTR [rsp+48], ecx ;12.1
mov ecx, 3 ;12.1
call __intel_new_proc_init ;12.1
; LOE rbx rbp rsi rdi r12 r13 r14 r15 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
$B1$5: ; Preds $B1$1
stmxcsr DWORD PTR [rsp+32] ;12.1
or DWORD PTR [rsp+32], 32832 ;12.1
ldmxcsr DWORD PTR [rsp+32] ;12.1
;;; return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;
lea rdx, QWORD PTR [rsp+48] ;13.9
mov eax, 24 ;13.9
bt DWORD PTR [rdx], eax ;13.9
setb al ;13.9
; LOE rbx rbp rsi rdi r12 r13 r14 r15 eax xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
$B1$2: ; Preds $B1$5
movzx eax, al ;13.9
xor ecx, ecx ;13.46
mov edx, -1 ;13.46
cmp eax, 0 ;13.46
cmovne ecx, edx ;13.46
mov eax, ecx ;13.46
add rsp, 40 ;13.46
ret ;13.46
; mark_end;
main ENDP
; COMDAT xdata
$unwind$main DD 010401H
DD 04204H
xdata ENDS
; COMDAT pdata
$pdata$main DD @imagerel(main#)
DD @imagerel(main#+76)
DD @imagerel($unwind$main#)
pdata ENDS
;main ENDS
; -- End main
EXTRN __intel_new_proc_init:PROC
As you see it, it is not optimized at all, which defeats the purpose of the intrinsic function.
_bittest only uses memory form of bt instructions, even the index is literal constatan and within length of an integer.
lea rdx, QWORD PTR [rsp+48]
mov eax, 24
setb al 
movzx eax, al
The above instructions are not needed at all if bt reg reg/const is chosen here and a conditional jump/set/move would the job nicely right after the bt intruction which changes the flag.
Microsoft VC compiler do
es the same thing:

; Listing generated by Microsoft Optimizing Compiler Version 14.00.50727.762




; Function compile flags: /Ogtpy

; File c:documents and settingsdzhaomy documentsvisual studio 2005projectsttt.cpp

; COMDAT main


argc$ = 8

argv$ = 16


; 13 : return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;

00000 48 8d 44 24 08 lea rax, QWORD PTR argc$[rsp]

00005 89 4c 24 08 mov DWORD PTR [rsp+8], ecx

00009 0f ba 20 18 bt DWORD PTR [rax], 24

0000d 0f 92 c0 setb al

00010 f6 d8 neg al

00012 1b c0 sbb eax, eax

; 14 : }

00014 c3 ret 0

main ENDP



You're right. More improvements will be in the next major release and icl will support some bitscan and bittest intrinsics without memory reference. Sorry no schedule on next release.

Example is below.

Source code - a.c
unsigned char _bittest(long *, long);
unsigned char my_bittest(long num)
return _bittest(&num, 15);

With the next version icl, you'll get something like below:
my_bittest PROC
; parameter 1: ecx
$B1$1:: ; Preds $B1$0
mov edx, 15 ;25.12
xor eax, eax ;25.12
mov r8d, 1 ;25.12
bt ecx, edx ;25.12
cmovc eax, r8d ;25.12
ret ;25.12

Jennifer, Instead of:

mov	edx, 15
bt	ecx, edx

If 15 is a constant known at compile time it would be better just:

bt	ecx, 15

Note that you can also save some bytes in long mode by using edx instead of r8d:

	xor	eax, eax
	mov	edx, 1
	bt	ecx, 15
	cmovc	eax, edx

Or how about this:

	xor	eax, eax
	bt	ecx, 15
	rcl	eax, 1	; this shifts the carry in giving 0 or 1 in eax

Seems a lot shorter, no? If there are no false dependencies on the flags (from partial access) it would probably also run faster.

Igor Levicki

I ran the same test again today and wanted to report the great results fromIntel C++ Compiler

// BT.cpp : Test bt intrinsics

inline int bt(const long m, int i)
return _bittest(&m, i);

int main(int argc, char* argv[])
return _bittest((const long*) &argc, 24) == 0 ? 0 : -1;

Intel asm output snip:

xor eax, eax
mov ecx, -1
bt edx, 24 ;18.2
cmovc eax, ecx

The IntelASM outputis simply great!

now, compare it to Microsoft C++ Compiler Version 16.00.31118.01 (shipped with VS10) asm code snip:

lea eax, DWORD PTR _argc$[ebp]
bt DWORD PTR [eax], 24
setb al
movzx eax, al
neg eax
sbb eax, eax

Great work! Thanks everybody!

How about

xor eax,eax
bt edx,24
adc eax,eax

Jim Dempsey

Quoting jimdempseyatthecoveHow about

xor eax,eax
bt edx,24
adc eax,eax

Jim Dempsey

That is better. However, it is not really bit test specific. You're addressing a more general problem: How to handle conditional jump more efficiently.

The compiler's conditional handling handling is OKbut not perfect.
Anyways,I'm already pleased by the bit test code generation: it avoids memory whenever possible.

The above has three attractive features

1) it removes one instruction
2) it removes on immediate value
3) works on processors that do not support conditional move

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today