When I compile the following C function containing some variants of x ^= x >> n:
===>
typedef unsigned int t; // or other integer types such as char or int
t test(t x)
{
t x1;
x = x ^ (x >> 1);
x = (x >> 2) ^ x;
x ^= x >> 3;
x1 = x;
x = x >> 4;
x = x ^ x1;
x1 = x;
x = x >> 5;
x = x1 ^ x;
x1 = x;
x >>= 6;
x ^= x1;
return x;
}
<===
with maximum optimization most C-compilers generate code for *all* variants of x ^= x >> n such as (registers may vary)
mov edx,eax // make a copy of eax
shr edx,5 // modify the copy
xor eax,edx
These are 3 dependent statements whereas my preferred variant is
mov edx,eax // make a copy of eax
shr eax,5 // modify original register
xor eax,edx
where the mov and the shr command easily can be processed simultaneously.
I have tested GCC, AMD's C compiler, MS-C, and Intel-C with the same result.
A speed test revealed that my variant only costs about 2/3 of the time on Intel Atom N450 and 3/4 of the time on Intel i7. I'm quite sure that other processors behave similarly.
Code like this occurs quite often in a program. In many cases the compiler can avoid modifying a copy. It might happen that the association of registers to variables are exchanged by such optimization.
This trick is not new ... but why is it not applied? Have I missed something?
BTW: My similar inquiry in the GCC forum has not been commented on for about 2 months now, see http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48064


