I have an application in which I need to perform many unary reduction XOR (parity computation) in parallel and then concatenate the result bits. The fastest way to do this I believe is to do a popcount and then deposit the lsb in the results register. However, popcount and deposit are both I0 instructions and cannot be issued simulaneously. Still, having each pair of bundles contain either a deposit or popcount and having the remaining slots perform the operation more explicitly (ie shifts+XOR and then mask, shift and or) likely would maximize parallelism. Unfortunately, I cant seem to be able to code the operation so that the compiler will choose do this. If i use the popcount and deposit intrinsics, the compiler just serializes all of them and wont decompose the operations. If I try to explicitly code the operations, the compiler doesnt select a popcount or deposit operation. The popcount is a little understandable, as the operation itself is different, but i dont understand why code like this:
j = j & 0x0000000000000001;
i = i & 0xFFFFFFFFFFFFFEFF;
j = j << 8;
m = j | i;
wont yield one deposit instruction: dep m = j, i, 8, 1.
Anyone have any ideas or suggestions?