Thanks for the reply. I already use the SSE registers quite a bit for number crunching functions. There are things that are still inefficient with them. For example, the naive task of counting the number of each byte value in a block of data requires a temporary array (256 entries x 16 or 32 bits) in main memory.
Now, I'm by no means an expert on the intracacies of Intel microcode, but I think there are enough clever people around to overcome current design limitations.
Again, thanks for the reply,
James