So I'm optimizing a series of slow math operations into a single vectorized routine using SSE[x] instructions / intrinsics. This is in 32 bit code. I'm trying to figure out an elegant way to convert 64 bit integers to double precision floating point without using the FPU and incurring the hit of having to store into memory then reload into xmm registers to get back into the SSE world. SSE doesn't support 64 bit int <-> double conversion, as far as I can tell, so using the FPU is the only obvious way. Does anyone know of an elegant way to do this that would be faster than going to memory with the FPU?
From timing tests I've done, avoiding the FPU load & store that this int->float conversion requires would make the routine almost three times as fast with large data sets.
This routine does a bunch of 64 bit integer math with SSE, converts to floating point (doing a store / load / store / load combo with the FPU), does a bunch of floating point with SSE, then finally returns to 64 bit integers and stores the results.
I've got that last step going from 64 bit floats to 64 bit ints quite fast within SSE, since I know the range of the data (<2^52) and can just add in rounding & normalizing constants and mask out the mantissa to extract it as an integer.
Does anyone know of any easy tricks (i.e., one that stays within xmm registers and doesn't require branching) to convert 64 bit ints to 64 bit floats, given that there's no SSE instruction for doing this (and please correct me if I'm wrong)?
I was thinking that the CVTDQ2PD (convert DWord ints to doubles) could be utilised for this by doing the high dwords and low dwords separately and combining them after adjusting the exponent, but that got messy. There's also the BSR - bit scan reverse - instruction which works on regular registers which could help figure out the exponent (once the value was made positive and the sign stored away), but that was also less than satisfying. Either of these lines of thought were so complicated that I'm not sure they're faster than just going back and forth to memory.