I benchmarked vsSqrt and vsInvSqrt against SSE loops that use sqrtps and sqrtps/divps. Each routine is passed a 2^20 element array, and is run 256 times. My platform is a RevE (SSE3) Athlon64 on XP64. The compiler is msvc8 beta2. The SSE code runs the same speed on 32 and 64bit, but the VML code is much slower on 64bit. 64bt vsInvSqrt is 4x slower than its 32bit counterpart, and vsSqrt is almost 8x slower than its 32bit counterpart. The errors are the same between 32 and 64bit, so I'm guessing that the libraries are implementating the same algorithm. In both cases I am linking against the static, not the dll, libraries. The timings listed here:
don't show any signficant difference between the 32bit and 64bit libs. Anyone else witnessed these sorts of timings on 64bit?
32bit INVSQRT (sqrtps + divps)
SSE time: 2.801254s.
Standard deviation: 1.040377e-013.
VML time: 1.329339s.
Standard deviation: 7.497016e-014.
32bit SQRT (sqrtps)
SSE time: 1.513263s.
Standard deviation: 8.583913e-009.
VML time: 1.847221s.
Standard deviation: 9.348521e-009.
64bit INVSQRT (sqrtps + divps)
SSE time: 2.809789s.
Standard deviation: 1.050167e-013.
VML time: 5.318174s.
Standard deviation: 7.547861e-014.
64bit SQRT (sqrtps)
SSE time: 1.531398s.
Standard deviation: 8.583939e-009.
VML time: 14.243044s.
Standard deviation: 9.348549e-009.