Double precision Newton-Raphson

Double precision Newton-Raphson

I have never seen compilers (GNU or Intel) generating Newton-Raphson (NR) constructs for faster double precision (DP) divides or square roots. I know that there are no DP equivalents of RCPSS, RCPPS, RSQRTSS and RSQRTPS. 3 questions :
 - Why there is no DP equivalents of RCPSS, RCPPS, RSQRTSS and RSQRTPS ?
 - Is it possible, with compiler flags, to generate NR constructs for DP using the existing fast single precision RCP and RSQRT instructions (with a higher number of NR iterations, probably 4 or 5 instead of 2, something like that) ?
 - If not possible, why ? Not efficient ? No demand/interest for faster DP (precision near from DP) divides or square roots ?


Thank you in advance

4 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Once in a while, consideration is given to making rcpps et al. sufficiently accurate (as the original AMD version was) to get a double N-R result in 2 iterations (I guess you would count 3). There seems to be consensus that's it's not worth while.
In the Sandy Bridge, you might consider that the lack of an AVX-256 parallel divide leaves an opening. The improvements in Ivy Bridge et al. seem to be a better method to fix this than adoption of N-R.
Maybe you can see your wish partly granted in the Intel(c) Xeon Phi(tm) implementation.

Thanks for your answer.

According to strong (up to 3.5x) speedup can be gained by using DP NR. I will try to reproduce them on my own. If I can get speedups greater than 1.2x, I consider it is strongly worth while to make the compiler generate by default (with non precise FP models) NR constructs for divides and square roots, exaclty as for single precision. Vectorization is orthogonal: packed version are available for both IEEE (slow) and non IEEE (fast) instructions even if I know that, in Ivy Bridge, VDIVPS/D (ymm) will be natively 256 bits wide contrary to Sandy Bridge, implying a 2x speedup for this instruction (on Ivy Bridge, comparing to Sandy Bridge).

The much faster division and sqrt on Ivy Bridge would greatly alleviate the problem (nearly 2x speedup), but they are still sequenced 128-bit wide operations for now.
According to the URL you posted, about 48 bits accuracy was all that was desired from the "double" division. That would correspond to ICL option /Qimf-accuracy-bits:48 (or maybe 44), in case you have a context where that option is implemented. I couldn't see whether they were considering vectorized code, which is the situation where Intel compilers make use of the lower accuracy options.

Leave a Comment

Please sign in to add a comment. Not a member? Join today