vectorization of sin/cos results in wrong values

67 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting iliyapolak

Windows Calculator was used:a top result iswithout a range reduction and bottom results is with a "simulated" range reduction

Hi Sergey!
How were you able to disable a range reduction in Windows Calculator?
I need to check Calc.exe imports...
Please check it andlet us knowif you find something interesting. To be honest, I'm very impressed with capabilities
of Windows Calculator andI finally understood why it is so accurate. Please take a look at its help:

WinCalc Help -> Index -> precision

and this is what it displays:

Understanding Extended Precision [SergeyK] Is that a128-bit precision?

Extended Precision, a feature of Calculator, means that all operations are accurate to
at least 32 digits. Calculator also stores rational numbers as fractions to retain accuracy.
For example, 1/3 is stored as 1/3, rather than .333. However, errors accumulate during
repeated operations on irrational numbers. For example, Calculator will truncate pi to 32 digits,
so repeated operations on pi will lose accuracy as the number of operations increases.

Quoting TimP (Intel)...I suppose you can perform range reduction yourself before calling the library function...

There is a small problem here. It is impossible to use '%' on any FP-data types.A call to 'fmod' CRT function
creates additional overhead. An example of acompilation error with '%' operator is below:

...
float fValue1 = 765.0f;
float fValue2 = fValue1 % ( int )360;
...

Compilation error:
...
..\Test\DspSetTest.cpp(824) : error C2296: '%' : illegal, left operand has type 'float'
...

Quoting TimP (Intel)
...You could profile or disassemble to see whether the library function uses fsin or fsincos...

It uses SSE2 by default ( SSE2 support is enabled ),or Intel instruction'fsin' when SSE2 support is disabled.
Please take a look at http://software.intel.com/en-us/forums/showpost.php?p=190884

Quoting iliyapolak...sin() functions which in turn calls x87 fsin. So disabling a range reduction is not possible because fsin
performs in range reduction in the hardware...
I hope that you've already looked at:

http://software.intel.com/en-us/forums/showpost.php?p=190884
and
http://software.intel.com/en-us/forums/showpost.php?p=190885

Microsoft software developers implemented CRT'sin'functionin a different way. I'll provide technicaldetails later.

Best regards,
Sergey

Hi Sergey!
I have also investigated MSVCRT.DLL dynamic library which is responsible for runtime math function.
I disassembled this library and counted at least three sin function.One of them uses x87 fsin to perform calculation.Later I will post code snippets.

My search engine skills aren't up to the task of researching this, but surely after all the discussion about how fsin does range reduction you don't expect a single standard library function or x87 intrinsic (fprem) to do the job. If it did, we wouldn't have to settle for these things being buried in proprietary code, or obscured beyond belief in open source code.
I didn't think you'd take my suggestion of performing your own range reduction and then calling your compiler's library function as a way to increase speed. If I wasn't clear, I meant it as a step to comparing the accuracy of your method against the one in the library function.
Did you look at any posted code examples such as Moshier's cephes library on netlib?
You could look in glibc or newlib, but those seem to set a priority on obscurity.
Plauger's "The Standard C Library" gives the shortest useful explanation I've seen of a typical compromise between speed and accuracy.
For example, my quick and dirty method for 128-bit long double involves these steps:
calculate how many multiples of Pi/2 you must subtract to produce |x| < Pi/4.
Take a value of Pi/2 correctly rounded to double precision (best double approximation to Pi/2); working in long double, subtract the multiple of the double value, then subtract the multiple of the difference between Pi/2 and your double rounded value of Pi/2. You must use compiler options which follow the standards on associativity with parentheses (not icc default). In effect, you are using a value of Pi/2 which is correct to the best of your ability to double plus long double precision. Long double and a half, in the best case.
There have been published implemented algorithms for numerically exact range reduction; but I haven't got my hands on them.
I don't think Bill Plauger got rich trying to explain this, so I'm not expecting much.

What Windows OS did you use? Here is a screenshot with a version of Windows Calculator I used

I have Windows 7.
I tried three times to calculate sine of 10^1024 radians ,but unfortunately Calc.exe always has been in "hang" state.

I hope that you've already looked at:

http://software.intel.com/en-us/forums/showpost.php?p=190884
and
http://software.intel.com/en-us/forums/showpost.php?p=190885

Microsoft software developers implemented CRT'sin'functionin a different way. I'll provide technicaldetails later

Here is the disassembled code from one of the CRT library sin() functions.You can clearly see the x87 'fsin' call.
The investigated CRT sine implementation is called
_CIsin which is compiler optimized version of many sin() implementations.Please read this post:
http://software.intel.com/en-us/forums/showthread.php?t=105474post #201

               push    edx
.text:6FF759A3                 fstcw   [esp+4+var_4]
.text:6FF759A7                 jz      loc_6FF7E5AC
.text:6FF759AD                 cmp     [esp+4+var_4], 27Fh
.text:6FF759B3                 jz      short loc_6FF759BB
.text:6FF759B5                 fldcw   ds:word_6FF5C0D6
.text:6FF759BB
.text:6FF759BB loc_6FF759BB:                           ; CODE XREF: sub_6FF759A2+11j
.text:6FF759BB                 fsin ; x87 instruction call .text:6FF759BD                 fstsw   ax
.text:6FF759C0                 sahf
.text:6FF759C1                 jp      loc_6FF7E552
.text:6FF759C7
.text:6FF759C7 loc_6FF759C7:                           ; CODE XREF: sub_6FF759A2+8BC4j
.text:6FF759C7                 cmp     dword_6FFF50C4, 0
.text:6FF759CE                 jnz     loc_6FF6C003
.text:6FF759D4                 mov     edx, 1Eh
.text:6FF759D9                 lea     ecx, unk_6FFF5390
.text:6FF759DF                 jmp     loc_6FF5EE3C
.text:6FF759DF sub_6FF759A2    endp

I've spent some time today investigating it and my answer is No. However, I've discovered a very interesting
feature ofCRTmath functions from Microsoft

MSVCRT.DLL has three sine functions:

1.__libm_sse2_sin >> This function uses SSE instructions which are part of the Horner scheme approximation of the sine function.

2._CIsin>> which is compiler optimized version of many sin() implementations please look here:http://software.intel.com/en-us/forums/showthread.php?t=105474post #201

3.sin function>> asfar as I was able to understand assembly code this function also calls 'fsin'.

Did you look at any posted code examples such as Moshier's cephes library on netlib?
You could look in glibc or newlib, but those seem to set a priority on obscurity

You can also look at FDLIBM 5 free math library which implements range-reduction for trigonometric functions.
My method is slightly different.
I start with a large radian argument and divide it by 2*Pi radians.Next I have a value which represents a number of full 2*Pi cycles and the decimal part of this number I isolate and extract.
Next this decimal part is multiplied by 2*Pi in order to obtain the proper quadrant from this value I extract decimal part which is multiplied by Pi/2 and the result is subtracted from Pi/2.By knowing the exact part of the full sine cycle which is obtained from the first decimal value I use if- statement which returns a proper quadrant sign.

Understanding Extended Precision [SergeyK] Is that a128-bit precision?

It is hard to tell without seeing the source code or doing the full disassembling.
It seems to me that probably when the "Extended Precision" mode is enabled Calc.exe could possibly switch to some kind of "poor's man" arbitrary precision arithmetics.
Afaik some of the arbitrary precision implementations storerational numbers as a fractions.

Here is link to Microsoft page describing '_set_SSE2_enable' : http://msdn.microsoft.com/en-us/library/bthd138d.aspx
Interesting that there is no support for trigonometric functions beside the 'atan'.
From my own investigation I know that run-time library implements SSE - coded sin() function.

Quoting TimP (Intel)My search engine skills aren't up to the task of researching this, but surely after all the discussion about how fsin does range reduction you don't expect a single standard library function or x87 intrinsic (fprem) to do the job...

[SergeyK] IfI will decide to implementmy own Sine based on Intel fsin instruction I won't do a range
reduction at the beginning. There is a high probability ( let's say 95% )that an input argument in radians
will be in a range from -2^63 to +2^63. So, simply try to use fsin and check for anerror. If there is
some error than do a range reduction and call fsin again. If there is no error afterthe 1st call to fsinthen return a result.

For example, my quick and dirty method for 128-bit long double...

[SergeyK] Unfortunately,I don't use a'long double' FP-type. I've done a couple of tests yesterday
for a 'double' FP-type ( 53-bit precision ) and my range reduction codeworks well up to 10^12*Radians.
At 10^14*Radians accuracy drops ( just 2 or 3 digits are correct depending on a value of an angle ),
and after 10^16*Radians all resultsare unacceptable.

[SergeyK] IfI will decide to implementmy own Sine based on Intel fsin instruction I won't do a range
reduction at the beginning. There is a high probability ( let's say 95% )that an input argument in radians
will be in a range from -2^63 to +2^63. So, simply try to use fsin and check for anerror. If there is
some error than do a range reduction and call fsin again. If there is no error afterthe 1st call to fsinthen return a

I suppose that 'fsin' is doing range reduction already on the values larger than [-Pi/4,Pi/4].
You can also choose the Java approach anddevelop your own range reduction method and call it if the range is larger than [-Pi/4,Pi/4] after method returns the properly reduced value you can pass it to the fsin.

Quoting iliyapolak...Interesting that there is no support for trigonometric functions beside the 'atan'...
I just checked a description for '_set_SSE2_enable' function in online version of MSDN andin my local installationMSDN and
I see that both are not up-to-date. The list of CRT functions with enabled support for SSE2 instructions is not complete.

The list of CRT functions with enabled support for SSE2 instructions is not complete

Sergey please read this post #71 there you can clearly see that Microsoft has various implementation of elementary functions.
Take a look at this function '__libm_sse2_sin' this is a library of SSE2 coded elementary functions.
You can use dumpbin.exe and study the MSVCRT.DLL imports.

As Jennifer mentioned previously was coming, the Intel(R) C++ Composer XE 2013 product is now available and resolves the reported issue with vectorization of sin/cos. Let us know if you run into any problems with that version.

Pages

Leave a Comment

Please sign in to add a comment. Not a member? Join today